Lucene.net implements full-text search

Author：Eve Cole Update Time：2009-06-30 17:02:06

After a few days of busy work, I finally implemented a simple full-text search. Let me review and summarize

this article. What is Lucene.Net? What can Lucene.Net do? And the question of how to do it? Finally, an example of Lucene.Net implementing full-text search is given.

1. What is Lucene.Net?

Lucene.net was originally an open source project and then turned to commercialization. Lucene.net 2.0 has also been released, but for money D. The fate of Lucene.net is somewhat similar to FreeTextBox, which started with 2.0 released after version 1.6.5. Commercial route, 2.0 provides a free version in the form of DLL, and the source code version must purchase a commercial license; however, it leaves the source code of version 1.6.5, and you can still see most of the internal details, but in version 2.0 The added support for the Mozilla browser is only visible through the HTML and JavaScript scripts it generates.

Lucene is a commonly used indexing API in the Java world. Using the methods it provides, you can create indexes for text materials and provide retrieval. (Reference: NLucene and Lucene .NET) NLucene is the first .net port and a .net-style version that uses .net naming conventions and class library design. However, due to energy reasons, the leader of the NLucene project only released the 1.2beta version. After the Lucene.NET project emerged, there were no new plans for NLucene.

Lucene.NET originally claimed to be an up-to-date .net Lucene transplant. It only adopted .net's suggestions in terms of naming. Its main goals tended to be compatible with Java Lucene: one was to make the index format compatible so that they could work together. Purpose; one is that the naming is close (with only a few differences, such as upper and lower case, etc.), and the purpose is to facilitate developers to use Java Lucene-related code and information.

I don't know when the Lucene.NET project has given up its open source plan and turned to business. It actually deleted the open source files on SourceForge. At the same time, the dotLucene project appeared on SourceForge. In protest against Lucene.NET, dotLucene almost put the Lucene.NET code intact as their starting point. ( https://sourceforge.net/forum/forum.php?thread_id=1153933&forum_id=408004 ).

To put it bluntly, Lucene.Net is an information retrieval function library (Library). You can use it to add indexing and search functions to your application.

Lucene users do not need to have in-depth knowledge about full-text retrieval, but only learn to use the library. If you know how to call the functions in Library, you can realize the full-text search function for your application.

But don’t expect Lucene to be a search engine like Google and Baidu. It is just a tool, a Library. You can also think of it as a simple and easy-to-use API that encapsulates indexing and search functions. You can do a lot of search-related things with this API, and it is very convenient and can satisfy your needs. An application performs simple full-text search. As an application developer (non-professional search engine developer), its functions are enough to satisfy you.

2. What can Lucene.Net do?

Lucene can index and search any data. Regardless of the format of the data source, as long as it can be converted into text, it can be analyzed and utilized by Lucene. In other words, whether it is MS word, Html, pdf or other Any form of file can be used by Lucene as long as you can extract text content from it. You can use Lucene to index and search them.

3. How to use Lucene.Net?

It simply boils down to: creating an index, and using an index. Creating an index is to store or analyze the information of the data source to be searched as our key information, and leaving a mark for the search is like creating a table of contents in Word (personal understanding) , using the index is to analyze the data source based on the index information during search and extract the information we need.

Please take a look at the example:

public class IntranetIndexer

for creating an index

{
/**/////Index writer
private IndexWriter writer;

//The root directory of the file to be written to the index
private string docRootDirectory;

//File format to match
private string[] pattern;

/**//// <summary>
/// Initialize an index writer. Directory is the directory where the index is created. true means that if the index file does not exist, the index file will be re-created. If the index file already exists, the index file will be overwritten.
/// </summary>
/// <param name="directory">The directory to be indexed is passed in. Note that it is a string value. If the directory does not exist, it will be automatically created</param>
publicIntranetIndexer(string directory)
{
writer = new IndexWriter(directory, new StandardAnalyzer(), true);
writer.SetUseCompoundFile(true);
}

public void AddDirectory(DirectoryInfo directory, string [] pattern)
{
this.docRootDirectory = directory.FullName;
this.pattern = pattern;
addSubDirectory(directory);
}

private void addSubDirectory(DirectoryInfo directory)
{
for(int i=0;i<pattern .Length ;i++)
{
foreach (FileInfo fi in directory.GetFiles(pattern[i]))
{
AddHtmlDocument(fi.FullName);
}
}
foreach (DirectoryInfo di in directory.GetDirectories())
{
addSubDirectory(di);
}
}

public void AddHtmlDocument(string path)
{
string exname=Path.GetExtension (path);
Document doc = new Document();

string html;
if(exname.ToLower ()==".html" ||exname .ToLower ()==".htm"||exname .ToLower ()==".txt")
{
using(StreamReader sr=new StreamReader (path,System .Text .Encoding .Default ))
{
html = sr.ReadToEnd();
}
}
else
{
using (StreamReader sr = new StreamReader(path, System.Text.Encoding.Unicode ))
{
html = sr.ReadToEnd();
}
}

int relativePathStartsAt = this.docRootDirectory.EndsWith("\") ? this.docRootDirectory.Length : this.docRootDirectory.Length + 1;
string relativePath = path.Substring(relativePathStartsAt);
string title=Path.GetFileName(path);

//Determine if it is a web page, remove the tag, otherwise do not use it
if(exname.ToLower ()==".html" ||exname .ToLower ()==".htm")
{
doc.Add(Field.UnStored("text", parseHtml(html)));
}
else
{
doc.Add (Field .UnStored ("text",html));
}
doc.Add(Field.Keyword("path", relativePath));
//doc.Add(Field.Text("title", getTitle(html)));
doc.Add (Field .Text ("title",title));
writer.AddDocument(doc);
}
/**//// <summary>
/// Remove tags from web pages
/// </summary>
/// <param name="html">Web page</param>
/// <returns>Return the removed web page text</returns>
private string parseHtml(string html)
{
string temp = Regex.Replace(html, "<[^>]*>", "");
return temp.Replace(" ", " ");
}
/**//// <summary>
/// Get the page title
/// </summary>
/// <param name="html"></param>
/// <returns></returns>
private string getTitle(string html)
{
Match m = Regex.Match(html, "<title>(.*)</title>");
if (m.Groups.Count == 2)
return m.Groups[1].Value;
return "Document title unknown";
}
/**//// <summary>
/// Optimize index and close writer
/// </summary>
public void Close()
{
writer.Optimize();
writer.Close();
}
}

First create a Document object, and then add some attributes Field to the Document object. You can think of the Document object as a virtual file, from which information will be obtained in the future. The Field is regarded as metadata describing this virtual file. Among them Field includes four types: Keywork

Data of this type will not be analyzed, but will be indexed and saved in the index.

UnIndexed
Data of this type will not be analyzed or indexed, but will be stored in the index.

UnStored
Just the opposite of UnIndexed, it is analyzed and indexed, but not saved.

Text
Similar to UnStored. If the value type is string, it will be saved. If the value type is Reader, it will not be saved, just like UnStored.

Finally, each Document is added to the index.

The following is a search of the index

//Create an indexer
IndexSearcher searcher = new IndexSearcher(indexDirectory);
//Parse the text field of the index for search
Query query = QueryParser.Parse(this.Q, "text", new StandardAnalyzer());
//Put search results in hits
Hits hits = searcher.Search(query);
//Statistics on the total number of records searched
this.total = hits.Length();
//Highlight
QueryHighlightExtractor highlighter = new QueryHighlightExtractor(query, new StandardAnalyzer(), "<font color=red>", "</font>");
The first step is to use IndexSearcher to open the index file for subsequent searches, and the parameter is the path of the index file.

The second step is to use QueryParser to convert more readable query statements (such as the query word lucene, and some advanced methods lucene AND. net) into a query object used internally by Lucene.

The third step performs the search and returns the results to the hits collection. It should be noted that Lucene does not put all the results into hits at once but puts them part at a time. For space considerations,

the search results are then processed and displayed on the page:

for (int i = startAt; i < resultsCount; i++)
{

Document doc = hits.Doc(i);

string path = doc.Get("path");

string location =Server.MapPath("documents")+" \"+path ;
string exname=Path.GetExtension (path);

string plainText;
string str=doc.Get ("title");
if(exname==".html" || exname ==".htm" || exname ==".txt")
{
using (StreamReader sr = new StreamReader(location, System.Text.Encoding.Default))
{
plainText = parseHtml(sr.ReadToEnd());
}
}
else
{
using (StreamReader sr = new StreamReader(location, System.Text.Encoding.Unicode ))
{
plainText = sr.ReadToEnd();
}
}

//DataTable add rows
DataRow row = this.Results.NewRow();
row["title"] = doc.Get("title");
string IP=Request.Url.Host;//Get the server IP
//Request.Url.Port;
row["path"]=@" http://"+IP+"/WebUI/Search/documents/"+path ;
row["sample"] = highlighter.GetBestFragments(plainText, 80, 2, "");
this.Results.Rows.Add(row);
}
searcher.Close();//Close the searcher. If you want to have a more advanced, comprehensive and in-depth understanding of Lucene.Net, please refer to the website:

http://www.alphatom.com/

http://blog.tianya. cn/blogger/view_blog.asp?BlogName=aftaft