Index and retrieve Chinese files using lucene 3.0.0

Author：Eve Cole Update Time：2010-01-15 17:07:11

1. My original program

In fact, my original program was quite simple, completely modified from the SearchFiles and IndexFiles in the Demo. The only difference was that it referenced the SmartCN tokenizer.

I'll post the code that modified that point.

IndexhChinese.java:

Date start = new Date();try { IndexWriter writer = new IndexWriter(FSDirectory.open(INDEX_DIR), new SmartChineseAnalyzer(Version.LUCENE_CURRENT), true, IndexWriter.MaxFieldLength.LIMITED); indexDocs(writer, docDir); System.out .println("Indexing to directory '" +INDEX_DIR+ "'..."); System.out.println("Optimizing..."); //writer.optimize(); writer.close(); Date end = new Date(); System.out.println(end.getTime() - start.getTime() + " total milliseconds"); }
SearchChinese.java

Analyzer analyzer = new SmartChineseAnalyzer(Version.LUCENE_CURRENT); BufferedReader in = null;if (queries != null) { in = new BufferedReader(new FileReader(queries));} else { in = new BufferedReader(new InputStreamReader(System.in , "GBK"));}
Here, I specify that the input query is encoded in GBK.

Then I ran it with confidence... and found that Chinese could not be retrieved, and the English retrieval inside was normal.

2. Find the problem.

So I became depressed. Since I was too familiar with Java and Lucene, and there weren't many discussions outside the 3.0.0 version I was using, I fiddled around for a while and found that I could retrieve the file if I saved it in ansi format. It's in Chinese (it used to be UTF-8). It seems to be a file encoding problem. After some searching, I found the following code in indexChinese.java:

static void indexDocs(IndexWriter writer, File file) throws IOException { // do not try to index files that cannot be read if (file.canRead()) { if (file.isDirectory()) { String[] files = file. list(); // an IO error could occur if (files != null) { for (int i = 0; i < files.length; i++) { indexDocs(writer, new File(file, files[i])) ; } } } else { System.out.println("adding " + file); try { writer.addDocument(FileDocument.Document(file)); } // at least on windows, some temporary files raise this exception with an " access denied" message // checking if the file can be read doesn't help catch (FileNotFoundException fnfe) { ; } } }
The key point is this sentence:

try { writer.addDocument(FileDocument.Document(file));}
The code for reading the file should be here, trace it:

public static Document Document(File f) throws java.io.FileNotFoundException, UnsupportedEncodingException { Document doc = new Document(); doc.add(new Field("path", f.getPath(), Field.Store.YES, Field. Index.NOT_ANALYZED)); doc.add(new Field("modified", DateTools.timeToString(f.lastModified(), DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED)); doc. add(new Field("contents", FileReader(f))); // return the document return doc;} private FileDocument() {}}

This is an internal class of Lucene. Its function is to obtain content from a text file. The generated Document has three fields by default: path, modified, content, and content is the text content of the file. It seems to be FileReader(f), There was something wrong with this function. There was no specified encoding to use for reading, so I simply modified it here.

FileInputStream fis=new FileInputStream(f);//Convert the byte stream into a character stream according to UTF-8 encoding InputStreamReader isr=new InputStreamReader(fis,"UNICODE");//Get text from the character stream and buffer it BufferedReader br=new BufferedReader(isr); doc.add(new Field("contents", br));
As for "Unicode", it can be changed to all supported encodings. When I changed it to "utf-8", it can be used normally.

3. Some guesses:

When indexing files in Lucene, the encoding does not matter. As long as it is specified correctly, the output files can be retrieved normally. In other words, the results after indexing different encoding files are the same (verification)