This article summarizes experiences with the Lucene text search library put to work on a large XML Wikipedia article data file. This page may be of interest to you if you are a Java developer who wants to (1) get to know the basics of using Lucene in combination with an (2) easy-to-follow explanatory text and (3) ready-to-use Java code. There are more extensive tutorials on Lucene than this one, even a book. The software developed and available below is meant for demonstration purposes. This is not a full-blown Wikipedia reader.
This page is about two small utilities that do full text search on Wikipedia article data. Wikipedia is a free online encyclopedia that lets anyone edit its content. Sounds like a recipe for disaster, but works surprisingly well.
Anyway, what's interesting about Wikipedia in the context of this project: its volume of data is huge, and it can be downloaded for free. My goal was to play with the Lucene library. Lucene is a text search engine library written in Java (it has been ported to other languages as well).
I worked with the German language version of Wikipedia,
which is the second-largest one after the English version.
At http://download.wikimedia.org/wikipedia/de/,
data files can be downloaded in German language.
If you replace de with another language
abbreviation (e.g. en for English)
you'll have other versions available.
I picked the file 20051105_pages_current.xml.bz2.
Actually, I downloaded only about 100 MB of the compressed version,
decompressed it ignoring bzip2's error message and
fixed the resulting broken XML file by removing the last half-downloaded
article and adding a closing </mediawiki> element.
If you can afford the traffic or if you have a lot of patience,
you may want to play with a complete file.
My data file finally was located at h:\20051105_pages_current.xml.
Apart from the data file you also need the code I've written.
Here it is: lucene-wikipedia.zip (13 KB).
Unpack it into some directory like c:\wikilucene\.
It has a subdirectory wikipedia which is also the package name
of my classes.
There are two Windows batch files wikiindex.bat und wikisearch.bat which
can be trivially adapted to become shell scripts.
The code is available under the LGPL version 2.1 (February 1999).
You'll also need an installed Java Runtime Environment (JRE). There are problems with the XML parser in 1.4.2, so you might want to use a 1.5 JRE. The other thing is the Lucene JAR file. Mine is called lucene-1.4.3.jar and I got it from the Lucene binaries page.
You'll have to create an index from the XML data file once, then you can do searches on it. Call wikiindex.bat to create the index. That's a process which will take some time, the exact amount depending on the size of the data file and your computer's CPU speed and current working load. You may want to adapt the batch file to your needs. It looks like this in its original form:
java -DentityExpansionLimit=200000 -cp .;h:\lucene-1.4.3\lucene-1.4.3.jar
wikipedia.IndexWikipedia h:\20051105_pages_current.xml h:\wikipedia
I'll ignore the entityExpansionLimit variable for now, -cp is the classpath (current path . and the path to the Lucene JAR file), then comes the indexer class name (including its package name wikipedia), then the file name of the article data, then the directory name where the index will be put. That directory must be readable and writeable, and for my 334 MB data file I got an index of 410 MB. Before optimizing it was even larger, so take that into account.
The indexer spits out the title of every hundredth article to have some primitive progress report. Once it's done reading the XML file it prints the number of articles read (in my case Articles: 72639). It then optimizes the index, which takes some time.
Now for the searching part. Call the wikisearch.bat file followed by some words you want to find. I tried Autobahn and Marco Polo, which both resulted in ten reasonable results, including the articles on those two terms. If you do not download the complete XML file like I did be aware that a search may fail easily even if the article is in the real Wikipedia—simply because it is not included in the part that you downloaded. Here's the search batch file's content which you may have to adapt:
java -cp .;h:\lucene-1.4.3\lucene-1.4.3.jar wikipedia.SearchWikipedia h:\wikipedia %1 %2 %3 %4 %5 %6 %7 %8
The -cp switch for the class path has remained the same, followed by the search program's class name, followed by the index directory and finally variables for up to eight query terms which will be passed to the actual program by this batch file.
I'll walk through the code the same way the JVM does running the two programs. The code is similar to the minimal example program in Lucene, it just works on a bigger text corpus.
The class IndexWikipedia does the indexing work. In its main method, it gets two arguments, XML file and index directory. It then creates a Lucene IndexWriter with the given directory and a StandardAnalyzer. A SAX parser (which does not keep the complete XML tree in memory, important for working with large XML files) is created, a WikipediaXmlHandler is given to its parse method. After the parsing, the optimization method of IndexWriter is called and the index is then closed. The program then terminates.
If you don't know how the SAX parsing system works—it is event-based and calls
methods in the Handler class which have to be overridden to do some
actual work.
In WikipediaXmlHandler, we override the characters method to intercept all
character data and put it into a StringBuffer.
The startElement method is overridden to create a new WikipediaArticle object when
a page element is started.
In the endElement method we retrieve text and title content and put it into the
article object.
Besides, when we encounter a closing page tag,
the article is given to a class putting that article into the index.
I've created an interface WikipediaArticleSink for this purpose, it only demands one method to add a WikpediaArticle. The implementation which was given to the Handler at construction time is called WikipediaArticleIndexer. It creates a Lucene Document object from the article and feeds it to the IndexWriter we created at startup.
So the mechanism is quite simple: wait for events indicating text and title information of an article, put it into a WikipediaArticle object, once the article is passed convert that object to a Document which is understood by Lucene and finally add the document to the index. At the end, optimize and close the index.
The SearchWikipedia program is a bit simpler than the indexing program. It opens an existing index, the first argument must be the index directory. It then compiles a query from the remaining program arguments. An IndexSearcher object is created on the index directory. A StandardAnalyzer is given to the QueryParser along with the query String and the field to search (the text field). Then up to ten hits are collected and their titles printed to standard output. The searcher object is closed and the program terminates.
I was most impressed by the speed of indexing and searching. Unfortunately I haven't done measuring yet, but it seems to me that the index was created at about 1 MB/s, and searches work without noticeable delay. TODO
As I said before, I basically adapted a small demo program, so this was not much work. Most of the time I spent fighting with the SAX parsing. First of all, with the event system it's a nightmare to find out at what point in the XML file the parser is failing (e.g. with a NullPointerException) because all RuntimeExceptions are caught in the parser and rethrown with only the original exception's class name in it. Not helpful. That's where Eclipse in debugging mode comes in handy. I found a place where I assumed the article object would exist when in fact it was null.
The next problem was the built-in parser's failure to deal with exotic characters. I first thought that the Wikpedia file was incorrectly encoded (which was the case for the first version I tried, a data file from September which I had lying around, missing the initial <?xml version="1.0" encoding="utf-8"?> line). As it turns out � is a correct way to encode a character, but it resulted in a message Parser has reached the entity expansion limit "64.000" set by the Application. The solution is to give the JVM a higher value for that limit—that's what the -DentityExpansionLimit=200000 switch in wikiindex.bat is for. The decimal value for hexadecimal d800 is 55296, way below 64000, so I never really understood why the default value of 64000 isn't enough.
Now for the next problem—the indexer ran longer, but after about 40,000 articles it gave up with the message Exception in thread "main" java.lang.InternalError: fillbuf. Not very helpful, but a search for the message revealed that it was some bug of the Crimson parser which had been patched a long time ago, but was still present in a 1.4.2 JRE. After searching for alternative libraries, download locations and usage examples I decided that it would be less trouble to upgrade to Java 1.5 (or 5.0 as it's called now), and indeed, the error did not occur anymore.