Create the Index
So, step one is to create the index for our set of Word documents. To do this, we need to write some code that takes the information from the Word documents and turns them into a searchable index. The only way to do this is by brute force. We’ll have to iterate over each of the Word documents, examing each and converting each into the pieces that Lucene needs to work with when it creates the index.
What are the pieces that Lucene needs to create the index? There are two.
These two abstractions are so key to Lucene that Lucene represents them with two top level Java classes, Document andField. A Document, not to be confused with our actual Word documents, is a Java class that represents a searchable item in Lucene. By searchable item, we mean that a Document is the thing that you find when you search. It’s up to you to create these Documents.
Lucky for us, it’s a pretty clear step from an actual Word document to a Lucene Document. I think anyone would agree that it will be the Word documents that our users will want to find when they conduct a search. This makes our processing rather simple, we will simply create a single Lucene Document for each of our actual Word documents.
Create the Document and its Fields
But how do we do that? It’s actually very easy. First, we make the Document object, with the new operator — nothing more. But at this point the Document is meaningless. We now have to decide what Fields to add to the Document. This is the part where we have to think. A Document is made of any number of Fields, and each Field has a name and a value. That’s all there is to it.
Two fields are created almost universally by developers creating Lucene indexes. The most important field will be the “content” field. This the Field that holds the content the Word document for which we are creating the Lucene Document. Bear in mind, the name of the Field is entirely arbitrary, but most people call one of the Fields “content” and they stick the actual content of the real world searchable object, the Word document in our case, into the value of that Field. In essense, a Field is just a name: value pair.
Another very common Field that developers create is the “title” Field. This field’s value will be the title of the Word document. What other information about the Word document might we want to keep in our index. Other common fields are things like “author”, “creation_date”, “keywords”, etc. The identification of the fields that you will need is entirely driven by your business requirements.
So, for each Word document that we want to make searchable, we will have to create a Lucene Document, with Fields such as those we outlined above. Once we have created the Document with those Fields, we then add it the Lucene index writer and ask it to write our Index. That’s it! We now have a searchable index. This is true, but we may have glossed over a couple of Field details. Let’s take a closer look at Fields.
Field Details: Stored or Indexed?
A Field may be kept in the index in more than one way. The most obvious way, and perhaps the only way that you might at first suspect the existence of, is the searchable way. In our example, we fully expect that if the user types in a word that exists in the contents of one of the Word documents, then the search will return that Word document in the search results. To do this, Lucene must index that Field. The nomenclature is a bit confusing a first, but, note, it is entirely possible to “store” a Field in the index without making it searchable. In other words, it’s possible to “store” a Field but not “index” it. Why? You’ll see shortly.
The first distiniction that Lucene makes between the way it can keep a Field in the index is whether it is stored or indexed. If we expect a match on a Field’s value to cause the Document to be hitby the search, then we must index the Field. If we only store the Field, it’s value can’t be reached by the search queries. Why thenstore a Field? Simple, when we hit the Document, via one of theindexed fields, Lucene will return us the entire Document object. All stored Fields will be available on that Document object; indexedFields will not be on that object. An indexed Field is information used to find an Document, a stored Field is information returned with the Document. Two different things.
This means that while we might not make searches based upon the contents of a given Field, we might still be able to make use of that Field’s value when the Document is returned by the search. The most obvious use case I can think of is a “url” Field for a web based Document. It makes no sense to search for the value of aURL, but you will definitely want to know the URL for the documents that your search returns. How else would your results page be able to steer the user to the hit page? This is a very important point: a stored Field’s value will be available on the Document returned by a search, but only an indexed Field’s value can actually be used as the target of a search.
Technically, stored Fields are kept within the Lucene index. But we must keep track of the fact that an indexed Field is different than a stored Field. Unfortunate nomenclature. This is why words matter. They can save on a lot of confusion.
Indexed Fields: Analyzed or Not Analyzed?
For the next wrinkle, we must point out that an indexed Field can be indexed in two different fashions. First, we can index the value of the Field in a single chunk. In other words, we might have a “phone number” Field. When we search for phone numbers, we need to match the entire value or nothing. This makes perfect sense. So, for a Field like phone number, we index the entire value ATOMICALLY into the Lucene index.
But let’s consider the “content” Field of the Word document. Do we want the user to have to match that entire Field? Certainly not. We want the contents of the Word document to be broken down into searchable tokens. This process is know as analyzation.We can start by throwing out all of the unimportant words like, “a”, “the”, “and”, etc. There are many other optimizations we can make, but the bottom line is that the content of a Field like “contents” should be analyzed by Lucene. This produces a targeted lightweight index. This is how search becomes efficient and powerful.
In the APIs, this comes down to the fact that when we create a Field, we must specify
- Whether to STORE it or not
- Whether to INDEX it or not
- If indexing, whether to ANALYZE it or not
Now, you should be clear on the details of Fields. Importantly, we can both store and index a given Field. It’s not an either or choice.