replace indexed but not stored data in lucene

Was trying to replace indexed but not stored data in lucene . found this thread has the same issue:

> > >> I have a strange problem with Field.Store.NO and Field.Index.ANALYZED
> > >> fields with Lucene 3.0.1.
> > >>
> > >> I'm testing my app with twenty test documents. Each has about ten
> > >> fields. All fields except one, "Content", are set as Field.Store.YES.
> > >> The "Content" field is set as Field.Store.NO and
> > >> Field.Index.ANALYZED. Using Luke, I discovered that this "Content"
> > >> field is not persisted to the disk, except on one document (neither
> > >> the first nor the last in the list). This always happens for exactly
> > >> the same document. When I examine the Document object before writing
> > >> it, it has the "Content" field I expect.
> > >>
> > >> When I change the "Content" field from Field.Store.NO to
> > >> Field.Store.YES, everything starts working. Every document has the
> > >> "Content" field exactly as I expect, and searches produce the hits I
> > >> expect to see. I really don't want to save the full "Content" data in
> > >> the Lucene index, though. I'm baffled why Field.Store.NO results in
> > >> nothing being written to the index even with Field.Index.ANALYZED.
> > I finally had time to go back and look at this problem. I discovered that
> the
> > analyzed fields work fine for searching until I use
> > IndexWriter.updateDocument().
> >
> > The way my application runs, it has to update documents several times to
> > update one specific field. The update code queries out Document objects
> using
> > a unique identifier, and updates the field. The problem is in Document
> objects
> > returned by the query. The querying code runs a search, and eventually
> calls
> > IndexSearcher.doc(int). According to the API documentation, that method
> only
> > returns Document objects with stored fields from the underlying index.
> >
> > I tried calling IndexSearcher.doc(int i, FieldSelector fieldSelector)
> with
> > fieldSelector set to null: the documentation states that this returns
> Document
> > objects with all fields, but that also only seems to return stored
> fields.
> >
> > So my question becomes: how can I update a document which contains non-
> > stored analyzed fields without clobbering the analyzed-only fields?
> > Note that I do not need to update the analyzed-only fields. I have found
> nothing
> > helpful in the documentation.
> You cannot retrieve non-stored fields. They are analyzed and tokenized
> during indexing and this is a one-way transformation. If you update
> documents you have to reindex the contents. If you do not have access to
> the
> original contents anymore, you may consider adding a stored-only "raw
> document" field, that contains everything to rebuild the indexed fields. In
> our installation, we have a stored field containing the JSON/XML source
> document to do this.
Adding to Uwe's comment, you may be operating under a false
assumption. Lucene has no capability to update fields in a document.
Period. This is one of the most frequently requested changes, but
the nature of an inverted index makes this...er...tricky. Updates
are really a document delete followed by a document add. And as
a bonus, the new document won't even have the same internal
Lucene doc id as the one it replaces.

So if you're reading a document from the index, non-stored fields
are not part of the new update and your results will be...uhmmmm....
not what you expect...

Also this is a good link for mistakes when using lucene.

This is a good article for lucene coding reference.

Advertisements

lucene index, store, anaylzed, not_analyzed, Document ,Fields

Create the Index

So, step one is to create the index for our set of Word documents. To do this, we need to write some code that takes the information from the Word documents and turns them into a searchable index. The only way to do this is by brute force. We’ll have to iterate over each of the Word documents, examing each and converting each into the pieces that Lucene needs to work with when it creates the index.

What are the pieces that Lucene needs to create the index? There are two.

  1. Documents
  2. Fields

These two abstractions are so key to Lucene that Lucene represents them with two top level Java classes, Document andField. A Document, not to be confused with our actual Word documents, is a Java class that represents a searchable item in Lucene. By searchable item, we mean that a Document is the thing that you find when you search. It’s up to you to create these Documents.

Lucky for us, it’s a pretty clear step from an actual Word document to a Lucene Document. I think anyone would agree that it will be the Word documents that our users will want to find when they conduct a search. This makes our processing rather simple, we will simply create a single Lucene Document for each of our actual Word documents.

Create the Document and its Fields

But how do we do that? It’s actually very easy. First, we make the Document object, with the new operator — nothing more. But at this point the Document is meaningless. We now have to decide what Fields to add to the Document. This is the part where we have to think. A Document is made of any number of Fields, and each Field has a name and a value. That’s all there is to it.

Two fields are created almost universally by developers creating Lucene indexes. The most important field will be the “content” field. This the Field that holds the content the Word document for which we are creating the Lucene Document. Bear in mind, the name of the Field is entirely arbitrary, but most people call one of the Fields “content” and they stick the actual content of the real world searchable object, the Word document in our case, into the value of that Field. In essense, a Field is just a name: value pair.

Another very common Field that developers create is the “title” Field. This field’s value will be the title of the Word document. What other information about the Word document might we want to keep in our index. Other common fields are things like “author”, “creation_date”, “keywords”, etc. The identification of the fields that you will need is entirely driven by your business requirements.

So, for each Word document that we want to make searchable, we will have to create a Lucene Document, with Fields such as those we outlined above. Once we have created the Document with those Fields, we then add it the Lucene index writer and ask it to write our Index. That’s it! We now have a searchable index. This is true, but we may have glossed over a couple of Field details. Let’s take a closer look at Fields.

Field Details: Stored or Indexed?

A Field may be kept in the index in more than one way. The most obvious way, and perhaps the only way that you might at first suspect the existence of, is the searchable way. In our example, we fully expect that if the user types in a word that exists in the contents of one of the Word documents, then the search will return that Word document in the search results. To do this, Lucene must index that Field. The nomenclature is a bit confusing a first, but, note, it is entirely possible to “store” a Field in the index without making it searchable. In other words, it’s possible to “store” a Field but not “index” it. Why? You’ll see shortly.

The first distiniction that Lucene makes between the way it can keep a Field in the index is whether it is stored or indexed. If we expect a match on a Field’s value to cause the Document to be hitby the search, then we must index the Field. If we only store the Field, it’s value can’t be reached by the search queries. Why thenstore a Field? Simple, when we hit the Document, via one of theindexed fields, Lucene will return us the entire Document object. All stored Fields will be available on that Document object; indexedFields will not be on that object. An indexed Field is information used to find an Document, a stored Field is information returned with the Document. Two different things.

This means that while we might not make searches based upon the contents of a given Field, we might still be able to make use of that Field’s value when the Document is returned by the search. The most obvious use case I can think of is a “url” Field for a web based Document. It makes no sense to search for the value of aURL, but you will definitely want to know the URL for the documents that your search returns. How else would your results page be able to steer the user to the hit page? This is a very important point: a stored Field’s value will be available on the Document returned by a search, but only an indexed Field’s value can actually be used as the target of a search.

Technically, stored Fields are kept within the Lucene index. But we must keep track of the fact that an indexed Field is different than a stored Field. Unfortunate nomenclature. This is why words matter. They can save on a lot of confusion.

Indexed Fields: Analyzed or Not Analyzed?

For the next wrinkle, we must point out that an indexed Field can be indexed in two different fashions. First, we can index the value of the Field in a single chunk. In other words, we might have a “phone number” Field. When we search for phone numbers, we need to match the entire value or nothing. This makes perfect sense. So, for a Field like phone number, we index the entire value ATOMICALLY into the Lucene index.

But let’s consider the “content” Field of the Word document. Do we want the user to have to match that entire Field? Certainly not. We want the contents of the Word document to be broken down into searchable tokens. This process is know as analyzation.We can start by throwing out all of the unimportant words like, “a”, “the”, “and”, etc. There are many other optimizations we can make, but the bottom line is that the content of a Field like “contents” should be analyzed by Lucene. This produces a targeted lightweight index. This is how search becomes efficient and powerful.

In the APIs, this comes down to the fact that when we create a Field, we must specify

  1. Whether to STORE it or not
  2. Whether to INDEX it or not
    • If indexing, whether to ANALYZE it or not

Now, you should be clear on the details of Fields. Importantly, we can both store and index a given Field. It’s not an either or choice.

copy FSdirectory to InfinispanDirectory

FSdirectory vs InfinispanDirectory vs RAMDirectory

The RAMDirectory provided by Lucene is not really meant for high performance. The filesystem based implementations using NIO and memory map are likely more efficient, unless you’re dealing with indexes meant for proof of concepts and unit tests.

The Infinispan Directory is – like the filesystem one – tuned for good performance; it is in fact a bit faster than the filesystem ones to perform write operations (for obvious reasons); the speed race on read performance is a subtle battle, strongly depending on your actual use case.

The main reason to use the Infinispan Directory is not raw performance but:

  1. it’s capability to replicate and distribute the index across multiple nodes: using shared filesystems for FSDirectory is usually problematic and not fast at all.
  2. It’s able to work as a caching write-through store to slower persistence services. FS is one, but it might be a relational database, Cassandra, cloud storage services such as S3, …[write your plugin]
  3. It’s (optionally) Transactional. You could have it participate in XA transactions if needed.

Migration steps:

  • get the infinispan lucene configed using lucene-infinispan-config.xml. can be found online.
    1. Need to pay attention to the <transport> tag, which I uses jGroup.
    2. for DB, if in web env, use org.infinispan.loaders.jdbc.connectionfactory.ManagedConnectionFactory for connectionFactoryClass in property then add the jndi stuff. if in standalone env, use org.infinispan.loaders.jdbc.connectionfactory.PooledConnectionFactory and then add jdbc stuff.
  • creat InfinispanDirectory using the lucene-infinispan-config.xml. Here in the constructor, specify the chunk size if the index file is huge. Otherwise it would use the default buffer size 16K, which may kill the DB if we have 2GB or larger index file. I use 30MB.
  • get the files from FSDirectory ready.
  • use API(srcdir.copy(destdir, src, dest)) to do migration.
  • Last step is test~~
Directory dir = null;
            Cache&lt;Object, Object&gt; cache = new DefaultCacheManager("lucene-infinispan-config.xml").getCache("INDX_MD");
            Directory infinispanDirectory = new InfinispanDirectory(cache, "tempTestIndex", 31457280, new NoopSegmentReadLocker());
            if (args == null || args.length != 1)
            {
                throw new IllegalArgumentException("please specify index file path as the 1st argument.");
            }
            else
            {
                dir = FSDirectory.open(new File(args[0]));
            }
            for (String file : dir.listAll())
            {
                dir.copy(infinispanDirectory, file, file);
                System.out.println("--&gt; " + file + "  copying ... ");
            }

for Test:

import java.io.File;
import java.io.IOException;

import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.TopDocs;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.FSDirectory;
import org.apache.lucene.util.Version;
import org.infinispan.Cache;
import org.infinispan.lucene.InfinispanDirectory;
import org.infinispan.manager.DefaultCacheManager;
import org.junit.Assert;
import org.junit.Test;

/**
 * Created with IntelliJ IDEA. User: LiHa Date: 11/5/12
 */
public class InfinispanSearchTest
{
    private static File fsDirectoryBase = new File("C:\\Users\\LiHa\\Desktop\\indexFiles\\");
    private static File fsDirectoryTest1 = new File(fsDirectoryBase, "test1");

    /**
     * write doc into FSDirectory in the file system
     *
     * @throws IOException
     */
    @Test

    public void beforeMigrationTest() throws IOException
    {
        Directory dir = FSDirectory.open(fsDirectoryBase);
        StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_35);
        IndexWriter iw = new IndexWriter(dir, sa, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i &lt; 10; i++)
        {
            Document d = new Document();
            d.add(new Field("ID", Integer.toString(i), Field.Store.YES, Field.Index.NO));
            d.add(new Field("X", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("Y", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("VALID", Boolean.TRUE.toString(), Field.Store.NO, Field.Index.ANALYZED));

            iw.addDocument(d);
        }
        iw.commit();
        iw.waitForMerges();
        iw.close(true);

        IndexReader ir = IndexReader.open(dir, true);
        IndexSearcher is = new IndexSearcher(ir);
        long t = System.currentTimeMillis();
        TopDocs td = is.search(new TermQuery(new Term("VALID", "true")), 1);
        System.out.println(System.currentTimeMillis() - t);

        Assert.assertEquals(td.totalHits, 10);
        is.close();
        ir.close();

    }

    /**
     * after migration. test index existence.
     *
     * @throws IOException
     */
    @Test
    public void afterMigrationTest() throws IOException, ParseException
    {

        Cache cache = new DefaultCacheManager("lucene-infinispan-config.xml").getCache("INDX_MD");

        Directory dir = new InfinispanDirectory(cache, "tempTestIndex");

        IndexReader ir = IndexReader.open(dir, true);

        IndexSearcher is = new IndexSearcher(ir);
        long t = System.currentTimeMillis();
                TopDocs td = is.search(new TermQuery(new Term("VALID", "true")), 1);
                System.out.println("5 : "+ (t-System.currentTimeMillis()));
                Assert.assertEquals(td.totalHits, 10);

        //        TopDocs td = is.search(new TermQuery(new Term("formType", "RPR")), 1);
        //        System.out.println("total hits : "+td.totalHits);

        for (int i = 0; i &lt; 10; i++)
        {
        StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_35);
        QueryParser parser = new QueryParser(Version.LUCENE_35, "key", sa);
        Query query = parser.parse("formType:RPR +key:2*");

            t = System.currentTimeMillis();
            TopDocs topDocs = is.search(query, 1);
            System.out.println("search "+i+" time: " + (t - System.currentTimeMillis()));

            System.out.println("search "+i+" total hits : " + topDocs.totalHits);
        }
        is.close();
        ir.close();

    }

    /**
     * add doc to a infinispan directory.
     *
     * @throws IOException
     */
    @Test
    public void addDocToInfinispanTest() throws IOException
    {
        Cache cache = new DefaultCacheManager("lucene-infinispan-config.xml").getCache("INDX_MD");
        Directory dir = new InfinispanDirectory(cache, "tempTestIndex");

        StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_35);
        IndexWriter iw = new IndexWriter(dir, sa, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i &lt; 10; i++)
        {
            Document d = new Document();
            d.add(new Field("ID", Integer.toString(i), Field.Store.YES, Field.Index.NO));
            d.add(new Field("X", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("Y", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("VALID", Boolean.TRUE.toString(), Field.Store.NO, Field.Index.ANALYZED));

            iw.addDocument(d);
        }
        iw.commit();
        iw.waitForMerges();
        iw.close(true);

        IndexReader ir = IndexReader.open(dir, true);
        IndexSearcher is = new IndexSearcher(ir);
        TopDocs td = is.search(new TermQuery(new Term("VALID", "true")), 1);
        Assert.assertEquals(td.totalHits, 10);
        is.close();
        ir.close();

    }

    /**
     * add doc into a test directory.
     *
     * @throws IOException
     */
    @Test
    public void addDocTest() throws IOException
    {
        Directory dir = FSDirectory.open(fsDirectoryTest1);
        StandardAnalyzer sa = new StandardAnalyzer(Version.LUCENE_35);
        IndexWriter iw = new IndexWriter(dir, sa, IndexWriter.MaxFieldLength.UNLIMITED);
        for (int i = 0; i &lt; 10; i++)
        {
            Document d = new Document();
            d.add(new Field("ID", Integer.toString(i), Field.Store.YES, Field.Index.NO));
            d.add(new Field("X", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("Y", Integer.toString(i), Field.Store.NO, Field.Index.ANALYZED));
            d.add(new Field("VALID", Boolean.TRUE.toString(), Field.Store.NO, Field.Index.ANALYZED));

            iw.addDocument(d);
        }
        iw.commit();
        iw.waitForMerges();
        iw.close(true);

        IndexReader ir = IndexReader.open(dir, true);
        IndexSearcher is = new IndexSearcher(ir);
        TopDocs td = is.search(new TermQuery(new Term("VALID", "true")), 1);
        Assert.assertEquals(td.totalHits, 10);
        is.close();
        ir.close();

    }

}

Apache Lucene – Index File Formats

Summary of File Extensions

The following table summarizes the names and extensions of the files in Lucene:

Name Extension Brief Description
Segments File segments.gen, segments_N Stores information about segments
Lock File write.lock The Write lock prevents multiple IndexWriters from writing to the same file.
Compound File .cfs An optional “virtual” file consisting of all the other index files for systems that frequently run out of file handles.
Fields .fnm Stores information about the fields
Field Index .fdx Contains pointers to field data
Field Data .fdt The stored fields for documents
Term Infos .tis Part of the term dictionary, stores term info
Term Info Index .tii The index into the Term Infos file
Frequencies .frq Contains the list of docs which contain each term along with frequency
Positions .prx Stores position information about where a term occurs in the index
Norms .nrm Encodes length and boost factors for docs and fields
Term Vector Index .tvx Stores offset into the document data file
Term Vector Documents .tvd Contains information about each document that has term vectors
Term Vector Fields .tvf The field level info about term vectors
Deleted Documents .del Info about what files are deleted

 

More information here

 

 

 

Lucene3.6 入门指南

From HERE

一、 简介


Lucene是什么:Lucene是apache软件基金会jakarta项目组的一个子项目,是一个开放源代码的全文检索引擎工具包,即它不是一个完整的全文检索引擎,而是一个全文检索引擎的架构,提供了完整的查询引擎和索引引擎,部分文本分析引擎(英文与德文两种西方语言)。Lucene的目的是为软件开发人员提供一个简单易用的工具包,以方便的在目标系统中实现全文检索的功能,或者是以此为基础建立起完整的全文检索引擎。

Lucene是一个基于Java的全文搜索,不是一个完整的搜索应用,而是一个代码库和API,可以方便地为应用提供搜索功能。 实际上Lucene的功能就是将开发人员提供的若干个字符串建立索引,然后提供一个全文搜索服务,用户将搜索的关键词提供给搜索服务,搜索服务告诉用户关键词出现的各字符串。

二、 基本流程


可见,lucene包含两部分:建立索引和搜索服务。建立索引是将源(本质是字符串)写入索引或者将源从索引中删除;进行搜索是向用户提供全文搜索服务,用户可以通过关键词定位源。

1. 建立索引的流程

  • 使用analyzer处理源字符串,包括:分词,即分成一个个单词;去除stopword(可选)。
  • 将源中的有效信息以不同Field的形式加入Document中,并把Document加入索引,从而在索引中记录有效的Field。
  • 将索引写入存储器(内存或磁盘)。

2. 检索的流程

  • 用户提供搜索关键词,经过analyzer处理。
  • 对处理后的关键词搜索索引找出对应的Document。
  • 用户根据需要从找到的Document中提取需要的Field。

三、 基本概念


1. Analyzer

Analyzer的作用是分词,并去除字符串中的无效词语。

分词的目的是把字符串按某种语义规则划分为若干个词。英文中比较容易实现分词,因为英文本身就是以单词为单位,已经用空格分开;而中文则必须以某种方法将连成一片的句子划分成一个个词。 无效词语,如英文中的“of”、“the”和中文中的“的”、“地”等,这些词语在文章中大量出现。但是本身不包含关键信息,去掉后有利于缩小索引文件、提高命中率和执行效率。

2. Document

用户提供的源可以是文本文件、字符串或者数据库表中的一条记录等。一个源字符串经过索引之后,以一个Document的形式存储在索引文件中。搜索服务的结果也是以Document列表的形式返回。

 Index
Document 1

Field A (name/value)
Field B (name/value)
Document 2

Field A (name/value)
Field B (name/value)

3. Field

一个Document可以包含多个信息域,如一篇文章可以包含“标题”、“正文”、“最后修改时间”等信息域,这些信息域以Field的形式保存在Document中。

Field有两个属性:存储和索引。存储属性可以控制是否对这个Field进行存储;索引属性可以控制是否对该Field进行索引。这似乎多此一举,但事实上对这两个属性的正确组合很重要。

下面举例说明:一篇文章需要对标题和正文进行全文搜索,所以把这两个Field的索引属性设置为真;同时希望能直接从搜索结果中提取文章标题,所以把标题Field的存储属性设置为真。但是正文Field太大了,为了缩小索引文件,将正文Field的存储属性设置为假,需要访问时再直接读取文件正文;希望能从搜索结果中提取最后修改时间;但是不需要对它进行搜索,所以把最后修改时间Field的存储属性设置为真,索引属性设置为假。

Field的两个属性禁止全为假的情况因为这对建立索引没有意义。

4. segment

建立索引时,并不是每个document都马上添加到同一个索引文件,它们首先被写入到不同的小文件,然后再合并成一个大索引文件,每个小文件都是一个segment。

5. term

term表示文档的一个词,是搜索的最小单位。term由两部分组成:所表示的词语和这个词语所出现的field。

6. token

token是term的一次出现,它包含trem文本和相应的起止偏移,以及一个类型字符串。一句话中可以出现多次相同的词语,它们都用同一个term表示,但是用不同的token,每个token标记该词语出现的位置。

四、 Lucene的组成结构


Lucene包括core和sandbox两部分,其中core是lucene的核心,sandbox包含了一些附加功能,如highlighter、各种分析器等。 Lucene core包含8个包:analysis、collation、document、index、queryParser、search、store、util。

1. analysis包

Analysis提供自带的各种Analyzer,如按空白字符分词的WhitespaceAnalyzer,添加了stopword过滤的StopAnalyzer,支持中文分词的SmartChineseAnalyzer,以及最常用的StandardAnalyzer。

2. collation包

包含collationKeyFilter和collationKeyAnalyzer两个相同功能的类,将所有token转换为CollationKey,并将CollationKey与IndexableBinaryStringTools一起编码存储为一个term。

3. document包

document包中是Document相关的各种数据结构,如Document类、Field类等。

4. index包

index包中是索引的读写操作类,常用的是对索引文件的segment进行写、合并和优化的IndexWriter类和对索引进行读取和删除操作的IndexReader类。IndexWriter只关心如何将索引写入一个个segment并将它们合并优化;IndexReader关注索引文件中各个文档的组织形式。

5. queryParser包

queryParser包中是解析查询语句相关的类(常用的是QueryParser类)以及Token类。

6. search包

search包中是从索引中进行搜索的各种不同的Query类(如TermQuery、BooleanQuery等)和搜索结果集Hits类。

7. store包 store包中是索引的存储相关类,如Directory类定义了索引文件的存储结构,FSDirectory是存储在文件系统(即磁盘)中的索引存储类,RAMDirectory为存储在内存中的索引存储类,MmapDirectory为使用内存映射的索引存储类。

8. util包 util包中是公共工具类,例如时间和字符串之间的转换工具。

五、 环境搭建


下载:

  1. http://lucene.apache.org/core/downloads.html
  2. http://mirror.bjtu.edu.cn/apache/lucene/java/3.6.1/lucene-3.6.1.zip

把lucene-core-3.6.1.jar加到项目中。

快速入门(Helloworld)

在开始helloworld之前,以lucene创建索引的流程辅助我们来了解几个概念。

  1. 信息源:要采集,必须有信息源,在这里我们就以读取硬盘中的文件(File)充当信息源。
  2. 加工:要把采集的信息,以lucene规定的形式存放到索引库中,所以要创建相应的文档(Document)对象。在这个文档中,我们要存放哪些信息才能达到完整且辟免垃圾信息,例如网页,我们可以要存储的是他的标题、内容、URL等,那些广告是不用存储的。在这里我们用到Field来存储各项目内容。
  3. 分析:对于加工好的了文档,我们是不是应该对其进行分词,答案是肯定的。用什么分词器呢?对英文和中文使用的分词器有可能不一样吧,这个得看后续分解了。在这里我们就用标准的分词器(StandardAnalyzer)
  4. 索引库:要把文档写入到索引库,并且根据分词器进行分词、建立索引,这得建索引库吧,在lucene中对应的是Directory,它可以建立在内存中,也可以建立在硬盘中。
  5. 一切具备,只缺把文档写入到索引库了,用什么呢?当然是IndexWriter。

好了,这就是lucene创建索引的过程,下面看看代码是怎样表现的。

六、 代码示例


Lucene在lucene-3.6.1-src/contrib/demo/src/java/org/apache/lucene/demo中提供了入门的示例代码。 

  • IndexFiles.java是关于建立索引的示例。 
  • SearchFiles.java是关于进行检索的示例。

1. 在文件系统中建立索引的代码

String indexPath = "/lucene/myindex"; 
 Directory dir = FSDirectory.open(new File(indexPath));  
 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
 IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36, analyzer); 
 iwc.setOpenMode(OpenMode.CREATE); //即创建新索引文件OpenMode.CREATE_OR_APPEND表示创建或追加到已有索引文件 
 IndexWriter writer = new IndexWriter(dir, iwc); 
 Document doc = new Document();  
 doc.add(new Field("title", "lucene introduction", Field.Store.YES, Field.Index.ANALYZED));  
 doc.add(new Field("content", "lucene works well", Field.Store.YES, Field.Index. ANALYZED));  
 writer.addDocument(doc);  
 writer.close()

2. 直接在内存中建立索引的代码

Directory dir = new RAMDirectory();  
 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
 IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_36, analyzer); 
 iwc.setOpenMode(OpenMode.CREATE); //即创建新索引文件OpenMode.CREATE_OR_APPEND表示创建或追加到已有索引文件 
 IndexWriter writer = new IndexWriter(dir, iwc); 
 Document doc = new Document();  
 doc.add(new Field("title", "lucene introduction", Field.Store.YES, Field.Index.ANALYZED));  
 doc.add(new Field("content", "lucene works well", Field.Store.YES, Field.Index. ANALYZED));  
 writer.addDocument(doc);  
 writer.close();

3. 对整个文本文件my.txt建立索引的代码

… 
 File file = new File(“/home/hanxb/my.txt”); 
 FileInputStream fis = new FileInputStream(file);
 Document doc = new Document();   
 Field pathField = new Field("path", file.getPath(), Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS); 
 pathField.setIndexOptions(IndexOptions.DOCS_ONLY); 
 doc.add(pathField);  
 NumericField modifiedField = new NumericField("modified");//索引key为modified 
 modifiedField.setLongValue(file.lastModified());//文件的最后修改时间 
 doc.add(modifiedField);  
 doc.add(new Field("contents", new BufferedReader(new InputStreamReader(fis, "UTF-8"))));  
 writer.addDocument(doc);//这里为创建新的索引文件 
 //如果为创建或追加索引文件, 则writer.updateDocument(new Term("path", file.getPath()), doc);  
 fis.close(); 
 writer.close()

4. 检索“Cloud Computing”关键词的代码

IndexReader reader = IndexReader.open(FSDirectory.open(new File(index))); 
 IndexSearcher searcher = new IndexSearcher(reader); 
 Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_36); 
 QueryParser parser = new QueryParser(Version.LUCENE_36, field, analyzer); 
 Query query = parser.parse("Cloud Computing"); //搜索关键词“Cloud Computing” 
 searcher.search(query, null, 100); 
 TopDocs results = searcher.search(query, 10); //只取排名前10的搜索结果 
 ScoreDoc[] hits = results.scoreDocs; 
 Document doc = null; 
 for (int i = start; i < end; i++) { 
     doc = searcher.doc(hits[i].doc); 
     String path = doc.get("path"); 
     long modifiedtime = doc.get("modified"); 
     String contents = doc.get("contents"); 
 } 
 searcher.close(); 
 reader.clos