lucene4 codec分析 -

fwuwen

浏览: 16090 次
来自: 厦门

最近访客更多访客>>

zjy_369

IT小鑫

picksun

discolt

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (6)

社区版块

存档分类

lucene4 codec分析

lucene4的一个很大的变化就是提供了可插拔的编码器架构，可以自行定义索引结构，包括词元，倒排列表，存储字段，词向量，已删除的文档，段信息，字段信息

关于codec:

lucene4中已经提供了多个codec的实现

Lucene40, 默认编码器.Lucene40Codec

Lucene3x, read-only, 可以用来读取采用3.x创建的索引,不能使用该编码器创建索引.Lucene3xCodec

SimpleText, 采用明文的方式存储索引,适合用来学习,不建议在生产环境中使用. SimpleTextCodec

Appending, 针对采用append写入的文件系统,例如hdfs. AppendingCodec

......

关于format:

codec事实上就是有多组的format构成的，一个codec总共包含8个format，

包含PostingsFormat，DocValuesFormat，StoredFieldsFormat，TermVectorsFormat，FieldInfosFormat，SegmentInfoFormat，NormsFormat，LiveDocsFormat

例StoredFieldsFormat用来处理stored fileds，TermVectorsFormat用来处理term vectors。在lucene4中可以自行定制各个format的实现

目前在lucene4中也提供了多个PostingsFormat的实现

Memory:将所有的term和postinglists加载到一个内存中的FST. MemoryPostingsFormat

Direct:写的时候采用默认的Lucene40PostingsFormat，读的时候在将terms和postinglists加载到内存里面.DirectPostingsFormat

Pulsing:默认将词频小于等于1的term采用inline的方式存储.PulsingPostingsFormat

BloomFilter:可以在每个segment上为某个指定的field添加Bloom Filter.实现了"fast-fail"来判断segment上有没有相对应的key。最适合的场景就是在索引的记录数很多，同时segment也很多的情况下为主键添加Bloom Filter。BloomFilteringPostingsFormat需实现在其他的PostingsFormat之上.这里有个关于BloomFilter的测试https://docs.google.com/spreadsheet/ccc?key=0AsKVSn5SGg_wdFNpNTl3R1cxLTluTTcya2hDRnlfdHc#gid=3

Block:提供了索引的压缩同时也加强了检索性能，在未来的版本中可能会变成默认的PostingsFormat。现在要使用此格式的同学得注意，目前这个版本还处在实验阶段，并不能保证索引格式的向后兼容。和Lucene40不同的是BlockPostingsFormat不会创建 .frq和.prx取而代之的是.doc和.pos文件

....

测试代码：

package test;

import java.io.File;  
import java.util.ArrayList;  
import java.util.List;  
import java.util.UUID;  
  
import org.apache.lucene.analysis.Analyzer;  
import org.apache.lucene.analysis.cjk.CJKAnalyzer;  
import org.apache.lucene.codecs.Codec;  
import org.apache.lucene.codecs.PostingsFormat;  
import org.apache.lucene.codecs.appending.AppendingCodec;  
import org.apache.lucene.codecs.bloom.BloomFilteringPostingsFormat;  
import org.apache.lucene.codecs.lucene3x.Lucene3xCodec;  
import org.apache.lucene.codecs.lucene40.Lucene40Codec;  
import org.apache.lucene.codecs.lucene40.Lucene40PostingsFormat;  
import org.apache.lucene.codecs.simpletext.SimpleTextCodec;  
import org.apache.lucene.document.Document;  
import org.apache.lucene.document.Field;  
import org.apache.lucene.document.StringField;  
import org.apache.lucene.document.TextField;  
import org.apache.lucene.index.IndexWriter;  
import org.apache.lucene.index.IndexWriterConfig;  
import org.apache.lucene.store.Directory;  
import org.apache.lucene.store.FSDirectory;  
import org.apache.lucene.util.Version;  
  
/** 
 * lucene codec 
 *  
 * @author wuwen 
 * @date 2013-1-14 下午04:54:17 
 *  
 */  
public class LuceneCodecTest {  
  
    static Codec getCodec(String codecname) {  
        Codec codec = null;  
        if ("Lucene40".equals(codecname)) {  
            codec = new Lucene40Codec();  
        } else if ("Lucene3x".equals(codecname)) {  
            codec = new Lucene3xCodec();  
//          throw new UnsupportedOperationException("this codec can only be used for reading");  
        }  
        else if ("SimpleText".equals(codecname)) {  
            codec = new SimpleTextCodec();  
        } else if ("Appending".equals(codecname)) {  
            codec = new AppendingCodec();  
        } else if ("Pulsing40".equals(codecname)) {  
             codec = new Lucene40Codec() {  
                  public PostingsFormat getPostingsFormatForField(String field) {  
                      return PostingsFormat.forName("Pulsing40");  
                  }  
             };  
        } else if ("Memory".equals(codecname)) {  
             codec = new Lucene40Codec() {  
                  public PostingsFormat getPostingsFormatForField(String field) {  
                      return PostingsFormat.forName("Memory");  
                  }  
             };  
        } else if ("BloomFilter".equals(codecname)) {  
             codec = new Lucene40Codec() {  
                  public PostingsFormat getPostingsFormatForField(String field) {  
                      return new BloomFilteringPostingsFormat(new Lucene40PostingsFormat());  
                  }  
             };  
        }else if ("Direct".equals(codecname)) {  
             codec = new Lucene40Codec() {  
                  public PostingsFormat getPostingsFormatForField(String field) {  
                      return PostingsFormat.forName("Direct");  
                  }  
             };  
        } else if ("Block".equals(codecname)) {  
             codec = new Lucene40Codec() {  
                  public PostingsFormat getPostingsFormatForField(String field) {  
                      return PostingsFormat.forName("Block");  
                  }  
             };  
        }  
        return codec;  
    }  
      
    public static void main(String[] args) {  
        String[] codecs = {"Lucene40", "Lucene3x", "SimpleText", "Appending", "Pulsing40", "Memory", "BloomFilter", "Direct", "Block"};  
        String suffixPath = "E:\\lucene\\codec\\";  
        for (String codecname : codecs) {  
            String indexPath = suffixPath + codecname;  
            Codec codec = getCodec(codecname);  
            Analyzer analyzer = new CJKAnalyzer(Version.LUCENE_40);  
            IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_40, analyzer);  
            config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);  
            config.setCodec(codec);     // 设置编码器  
            IndexWriter writer = null;  
            try {  
                Directory luceneDir = FSDirectory.open(new File(indexPath));  
                writer = new IndexWriter(luceneDir, config);  
                List<Document> list = new ArrayList<Document>();  
                
                Document doc = new Document();  
                doc.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));  
                doc.add(new TextField("Content", "北京时间1月14日04:00(西班牙当地时间13日21:00)，2012/13赛季西班牙足球甲级联赛第19轮一场焦点战在纳瓦拉国王球场展开争夺.", Field.Store.YES));
                list.add(doc);  
                
                Document doc1 = new Document();  
                doc1.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));  
                doc1.add(new TextField("Content", "巴萨超皇马18分毁了西甲？媒体惊呼 克鲁伊夫看不下去.", Field.Store.YES));
                list.add(doc1);  
                
                Document doc2 = new Document();  
                doc2.add(new StringField("GUID", UUID.randomUUID().toString(), Field.Store.YES));  
                doc2.add(new TextField("Content", "what changes in lucene4.", Field.Store.YES));  
                list.add(doc2);  
  
                writer.addDocuments(list);  
            } catch (Exception e) {  
                e.printStackTrace();  
            } finally {  
                if (writer != null) {  
                    try {  
                        writer.close();  
                    } catch (Exception e) {  
                        e.printStackTrace();  
                    }  
                }  
            }  
              
        }  
    }  
}

分享到：