mg不朽的情缘在哪玩

書名：自己動手寫分布式搜索引擎
作者名：羅剛
本章字數： 1321字
更新時間： 2020-11-28 15:52:47

3.2.10 定制索引存儲結構

開發一個高效的搜索引擎是一項有挑戰性的工作。Lucene底層代碼讀起來往往很費勁。可以自己通過codec來定制編碼和索引的結構。

Lucene 4相對更早的版本，一個很大的變化就是提供了可插拔的編碼器架構，可以自行定義索引結構，包括詞元、倒排列表、存儲字段、詞向量、已刪除的文檔、段信息、字段信息。codec在Lucene中的結構如圖3-7所示。

圖3-7 codec在Lucene中的結構

codec直接傳遞給SegmentReader來編碼索引格式。提供枚舉類的實現給SegmentReader。提供索引文件的寫入器給IndexWriter。

Lucene 4中已經提供了多個codec的實現，其中Lucene40是默認編碼器Lucene40Codec。為了兼容更早的版本，提供了只讀的Lucene3xCodec，可以用來讀取Lucene 3.x創建的索引，但不能使用該編碼器創建Lucene3.x的索引。

PerFieldCodec用來支持不同的列使用不同的讀寫格式。

lucene-codecs-6.3.0.jar中包含一些額外的codec，和其他的codec寫入到壓縮的二進制文件不一樣，SimpleTextCodec把所有的投遞列表寫到可讀的文本文件。SimpleTextCodec適合用來學習，不建議在生產環境中使用。

        Analyzer analyzer = new StandardAnalyzer();
        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);


        iwc.setCodec(new SimpleTextCodec());
        iwc.setUseCompoundFile(false);


        Directory directory = FSDirectory.open(new File("F:/lucene/index"));


        IndexWriter writer = new IndexWriter(directory, iwc);


        // index a few documents
        writer.addDocument(createDocument("1", "青菜雞肉"));
        writer.addDocument(createDocument("2", "老鴨粉絲湯"));
        writer.addDocument(createDocument("3", "辣子雞丁"));
        writer.close();

主要產生5個文件：_0.len、_0.fld、_0.inf、_0.pst和_0.si。其中，pst文件保存倒排索引；fld文件保存存儲到索引的原值；inf文件保存文件是如何索引的。

倒排索引在_0.pst文件中。先存某一列的倒排索引，然后再存另外一列的倒排索引。例如，像如下方式寫入contents列和id列的倒排索引：

        field contents
          term丁
            doc 2
              freq 1
              pos 3
          term絲
            doc 1
              freq 1
              pos 3
          term子
            doc 2
              freq 1
              pos 1
          term湯
            doc 1
        freq 1
        pos 4
    term粉
      doc 1
        freq 1
        pos 2
    term老
      doc 1
        freq 1
        pos 0
    term肉
      doc 0
        freq 1
        pos 3
    term菜
      doc 0
        freq 1
        pos 1
    term辣
      doc 2
        freq 1
        pos 0
    term青
      doc 0
        freq 1
        pos 0
    term雞
      doc 0
        freq 1
        pos 2
      doc 2
        freq 1
        pos 2
    term鴨
      doc 1
        freq 1
        pos 1
  field id
    term 1
      doc 0
    term 2
      doc 1
    term 3
      doc 2
  END

_0.fld文件的內容如下：

    doc 0
      numfields 2
      field 0
        name id
        type string
        value 1
      field 1
        name contents
        type string
        value青菜雞肉
    doc 1
      numfields 2
      field 0
        name id
        type string
        value 2
      field 1
        name contents
        type string
        value老鴨粉絲湯
    doc 2
      numfields 2
      field 0
        name id
        type string
        value 3
      field 1
        name contents
        type string
        value辣子雞丁
    END

_0.inf文件的內容如下：

    number of fields 2
      name id
      number 0
      indexed true
      index options DOCS_ONLY
      term vectors false
      payloads false
      norms false
      norms type false
      doc values false
      attributes 0
      name contents
      number 1
      indexed true
      index options DOCS_AND_FREQS_AND_POSITIONS
      term vectors false
      payloads false
      norms true
      norms type NUMERIC
      doc values false
      attributes 0

codec事實上就是由多組format構成的，一個codec共包含8個format，即PostingsFormat、DocValuesFormat、StoredFieldsFormat、TermVectorsFormat、FieldInfo Format、SegmentInfoFormat、NormsFormat和LiveDocsFormat。例如，StoredFieldsFormat用來處理存儲數據的列；TermVectorsFormat用來處理詞向量。在Lucene4中可以自行定制各個format的實現。

其他的codec可以轉換成這樣的標準輸出。SimpleTextCodec這樣的文件格式沒有索引，所以無法快速查找某個詞，但是可以用于調試和學習。

在IndexWriterConfig中有setCodec()方法可以設置編解碼器，可以用這個IndexWriterConfig創建一個IndexWriter。但在IndexReader類中沒有這樣的方法。寫索引的時候需指定要使用的codec，并且把所使用的codec的名字寫入索引的每個段中。

在讀索引的時候(當打開一個IndexReader的時候)，不能再改變編解碼器，只能保證索引使用的所有codec都在CLASSPATH中。IndexReader將檢查每個段，以確定它是用哪種codec寫的，在CLASSPATH中找到這個codec，并用它來打開該段。

        IndexWriterConfig iwc = new IndexWriterConfig(analyzer);


        System.out.println(iwc.getCodec().availableCodecs());
        String name = "Lucene42";
        iwc.setCodec(iwc.getCodec().forName(name));
        IndexWriter writer = new IndexWriter(directory, iwc);

可以在索引中保存和每個詞相關的字節數組信息，叫作Payload。首先在分析文本期間生成Payload信息。可以使用PayloadAttribute達到這一點，只需要在分析過程中將該屬性添加到Token屬性中。使用PayloadHelper將數字編碼為Payload，然后就可以設置到PayloadAttribute。例如，編碼浮點數：

        Payload p = new Payload(PayloadHelper.encodeFloat(42));

注意，這里的PayloadHelper類不在核心包中，而在contrib/common/lucene-analyzers-3.x中。PayloadHelper中的decodeInt()方法從字節數組中得到一個整數。

        public static final int decodeInt(byte [] bytes, int offset){
            return ((bytes[offset] & 0xFF) << 24) | ((bytes[offset + 1] & 0xFF) << 16)
                | ((bytes[offset + 2] & 0xFF) <<  8) |  (bytes[offset + 3] & 0xFF);
        }

這里的bytes[offset] & 0xFF是為了得到整數結果，然后參與后續的位移運算。

官术网_书友最值得收藏!

自己動手寫分布式搜索引擎

3.2.10 定制索引存儲結構