官术网_书友最值得收藏!

  • Lucene 4 Cookbook
  • Edwood Ng Vineeth Mohan
  • 300字
  • 2021-07-16 14:07:51

Defining custom tokenizers

Although there are several excellent built-in tokenizers in Lucene, you may still find yourself needing something to behave slightly differently. You will then have to custom-build a Tokenizer. Lucene provides a character-based tokenizer called CharTokenizer that should be suitable for most types of tokenizations. You can override its isTokenChar method to determine what characters should be considered as part of a token and what characters should be considered as delimiters. It's worthwhile to note that both LetterTokenizer and WhitespaceTokenizer extend from CharTokenizer.

How to do it…

In this example, we will create our own tokenizer that splits text by space only. It is similar to WhitespaceTokenizer but this one is simpler. Here is the sample code:

public class MyTokenizer extends CharTokenizer {

    public MyTokenizer(Reader input) {
        super(input);
    }

    public MyTokenizer(AttributeFactory factory, Reader input) {
        super(factory, input);
    }

    @Override
    protected boolean isTokenChar(int c) {
        return !Character.isSpaceChar(c);
    }
}

How it works…

In this example, we extend from an abstract class called CharTokenizer. As described earlier, this is a character-based tokenizer. To use CharTokenizer, you need to override the isTokenChar method. In this method, you get to examine the input stream (via Reader) character by character and determine whether to treat the character as a token character or a delimiting character. It handles the complexity of token extraction from a Reader for you so you can focus on the business logic of how text should be tokenized. We want to build a tokenizer that splits text by space only, so we leverage the isSpaceChar method from the character class to determine if the character is a space. If it's a space, it returns false, which means it's a token character. Otherwise, the character will be treated as a delimiting character and a new token will form afterwards.

主站蜘蛛池模板: 鄂托克旗| 天等县| 张家界市| 梅河口市| 澜沧| 秦皇岛市| 来安县| 福建省| 分宜县| 西平县| 九台市| 宿松县| 墨玉县| 温泉县| 锦屏县| 革吉县| 林芝县| 高台县| 正镶白旗| 呼和浩特市| 潢川县| 肇源县| 大英县| 佛山市| 怀来县| 监利县| 鄂州市| 阜阳市| 灵宝市| 庐江县| 乌兰察布市| 桦川县| 墨竹工卡县| 来安县| 越西县| 江口县| 手机| 和林格尔县| 舟曲县| 成都市| 顺昌县|