官术网_书友最值得收藏!

  • Lucene 4 Cookbook
  • Edwood Ng Vineeth Mohan
  • 262字
  • 2021-07-16 14:07:51

Defining custom TokenFilters

Sometimes, search behaviors may be so specific that we need to create a custom TokenFilter to achieve those behaviors. To create a custom filter, we will extend from the TokenFilter class and override the incrementToken() method.

We will create a simple word-expanding TokenFilter that expands courtesy titles from the short form to the full word. For example, Dr expands to doctor.

How to do it…

Here is the sample code:

public class CourtesyTitleFilter extends TokenFilter {
    Map<String,String> courtesyTitleMap = new HashMap<String,String>();
    private CharTermAttribute termAttr;
    public CourtesyTitleFilter(TokenStream input) {
        super(input);
        termAttr = addAttribute(CharTermAttribute.class);
        courtesyTitleMap.put("Dr", "doctor");
        courtesyTitleMap.put("Mr", "mister");
        courtesyTitleMap.put("Mrs", "miss");
    }
    public boolean incrementToken() throws IOException {
        if (!input.incrementToken())
            return false;
        String small = termAttr.toString();
        if(courtesyTitleMap.containsKey(small)) {
            termAttr.setEmpty().append(courtesyTitleMap.get(small));
        }
        return true;
    }
}

How it works…

We create the CourtesyTitleFilter class by extending TokenFilter. In its constructor, we initialize a CharTermAttribute instance for reading the token value and initialize courtesyTitleMap with the short form and word mapping for our conversion. In the overridden method, incrementToken(), we first check if the input (inputting TokenStream) still has a token. If no token is found, it exits with a false value. Then it checks if the token exists in courtesyTitleMap. If a mapping is found, it resets the token value with CharTermAttribute, setting the attribute empty by calling setEmpty() and appending it with the new value from courtesyTitleMap.

When you run this code as part of an analysis process that splits text by whitespaces and applies a lowercase filter at the end, the string Dr Watson would become [doctor] [watson] in output.

主站蜘蛛池模板: 龙山县| 巴彦县| 襄樊市| 滨州市| 海城市| 吕梁市| 利辛县| 雅安市| 台东县| 梓潼县| 宁海县| 巍山| 兰考县| 德惠市| 安徽省| 伊金霍洛旗| 滁州市| 县级市| 科技| 子洲县| 娄底市| 瑞金市| 株洲县| 唐山市| 马山县| 古交市| 莫力| 阿克陶县| 色达县| 盐山县| 达孜县| 囊谦县| 土默特右旗| 万安县| 南阳市| 皮山县| 金秀| 邯郸市| 广德县| 健康| 芜湖市|