淘先锋技术网

首页 1 2 3 4 5 6 7

拼音分词器

当用户在搜索框输入字符时,我们应该提示出与该字符有关的搜索项,如图:

要实现根据字母做补全,就必须对文档按照拼音分词。在GitHub上恰好有elasticsearch的拼音分词插件。地址:

GitHub - medcl/elasticsearch-analysis-pinyin: This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin. - GitHub - medcl/elasticsearch-analysis-pinyin: This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.https://github.com/medcl/elasticsearch-analysis-pinyin 安装方式与IK分词器一样,分三步:

解压  上传到虚拟机中,elasticsearch的plugin目录,一般就是"/var/lib/docker/volumes/es-plugins/_data"

重启

elasticsearch 测试

POST /_analyze
{
  "text": ["如家酒店真不错"],
  "analyzer": "pinyin"
}

自定义分词器

elasticsearch中分词器(analyzer)的组成包含三部分:

character filters:在tokenizer之前对文本进行处理。例如删除字符、替换字符

tokenizer:将文本按照一定的规则切割成词条(term)。例如keyword,就是不分词;还有ik_smart

tokenizer filter:将tokenizer输出的词条做进一步处理。例如大小写转换、同义词处理、拼音处理等

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { 
        "my_analyzer": { 
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
        "py": { 
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  }

拼音分词器适合在创建倒排索引的时候使用,但不能在搜索的时候使用。为了避免搜索到同音字,搜索时不要使用拼音分词器

因此字段在创建倒排索引时应该用my_analyzer分词器;字段在搜索时应该使用ik_smart分词器:

PUT /test
{
  "settings": {
    "analysis": {
      "analyzer": { 
        "my_analyzer": { 
          "tokenizer": "ik_max_word",
          "filter": "py"
        }
      },
      "filter": {
        "py": { 
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  }
  , "mappings": {
    "properties": {
      "name":{
        "type": "text",
        "analyzer": "my_analyzer",
        "search_analyzer": "ik_smart"
      }
    }
  }
}

自动补全查询

elasticsearch提供了Completion Suggester查询来实现自动补全功能。这个查询会匹配以用户输入内容开头的词条并返回。为了提高补全查询的效率,对于文档中字段的类型有一些约束: 参与补全查询的字段必须是completion类型。 字段的内容一般是用来补全的多个词条形成的数组。

# 自动补全的索引库
PUT test1
{
  "mappings": {
    "properties": {
      "title":{
        "type": "completion"
      }
    }
  }
}

# 示例数据
POST test1/_doc
{
  "title": ["Sony", "WH-1000XM3"]
}
POST test1/_doc
{
  "title": ["SK-II", "PITERA"]
}
POST test1/_doc
{
  "title": ["Nintendo", "switch"]

# 自动补全查询
POST /test1/_search
{
  "suggest": {
    "title_suggest": {
      "text": "s",
      "completion": {
        "field": "title",
        "skip_duplicates": true,
        "size": 10
      }
    }
  }
}

自动补全对字段的要求:

类型是completion类型

字段值是多词条的数组

案例——实现hotel索引库的自动补全、拼音搜索功能

修改hotel索引库结构,设置自定义拼音分词器

修改索引库的name、all字段,使用自定义分词器

索引库添加一个新字段suggestion,类型为completion类型,使用自定义的分词器

# 酒店数据索引库
PUT /hotel
{
  "settings": {
    "analysis": {
      "analyzer": {
        "text_anlyzer": {
          "tokenizer": "ik_max_word",
          "filter": "py"
        },
        "completion_analyzer": {
          "tokenizer": "keyword",
          "filter": "py"
        }
      },
      "filter": {
        "py": {
          "type": "pinyin",
          "keep_full_pinyin": false,
          "keep_joined_full_pinyin": true,
          "keep_original": true,
          "limit_first_letter_length": 16,
          "remove_duplicated_term": true,
          "none_chinese_pinyin_tokenize": false
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "id":{
        "type": "keyword"
      },
      "name":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart",
        "copy_to": "all"
      },
      "address":{
        "type": "keyword",
        "index": false
      },
      "price":{
        "type": "integer"
      },
      "score":{
        "type": "integer"
      },
      "brand":{
        "type": "keyword",
        "copy_to": "all"
      },
      "city":{
        "type": "keyword"
      },
      "starName":{
        "type": "keyword"
      },
      "business":{
        "type": "keyword",
        "copy_to": "all"
      },
      "location":{
        "type": "geo_point"
      },
      "pic":{
        "type": "keyword",
        "index": false
      },
      "all":{
        "type": "text",
        "analyzer": "text_anlyzer",
        "search_analyzer": "ik_smart"
      },
      "suggestion":{
          "type": "completion",
          "analyzer": "completion_analyzer"
      }
    }
  }
}

给HotelDoc类添加suggestion字段,内容包含brand、business

 重新导入数据到hotel库

 name、all是可分词的,自动补全的brand、business是不可分词的,要使用不同的分词器组合

测试

GET /hotel/_search
{
  "suggest": {
    "suggestions": {
      "text": "sh",
      "completion": {
        "field": "suggestion",
        "skip_duplicates":true,
        "size":10
      }
    }
  }
}

实现酒店搜索框自动补全

后端控制层接口

//    搜索自动补全
    @GetMapping("/suggestion")
    public List<String> suggestion(String key){
        return hotelService.suggestion(key);
    }

业务层代码

    @Override
    public List<String> suggestion(String key) {
        List<String> list=new ArrayList<>();
        SearchRequest request=new SearchRequest("hotel");
        request.source().suggest(new SuggestBuilder().addSuggestion(
                "mySuggestion",
                SuggestBuilders.completionSuggestion("suggestion")
                        .prefix(key)
                        .skipDuplicates(true)
                        .size(10)
        ));
        SearchResponse response = null;
        try {
            response = restHighLevelClient.search(request, RequestOptions.DEFAULT);
        } catch (IOException e) {
            throw  new RuntimeException(e);
        }
        Suggest suggest = response.getSuggest();
        CompletionSuggestion mySuggestion = suggest.getSuggestion("mySuggestion");
        for (CompletionSuggestion.Entry.Option option : mySuggestion.getOptions()) {
            String s = option.getText().toString();
            list.add(s);
        }
        return list;
    }