elasticsearch analysis hao Download - elasticsearch analysis hao Source code download

elasticsearch analysis hao

Other source code

v8.7.1

Download

HAO ES tokenizer

Introduction

An elasticsearch Chinese word segmentation plug-in.

QQ communication group: 743457803

Please refer here for how to develop an ES word segmentation plug-in.

Mainly refer to IK and HanLP

characteristic

Supports complex Chinese characters . Some Chinese characters are not 1 in length in Java, such as ? , but IK and others do not support it.
Single-word word segmentation and search are supported, but ik_max_word mode is not supported.
Supports custom length word segmentation , suitable for identifying names of people in short texts.

The Chinese character text length <=autoWordLength separated by spaces, punctuation marks, letters, numbers, etc. will be automatically recognized as a word.

Support emoji search
Compared with IK, it is smarter and more accurate than IK.
- Example: For example, IK ik_max_word exhaustively enumerates all possible words, resulting in some irrelevant searches being found. The任性冲动过turns out to任性性冲动动过, so searching for "性冲动will search for this doc.南京市长江大桥, the result is南京市市长长江大桥, then the search for市长will search for this doc, but the hao word segmenter will not. It calculates the shortest path through word frequency and identifies the phrase with the highest probability. You can also adjust the word frequency at will according to your own scene.
The word segmentation result of ik_smart is not a subset of ik_max_word, and the word segmentation result of hao_search_mode is a subset of the word segmentation result of hao_index_mode
Compared with HanLp, it is lighter in weight and more controllable in word segmentation . It does not have some intelligent prediction functions such as names, which may lead to unstable and inaccurate word segmentation. Machine learning has different prediction results for long and short texts, and the predicted word segmentation results are also different. And HanLP does not have an official ES plug-in.
Calculate the shortest path based on word frequency and enumerate possible words instead of all words. If the enumerated words are wrong, you can adjust the word frequency to correct it. The word frequency file is a txt file with better readability .
Meta-words are supported, for example,俄罗斯will no longer be split into俄and罗斯(罗斯is a common personal name). This way罗斯will not recall俄罗斯related documents
But part of speech is not supported

Provide Analyzer: hao_search_mode , hao_index_mode Tokenizer: hao_search_mode , hao_index_mode

Versions

Git tag	ES version
master	ES latest stable version
v7.17.1	7.17.1
vX.YZ	XYZ

use

Install

Method 1. bin/elasticsearch-plugin install file:///Users/xiaoming/Download/analysis-hao.zip

Method 2. After decompression, place it in the es plugins directory. Make sure it is the following directory structure {ES_HOME}/plugins/analysis-hao/(各种jar等文件) . At the same time, the directory cannot contain zip files.

Finally restart ES

ES version upgrade

If there is no corresponding ES version you need, you need to modify a few places:

Modify the value of pom.xml -> elasticsearch.version to the corresponding version.
Compile and modify the code according to the response error. For example, there may be a constructor method of HaoTokenizerFactory.java . Finally, execute mvn clean package -Dmaven.test.skip=true to get the zip installation package of the plug-in.

tokenizer

The following are the configuration items available for custom word segmenters:

parameter

Configuration item parameters	Function	default value
`enableIndexMode`	Whether to use index mode, index mode is fine-grained.	`hao_search_mode` is `false` , `hao_index_mode` is `true` , fine granularity is suitable for Term Query, and coarse granularity is suitable for Phrase query
`enableFallBack`	If an error is reported in word segmentation, whether to start the most fine-grained word segmentation, that is, segmentation by characters. It is recommended to use `search_mode` so as not to affect user search. `index_mode` is not started so that error alarm notifications can be reported in a timely manner.	`false` does not start downgrade
`enableFailDingMsg`	Whether to start DingTalk notification of failure, the notification address is the `dingWebHookUrl` field of `HttpAnalyzer.cfg.xml` .	`false`
`enableSingleWord`	Whether to use fine-grained returned words. For example,`体力值`, the word segmentation result only stores`体力值`,`体力`, but not the`值`	`false`
`autoWordLength`	Chinese character text separated by spaces, punctuation marks, letters, numbers, etc. whose length is less than `autoWordLength` will be automatically recognized as one word. Default -1 is not enabled, >=2 is considered enabled	`-1`

Introduction to built-in tokenizer

hao_index_mode

Words will be segmented recursively based on the terms and weights of the vocabulary until the word is inseparable. If enableSingleWord=true is set, it will be divided into single words.

For example, this text南京市长江大桥

南京市长江大桥==>南京市,长江大桥
南京市==>南京,市,长江大桥==>长江,大桥
If enableSingleWord=false , the recursion stops and the word segments obtained are南京市,南京,市,长江大桥,长江,大桥
If enableSingleWord=true , continue the recursion until the single word position, and get the word segmentation as南京市,南京,南,京,市,长江大桥,长江,长江,大桥,大桥

hao_search_mode

In this mode, it is equivalent to the hao_index_mode mode recursively only once. The word segmentation result is南京市,长江大桥. Because enableIndexMode=false in this mode, if it is changed to true , it will have the same effect as hao_index_mode .

HaoAnalyzer.cfg.xml configuration

parameter	Function	Remark
`baseDictionary`	Basic dictionary file name	Place it in the plug-in `config` directory or the es `config` directory without changing it.
`customerDictionaryFile`	User-defined remote lexicon file, multiple files separated by English semicolon;	It will be stored in the plug-in `config` directory or the es `config` directory.
`remoteFreqDict`	Remote user-defined vocabulary file	Convenient hot update, hot update is updated regularly through the following two parameters.
`syncDicTime`	The next synchronization time of the remote dictionary `hh:mm:ss`	Leave it blank and use `syncDicPeriodTime` as the next synchronization time.
`syncDicPeriodTime`	Remote dictionary synchronization time interval, seconds, minimum value 30	For example, `syncDicTime=20:00:00,syncDicPeriodTime=86400` , it is synchronized at 20 o'clock every day
`dingWebHookUrl`	DingTalk robot url	Used for word segmentation exceptions and synchronization thesaurus exception/success notifications
`dingMsgContent`	Robot notification copywriting	Note that when configuring the DingTalk robot, the keywords must match this copy, otherwise the message will fail to be sent.

Lexicon description

The {ES_HOME}/config/analysis-hao/ directory is read first, and the files in {ES_HOME}/plugins/analysis-hao/config directory are not read.

Basic dictionary The basic dictionary is base_dictionary.txt , separated by commas, and the following number indicates the word frequency. For example: the word segmentation result of奋发图强is奋,发图,强, because the word frequency of the word发图is too high (because of the high number of occurrences), you can reduce the word frequency and manually modify base_dictionary.txt file.
Remote lexicon user-defined lexicon will be executed regularly according to the configured time and period. After the update from the remote dictionary is completed, the current customerDictionaryFile will be automatically overwritten. The file format of the remote lexicon is {词},{词频},{是否元词} , for example,俄罗斯,1000,1 . Explanation of whether it is a meta-word or not: 1 means it is a meta-word and will not be split further.俄罗斯will not be split into俄and罗斯(Russ is a common name). In this way,罗斯will not recall俄罗斯related documents. 0 means you can continue to break it down, such as奋发图强
Whether the remote dictionary is reloaded is based on whether at least one of the two fields in the http head request return header has changed. The two fields are: Last-Modified and ETag.

Sample index demo

Build index:

 PUT test/
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "search_analyzer": {
            "filter": [
              "lowercase"
            ],
            "char_filter": [
              "html_strip"
            ],
            "type": "custom",
            "tokenizer": "my_search_token"
          },
          "index_analyzer": {
            "filter": [
              "lowercase"
            ],
            "char_filter": [
              "html_strip"
            ],
            "type": "custom",
            "tokenizer": "my_index_token"
          }
        },
        "tokenizer": {
          "my_index_token": {
            "enableFailDingMsg": "true",
            "type": "hao_index_mode",
            "enableSingleWord": "true",
            "enableFallBack": "true",
            "autoWordLength": 3
          },
          "my_search_token": {
            "enableFailDingMsg": "true",
            "type": "hao_search_mode",
            "enableSingleWord": "true",
            "enableFallBack": "true",
            "autoWordLength": 3
          }
        }
      },
      "number_of_replicas": "0"
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "index_options": "offsets",
        "analyzer": "index_analyzer",
        "search_analyzer": "search_analyzer"
      }
    }
  }
}

Test word segmentation

 test/_analyze
{
  "analyzer": "index_analyzer",
  "text": "徐庆年 奋发图强打篮球有利于提高人民生活，有的放矢，中华人民共和国家庭宣传委员会宣。?"
}

test/_analyze
{
  "analyzer": "search_analyzer",
  "text": "徐庆年 奋发图强打篮球有利于提高人民生活，有的放矢，中华人民共和国家庭宣传委员会宣。?"
}

徐庆年is not in the vocabulary, but is recognized as a word through autoWordLength .

Expand

Additional Information

Version v8.7.1
Type Other source code
Update Time 2024-12-26
size 8.22MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
elasticsearch

2024-11-01
HAO health app

2023-05-30

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All