An elasticsearch Chinese word segmentation plug-in.
QQ communication group: 743457803
Please refer here for how to develop an ES word segmentation plug-in.
Mainly refer to IK and HanLP
Supports complex Chinese characters . Some Chinese characters are not 1 in length in Java, such as ?
, but IK
and others do not support it.
Single-word word segmentation and search are supported, but ik_max_word
mode is not supported.
Supports custom length word segmentation , suitable for identifying names of people in short texts.
The Chinese character text length
<=autoWordLength
separated by spaces, punctuation marks, letters, numbers, etc. will be automatically recognized as a word.
Support emoji search
Compared with IK, it is smarter and more accurate than IK.
ik_max_word
exhaustively enumerates all possible words, resulting in some irrelevant searches being found. The任性冲动过
turns out to任性性冲动动过
, so searching for "性冲动
will search for this doc.南京市长江大桥
, the result is南京市市长长江大桥
, then the search for市长
will search for this doc, but the hao word segmenter will not. It calculates the shortest path through word frequency and identifies the phrase with the highest probability. You can also adjust the word frequency at will according to your own scene.The word segmentation result of ik_smart is not a subset of ik_max_word, and the word segmentation result of hao_search_mode is a subset of the word segmentation result of hao_index_mode
Compared with HanLp, it is lighter in weight and more controllable in word segmentation . It does not have some intelligent prediction functions such as names, which may lead to unstable and inaccurate word segmentation. Machine learning has different prediction results for long and short texts, and the predicted word segmentation results are also different. And HanLP does not have an official ES plug-in.
Calculate the shortest path based on word frequency and enumerate possible words instead of all words. If the enumerated words are wrong, you can adjust the word frequency to correct it. The word frequency file is a txt
file with better readability .
Meta-words are supported, for example,俄罗斯
will no longer be split into俄
and罗斯
(罗斯
is a common personal name). This way罗斯
will not recall俄罗斯
related documents
But part of speech is not supported
Provide Analyzer: hao_search_mode
, hao_index_mode
Tokenizer: hao_search_mode
, hao_index_mode
Git tag | ES version |
---|---|
master | ES latest stable version |
v7.17.1 | 7.17.1 |
vX.YZ | XYZ |
Method 1. bin/elasticsearch-plugin install file:///Users/xiaoming/Download/analysis-hao.zip
Method 2. After decompression, place it in the es plugins directory. Make sure it is the following directory structure {ES_HOME}/plugins/analysis-hao/(各种jar等文件)
. At the same time, the directory cannot contain zip files.
Finally restart ES
If there is no corresponding ES version you need, you need to modify a few places:
pom.xml
-> elasticsearch.version
to the corresponding version.HaoTokenizerFactory.java
. Finally, execute mvn clean package -Dmaven.test.skip=true
to get the zip
installation package of the plug-in.The following are the configuration items available for custom word segmenters:
Configuration item parameters | Function | default value |
---|---|---|
enableIndexMode | Whether to use index mode, index mode is fine-grained. | hao_search_mode is false , hao_index_mode is true , fine granularity is suitable for Term Query, and coarse granularity is suitable for Phrase query |
enableFallBack | If an error is reported in word segmentation, whether to start the most fine-grained word segmentation, that is, segmentation by characters. It is recommended to use search_mode so as not to affect user search. index_mode is not started so that error alarm notifications can be reported in a timely manner. | false does not start downgrade |
enableFailDingMsg | Whether to start DingTalk notification of failure, the notification address is the dingWebHookUrl field of HttpAnalyzer.cfg.xml . | false |
enableSingleWord | Whether to use fine-grained returned words. For example,体力值 , the word segmentation result only stores体力值 ,体力 , but not the值 | false |
autoWordLength | Chinese character text separated by spaces, punctuation marks, letters, numbers, etc. whose length is less than autoWordLength will be automatically recognized as one word. Default -1 is not enabled, >=2 is considered enabled | -1 |
hao_index_mode
Words will be segmented recursively based on the terms and weights of the vocabulary until the word is inseparable. If enableSingleWord=true
is set, it will be divided into single words.
For example, this text南京市长江大桥
南京市长江大桥
==>南京市
,长江大桥
南京市
==>南京
,市
,长江大桥
==>长江
,大桥
enableSingleWord=false
, the recursion stops and the word segments obtained are南京市
,南京
,市
,长江大桥
,长江
,大桥
enableSingleWord=true
, continue the recursion until the single word position, and get the word segmentation as南京市
,南京
,南
,京
,市
,长江大桥
,长江
,长
江
,大桥
,大
桥
hao_search_mode
In this mode, it is equivalent to the hao_index_mode
mode recursively only once. The word segmentation result is南京市
,长江大桥
. Because enableIndexMode=false
in this mode, if it is changed to true
, it will have the same effect as hao_index_mode
.
parameter | Function | Remark |
---|---|---|
baseDictionary | Basic dictionary file name | Place it in the plug-in config directory or the es config directory without changing it. |
customerDictionaryFile | User-defined remote lexicon file, multiple files separated by English semicolon; | It will be stored in the plug-in config directory or the es config directory. |
remoteFreqDict | Remote user-defined vocabulary file | Convenient hot update, hot update is updated regularly through the following two parameters. |
syncDicTime | The next synchronization time of the remote dictionary hh:mm:ss | Leave it blank and use syncDicPeriodTime as the next synchronization time. |
syncDicPeriodTime | Remote dictionary synchronization time interval, seconds, minimum value 30 | For example, syncDicTime=20:00:00,syncDicPeriodTime=86400 , it is synchronized at 20 o'clock every day |
dingWebHookUrl | DingTalk robot url | Used for word segmentation exceptions and synchronization thesaurus exception/success notifications |
dingMsgContent | Robot notification copywriting | Note that when configuring the DingTalk robot, the keywords must match this copy, otherwise the message will fail to be sent. |
The
{ES_HOME}/config/analysis-hao/
directory is read first, and the files in{ES_HOME}/plugins/analysis-hao/config
directory are not read.
base_dictionary.txt
, separated by commas, and the following number indicates the word frequency. For example: the word segmentation result of奋发图强
is奋
,发图
,强
, because the word frequency of the word发图
is too high (because of the high number of occurrences), you can reduce the word frequency and manually modify base_dictionary.txt
file.customerDictionaryFile
will be automatically overwritten. The file format of the remote lexicon is {词},{词频},{是否元词}
, for example,俄罗斯,1000,1
. Explanation of whether it is a meta-word or not: 1
means it is a meta-word and will not be split further.俄罗斯
will not be split into俄
and罗斯
(Russ is a common name). In this way,罗斯
will not recall俄罗斯
related documents. 0
means you can continue to break it down, such as奋发图强
Build index:
PUT test/
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"search_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "my_search_token"
},
"index_analyzer": {
"filter": [
"lowercase"
],
"char_filter": [
"html_strip"
],
"type": "custom",
"tokenizer": "my_index_token"
}
},
"tokenizer": {
"my_index_token": {
"enableFailDingMsg": "true",
"type": "hao_index_mode",
"enableSingleWord": "true",
"enableFallBack": "true",
"autoWordLength": 3
},
"my_search_token": {
"enableFailDingMsg": "true",
"type": "hao_search_mode",
"enableSingleWord": "true",
"enableFallBack": "true",
"autoWordLength": 3
}
}
},
"number_of_replicas": "0"
}
},
"mappings": {
"properties": {
"title": {
"type": "text",
"index_options": "offsets",
"analyzer": "index_analyzer",
"search_analyzer": "search_analyzer"
}
}
}
}
Test word segmentation
test/_analyze
{
"analyzer": "index_analyzer",
"text": "徐庆年 奋发图强打篮球有利于提高人民生活,有的放矢,中华人民共和国家庭宣传委员会宣。?"
}
test/_analyze
{
"analyzer": "search_analyzer",
"text": "徐庆年 奋发图强打篮球有利于提高人民生活,有的放矢,中华人民共和国家庭宣传委员会宣。?"
}
徐庆年
is not in the vocabulary, but is recognized as a word through autoWordLength
.