What is Chinese word segmentation
What is word segmentation? What is the difference between Chinese word segmentation and other word segmentations? Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications. From the above example, we can see that in English writing, spaces are used as natural delimiters between words, while in Chinese, only words, sentences and paragraphs can be simply delimited by obvious delimiters. Only Words do not have a formal delimiter. Although English also has the problem of dividing phrases, at the word level, as we can see from the above example, Chinese is much more complicated and difficult than English.
There are currently three mainstream Chinese word segmentation algorithms:
1. Word segmentation method based on string matching
This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with the entries in a "sufficiently large" machine dictionary according to a certain strategy. If a certain string is found in the dictionary, the match is successful. (Recognize a word). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the priority matching of different lengths, it can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether it is related to the part-of-speech tagging process Combined, it can be divided into simple word segmentation method and integrated method that combines word segmentation and annotation. Several commonly used mechanical word segmentation methods are as follows:
1) Forward maximum matching method (direction from left to right);
2) Inverse maximum matching method (direction from right to left);
3) Minimum segmentation (minimize the number of words in each sentence).
The various methods mentioned above can also be combined with each other. For example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characteristics of Chinese single-character word formation, forward minimum matching and reverse minimum matching are generally rarely used. Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and fewer ambiguities are encountered. Statistical results show that the error rate of simply using forward maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. However, this accuracy is far from meeting actual needs. The actually used word segmentation systems all use mechanical word segmentation as a preliminary segmentation method, and it is necessary to further improve the accuracy of segmentation by using various other linguistic information.
One method is to improve the scanning method, which is called feature scanning or mark segmentation. It prioritizes identifying and segmenting some words with obvious characteristics in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Mechanical word segmentation is performed for smaller strings to reduce the matching error rate. Another method is to combine word segmentation and part-of-speech tagging, use rich part-of-speech information to help word segmentation decisions, and in turn check and adjust the word segmentation results during the tagging process, thereby greatly improving the accuracy of segmentation.
2. Word segmentation method based on understanding
This word segmentation method achieves the effect of word recognition by allowing the computer to simulate human understanding of sentences. The basic idea is to perform syntactic and semantic analysis while segmenting words, and use syntactic information and semantic information to deal with ambiguity. It usually consists of three parts: word segmentation subsystem, syntax and semantics subsystem, and overall control part. Under the coordination of the overall control part, the word segmentation subsystem can obtain syntactic and semantic information about words, sentences, etc. to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This word segmentation method requires the use of a large amount of language knowledge and information. Due to the generality and complexity of Chinese language knowledge, it is difficult to organize various language information into a form that can be directly read by machines. Therefore, the word segmentation system based on comprehension is still in the experimental stage.
3. Word segmentation method based on statistics
From a formal point of view, words are stable combinations of words, so in context, the more times adjacent words appear at the same time, the more likely they are to form a word. Therefore, the frequency or probability of adjacent words co-occurring between words can better reflect the credibility of the word. The frequency of combinations of adjacent words that co-occur in the corpus can be counted and their mutual occurrence information can be calculated. Define the mutual occurrence information of two characters and calculate the adjacent co-occurrence probability of two Chinese characters X and Y. The mutual occurrence information reflects the closeness of the combination relationship between Chinese characters. When the closeness is higher than a certain threshold, it can be considered that this word group may form a word. This method only needs to count the frequency of word groups in the corpus and does not need to segment the dictionary, so it is also called the dictionary-free word segmentation method or the statistical word extraction method. However, this method also has certain limitations. It will often extract some commonly used word groups that co-occur frequently but are not words, such as "this", "one", "some", "my", "Many", etc., and the recognition accuracy of common words is poor and the time and space overhead is large. Practical statistical word segmentation systems must use a basic word segmentation dictionary (common word dictionary) for string matching and word segmentation, and at the same time use statistical methods to identify some new words, that is, combine string frequency statistics with string matching, which not only plays the role of matching word segmentation, but also uses statistical methods to identify some new words. It has the characteristics of fast segmentation and high efficiency. It also takes advantage of dictionary-free word segmentation and context recognition to identify new words and automatically eliminate ambiguities.
Some points to note about participles:
1. The time performance of the word segmentation algorithm is relatively high. Especially today's web search has high real-time requirements. Therefore, word segmentation, which is the basis of Chinese information processing, must first take up as little time as possible.
2. The improvement of word segmentation accuracy does not necessarily lead to the improvement of retrieval performance. After word segmentation reaches a certain accuracy, the impact on Chinese information retrieval will no longer be obvious. Although there is still some impact, this is no longer the performance bottleneck of CIR. Therefore, the one-sided word segmentation algorithm that blindly pursues high accuracy is not very suitable for large-scale Chinese information retrieval. When there is a conflict between time and accuracy, we need to find a suitable balance between the two.
3. The granularity of segmentation can still follow the long word priority principle, but relevant subsequent processing needs to be performed at the query expansion level. In information retrieval, word segmentation algorithms only need to focus on how to eliminate cross-ambiguities. For coverage ambiguity, we can use secondary indexing of the dictionary and query expansion to resolve it.
4. The accuracy of unregistered word recognition is more important than the recall rate. It is necessary to try to ensure that no wrong combinations are performed when identifying unregistered words, so as to avoid segmenting wrong unregistered words. If single words are incorrectly combined into unregistered words, the corresponding document may not be correctly retrieved.
Baidu participle
First separate the query based on the delimiter. "Information retrieval theoretical tools" after participle <information retrieval, theory, tools>.
Then see if there are duplicate strings. If so, discard the extra ones and keep only one. After the word "theoretical tool theory" is divided into <tool theory>, GOOGLE does not consider this merger calculation.
Then determine whether there are English words or numbers. If so, keep the English words or numbers as a whole and cut off the Chinese characters before and after. Query "movie BT download" after the word segmentation <movie, BT, download>.
If the string only contains less than or equal to 3 Chinese characters, then keep it unchanged. When the length of the string is greater than 4 Chinese characters, Baidu's word segmentation program will go to work and break up the string.
Word segmentation algorithm types: forward maximum matching, reverse maximum matching, two-way maximum matching, language model method, shortest path algorithm. To judge whether a word segmentation system is good or not, there are two key points. One is the ability to eliminate ambiguity; the other is the identification of words that are not registered in the dictionary. For example, names of people, places, organizations, etc.
Baidu word segmentation uses at least two dictionaries, one is a general dictionary and the other is a special dictionary (names of people, place names, new words, etc.). Moreover, the special dictionary cuts it first, and then the remaining fragments are divided by the ordinary dictionary.
Baidu's word segmentation algorithm type uses a two-way maximum matching algorithm.
Example: Query "Mao Zedong Beijing Hua Yanyun", Baidu's word segmentation results: "Mao Zedong/Beijing/Beijing Hua Yanyun"
Baidu word segmentation can identify people's names, and it can also identify "Beijing Yanyun", which shows that it has the function of identifying words that are not registered in the dictionary.
First, query the special dictionary (names of people, some place names, etc.), cut out the proper names, and adopt a two-way word segmentation strategy for the remaining parts. If the two (forward maximum matching, reverse maximum matching) segmentation results are the same, it means there is no ambiguity , directly output the word segmentation results.
If they are inconsistent, the result of the shortest path is output, that is, the fewer fragments, the better. For example, compared with <Cuba, Bi, Ethics> and <Old Babylon, Li>, choose the latter, <Beijing, Hua, Yanyun> Compared with <Beijing Yanyun>, choose the latter.
If the lengths are the same, select the group of segmentation results with fewer single words. "Distant ancient Babylon", this query was segmented by Baidu into <distant, ancient, Babylon>, instead of being segmented into "distant/ancient/ancient Babylon"
If the words are also the same, select the forward word segmentation result. Query "Wang Qiang Xiao:", Baidu will segment it into "Wang/Qiang/Small" instead of reversely segmenting it into "Wang/Qiang/Small"
Baidu has always promoted its advantages in Chinese processing. From the above point of view, there is nothing special about the word segmentation algorithm, and the disambiguation effect is not ideal. Even if Baidu adopts an algorithm that is more complex than the above word segmentation algorithm, it is difficult to say that it is an advantage. If we say If Baidu has an advantage, its only advantage is its large special dictionary. This special dictionary contains names of people (such as Dae Jang Geum), titles (such as the old lady), and some place names (such as the United Arab Emirates, etc.). It is estimated that Baidu adopts the information published by academia. The relatively new named entity recognition algorithm continuously identifies words that are not registered in the dictionary from the corpus, and gradually expands this specialized dictionary. ——This article comes from the original post address of China SEO Forum: http://www.web520.com/bbs/thread-2742-1-1.html
Author information: Lao Chen, one of the founders of China SEO Forum (www.web520.com/bbs)