Understanding search engine word segmentation technology is of great significance to our SEO work. Whether it is our keyword layout or link structure, it is closely related to word segmentation. Here Xiao Han will talk about Baidu's Chinese word segmentation (of course it is not limited to Baidu, other search engines are similar). This article is divided into two parts. The first is to extract the existing explanations about word segmentation, and then add my own expanded ideas about word segmentation.
What is Chinese word segmentation?
We all know that English sentences are composed of words separated by spaces, so word segmentation is much more convenient. However, our Chinese sentences are composed of Chinese characters connected one by one, so it is relatively complicated. Chinese word segmentation refers to the process of cutting a Chinese sentence into individual words and reassembling them into word sequences according to certain rules. This is also called "Chinese word segmentation".
Word segmentation plays a great role in search engines and is the basis of text mining. It can help programs automatically identify the meaning of sentences to achieve a high degree of matching in search results. The quality of word segmentation directly affects the accuracy of search results. At present, search engine word segmentation methods mainly use dictionary matching and statistics.
1. Word segmentation method based on dictionary matching
This method first requires a very large dictionary, which is a word segmentation index library, and then matches the string to be segmented with the words in the thesaurus according to certain rules. If a certain word is found, the match is successful. There are four matching methods:
1. Forward maximum matching method (direction from left to right);
2. Inverse maximum matching method (direction from right to left);
3. Minimum segmentation (minimize the number of words in each sentence);
4. Bidirectional maximum matching method (scanning twice from left to right and from right to left)
Typically, search engines use a combination of methods. But this method also brings difficulties to search engines, such as handling ambiguities (the key is the breadth and depth of our Chinese language). In order to improve the accuracy of matching, search engines will also simulate human understanding of sentences to achieve word recognition. effect. The basic idea is to perform syntactic and semantic analysis while segmenting words, and use syntactic information and semantic information to deal with ambiguity. It usually includes three parts: word segmentation subsystem, syntax and semantics subsystem, and overall control part. Under the coordination of the overall control part, the word segmentation subsystem can obtain syntactic and semantic information about words, sentences, etc. to judge the ambiguity of word segmentation, that is, it simulates the process of human understanding of sentences. This word segmentation method requires the use of a lot of language knowledge and information. Of course, our search engines are also constantly improving.
2. Word segmentation method based on statistics
Although the word segmentation dictionary solves many problems, it is still far from enough. The search engine must also have the ability to continuously discover new words and determine whether it is a separate word by calculating the probability of adjacent words appearing. Therefore, the more context you have, the more accurate your understanding of the sentence will be, and the more precise the word segmentation will be. For example, "search engine optimization" may be matched in the dictionary as: search/engine/optimization, search/index/engine/optimization, but after later probability calculations, it was found that "search engine optimization" is adjacent in the context. If it appears a lot, the word will be added to the word index based on statistics.
Application of Chinese word segmentation
The accuracy of word segmentation is very important for search engines, but if the word segmentation speed is too slow, no matter how high the accuracy is, it will not be usable for search engines, because search engines need to process hundreds of millions of web pages. If word segmentation consumes If the time is too long, it will seriously affect the speed of search engine content update. Therefore, for search engines, both the accuracy and speed of word segmentation need to meet very high requirements.
For us SEO practitioners, we must master the principles and methods of word segmentation, so that we can design our website so that search engines can easily determine its topic relevance. For example, our website is about SEO training. When a user searches for this word, the search engine will first segment it, such as "SEO" and "training", and then match it separately in the index database. There is another point involved here, and it is also my own summary. After each word segmentation, there is a subject and an adverb. Usually, the subject is matched first, and then the adverb is matched. For example, SEO is obviously the subject here, so it is matched first, and then the adverb. The adverb of training. So, it is left to everyone to think about how our website should be laid out and structured.
Author: Xiao Han first published Xiao Han SEO blog,
Original address: http://www.xiaohan86.com/2011061149.html Please indicate the source when reprinting.
Thank you Xiao Han for your contribution