The PHP Chinese word segmentation code uses a lexicon based on Unicode and uses reverse matching mode word segmentation. It is theoretically compatible with a wider range of encodings and is particularly convenient for UTF-8 encoding. Since PhpanAlysis is a component-less system, the speed will be slightly slower than that with components. However, in a large number of word segmentations, since word database loading is completed during word segmentation, the more content, the faster the speed will be. This is a normal phenomenon. , For servers that support PHP-APC, this program supports caching of dictionaries. After doing so, the theoretical speed will not be slower than those word segmentation programs with components.
The word segmentation system is a word segmentation method based on string matching . This method is also called the mechanical word segmentation method. It matches the Chinese character string to be analyzed with entries in a "sufficiently large" machine dictionary according to a certain strategy. If If a string is found in the dictionary, the match is successful (a word is recognized). According to different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the priority matching of different lengths, it can be divided into maximum (longest) matching and minimum (shortest) matching; according to whether it is related to the part-of-speech tagging process Combined, it can be divided into simple word segmentation method and integrated method that combines word segmentation and annotation. Several commonly used mechanical word segmentation methods are as follows:
1) Forward maximum matching method (direction from left to right);
2) Inverse maximum matching method (direction from right to left);
3) Minimum segmentation (minimize the number of words in each sentence).
The various methods mentioned above can also be combined with each other. For example, the forward maximum matching method and the reverse maximum matching method can be combined to form a two-way matching method. Due to the characteristics of Chinese single-character word formation, forward minimum matching and reverse minimum matching are generally rarely used. Generally speaking, the segmentation accuracy of reverse matching is slightly higher than that of forward matching, and fewer ambiguities are encountered. Statistical results show that the error rate of simply using forward maximum matching is 1/169, and the error rate of simply using reverse maximum matching is 1/245. However, this accuracy is far from meeting actual needs. The actually used word segmentation systems all use mechanical word segmentation as a preliminary segmentation method, and it is necessary to further improve the accuracy of segmentation by using various other linguistic information.
One method is to improve the scanning method, which is called feature scanning or mark segmentation. It prioritizes identifying and segmenting some words with obvious characteristics in the string to be analyzed. Using these words as breakpoints, the original string can be divided into Mechanical word segmentation is performed for smaller strings to reduce the matching error rate. Another method is to combine word segmentation and part-of-speech tagging, use rich part-of-speech information to help word segmentation decisions, and in turn check and adjust the word segmentation results during the tagging process, thereby greatly improving the accuracy of segmentation.
Expand