A few months ago, I found a Chinese thesaurus material (several hundred K) on the Internet, and then I wanted to write a word segmentation program. I have no research on Chinese word segmentation, so I wrote it based on my own imagination. If there are experts in the field, , please give me more opinions.
1. Thesaurus
Thesaurus has about 50,000 words (you can search it on Google, and similar thesauruses can be used). I summarize it as follows:
Region 82
Important 81
Xinhua News Agency 80
Technology 80
Meeting 80
myself 79
Cadre 78
Employees 78
Mass 77
No 77
Today 76
Gay 76
Department 75
Strengthen 75
Organization 75
The first column is the word, and the second column is the weight. The word segmentation algorithm I wrote does not currently use the weight.
2. Design idea
A brief description of the algorithm:
For a string S, scan from front to back, and for each word scanned, Find the longest match from the lexicon. For example, suppose S="I am a citizen of the People's Republic of China", and the lexicon includes "People's Republic of China", "China", "citizen", "people", "Republic"... ...and other words. When the word "中" is scanned, then start from the middle character and take 1, 2, 3,... characters ("中", "中华", "中华人") ,"People's Republic of China","People's Republic of China","People's Republic of China","People's Republic of China",,"Principal of the People's Republic of China"), the longest matching string in the vocabulary is "People's Republic of China", then At this point, the scanner advances to the word "public".
Data structure:
The choice of data structure has a great impact on performance. I use Hashtable _rootTable to record the vocabulary. The key-value pair is (key, number of insertions). For each Word, if the word has N words, then 1,1~2,1~3,...1~N words of the word are used as keys and inserted into _rootTable. And if the same key is repeated If inserted, the following values will be incremented.
3.
The specific program procedures are as follows (the program includes factors such as weight, number of insertions, etc., and the current algorithm does not use these. This can be used to write a more effective word segmentation algorithm):
ChineseWordUnit.cs // struct--(word, weight) pair
1 public struct ChineseWordUnit
2 {
3 private string _word;
4 private int _power;
5
6 /**//// <summary>
7 /// The Chinese word corresponding to the Chinese word unit.
8 /// </summary>
9 public string Word
10 {
11 get
12 {
13 return _word;
14}
15}
16
17 /**//// <summary>
18 /// The weight of the Chinese word.
19 /// </summary>
20 public int Power
twenty one {
22 get
twenty three {
24 return _power;
25}
26}
27
28 /**//// <summary>
29 /// Structure initialization.
30 /// </summary>
31 /// <param name="word">Chinese words</param>
32 /// <param name="power">The weight of the word</param>
33 public ChineseWordUnit(string word, int power)
34 {
35 this._word = word;
36 this._power = power;
37 }
38 }
ChineseWordsHashCountSet.cs //lexicon container
1 /**//// <summary>
2 /// A dictionary class that records the number of times a string appears at the front of a Chinese word recorded in the Chinese dictionary. If the string "中" appears at the front of "China", a number of times is recorded in the dictionary.
3 /// </summary>
4 public class ChineseWordsHashCountSet
5 {
6 /**//// <summary>
7 /// Hashtable that records the number of times a string appears in Chinese words. The key is a specific string, and the value is the number of times the string appears in the Chinese word.
8 /// </summary>
9 private Hashtable _rootTable;
10
11 /**//// <summary>
12 /// Type initialization.
13 /// </summary>
14 public ChineseWordsHashCountSet()
15 {
16 _rootTable = new Hashtable();
17}
18
19 /**//// <summary>
20 /// Query the number of times the specified string appears at the front of the Chinese words recorded in the Chinese dictionary.
21 /// </summary>
22 /// <param name="s">Specified string</param>
23 /// <returns>The number of times the string appears at the front of the Chinese words recorded in the Chinese dictionary. If it is -1, it means it does not appear. </returns>
24 public int GetCount(string s)
25 {
26 if (!this._rootTable.ContainsKey(s.Length))
27 {
28 return -1;
29 }
30 Hashtable _tempTable = (Hashtable)this._rootTable[s.Length];
31 if (!_tempTable.ContainsKey(s))
32 {
33 return -1;
34}
35 return (int)_tempTable[s];
36}
37
38 /**//// <summary>
39 /// Insert a word into the subdigit dictionary. Parse the word and insert it into the dictionary of times.
40 /// </summary>
41 /// <param name="s">The string processed. </param>
42 public void InsertWord(string s)
43 {
44 for(int i=0;i<s.Length;i++)
45 {
46 string _s = s.Substring(0,i+1);
47 this.InsertSubString(_s);
48 }
49 }
50
51 /**//// <summary>
52 /// Record the number of times a string is inserted into the times dictionary.
53 /// </summary>
54 /// <param name="s">The inserted string. </param>
55 private void InsertSubString(string s)
56 {
57 if (!_rootTable.ContainsKey(s.Length)&&s.Length>0)
58 {
59 Hashtable _newHashtable = new Hashtable();
60 _rootTable.Add(s.Length,_newHashtable);
61 }
62 Hashtable _tempTable = (Hashtable)_rootTable[s.Length];
63 if (!_tempTable.ContainsKey(s))
64 {
65 _tempTable.Add(s,1);
66 }
67 else
68 {
69 _tempTable[s]=(int)_tempTable[s]+1;
70}
71 }
72 }
ChineseParse.cs //Word segmenter
1 /**//// <summary>
2 /// Chinese word segmenter.
3 /// </summary>
4 public class ChineseParse
5 {
6 private static ChineseWordsHashCountSet _countTable;
7
8 static ChineseParse()
9 {
10 _countTable = new ChineseWordsHashCountSet();
11 InitFromFile("ChineseDictionary.txt");
12}
13
14 /**//// <summary>
15 /// Initialize the Chinese word dictionary and string times dictionary from the specified file.
16 /// </summary>
17 /// <param name="fileName">File name</param>
18 private static void InitFromFile(string fileName)
19 {
20 string path = Directory.GetCurrentDirectory() +@" " + fileName;
21 if (File.Exists(path))
twenty two {
23 using (StreamReader sr = File.OpenText(path))
twenty four {
25 string s = "";
26 while ((s = sr.ReadLine()) != null)
27 {
28 ChineseWordUnit _tempUnit = InitUnit(s);
29 _countTable.InsertWord(_tempUnit.Word);
30}
31}
32}
33}
34
35 /**//// <summary>
36 /// Parse a string into ChineseWordUnit.
37 /// </summary>
38 /// <param name="s">String</param>
39 /// <returns>The parsed ChineseWordUnit</returns>
40 private static ChineseWordUnit InitUnit(string s)
41 {
42 Regex reg = new Regex(@"s+");
43 string[] temp = reg.Split(s);
44 if (temp.Length!=2)
45 {
46 throw new Exception("String parsing error: "+s);
47 }
48 return new ChineseWordUnit(temp[0],Int32.Parse(temp[1]));
49 }
50
51 /**//// <summary>
52 /// Analyze the input string and cut it into words.
53 /// </summary>
54 /// <param name="s">String to be cut</param>
55 /// <returns>Chinese word array obtained by cutting</returns>
56 public static string[] ParseChinese(string s)
57 {
58 int _length = s.Length;
59 string _temp = String.Empty;
60 ArrayList _words = new ArrayList();
61
62 for(int i=0;i<s.Length;)
63 {
64 _temp = s.Substring(i,1);
65 if (_countTable.GetCount(_temp)>1)
66 {
67 int j=2;
68
69 for (;i+j<s.Length+1&&_countTable.GetCount(s.Substring(i,j))>0;j++)
70 {
71 }
72 _temp = s.Substring(i,j-1);
73 i = i + j - 2;
74}
75i++;
76 _words.Add(_temp);
77 }
78
79 string[] _tempStringArray = new string[_words.Count];
80 _words.CopyTo(_tempStringArray);
81 return _tempStringArray;
82}
83 }
IV.
Comparative test between test and massive word segmentation demonstration program:
Case 1: Sina Sports News After being eliminated by Juventus, Real Madrid coach Vicente del Bosque refused to accept the media’s criticism of the team’s defense, and at the same time also conducted a test for his starting lineup. defend. "The defeat is the responsibility of the whole team, not just the defense," Del Bosque said. "I don't think we played a mess." "We made it to the semifinals and fought hard on the way there. Even in today's game we had a few chances to turn around, but the opponents we faced were very strong and they played very well. "Our fans should be proud of our performance in the Champions League over the past few seasons. "Bosque also said. After the game, some reporters questioned Del Bosque's absence from Cambiasso in the starting lineup, believing that another player in the team, Pavin, should be sent to strengthen the back line. Regarding this suspicion, Del Bosque refused to take the so-called "responsibility" and believed that there was no problem with the team's starting lineup. "We have done it the way we have done it all season, and I have nothing to say about the changes in personnel." Regarding the team's prospects this season, Bosque said that Real Madrid still has the La Liga championship as its goal. "Real Madrid fought to the end in the Champions League, and we will do the same in the league."
Massive word segmentation results:
Sina Sports News After being eliminated by Juventus, Real Madrid coach Del Bosque refused to accept media criticism of the team's defense. He also defended his starting lineup. "The defeat is the responsibility of the whole team, not just the defense," Bosque said. "I don't think we played in a mess." "We entered the semifinals and fought hard on the way to promotion. Even in today's game we had several chances to turn around, but the opponents we faced were very strong and they played very well. "Our fans should be proud of our performance in the Champions League over the past few seasons. ” Bosque also said. After the game, some reporters questioned Bosque's absence from Cambiasso in the starting lineup, believing that Pavin, another player in the team, should be sent to strengthen the back line. Regarding this suspicion, Del Bosque refused to take the so-called "responsibility" and believed that there was no problem with the team's starting lineup. "We have done it the way we have done it all season, and I have nothing to say about the changes in personnel." Regarding the team's prospects this season, Del Bosque said that Real Madrid still has the La Liga championship as its goal. "Real Madrid fought to the end in the Champions League, and we will do the same in the league."
ChineseParse segmentation results:
Sina Sports News After being eliminated by Juventus, Real Madrid coach Del Bosque refused to accept media criticism of the team's defense. He also defended his starting lineup. "The defeat is the responsibility of the whole team, not just the defense," Bosque said. "I don't think we played a mess." "We entered the semifinals and fought hard on the way to promotion. Even in today's game we had several chances to turn around, but the opponents we faced were very strong and they played very well. "Our fans should be proud of our performance in the Champions League over the past few seasons. ” Bosque also said. After the game, some reporters questioned Bosque's absence from Cambiasso in the starting lineup, believing that Pavin, another player in the team, should be sent to strengthen the back line. Regarding this suspicion, Del Bosque refused to take the so-called "responsibility" and believed that there was no problem with the team's starting lineup. "We have done it the way we have done it all season, and I have nothing to say about the changes in personnel." Regarding the team's prospects this season, Del Bosque said that Real Madrid still has the La Liga championship as its goal. "Real Madrid fought to the end in the Champions League, and we will do the same in the league."
Because there is no professional vocabulary for sports and names, ChineseParse cannot recognize these professional words.
Case 2: The first time in China's automobile society The major transformation took more than ten years. In the "Automobile Industry Industrial Policy" issued in 1994, the most eye-catching item is to "gradually change the consumption structure of public funds to purchase and use cars, mainly administrative agencies, groups, institutions and state-owned enterprises." From mainly purchasing cars with public funds to cars gradually entering households, the first major transformation has brought huge improvements to people's quality of life. The main driving forces for this transformation are clear-cut industrial policies, sustained and rapid growth of the national economy, and the booming domestic automobile industry. However, as we rapidly move into an automobile society dominated by private cars, we are also facing new situations and new tests: the central government emphasizes the establishment and implementation of the scientific outlook on development and requires domestic enterprises to improve their independent innovation capabilities; during this year's "Two Sessions" , the central government also proposed the spirit of building a harmonious society and a conservation-oriented society; at the same time, our country's automobile society is facing many unfavorable factors such as energy shortage, rising fuel prices, and limited land resources. Against this background, it is urgent to carry out the second major transformation.
Massive word segmentation results:
The first major transformation of China's automobile society took more than ten years. In the "Automobile Industry Industrial Policy" issued in 1994, the most eye-catching item is to "gradually change the consumption structure of public funds to purchase and use cars, mainly administrative agencies, groups, institutions and state-owned enterprises." From mainly purchasing cars with public funds to cars gradually entering households, the first major transformation has brought huge improvements to people's quality of life. The main driving forces for this transformation are clear-cut industrial policies, sustained and rapid growth of the national economy, and the booming domestic automobile industry. However, as we rapidly move into an automobile society dominated by private cars, we are also facing new situations and new tests: The central government emphasizes the establishment and implementation of the scientific outlook on development and requires domestic enterprises to improve their independent innovation capabilities; During this year's "Two Sessions" , the central government also proposed the spirit of building a harmonious society and a conservation-oriented society; at the same time, my country's automobile society is facing many unfavorable factors such as energy shortage, rising fuel prices, and limited land resources. Against this background, it is urgent to carry out the second major transformation.
ChineseParse word segmentation results:
The first major transformation of China's automobile society took more than ten years. In the "Automotive Industry Industrial Policy" issued in 1994, the most eye-catching item is to "gradually change the consumption structure of public funds to purchase and use cars, mainly administrative agencies, groups, institutions and state-owned enterprises." From mainly purchasing cars with public funds to cars gradually entering households, the first major transformation has brought huge improvements to people's quality of life. The main driving forces for this transformation are clear-cut industrial policies, sustained and rapid growth of the national economy, and the booming domestic automobile industry. However, as we rapidly move into an automobile society dominated by private cars, we are also facing new situations and new tests: The central government emphasizes the establishment and implementation of the scientific outlook on development and requires domestic enterprises to improve their independent innovation capabilities; During this year's "Two Sessions" , the central government also proposed the spirit of building a harmonious society and a conservation-oriented society; at the same time, my country's automobile society is facing many unfavorable factors such as energy shortage, rising fuel prices, and limited land resources. Against this background, it is urgent to carry out the second major transformation.
It can be seen that ChineseParse cannot intelligently process words such as "first time" and "second time", and it has no ability to recognize numbers, but the basic word segmentation effect is still OK.
(After all, I completed the program in 3 hours. , how can it be compared with what others have accumulated in ten years?)
Performance test (Centrino 1.5M): 677,000 words per second.
Program optimization should be higher.
5. Summary of
what should be done further:
1. Able to recognize simple foreign languages and numbers
2. Have simple intelligence
3. Expand the vocabulary
and then it will have practical value.
Note: Most of what I wrote in the past few months were simple Chinese processing small programs, such as traditional and simplified conversion, automatic typesetting, batch replacement, Chinese word segmentation, if I have time, I will Let’s collect these programs and package them into a practical Chinese processing tool. I don’t know what other needs you have, so I feel free to tell you.