Java는 Nagao 알고리즘을 사용하여 새로운 단어 발견과 뜨거운 단어를 실현합니다.

저자：Eve Cole 업데이트 시간：2025-03-01 19:16:01

Nagao 알고리즘은 각 하위 스트링의 주파수를 계산하는 데 사용 된 다음 이러한 주파수 통계, 단어 주파수, 각 문자열의 이웃 수, 왼쪽 및 오른쪽 엔트로피 및 대화식 정보 (내부 응축)를 기준으로 사용됩니다. 각 문자열.

명사 설명 :

Nagao 알고리즘 : 텍스트의 모든 하위 스트링 주파수 알고리즘의 빠른 통계 텍스트. 자세한 알고리즘은 http://www.doc88.com/p-664123446503.html로 표시 될 수 있습니다
단어 주파수 : 문자열이 문서에 나타나는 횟수. 횟수가 많을수록 더 중요합니다.
왼쪽과 오른쪽의 이웃 수 : 문자열의 왼쪽과 오른쪽에있는 다른 문자 수. 왼쪽과 오른쪽의 이웃이 많을수록 문자열의 확률이 높아집니다.
왼쪽 및 오른쪽 엔트로피 : 문서의 문자열의 왼쪽과 오른쪽에있는 문자 수는 다른 문자의 문자 수를 배포합니다. 위의 지표와 유사하게 특정 차이점이 있습니다.
대화식 정보 : 문자열이 문자열의 왼쪽 절반과 문자열의 오른쪽 절반으로 문자열이 나눌 때마다 각각의 독립성 확률을 제외하고 동시에 나타날 확률을 계산하고 마지막으로 취합니다. 모든 부서의 최소 확률. 이 값이 클수록 문자열의 응축이 높을수록 가능성이 높아집니다.

알고리즘의 특정 프로세스 :

1. 비 중국어 문자 ([^/u4e00-/u9fa5]+)에 따라 입력 파일을 하나씩 읽고 단어 "?"가 끝난 후에는 말하지 않습니다. "
문자열로 나누면 코드는 다음과 같습니다.
문자열 [] phrases = line.split ( "[^/u4e00-/u9fa5]+| ["+stopwords+"];
중지 단어를 수정할 수 있습니다.
2. 모든 컷 스트링의 왼쪽 및 오른쪽 꼬치를 가져 와서 각각 왼쪽 및 오른쪽 PTABLE를 추가하십시오.
3. ptable을 정렬하고 ltable을 계산하십시오. ltable은 정렬 된 ptable에서 다음 서브 키어가 같은 문자를 가진 동일한 문자의 수를 가지고 있다고 기록합니다.
4. ptable과 ltable을 가로 지르면 모든 하위 스트링, 이웃의 빈번한 단어를 얻을 수 있습니다.
5. 단어 문자열, 왼쪽 및 오른쪽 이웃, 단어 주파수, 이웃 수, 왼쪽 및 오른쪽 엔트로피 및 출력 문자열의 대화식 정보의 주파수에 따라

1. Nagaoalgorithm.java

 com.algo.word java.io.bufferedwriter; ArrayList; import java.util.hashmap; 공공 클래스 NagaoAlgorithm {string> private int [] 개인지도; 아주 좋습니까? );} // 리버스 프레이즈 개인 문자열 리버스 (String Phrase) {StringBuilder Reversephrase = new StringBuilder (int i = phrase.length () -1; i-) reversephrase.append (phrase.charat (i). ); reversephrase.tstring ();} // S1 및 S2 Private int coprefixlength (String S1, String S2) {int coprefixlength = 0; 길이 (), s2.length ()) {s1.charat (i)) coprefixlength ++; // 연속 없음 중국어 문자열 [] phrasees = line.split ( "[^/u4e00-/u9fa5]+| i = 0; i ++) rightptable. ++) .add (reversephrase.substring (i)); int i = 1; rightltable [i] = coprefixlength (i-1); .Size ()]; 정렬 된 ptable 및 왼쪽 및 오른쪽 ltable "); t 통계 결과 : TF, 이웃 분포 개인 무효 Countfneighbor () () {// 오른쪽 이웃과 오른쪽 이웃 (int pindex = 0; pindex <rightptable.size) pindex ++) {string phrase = rightptable.get (int length h = 1 +rightltable [pindex]; = hbor.incrementtf (); .length; LINGLTA BLE [lindex]> = 길이) {tfneighbor.incrementtf (); if (cophrase.length); charat (length));} else break;} wordtfneighbor.put (word, tfneighbor);}} // 왼쪽 이웃을 얻으십시오 (int pindex = 0; pindex <leftptable.size (); pindex ++) {String phrase = leftptable.get (pindex); 길이)); tfneighbor tfneighbor = wordtfneighbor.get (word); if (phrase) .length (). .length; LENGTHENT; LEFTLTABLE [LINDEX]> = LENGTHING) {String Cophrase = Leftpta ble.get (lindex); ;} else break;}}}}} system.out.println ( "정보 : [nagao 알고리즘 단계 3] : tf와 이웃을 계산 한 상태);} // WordTfneighbor에 따르면 Word private double countmi (문자열 단어)에 따르면 mi. {if (word.length () <= 1) return 0; for (int pos = 1; pos <word.length () {string leftpart = substring (0, pos); LeftPart) .gettf ()/WordNumber = WordTfneighbor.get .gettf ()/wordNumber; TF, (왼쪽 및 오른쪽) 이웃 번호, 이웃 엔트로피, 상호 정보 개인 무효 문자열, 문자열 정지리스트, 문자열 [] 임계 값) {try {// 읽기 중지 단어 파일 <strong> stopwords = new Hashset <string> bufferedReader <(line = br.readline ()) {if (line.length ()> 1) stopwords.add (line);} br.close (); , Mi BufferedWriter BW = New BufferedWriter (New Filewriter (OWRITER (OWRITER (OWRITER (O ut); for (map .ntry <String, Tfneighbor> entry : wordTfneighbor.entryset ()) {if (rethet.getKey (). length () < = 1 || ); RightNeighbornumber = tfneighbor.getrightneighbornumb (); if (tf> integer.parseint [0]) &) . ",") sb.append ( ","). ". } system.out.println ( "정보 : [Nagao Algorithm 4 단계] : string [] 입력, 문자열 끄기, String stoplist) {nagaoalgorithm (nagaoalgorithm); 1 단계 : ptable String 라인에 대한 문구를 추가하십시오. nagao.addtoptable (line);} br.close ();} catch (ioexception e) {throw new runtimeexception ();}}}} tln ( "info : [[[[[[[[[[[[[[[[[[[[[[[[[[[[)))] : ptable에 왼쪽과 오른쪽 하위 문자가 추가되었습니다. // 2 단계 : ptable을 정렬하고 ltable nagao.countltapable (); // step 3 : count tf and neighboir nagao.counttfneighbor (); // step 4 : tf deighboirinfo 및 mi nagao.savetfneighborinfomi (out, stoplist, "20,3,3,5".split ( ","); uts, string out, string stoplist, int n, string file) {nagao = new nagaoalgorithm (nagao); .SETN (n); String [] threshold = filter.split ( ",", "); h! = 4) {System.out. println ("오류 : 필터는 4 숫자가 있어야합니다. return;} // 1 단계 : ptable String 라인에 대한 문구; ) {nagao.addtoptable (line);} <wash news exception ()} system.out.println; "); // 2 단계 : ptable 정렬 및 count ltable ao.countltable (); // step 3 : count tf and neighboir nagao.counttfneighbor (); // step4 : tf deighborinfo 및 mi nagao.savetfneighborinfomi (out, stoplist, 임계 값);} e void setn (int n) {n = n;} public static void main (String [] args) {string [] ins = { "e : //test/ganfen.txt"}; , "e : //test/out.txt", "e : // test /// stoplist.txt");}}

2. tfneighbor.java

 com.algo.word java.util.map; {Leftneighbor = New Hashmap <charactor, integer> (); rightneighbor = new Hashmap <문자, 정수> (); + leftneighbor.getordefault (Word, 0)); {//rightNeighbor.put (Word, 1 + RightNeighbor.getOnterfault (Word, 0)); } // public void excrementtf () {tf ++;} public int getleftleneighbornumber () {return leftneighbor.size () {return rightneighbor.size (); = 0; int number. 공개 getrightneighborentropy () {double altropy = 0; ) 반환 0;

3. Main.java

 package com.algo.word; public class main {public static void main (String [] args) {// 첫 번째 인수는 입력 파일이 분할된다. belw : // 단어, 용어 주파수, 왼쪽 이웃 번호, 오른쪽 이웃 번호, 왼쪽 이웃 엔트로피, 오른쪽 이웃 엔트로피, 상호 inf ormation과 같은 ',', ',', ',', ','왼쪽 이웃 번호, 오른쪽 이웃 엔트로피, 세 번째 인수는 (args.length == 3) nagaoalgorithm.applynagao (args [0] .split ( ","), args [1], args [2]); // fth 인수는 ngram 매개 변수 n // 출력 단어의 5 번째 인수입니다. , 기본값은 "20,3,3,5"// 출력 tf> 20 && (왼쪽 | 오른쪽) 이웃 번호> 3 && mi> 5 if (args.length == 5) .split ( ","), args [1], args [2], integer.parseint (args [3]), args [4]);}}}

위는이 기사의 모든 내용입니다.