similarity, compute similarity score between text strings, Java written.
similarity,相似度計算工具包,可用於文本相似度計算、情感傾向分析等,Java編寫。
similarity是由一系列演算法組成的Java版相似度計算工具包,目標是傳播自然語言處理中相似度計算方法。 similarity具備工具實用、效能高效、架構清晰、語料時新、可自訂的特性。
similarity提供下列功能:
詞語相似度計算
短語相似度計算
句子相似度計算
段落相似度計算
知網義原
情緒分析
近似詞
在提供豐富功能的同時, similarity內部模組堅持低耦合、模型堅持惰性加載、字典堅持明文發布,使用方便,幫助用戶訓練自己的語料。
引入Jar包
< repositories >
< repository >
< id >jitpack.io</ id >
< url >https://jitpack.io</ url >
</ repository >
</ repositories >
< dependency >
< groupId >com.github.shibing624</ groupId >
< artifactId >similarity</ artifactId >
< version >1.1.6</ version >
</ dependency >
gradle的引進:
import org . xm . Similarity ;
import org . xm . tendency . word . HownetWordTendency ;
public class demo {
public static void main ( String [] args ) {
double result = Similarity . cilinSimilarity ( "电动车" , "自行车" );
System . out . println ( result );
String word = "混蛋" ;
HownetWordTendency hownetWordTendency = new HownetWordTendency ();
result = hownetWordTendency . getTendency ( word );
System . out . println ( word + " 词语情感趋势值:" + result );
}
}
文字長度:詞語粒度
建議使用詞林相似度: org.xm.Similarity.cilinSimilarity
,是基於同義詞詞林的相似度計算方法
example: src/test/java/org.xm/WordSimilarityDemo.java
package org . xm ;
public class WordSimilarityDemo {
public static void main ( String [] args ) {
String word1 = "教师" ;
String word2 = "教授" ;
double cilinSimilarityResult = Similarity . cilinSimilarity ( word1 , word2 );
double pinyinSimilarityResult = Similarity . pinyinSimilarity ( word1 , word2 );
double conceptSimilarityResult = Similarity . conceptSimilarity ( word1 , word2 );
double charBasedSimilarityResult = Similarity . charBasedSimilarity ( word1 , word2 );
System . out . println ( word1 + " vs " + word2 + " 词林相似度值:" + cilinSimilarityResult );
System . out . println ( word1 + " vs " + word2 + " 拼音相似度值:" + pinyinSimilarityResult );
System . out . println ( word1 + " vs " + word2 + " 概念相似度值:" + conceptSimilarityResult );
System . out . println ( word1 + " vs " + word2 + " 字面相似度值:" + charBasedSimilarityResult );
}
}
文字長度:短語粒度
推薦使用短語相似度: org.xm.Similarity.phraseSimilarity
,本質是透過兩個短語具有的相同字符,和相同字符的位置計算其相似度的方法
example: src/test/java/org.xm/PhraseSimilarityDemo.java
public static void main ( String [] args ) {
String phrase1 = "继续努力" ;
String phrase2 = "持续发展" ;
double result = Similarity . phraseSimilarity ( phrase1 , phrase2 );
System . out . println ( phrase1 + " vs " + phrase2 + " 短语相似度值:" + result );
}
文字長度:句子粒度
推薦使用詞形詞序句子相似度: org.xm.similarity.morphoSimilarity
,一種既考慮兩個句子相同文本字面,也考慮相同文本出現的前後順序的相似度方法
example: src/test/java/org.xm/SentenceSimilarityDemo.java
public static void main ( String [] args ) {
String sentence1 = "中国人爱吃鱼" ;
String sentence2 = "湖北佬最喜吃鱼" ;
double morphoSimilarityResult = Similarity . morphoSimilarity ( sentence1 , sentence2 );
double editDistanceResult = Similarity . editDistanceSimilarity ( sentence1 , sentence2 );
double standEditDistanceResult = Similarity . standardEditDistanceSimilarity ( sentence1 , sentence2 );
double gregeorEditDistanceResult = Similarity . gregorEditDistanceSimilarity ( sentence1 , sentence2 );
System . out . println ( sentence1 + " vs " + sentence2 + " 词形词序句子相似度值:" + morphoSimilarityResult );
System . out . println ( sentence1 + " vs " + sentence2 + " 优化的编辑距离句子相似度值:" + editDistanceResult );
System . out . println ( sentence1 + " vs " + sentence2 + " 标准编辑距离句子相似度值:" + standEditDistanceResult );
System . out . println ( sentence1 + " vs " + sentence2 + " gregeor编辑距离句子相似度值:" + gregeorEditDistanceResult );
}
文字長度:段落粒度(一段話,25字元< length(text) < 500字元)
建議使用詞形詞序句子相似度: org.xm.similarity.text.CosineSimilarity
,一種考慮兩個段落中相同的文本,經過切詞,詞頻和詞性權重加權,並用餘弦計算相似度的方法
example: src/test/java/org.xm/similarity/text/CosineSimilarityTest.java
@ Test
public void getSimilarityScore () throws Exception {
String text1 = "对于俄罗斯来说,最大的战果莫过于夺取乌克兰首都基辅,也就是现任总统泽连斯基和他政府的所在地。目前夺取基辅的战斗已经打响。" ;
String text2 = "迄今为止,俄罗斯的入侵似乎没有完全按计划成功执行——英国国防部情报部门表示,在乌克兰军队激烈抵抗下,俄罗斯军队已经损失数以百计的士兵。尽管如此,俄军在继续推进。" ;
TextSimilarity cosSimilarity = new CosineSimilarity ();
double score1 = cosSimilarity . getSimilarity ( text1 , text2 );
System . out . println ( "cos相似度分值:" + score1 );
TextSimilarity editSimilarity = new EditDistanceSimilarity ();
double score2 = editSimilarity . getSimilarity ( text1 , text2 );
System . out . println ( "edit相似度分值:" + score2 );
}
cos相似度分值:0.399143
edit相似度分值:0.0875
example: src/test/java/org/xm/tendency/word/HownetWordTendencyTest.java
@ Test
public void getTendency () throws Exception {
HownetWordTendency hownet = new HownetWordTendency ();
String word = "美好" ;
double sim = hownet . getTendency ( word );
System . out . println ( word + ":" + sim );
System . out . println ( "混蛋:" + hownet . getTendency ( "混蛋" ));
}
本例是基於義原樹的詞語粒度情感極性分析,關於文本情緒分析有pytextclassifier,利用深度神經網路模型、SVM分類演算法實現的效果較好。
example: src/test/java/org/xm/word2vec/Word2vecTest.java
@ Test
public void testHomoionym () throws Exception {
List < String > result = Word2vec . getHomoionym ( RAW_CORPUS_SPLIT_MODEL , "武功" , 10 );
System . out . println ( "武功 近似词:" + result );
}
@ Test
public void testHomoionymName () throws Exception {
String model = RAW_CORPUS_SPLIT_MODEL ;
List < String > result = Word2vec . getHomoionym ( model , "乔帮主" , 10 );
System . out . println ( "乔帮主 近似词:" + result );
List < String > result2 = Word2vec . getHomoionym ( model , "阿朱" , 10 );
System . out . println ( "阿朱 近似词:" + result2 );
List < String > result3 = Word2vec . getHomoionym ( model , "少林寺" , 10 );
System . out . println ( "少林寺 近似词:" + result3 );
}
Word2vec詞向量訓練用的java版word2vec訓練工具Word2VEC_java,訓練語料是小說天龍八部,透過詞向量實現得到近義詞。 使用者可以訓練自訂語料,也可以用中文維基百科訓練通用詞向量。
文本相似性度量
授權協議為The Apache License 2.0,可免費用做商業用途。請在產品說明中附加similarity的連結和授權協議。
項目代碼還很粗糙,如果大家對程式碼有所改進,歡迎提交回本項目,在提交之前,請注意以下兩點:
test
中加入對應的單元測試之後即可提交PR。