gaoya
v0.2.0
该项目实现了用于索引和查询文本文档的局部敏感哈希算法和数据结构。高雅的主要用例是重复数据删除和集群。
>> > import gaoya
>> > index = gaoya . minhash . MinHashStringIndex ( hash_size = 32 ,
jaccard_threshold = 0.5 ,
num_bands = 42 ,
band_size = 3 ,
num_hashes = 42 * 3 ,
analyzer = 'word' ,
lowercase = True ,
ngram_range = ( 1 , 1 ))
>> > corpus = [
... 'This is the first document.' ,
... 'This document is the second document.' ,
... 'And this is the third document.' ,
... 'Is this the first document?' ,
... 'This not the first nor the second nor the third, but the fourth document'
... ]
>> >
>> > for i , doc in enumerate ( corpus ): index . insert_document ( i , doc )
...
>> > index . query ( 'This is the first document.' )
[ 0 , 1 , 2 , 3 ]
>> >
$ pip3 install gaoya
使用高雅进行文档重复数据删除
use gaoya :: minhash :: { MinHashIndex , MinHasher32 , MinHasher } ;
use gaoya :: text :: whitespace_split ;
use fxhash :: FxHashSet ;
let corpus = [
"This is the first document." ,
"This document is the second document." ,
"And this is the third document." ,
"Is this the first document?" ,
"This not the first nor the second nor the third, but the fourth document" ] ;
let ( num_bands , band_width ) = ( 42 , 3 ) ;
let minhasher = MinHasher32 :: new ( num_bands * band_width ) ;
let mut index = MinHashIndex :: new ( num_bands , band_width , 0.5 ) ;
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
index . insert ( i , minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ) ;
}
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
if i < 4 {
let mut expected = FxHashSet :: default ( ) ;
expected . extend ( vec ! [ 0 , 1 , 2 , 3 ] . into_iter ( ) ) ;
let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
} else {
let mut expected = FxHashSet :: default ( ) ;
expected . insert ( 4 ) ;
let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
}
}
[1] 第三章,海量数据集挖掘
[2] 舍入算法的相似度估计技术
[3] 检测网络爬行的近似重复项