gaoya Download - gaoya Source code download

gaoya

其他源码

v0.2.0

下载

高雅

关于

该项目实现了用于索引和查询文本文档的局部敏感哈希算法和数据结构。高雅的主要用例是重复数据删除和集群。

主要特点

64,32,16,8 位最小哈希
64,128 位 simhash
Rust 中的快速实现
得益于人造丝的多线程
Python 绑定

Python 示例

 >> > import gaoya
>> > index = gaoya . minhash . MinHashStringIndex ( hash_size = 32 , 
                                             jaccard_threshold = 0.5 , 
                                             num_bands = 42 , 
                                             band_size = 3 ,
                                             num_hashes = 42 * 3 ,
                                             analyzer = 'word' , 
                                             lowercase = True , 
                                             ngram_range = ( 1 , 1 ))
>> > corpus = [
...     'This is the first document.' ,
...     'This document is the second document.' ,
...     'And this is the third document.' ,
...     'Is this the first document?' ,
...     'This not the first nor the second nor the third, but the fourth document'
... ]
>> > 
>> > for i , doc in enumerate ( corpus ): index . insert_document ( i , doc )
... 
>> > index . query ( 'This is the first document.' )
[ 0 , 1 , 2 , 3 ]
>> >

安装

 $ pip3 install gaoya

示例

使用高雅进行文档重复数据删除

生锈的例子

 use gaoya :: minhash :: { MinHashIndex , MinHasher32 , MinHasher } ;
use gaoya :: text :: whitespace_split ;
use fxhash :: FxHashSet ;
let corpus = [
    "This is the first document." ,
    "This document is the second document." ,
    "And this is the third document." ,
    "Is this the first document?" ,
    "This not the first nor the second nor the third, but the fourth document" ] ;
let ( num_bands , band_width ) = ( 42 , 3 ) ;
let minhasher = MinHasher32 :: new ( num_bands * band_width ) ;
let mut index = MinHashIndex :: new ( num_bands , band_width , 0.5 ) ;
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
    index . insert ( i , minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ) ;
}
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
    if i < 4 {
        let mut expected = FxHashSet :: default ( ) ;
        expected . extend ( vec ! [ 0 , 1 , 2 , 3 ] . into_iter ( ) ) ;
        let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
        assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
    } else {
        let mut expected = FxHashSet :: default ( ) ;
        expected . insert ( 4 ) ;
        let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
        assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
    }
}