gaoya
v0.2.0
Ce projet implémente des algorithmes de hachage local et des structures de données pour l'indexation et l'interrogation de documents texte. Les principaux cas d'utilisation de Gaoya sont la déduplication et le clustering.
>> > import gaoya
>> > index = gaoya . minhash . MinHashStringIndex ( hash_size = 32 ,
jaccard_threshold = 0.5 ,
num_bands = 42 ,
band_size = 3 ,
num_hashes = 42 * 3 ,
analyzer = 'word' ,
lowercase = True ,
ngram_range = ( 1 , 1 ))
>> > corpus = [
... 'This is the first document.' ,
... 'This document is the second document.' ,
... 'And this is the third document.' ,
... 'Is this the first document?' ,
... 'This not the first nor the second nor the third, but the fourth document'
... ]
>> >
>> > for i , doc in enumerate ( corpus ): index . insert_document ( i , doc )
...
>> > index . query ( 'This is the first document.' )
[ 0 , 1 , 2 , 3 ]
>> >
$ pip3 install gaoya
Déduplication de documents avec Gaoya
use gaoya :: minhash :: { MinHashIndex , MinHasher32 , MinHasher } ;
use gaoya :: text :: whitespace_split ;
use fxhash :: FxHashSet ;
let corpus = [
"This is the first document." ,
"This document is the second document." ,
"And this is the third document." ,
"Is this the first document?" ,
"This not the first nor the second nor the third, but the fourth document" ] ;
let ( num_bands , band_width ) = ( 42 , 3 ) ;
let minhasher = MinHasher32 :: new ( num_bands * band_width ) ;
let mut index = MinHashIndex :: new ( num_bands , band_width , 0.5 ) ;
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
index . insert ( i , minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ) ;
}
for ( i , doc ) in corpus . iter ( ) . enumerate ( ) {
if i < 4 {
let mut expected = FxHashSet :: default ( ) ;
expected . extend ( vec ! [ 0 , 1 , 2 , 3 ] . into_iter ( ) ) ;
let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
} else {
let mut expected = FxHashSet :: default ( ) ;
expected . insert ( 4 ) ;
let signature = minhasher . create_signature ( whitespace_split ( & doc . to_lowercase ( ) ) ) ;
assert_eq ! ( index.query_owned ( &signature ) , expected ) ;
}
}
[1] Chapitre 3, Exploration d'ensembles de données massifs
[2] Techniques d'estimation de similarité à partir d'algorithmes d'arrondi
[3] Détection des quasi-doublons pour l'exploration du Web