拼寫校正和模糊搜索:通過對稱刪除拼寫校正算法快100萬倍
對稱刪除拼寫校正算法降低了給定的Damerau-Levenshtein距離的編輯候選生成和字典查找的複雜性。它是六個數量級的速度(比刪除 + transposes +替換 +插入的標準方法和語言無關。
與其他算法相反,僅需要刪除,不需要轉移 +替換 +插入。輸入項的轉置 +替換 +插入件被轉換為字典項的刪除。替換和插入量昂貴,語言依賴性:例如,中文具有70,000個Unicode Han字符!
速度來自廉價的刪除僅編輯候選人的生成和預計。
一個平均5個字母單詞在最大編輯距離為3,大約有300萬個可能的拼寫錯誤,
但是Symspell只需生成25個刪除即可覆蓋所有刪除,無論是在預計還是在查找時間時。魔法!
如果您喜歡Symspell,請嘗試SeekStorm- Rust(開源)中的子毫秒全文搜索庫和多租戶服務器。
Copyright (c) 2022 Wolf Garbe
Version: 6.7.2
Author: Wolf Garbe <[email protected]>
Maintainer: Wolf Garbe <[email protected]>
URL: https://github.com/wolfgarbe/symspell
Description: https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
Copyright (c) 2022 Wolf Garbe
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
https://opensource.org/licenses/MIT
查找提供了非常快速的單詞拼寫校正。
0.033毫秒/單詞(編輯距離2)和0.180毫秒/單詞(編輯距離3)(2012年MacBook Pro上的單核)
1,870倍的速度比BK-Tree快(請參見基準1:字典大小= 500,000,最大編輯距離= 3,與隨機編輯距離的查詢術語= 0 ...最大編輯距離,冗長= 0)
比Norvig的算法快100萬倍(請參閱基準2:字典大小= 29,157,最大編輯距離= 3,固定編輯距離的查詢項=最大編輯距離,冗長= 0)
1000x拼寫校正算法更快
在大數據中與大距離的快速近似字符串匹配
產品名稱,公司名稱和街道名稱的非常快的數據清潔
亞毫米複合自動拼寫校正
Symspell vs. BK-Tree:100倍更快的模糊字符串搜索和拼寫檢查
噪音文本的快速單詞細分
修剪radix trie-類固醇上的radix trie
LookupCompound支持複合意識到的自動拼寫校正多字輸入字符串。
1。複合分裂和分解
Lookup()將每個輸入字符串作為單個項。 LookupCompound還支持三種情況下的複合分裂 /分解:
分裂錯誤,串聯錯誤,替換錯誤,換位錯誤,刪除錯誤和插入錯誤可以通過在同一單詞中混合。
2。自動拼寫校正
示例:
- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him (9 edits)
- in te dhird qarter oflast jear he hadlearned ofca sekretplan
+ in the third quarter of last year he had learned of a secret plan (9 edits)
- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun (9 edits)
- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes (9 edits)
0.2毫秒 /單詞(編輯距離2)5000個單詞 /秒(2012年MacBook Pro上的單核)
詞分割通過在適當的位置插入缺失空間來將字符串劃分為單詞。
示例:
- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog
- itwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen
- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
應用程式:
表現:
4毫秒,用於將185個char字符串分為53個單詞(2012年MacBook Pro上的單核)
單詞 + enter:顯示拼寫建議
輸入沒有輸入:終止程序
多個單詞 + enter:顯示拼寫建議
輸入沒有輸入:終止程序
沒有空格的字符串 + enter:顯示單詞段文本
輸入沒有輸入:終止程序
可以使用免費的Visual Studio Code構建演示,DemoCompound和SemengeationDemo項目,該項目可以在Windows,MacOS和Linux上運行。
//create object
int initialCapacity = 82765 ;
int maxEditDistanceDictionary = 2 ; //maximum edit distance per dictionary precalculation
var symSpell = new SymSpell ( initialCapacity , maxEditDistanceDictionary ) ;
//load dictionary
string baseDirectory = AppDomain . CurrentDomain . BaseDirectory ;
string dictionaryPath = baseDirectory + " ../../../../SymSpell/frequency_dictionary_en_82_765.txt " ;
int termIndex = 0 ; //column of the term in the dictionary text file
int countIndex = 1 ; //column of the term frequency in the dictionary text file
if ( ! symSpell . LoadDictionary ( dictionaryPath , termIndex , countIndex ) )
{
Console . WriteLine ( " File not found! " ) ;
//press any key to exit program
Console . ReadKey ( ) ;
return ;
}
//lookup suggestions for single-word input strings
string inputTerm = " house " ;
int maxEditDistanceLookup = 1 ; //max edit distance per lookup (maxEditDistanceLookup<=maxEditDistanceDictionary)
var suggestionVerbosity = SymSpell . Verbosity . Closest ; //Top, Closest, All
var suggestions = symSpell . Lookup ( inputTerm , suggestionVerbosity , maxEditDistanceLookup ) ;
//display suggestions, edit distance and term frequency
foreach ( var suggestion in suggestions )
{
Console . WriteLine ( suggestion . term + " " + suggestion . distance . ToString ( ) + " " + suggestion . count . ToString ( " N0 " ) ) ;
}
//load bigram dictionary
string dictionaryPath = baseDirectory + " ../../../../SymSpell/frequency_bigramdictionary_en_243_342.txt " ;
int termIndex = 0 ; //column of the term in the dictionary text file
int countIndex = 2 ; //column of the term frequency in the dictionary text file
if ( ! symSpell . LoadBigramDictionary ( dictionaryPath , termIndex , countIndex ) )
{
Console . WriteLine ( " File not found! " ) ;
//press any key to exit program
Console . ReadKey ( ) ;
return ;
}
//lookup suggestions for multi-word input strings (supports compound splitting & merging)
inputTerm = " whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixtgrade and ins pired him " ;
maxEditDistanceLookup = 2 ; //max edit distance per lookup (per single word, not per whole input string)
suggestions = symSpell . LookupCompound ( inputTerm , maxEditDistanceLookup ) ;
//display suggestions, edit distance and term frequency
foreach ( var suggestion in suggestions )
{
Console . WriteLine ( suggestion . term + " " + suggestion . distance . ToString ( ) + " " + suggestion . count . ToString ( " N0 " ) ) ;
}
//word segmentation and correction for multi-word input strings with/without spaces
inputTerm = " thequickbrownfoxjumpsoverthelazydog " ;
maxEditDistance = 0 ;
suggestion = symSpell . WordSegmentation ( input ) ;
//display term and edit distance
Console . WriteLine ( suggestion . correctedString + " " + suggestion . distanceSum . ToString ( " N0 " ) ) ;
//press any key to exit program
Console . ReadKey ( ) ;
Symspell Targets .NET標準v2.0,可以使用:
可以使用免費的Visual Studio代碼構建Symspell,Demo,Demopompound和Benchmark項目,該項目可以在Windows,MacOS和Linux上運行。
字典質量對於校正質量至關重要。為了實現這兩個數據源,通過交集結合了:Google Books Ngram數據,提供了代表性的單詞頻率(但包含許多帶有拼寫錯誤的條目)和皺眉- 拼寫面向檢查器的單詞列表可確保真正的英語詞彙(但不包含Word頻率在相同的編輯距離內的建議排名所必需的)。
通過與下面提到的兩個列表相交而創建的頻率_dictionary_en_82_765.txt。通過相互濾波僅使用出現在兩個列表中的那些單詞。應用了其他過濾器,並將結果列表截斷為≈80,000個最常見的單詞。
您可以為您的語言或專門的技術領域構建自己的頻率詞典。 Symspell拼寫校正算法支持具有非拉丁字符的語言,例如西里爾,中文或喬治亞語。
Symspell包括英文頻率詞典
中文,英語,法語,德語,希伯來語,意大利語,俄語和西班牙語的字典位於這裡:
symspell.frequencyDictionary
許多其他語言中的頻率字典可以在此處找到:
頻率存儲庫
頻率詞典
頻率詞典
C# (原始源代碼)
https://github.com/wolfgarbe/symspell
.NET (Nuget軟件包)
https://www.nuget.org/packages/symspell
以下第三方端口或對其他編程語言的重新實現尚未由我自己測試,無論它們是確切的端口,沒有錯誤,提供相同的結果還是與原始算法一樣快。
大多數端口目標Symspell版本3.0 。但是版本6.1。提供更高的速度和較低的內存消耗!
WebAssembly
https://github.com/justinwilaby/spellchecker-wasm
Web API(Docker)
https://github.com/leonerath/symspellapi(版本6.3)
C ++
https://github.com/athes21/symspellcpp(版本6.5)
https://github.com/erhanbaris/symspellplusplus(版本6.1)
水晶
https://github.com/chenkovsky/aha/blob/master/src/src/aha/sym_spell.cr
去
https://github.com/sajari/fuzzy
https://github.com/eskriett/spell
哈斯克爾
https://github.com/cbeav/symspell
爪哇
https://github.com/mightguy/customized-symspell(版本6.6)
https://github.com/rxp90/jsymspell(版本6.6)
https://github.com/lundez/javasymspell(版本6.4)
https://github.com/rxp90/jsymspell
https://github.com/gpranav88/symspell
https://github.com/searchhub/predict
https://github.com/jpsingarayar/spellblaze
JavaScript
https://github.com/mathieuloutre/node-symspell(版本6.6,需要node.js)
https://github.com/itslenny/symspell.js
https://github.com/dongyuwei/symspell
https://github.com/icecreamyou/symspell
https://github.com/yomguithereal/mnemonist/blob/master/symspell.js
朱莉婭
https://github.com/arkoniak/symspell.jl
科特林
https://github.com/wavesonics/symspellkt
Objective-C
https://github.com/amitbhavsariphone/symspell(版本6.3)
Python
https://github.com/mammothb/symspellpy(版本6.7)
https://github.com/viig99/symspellcpppy(版本6.5)
https://github.com/zoho-labs/symspell(Rust版本的Python綁定)
https://github.com/ne3x7/pysymspell/(版本6.1)
https://github.com/ayyuriss/symspell
https://github.com/ppgmg/github_public/blob/master/spell/symspell_python.py
https://github.com/rcourivaud/symspellcompound
https://github.com/esukhia/sympound-python
https://www.kaggle.com/yk1598/symspell-spell-corrator
紅寶石
https://github.com/philt/symspell
銹
https://github.com/reneklacan/symspell(版本6.6,編譯到WebAssembly)
https://github.com/luketpeterson/fuzzy_rocks(由rocksdb支持的持久數據存儲)
Scala
https://github.com/semkath/symspell
迅速
https://github.com/gdetari/symspellswift
用於用戶查詢的上下文多語言拼寫檢查器
Sanat Sharma,Josep Valls-Vargas,Tracy Holloway King,Francois Guerin,Chirag Arora(Adobe)
https://arxiv.org/abs/2305.01082
上下文敏感的實時拼寫檢查器,具有語言適應性
Prabhakar Gupta(亞馬遜)
https://arxiv.org/abs/1910.11242
演講者:德國州和聯邦議會的元數據豐富的言論
Kai-Robin Lange和Carsten Jentsch
https://arxiv.org/pdf/2410.17886
語法誤差校正的擴展序列標記詞彙
Stuart Mesham,Christopher Bryant,Marek Rei,Zheng Yuan
https://arxiv.org/abs/2302.05913
德國議會語料庫(Gerparcor)
Giuseppe Abrami,MevlütBagci,Leon Hammerla,Alexander Mehler
https://arxiv.org/abs/2204.10422
IOCR:選舉選票的知情光學特徵識別
肯尼斯·U·奧伊波(Kenneth U. Oyibo),讓·路易斯(Jean D. Louis),胡安·E·吉爾伯特
https://arxiv.org/abs/2208.00865
Amazigh Spell Checker使用Damerau-Levenshtein算法和N-gram
youness chaabi,fadoua ataa allah
https://www.sciendirect.com/science/article/pii/s1319157821001828
泰國面向業務信息檢索的查詢校正調查
Phongsathorn Kittiworapanya,Nuttapong Saelek,Anuruth Lertpiya,Tawunrat Chalothorn
https://ieeexplore.ieee.org/document/9376809
泰米爾語的Symspell和LSTM的拼寫
selvakumar murugantamil arasan bakthavatchalamtamil arasan bakthavatchalammalaikannan sankarasubbu
https://www.researchgate.net/publication/349924975_symspell_and_lstm_based_based_spell-_checkers_for_tamil
Symspell4burmese:對稱刪除拼寫校正算法(Symspell)用於緬甸拼寫檢查
EI Phyu Phyu Mon; ye kyaw thu;比yu; Aye Wai Oo
https://ieeexplore.ieee.org/document/9678171
拼寫檢查印度尼西亞Menggunakan Norvig Dan Symspell
Yasir Abdur Rohman
https://medium.com/@yasirabd/spell-check-indonesia-menggunakan-norvig-dan-symspell-4fa583d62c24
Analisis Perbandingan Metode Burkhard Keller樹Dan Symspell Dalam Spell校正Bahasa印度尼西亞
Muhammad Hafizh Ferdiansyah,我Kadek Dwi Nuryana
https://ejournal.unesa.ac.id/index.php/jinacs/article/download/50989/41739
通過對弱和製造的印度尼西亞人翻譯的拼寫校正來改善文件檢索
穆罕默德Zaky Ramadhankemas M Lhaksmanakemas M Lhaksmana
https://www.researchgate.net/publication/342390145_improving_document_retrieval_with_with_with_with_with_correction_correction_for_weak_for_weak_and_fabricated_indonesian-translated_hadith
symspell을을한글맞춤법교정교정
https://heegyukim.medium.com/symspell%EC%9D%84-%EC%9D%B4%EC%9A%A9%95%95%9C-%95%9C%9C%B8%80-80- %EB%A7%9E%EC%B6%A4%EB%B2%95-%EA B5%90%EC%A0%95-3DEF9CA00805
修補斷裂文本。糾正OCR數據的啟發式程序
Jens Bjerring-Hansen,Ross Deans Kristensen-McLachla2,Philip Diderichsen和Dorte Haltrup Hansen
https://ceur-ws.org/vol-3232/paper14.pdf
邁向自然語言處理作為離線手寫文本識別系統的拼寫校正
Arthur Flor de Sousa Neto;拜倫·萊特·丹塔斯·貝塞拉(Byron Leite Dantas Bezerra);和亞歷杭德羅·霍克特·托塞利
https://www.mdpi.com/2076-3417/10/21/7711
何時將OCR後糾正用於指定實體識別?
Vinh-nam Huynh,艾哈邁德·漢迪(Ahmed Hamdi),安托萬·杜塞特(Antoine Doucet)
https://hal.science/hal-03034484v1/
自動錯誤校正:評估咒語檢查器工具的性能
A. Tolegenova
https://journals.sdu.edu.kz/index.php/nts/article/view/690
Zhaw-cai:瑞士德語演講的合奏方法對標準德語文本
Malgorzata Anna Ulasik,Manuela Hurlimann,Bogumila Dubel,Yves Kaufmann,
Silas Rudolf,Jan Deriu,Katsiaryna Mlynchyk,Hans-Peter Hutter和Mark Cieliebak
https://ceur-ws.org/vol-2957/sg_paper3.pdf
基於機器學習的西里爾單詞錯誤程序
Battumur,K.,U.Dulamragchaa,U.,Enkhbat,S.
https://mongoliajol.info/index.php/jimdt/article/view/2661
快速近似字符串搜索Wikification
szymon olewniczak,朱利安·西曼斯基(Julian Szymanski)
https://www.iccs-meeting.org/archive/iccs2021/papers/1274440334.pdf
RumedSpellchecker:使用機器學習技術在電子健康記錄中糾正自然俄語的拼寫錯誤
Dmitrii Pogrebnoi,Anastasia Funkner,Sergey Kovalchuk
https://link.springer.com/chapter/10.1007/978-3-031-36024-4_16
語法誤差校正的擴展序列標記詞彙
Stuart Mesham,Christopher Bryant,Marek Rei,Zheng Yuan
https://aclanthology.org/2023.findings-eacl.119.pdf
通過對稱刪除查找,閃電自適應免疫受體相似性搜索
Touchchai Chotisorayuth,Andreas Tiffeau-Mayer
https://arxiv.org/html/2403.09010v1
揭示偽裝的毒性:一種新穎的預處理模塊,用於增強內容的適度
約翰尼·陳(Johnny Chan),li
https://www.sciendirect.com/science/article/pii/s2215016124001225
bycycle
> bicycle
(而不是by cycle
)inconvient
- > inconvenient
(而不是i convent
)Symspell由Seekstorm貢獻 - 高性能搜索作為服務和搜索API