拼写校正和模糊搜索:通过对称删除拼写校正算法快100万倍
对称删除拼写校正算法降低了给定的Damerau-Levenshtein距离的编辑候选生成和字典查找的复杂性。它是六个数量级的速度(比删除 + transposes +替换 +插入的标准方法和语言无关。
与其他算法相反,仅需要删除,不需要转移 +替换 +插入。输入项的转置 +替换 +插入件被转换为字典项的删除。替换和插入量昂贵,语言依赖性:例如,中文具有70,000个Unicode Han字符!
速度来自廉价的删除仅编辑候选人的生成和预计。
一个平均5个字母单词在最大编辑距离为3,大约有300万个可能的拼写错误,
但是Symspell只需生成25个删除即可覆盖所有删除,无论是在预计还是在查找时间时。魔法!
如果您喜欢Symspell,请尝试SeekStorm- Rust(开源)中的子毫秒全文搜索库和多租户服务器。
Copyright (c) 2022 Wolf Garbe
Version: 6.7.2
Author: Wolf Garbe <[email protected]>
Maintainer: Wolf Garbe <[email protected]>
URL: https://github.com/wolfgarbe/symspell
Description: https://seekstorm.com/blog/1000x-spelling-correction/
MIT License
Copyright (c) 2022 Wolf Garbe
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated
documentation files (the "Software"), to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software,
and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
https://opensource.org/licenses/MIT
查找提供了非常快速的单词拼写校正。
0.033毫秒/单词(编辑距离2)和0.180毫秒/单词(编辑距离3)(2012年MacBook Pro上的单核)
1,870倍的速度比BK-Tree快(请参见基准1:字典大小= 500,000,最大编辑距离= 3,与随机编辑距离的查询术语= 0 ...最大编辑距离,冗长= 0)
比Norvig的算法快100万倍(请参阅基准2:字典大小= 29,157,最大编辑距离= 3,固定编辑距离的查询项=最大编辑距离,冗长= 0)
1000x拼写校正算法更快
在大数据中与大距离的快速近似字符串匹配
产品名称,公司名称和街道名称的非常快的数据清洁
亚毫米复合自动拼写校正
Symspell vs. BK-Tree:100倍更快的模糊字符串搜索和拼写检查
噪音文本的快速单词细分
修剪radix trie-类固醇上的radix trie
LookupCompound支持复合意识到的自动拼写校正多字输入字符串。
1。复合分裂和分解
Lookup()将每个输入字符串作为单个项。 LookupCompound还支持三种情况下的复合分裂 /分解:
分裂错误,串联错误,替换错误,换位错误,删除错误和插入错误可以通过在同一单词中混合。
2。自动拼写校正
示例:
- whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixthgrade and ins pired him
+ where is the love he had dated for much of the past who couldn't read in sixth grade and inspired him (9 edits)
- in te dhird qarter oflast jear he hadlearned ofca sekretplan
+ in the third quarter of last year he had learned of a secret plan (9 edits)
- the bigjest playrs in te strogsommer film slatew ith plety of funn
+ the biggest players in the strong summer film slate with plenty of fun (9 edits)
- Can yu readthis messa ge despite thehorible sppelingmsitakes
+ can you read this message despite the horrible spelling mistakes (9 edits)
0.2毫秒 /单词(编辑距离2)5000个单词 /秒(2012年MacBook Pro上的单核)
词分割通过在适当的位置插入缺失空间来将字符串划分为单词。
示例:
- thequickbrownfoxjumpsoverthelazydog
+ the quick brown fox jumps over the lazy dog
- itwasabrightcolddayinaprilandtheclockswerestrikingthirteen
+ it was a bright cold day in april and the clocks were striking thirteen
- itwasthebestoftimesitwastheworstoftimesitwastheageofwisdomitwastheageoffoolishness
+ it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
应用程序:
表现:
4毫秒,用于将185个char字符串分为53个单词(2012年MacBook Pro上的单核)
单词 + enter:显示拼写建议
输入没有输入:终止程序
多个单词 + enter:显示拼写建议
输入没有输入:终止程序
没有空格的字符串 + enter:显示单词段文本
输入没有输入:终止程序
可以使用免费的Visual Studio Code构建演示,DemoCompound和SemengeationDemo项目,该项目可以在Windows,MacOS和Linux上运行。
//create object
int initialCapacity = 82765 ;
int maxEditDistanceDictionary = 2 ; //maximum edit distance per dictionary precalculation
var symSpell = new SymSpell ( initialCapacity , maxEditDistanceDictionary ) ;
//load dictionary
string baseDirectory = AppDomain . CurrentDomain . BaseDirectory ;
string dictionaryPath = baseDirectory + " ../../../../SymSpell/frequency_dictionary_en_82_765.txt " ;
int termIndex = 0 ; //column of the term in the dictionary text file
int countIndex = 1 ; //column of the term frequency in the dictionary text file
if ( ! symSpell . LoadDictionary ( dictionaryPath , termIndex , countIndex ) )
{
Console . WriteLine ( " File not found! " ) ;
//press any key to exit program
Console . ReadKey ( ) ;
return ;
}
//lookup suggestions for single-word input strings
string inputTerm = " house " ;
int maxEditDistanceLookup = 1 ; //max edit distance per lookup (maxEditDistanceLookup<=maxEditDistanceDictionary)
var suggestionVerbosity = SymSpell . Verbosity . Closest ; //Top, Closest, All
var suggestions = symSpell . Lookup ( inputTerm , suggestionVerbosity , maxEditDistanceLookup ) ;
//display suggestions, edit distance and term frequency
foreach ( var suggestion in suggestions )
{
Console . WriteLine ( suggestion . term + " " + suggestion . distance . ToString ( ) + " " + suggestion . count . ToString ( " N0 " ) ) ;
}
//load bigram dictionary
string dictionaryPath = baseDirectory + " ../../../../SymSpell/frequency_bigramdictionary_en_243_342.txt " ;
int termIndex = 0 ; //column of the term in the dictionary text file
int countIndex = 2 ; //column of the term frequency in the dictionary text file
if ( ! symSpell . LoadBigramDictionary ( dictionaryPath , termIndex , countIndex ) )
{
Console . WriteLine ( " File not found! " ) ;
//press any key to exit program
Console . ReadKey ( ) ;
return ;
}
//lookup suggestions for multi-word input strings (supports compound splitting & merging)
inputTerm = " whereis th elove hehad dated forImuch of thepast who couqdn'tread in sixtgrade and ins pired him " ;
maxEditDistanceLookup = 2 ; //max edit distance per lookup (per single word, not per whole input string)
suggestions = symSpell . LookupCompound ( inputTerm , maxEditDistanceLookup ) ;
//display suggestions, edit distance and term frequency
foreach ( var suggestion in suggestions )
{
Console . WriteLine ( suggestion . term + " " + suggestion . distance . ToString ( ) + " " + suggestion . count . ToString ( " N0 " ) ) ;
}
//word segmentation and correction for multi-word input strings with/without spaces
inputTerm = " thequickbrownfoxjumpsoverthelazydog " ;
maxEditDistance = 0 ;
suggestion = symSpell . WordSegmentation ( input ) ;
//display term and edit distance
Console . WriteLine ( suggestion . correctedString + " " + suggestion . distanceSum . ToString ( " N0 " ) ) ;
//press any key to exit program
Console . ReadKey ( ) ;
Symspell Targets .NET标准v2.0,可以使用:
可以使用免费的Visual Studio代码构建Symspell,Demo,Demopompound和Benchmark项目,该项目可以在Windows,MacOS和Linux上运行。
字典质量对于校正质量至关重要。为了实现这两个数据源,通过交集结合了:Google Books Ngram数据,提供了代表性的单词频率(但包含许多带有拼写错误的条目)和皱眉 - 拼写面向检查器的单词列表可确保真正的英语词汇(但不包含Word频率在相同的编辑距离内的建议排名所必需的)。
通过与下面提到的两个列表相交而创建的频率_dictionary_en_82_765.txt。通过相互滤波仅使用出现在两个列表中的那些单词。应用了其他过滤器,并将结果列表截断为≈80,000个最常见的单词。
您可以为您的语言或专门的技术领域构建自己的频率词典。 Symspell拼写校正算法支持具有非拉丁字符的语言,例如西里尔,中文或乔治亚语。
Symspell包括英文频率词典
中文,英语,法语,德语,希伯来语,意大利语,俄语和西班牙语的字典位于这里:
symspell.frequencyDictionary
许多其他语言中的频率字典可以在此处找到:
频率存储库
频率词典
频率词典
C# (原始源代码)
https://github.com/wolfgarbe/symspell
.NET (Nuget软件包)
https://www.nuget.org/packages/symspell
以下第三方端口或对其他编程语言的重新实现尚未由我自己测试,无论它们是确切的端口,没有错误,提供相同的结果还是与原始算法一样快。
大多数端口目标Symspell版本3.0 。但是版本6.1。提供更高的速度和较低的内存消耗!
WebAssembly
https://github.com/justinwilaby/spellchecker-wasm
Web API(Docker)
https://github.com/leonerath/symspellapi(版本6.3)
C ++
https://github.com/athes21/symspellcpp(版本6.5)
https://github.com/erhanbaris/symspellplusplus(版本6.1)
水晶
https://github.com/chenkovsky/aha/blob/master/src/src/aha/sym_spell.cr
去
https://github.com/sajari/fuzzy
https://github.com/eskriett/spell
哈斯克尔
https://github.com/cbeav/symspell
爪哇
https://github.com/mightguy/customized-symspell(版本6.6)
https://github.com/rxp90/jsymspell(版本6.6)
https://github.com/lundez/javasymspell(版本6.4)
https://github.com/rxp90/jsymspell
https://github.com/gpranav88/symspell
https://github.com/searchhub/predict
https://github.com/jpsingarayar/spellblaze
JavaScript
https://github.com/mathieuloutre/node-symspell(版本6.6,需要node.js)
https://github.com/itslenny/symspell.js
https://github.com/dongyuwei/symspell
https://github.com/icecreamyou/symspell
https://github.com/yomguithereal/mnemonist/blob/master/symspell.js
朱莉娅
https://github.com/arkoniak/symspell.jl
科特林
https://github.com/wavesonics/symspellkt
Objective-C
https://github.com/amitbhavsariphone/symspell(版本6.3)
Python
https://github.com/mammothb/symspellpy(版本6.7)
https://github.com/viig99/symspellcpppy(版本6.5)
https://github.com/zoho-labs/symspell(Rust版本的Python绑定)
https://github.com/ne3x7/pysymspell/(版本6.1)
https://github.com/ayyuriss/symspell
https://github.com/ppgmg/github_public/blob/master/spell/symspell_python.py
https://github.com/rcourivaud/symspellcompound
https://github.com/esukhia/sympound-python
https://www.kaggle.com/yk1598/symspell-spell-corrator
红宝石
https://github.com/philt/symspell
锈
https://github.com/reneklacan/symspell(版本6.6,编译到WebAssembly)
https://github.com/luketpeterson/fuzzy_rocks(由rocksdb支持的持久数据存储)
Scala
https://github.com/semkath/symspell
迅速
https://github.com/gdetari/symspellswift
用于用户查询的上下文多语言拼写检查器
Sanat Sharma,Josep Valls-Vargas,Tracy Holloway King,Francois Guerin,Chirag Arora(Adobe)
https://arxiv.org/abs/2305.01082
上下文敏感的实时拼写检查器,具有语言适应性
Prabhakar Gupta(亚马逊)
https://arxiv.org/abs/1910.11242
演讲者:德国州和联邦议会的元数据丰富的言论
Kai-Robin Lange和Carsten Jentsch
https://arxiv.org/pdf/2410.17886
语法误差校正的扩展序列标记词汇
Stuart Mesham,Christopher Bryant,Marek Rei,Zheng Yuan
https://arxiv.org/abs/2302.05913
德国议会语料库(Gerparcor)
Giuseppe Abrami,MevlütBagci,Leon Hammerla,Alexander Mehler
https://arxiv.org/abs/2204.10422
IOCR:选举选票的知情光学特征识别
肯尼斯·U·奥伊波(Kenneth U. Oyibo),让·路易斯(Jean D. Louis),胡安·E·吉尔伯特
https://arxiv.org/abs/2208.00865
Amazigh Spell Checker使用Damerau-Levenshtein算法和N-gram
youness chaabi,fadoua ataa allah
https://www.sciendirect.com/science/article/pii/s1319157821001828
泰国面向业务信息检索的查询校正调查
Phongsathorn Kittiworapanya,Nuttapong Saelek,Anuruth Lertpiya,Tawunrat Chalothorn
https://ieeexplore.ieee.org/document/9376809
泰米尔语的Symspell和LSTM的拼写
selvakumar murugantamil arasan bakthavatchalamtamil arasan bakthavatchalammalaikannan sankarasubbu
https://www.researchgate.net/publication/349924975_symspell_and_lstm_based_based_spell-_checkers_for_tamil
Symspell4burmese:对称删除拼写校正算法(Symspell)用于缅甸拼写检查
EI Phyu Phyu Mon; ye kyaw thu;比yu; Aye Wai Oo
https://ieeexplore.ieee.org/document/9678171
拼写检查印度尼西亚Menggunakan Norvig Dan Symspell
Yasir Abdur Rohman
https://medium.com/@yasirabd/spell-check-indonesia-menggunakan-norvig-dan-symspell-4fa583d62c24
Analisis Perbandingan Metode Burkhard Keller树Dan Symspell Dalam Spell校正Bahasa印度尼西亚
Muhammad Hafizh Ferdiansyah,我Kadek Dwi Nuryana
https://ejournal.unesa.ac.id/index.php/jinacs/article/download/50989/41739
通过对弱和制造的印度尼西亚人翻译的拼写校正来改善文件检索
穆罕默德Zaky Ramadhankemas M Lhaksmanakemas M Lhaksmana
https://www.researchgate.net/publication/342390145_improving_document_retrieval_with_with_with_with_with_correction_correction_for_weak_for_weak_and_fabricated_indonesian-translated_hadith
symspell을을한글맞춤법교정교정
https://heegyukim.medium.com/symspell%EC%9D%84-%EC%9D%B4%EC%9A%A9%95%95%9C-%95%9C%9C%B8%80-80- %EB%A7%9E%EC%B6%A4%EB%B2%95-%EA B5%90%EC%A0%95-3DEF9CA00805
修补断裂文本。纠正OCR数据的启发式程序
Jens Bjerring-Hansen,Ross Deans Kristensen-McLachla2,Philip Diderichsen和Dorte Haltrup Hansen
https://ceur-ws.org/vol-3232/paper14.pdf
迈向自然语言处理作为离线手写文本识别系统的拼写校正
Arthur Flor de Sousa Neto;拜伦·莱特·丹塔斯·贝塞拉(Byron Leite Dantas Bezerra);和亚历杭德罗·霍克特·托塞利
https://www.mdpi.com/2076-3417/10/21/7711
何时将OCR后纠正用于指定实体识别?
Vinh-nam Huynh,艾哈迈德·汉迪(Ahmed Hamdi),安托万·杜塞特(Antoine Doucet)
https://hal.science/hal-03034484v1/
自动错误校正:评估咒语检查器工具的性能
A. Tolegenova
https://journals.sdu.edu.kz/index.php/nts/article/view/690
Zhaw-cai:瑞士德语演讲的合奏方法对标准德语文本
Malgorzata Anna Ulasik,Manuela Hurlimann,Bogumila Dubel,Yves Kaufmann,
Silas Rudolf,Jan Deriu,Katsiaryna Mlynchyk,Hans-Peter Hutter和Mark Cieliebak
https://ceur-ws.org/vol-2957/sg_paper3.pdf
基于机器学习的西里尔单词错误程序
Battumur,K.,U.Dulamragchaa,U.,Enkhbat,S.
https://mongoliajol.info/index.php/jimdt/article/view/2661
快速近似字符串搜索Wikification
szymon olewniczak,朱利安·西曼斯基(Julian Szymanski)
https://www.iccs-meeting.org/archive/iccs2021/papers/1274440334.pdf
RumedSpellchecker:使用机器学习技术在电子健康记录中纠正自然俄语的拼写错误
Dmitrii Pogrebnoi,Anastasia Funkner,Sergey Kovalchuk
https://link.springer.com/chapter/10.1007/978-3-031-36024-4_16
语法误差校正的扩展序列标记词汇
Stuart Mesham,Christopher Bryant,Marek Rei,Zheng Yuan
https://aclanthology.org/2023.findings-eacl.119.pdf
通过对称删除查找,闪电自适应免疫受体相似性搜索
Touchchai Chotisorayuth,Andreas Tiffeau-Mayer
https://arxiv.org/html/2403.09010v1
揭示伪装的毒性:一种新颖的预处理模块,用于增强内容的适度
约翰尼·陈(Johnny Chan),li
https://www.sciendirect.com/science/article/pii/s2215016124001225
bycycle
> bicycle
(而不是by cycle
)inconvient
- > inconvenient
(而不是i convent
)Symspell由Seekstorm贡献 - 高性能搜索作为服务和搜索API