JamSpell下載 - JamSpell原始碼下載

JamSpell

C/C++

v0.0.12

下載

拼音

JamSpell 是一個拼字檢查庫，具有以下功能：

準確- 它會考慮單字周圍（上下文）以進行更好的糾正
快- 每秒接近 5K 字
多語言- 它是用 C++ 編寫的，可用於具有 swig 綁定的多種語言

Colab 範例

JamSpellPro

jamspell.com - 查看具有以下功能的新 jamspell 版本

提高準確性（catboost 梯度提升決策樹候選排名模型）
拆分合併的單字
針對多種語言（小、中、大）的預訓練模式：
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
能夠在運行時添加單字/句子
微調/額外培訓
用於訓練大型模型的記憶體優化
靜態字典支持
內建Java, C#, Ruby支持
Windows 支援

內容

基準測試
用法
- Python
- C++
- 其他語言
- HTTP API
火車

基準測試

	錯誤	前 7 個錯誤	修復率	修復率前 7 位	破碎的	速度（字/秒）
拼音	3.25%	1.27%	79.53%	84.10%	0.64%	4854
諾維格	7.62%	5.00%	46.58%	66.51%	0.69%	第395章
亨斯佩爾	13.10%	10.33%	47.52%	68.56%	7.14%	163
假的	13.14%	13.14%	0.00%	0.00%	0.00%	-

模型接受了 300K 維基百科句子 + 300K 新聞句子（英語）的訓練。 95%用於訓練，5%用於評估。錯誤模型用於從原始文字生成錯誤文字。 JamSpell 校正器與 Norvig 的校正器、Hunspell 和虛擬校正器（無校正）進行了比較。

我們使用以下指標：

錯誤- 拼字檢查器處理後出現錯誤的單字百分比
前 7 個錯誤- 前 7 個候選詞中缺少單字的百分比
修復率- 由拼字檢查器修復的錯誤單字的百分比
前 7 名修復率- 前 7 名候選人之一修復的錯誤單字的百分比
損壞- 拼字檢查器損壞的無錯誤單字的百分比
速度- 每秒的字數

為了確保我們的模型不會過於適合維基百科+新聞，我們在“福爾摩斯歷險記”文本中檢查了它：

	錯誤	前 7 個錯誤	修復率	修復率前 7 位	破碎的	速度（每秒字數）
拼音	3.56%	1.27%	72.03%	79.73%	0.50%	5524
諾維格	7.60%	5.30%	35.43%	56.06%	0.45%	第647章
亨斯佩爾	9.36%	6.44%	39.61%	65.77%	2.95%	第284章
假的	11.16%	11.16%	0.00%	0.00%	0.00%	-

有關複製的更多詳細信息，請參閱“訓練”部分。

用法

Python

安裝swig3 （通常位於您的發行版套件管理器中）
安裝jamspell ：

pip install jamspell

下載或訓練語言模型
使用它：

 import jamspell

corrector = jamspell . TSpellCorrector ()
corrector . LoadLangModel ( 'en.bin' )

corrector . FixFragment ( 'I am the begt spell cherken!' )
# u'I am the best spell checker!'

corrector . GetCandidates ([ 'i' , 'am' , 'the' , 'begt' , 'spell' , 'cherken' ], 3 )
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector . GetCandidates ([ 'i' , 'am' , 'the' , 'begt' , 'spell' , 'cherken' ], 5 )
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

將jamspell和contrib目錄加入您的專案中
使用它：

# include < jamspell/spell_corrector.hpp >

int main ( int argc, const char ** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector. LoadLangModel ( " model.bin " );

    corrector. FixFragment ( L" I am the begt spell cherken! " );
    // "I am the best spell checker!"

    corrector. GetCandidates ({ L" i " , L" am " , L" the " , L" begt " , L" spell " , L" cherken " }, 3 );
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector. GetCandidates ({ L" i " , L" am " , L" the " , L" begt " , L" spell " , L" cherken " }, 3 );
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0 ;
}

其他語言

您可以使用 swig 教程產生其他語言的擴充。 swig 介面檔是jamspell.i 。歡迎使用建置腳本請求請求。

HTTP API

安裝cmake
複製並建立 jamspell（它包括 http 伺服器）：

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

下載或訓練語言模型
運行http伺服器：

./web_server/web_server en.bin localhost 8080

GET請求範例：

$ curl " http://localhost:8080/fix?text=I am the begt spell cherken "
I am the best spell checker

POST請求範例

$ curl -d " I am the begt spell cherken " http://localhost:8080/fix
I am the best spell checker

候選範例

curl " http://localhost:8080/candidates?text=I am the begt spell cherken "
# or
curl -d " I am the begt spell cherken " http://localhost:8080/candidates

 {
    "results" : [
        {
            "candidates" : [
                "best" ,
                "beat" ,
                "belt" ,
                "bet" ,
                "bent" ,
                "beet" ,
                "beit"
            ] ,
            "len" : 4 ,
            "pos_from" : 9
        } ,
        {
            "candidates" : [
                "checker" ,
                "chicken" ,
                "checked" ,
                "wherein" ,
                "coherent" ,
                "cheered" ,
                "cherokee"
            ] ,
            "len" : 7 ,
            "pos_from" : 20
        }
    ]
}

這裡pos_from - 拼寫錯誤的單字第一個字母位置， len - 拼寫錯誤的單字 len

火車

要訓練自訂模型，您需要：

安裝cmake
克隆並建構 jamspell：

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

準備一個 utf-8 文字文件，其中包含要訓練的句子（例如sherlockholmes.txt ）和另一個包含語言字母表的文件（例如alphabet_en.txt ）
列車型號：

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

若要評估拼字檢查器，您可以使用evaluate/evaluate.py腳本：

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

您可以使用evaluate/generate_dataset.py產生訓練/測試資料。它支援txt檔案、Leipzig Corpora Collection格式和fb2書籍。

下載模型

這是一些簡單的模型。他們訓練了 30 萬則新聞 + 30 萬個維基百科句子。我們強烈建議您訓練自己的模型，至少訓練數百萬個句子，以達到更好的品質。請參閱上面的“火車”部分。

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

展開

附加信息

版本 v0.0.12
類型 C/C++
更新時間 2024-12-23
大小 529.4KB
來自於 Github

相關應用

cpp peglib

2025-01-02
vkhr

2024-12-17
carma platform

2024-12-15
qttabbar

2024-12-17
ewig

2024-12-23
jitify

2024-12-16

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
cpp peglib

C/C++

1.0.0
vkhr

C/C++

1.0.0
carma platform

C/C++

carma-system-4.5.0
waymo open dataset

其他源碼

December 2023 Update
wp functions

其他類別

1.0.0
termwind

其他類別

v2.3.0

相關資訊全部