JamSpellダウンロード - JamSpellソースコードのダウンロード

JamSpell

C/C++

v0.0.12

ダウンロード

ジャムスペル

JamSpell は、次の機能を備えたスペルチェックライブラリです。

正確- より適切に修正するために単語の周囲 (コンテキスト) を考慮します。
高速- 毎秒 5K ワード近く
多言語- C++ で書かれており、swig バインディングを使用して多くの言語で利用できます。

コラボの例

ジャムスペルプロ

jamspell.com - 次の機能を備えた新しい jamspell バージョンをチェックしてください

精度の向上 (catboost 勾配ブーストデシジョンツリー候補ランキングモデル)
結合された単語を分割します
多くの言語 (小規模、中規模、大規模) の事前トレーニング済みモデル:
en, ru, de, fr, it, es, tr, uk, pl, nl, pt, hi, no
実行時に単語/文章を追加する機能
微調整/追加トレーニング
大規模モデルをトレーニングするためのメモリの最適化
静的辞書のサポート
組み込みのJava, C#, Rubyサポート
Windowsのサポート

コンテンツ

ベンチマーク
使用法
- パイソン
- C++
- その他の言語
- HTTP API
電車

ベンチマーク

	エラー	上位 7 つのエラー	修正率	トップ7の定着率	壊れた	スピード (単語/秒)
ジャムスペル	3.25%	1.27%	79.53%	84.10%	0.64%	4854
ノーヴィグ	7.62%	5.00%	46.58%	66.51%	0.69%	395
フンスペル	13.10%	10.33%	47.52%	68.56%	7.14%	163
ダミー	13.14%	13.14%	0.00%	0.00%	0.00%	-

モデルは 30 万のウィキペディア文 + 30 万のニュース文 (英語) でトレーニングされました。 95% が列車に使用され、5% が評価に使用されました。エラーモデルは、元のテキストからエラーテキストを生成するために使用されました。 JamSpell 補正装置は、Norvig のもの、Hunspell およびダミーの補正装置 (補正なし) と比較されました。

次の指標を使用しました。

エラー- スペルチェッカーの処理後にエラーが発生した単語の割合
上位 7 つのエラー- 上位 7 候補に欠落している単語の割合
修正率- スペルチェッカーによって修正されたエラー単語の割合
上位 7 の修正率- 上位 7 候補のいずれかによって修正されたエラー単語の割合
壊れた- スペルチェッカーによって壊れた、エラーのない単語の割合
速度- 1 秒あたりの単語数

私たちのモデルがウィキペディア + ニュースに対して過剰適合していないことを確認するために、「シャーロックホームズの冒険」のテキストでモデルをチェックしました。

	エラー	上位 7 つのエラー	修正率	トップ7の定着率	壊れた	速度 (1 秒あたりの単語数)
ジャムスペル	3.56%	1.27%	72.03%	79.73%	0.50%	5524
ノーヴィグ	7.60%	5.30%	35.43%	56.06%	0.45%	647
フンスペル	9.36%	6.44%	39.61%	65.77%	2.95%	284
ダミー	11.16%	11.16%	0.00%	0.00%	0.00%	-

再生の詳細については、「トレイン」セクションを参照してください。

使用法

パイソン

swig3インストールします (通常、ディストリビューションパッケージマネージャーにあります)。
jamspellをインストールします。

pip install jamspell

言語モデルをダウンロードまたはトレーニングする
使用してください:

 import jamspell

corrector = jamspell . TSpellCorrector ()
corrector . LoadLangModel ( 'en.bin' )

corrector . FixFragment ( 'I am the begt spell cherken!' )
# u'I am the best spell checker!'

corrector . GetCandidates ([ 'i' , 'am' , 'the' , 'begt' , 'spell' , 'cherken' ], 3 )
# (u'best', u'beat', u'belt', u'bet', u'bent', ... )

corrector . GetCandidates ([ 'i' , 'am' , 'the' , 'begt' , 'spell' , 'cherken' ], 5 )
# (u'checker', u'chicken', u'checked', u'wherein', u'coherent', ...)

C++

jamspellとcontribディレクトリをプロジェクトに追加する
使用してください:

# include < jamspell/spell_corrector.hpp >

int main ( int argc, const char ** argv) {

    NJamSpell::TSpellCorrector corrector;
    corrector. LoadLangModel ( " model.bin " );

    corrector. FixFragment ( L" I am the begt spell cherken! " );
    // "I am the best spell checker!"

    corrector. GetCandidates ({ L" i " , L" am " , L" the " , L" begt " , L" spell " , L" cherken " }, 3 );
    // "best", "beat", "belt", "bet", "bent", ... )

    corrector. GetCandidates ({ L" i " , L" am " , L" the " , L" begt " , L" spell " , L" cherken " }, 3 );
    // "checker", "chicken", "checked", "wherein", "coherent", ... )
    return 0 ;
}

その他の言語

swig チュートリアルを使用して、他の言語の拡張機能を生成できます。 swig インターフェイスファイルはjamspell.iです。ビルドスクリプトを含むプルリクエストは大歓迎です。

HTTP API

cmakeをインストールする
jamspell をクローンしてビルドします (http サーバーが含まれます)。

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

言語モデルをダウンロードまたはトレーニングする
http サーバーを実行します。

./web_server/web_server en.bin localhost 8080

GETリクエストの例:

$ curl " http://localhost:8080/fix?text=I am the begt spell cherken "
I am the best spell checker

POSTリクエストの例

$ curl -d " I am the begt spell cherken " http://localhost:8080/fix
I am the best spell checker

候補例

curl " http://localhost:8080/candidates?text=I am the begt spell cherken "
# or
curl -d " I am the begt spell cherken " http://localhost:8080/candidates

 {
    "results" : [
        {
            "candidates" : [
                "best" ,
                "beat" ,
                "belt" ,
                "bet" ,
                "bent" ,
                "beet" ,
                "beit"
            ] ,
            "len" : 4 ,
            "pos_from" : 9
        } ,
        {
            "candidates" : [
                "checker" ,
                "chicken" ,
                "checked" ,
                "wherein" ,
                "coherent" ,
                "cheered" ,
                "cherokee"
            ] ,
            "len" : 7 ,
            "pos_from" : 20
        }
    ]
}

ここで、 pos_from - スペルミスのある単語の最初の文字位置、 len - スペルミスのある単語 len

電車

カスタムモデルをトレーニングするには、次のものが必要です。

cmakeをインストールする
jamspell をクローンしてビルドします。

git clone https://github.com/bakwc/JamSpell.git
cd JamSpell
mkdir build
cd build
cmake ..
make

トレーニングする文を含む utf-8 テキストファイル (例: sherlockholmes.txt ) と、言語アルファベットを含む別のファイル (例: alphabet_en.txt ) を準備します。
鉄道模型:

./main/jamspell train ../test_data/alphabet_en.txt ../test_data/sherlockholmes.txt model_sherlock.bin

スペルチェッカーを評価するにはevaluate/evaluate.pyスクリプトを使用できます。

python evaluate/evaluate.py -a alphabet_file.txt -jsp your_model.bin -mx 50000 your_test_data.txt

evaluate/generate_dataset.pyを使用して、トレーニング/テストデータを生成できます。 txt ファイル、Leipzig Corpora Collection 形式、および fb2 ブックをサポートしています。

モデルをダウンロードする

ここではいくつかの簡単なモデルを紹介します。彼らは 30 万のニュース + 30 万のウィキペディアの文でトレーニングしました。より良い品質を達成するには、少なくとも数百万文で独自のモデルをトレーニングすることを強くお勧めします。上記の「鉄道」セクションを参照してください。

en.tar.gz (35Mb)
fr.tar.gz (31Mb)
ru.tar.gz (38Mb)

拡大する

追加情報

バージョン v0.0.12
タイプ C/C++
更新時間 2024-12-23
サイズ 529.4KB
から Github

JamSpell

ジャムスペル

ジャムスペルプロ

コンテンツ

ベンチマーク

使用法

パイソン

C++

その他の言語

HTTP API

電車

モデルをダウンロードする

cpp peglib

vkhr

carma platform

qttabbar

ewig

jitify

chat.petals.dev

GPT Prompt Templates

GPTyped

cpp peglib

vkhr

carma platform

waymo open dataset

wp functions

termwind