persian spell checker kenlm下載 - persian spell checker kenlm原始碼下載

persian spell checker kenlm

Ai源碼

1.0.0

下載

下載波斯語 Wiki-Dump、Train Kenlm 和拼字檢查器

在這個專案中，我從維基百科下載波斯語 wiki-dump 資料集，對其進行預處理，最後訓練拼字檢查器和 kenlm 語言模型。

下載和預處理波斯語 Wiki-Dump

下載波斯語維基轉儲

使用以下 bash 腳本下載波斯語 wiki 轉儲。資料集大小約1G，請耐心等待！

注意：如果您住在伊朗，那麼您肯定會這樣做，因為此存儲庫適用於波斯語，請打開您的 VPN！

 language=fa
bash download_wiki_dump.sh $language

提取文字

提取.bz2格式並將其轉換為.txt 。使用wikiextractor清理轉儲並轉換為.txt檔案。這可能也需要一些時間！

 n_processors=16
bash extract_and_clean_wiki_dump.sh ${language}wiki-latest-pages-articles.xml.bz2 $n_processors

注意：如果出現 pdb 錯誤，請將expand_templates=True變數變更為expand_templates=False這是位於 wikiextractor/wikiextractor/extract.py 第 948 行左右的clean_text函數的輸入參數。

預處理和標準化

輸出文字應該進行預處理和規範化，以刪除不必要的文本，如“[doc]”，並使用hazm和nltk庫規範化文字！

安裝python要求：

安裝要求：

 pip install -r requirements.txt

預處理和標準化

主要加工。可能需要一些時間！

 python preprocess_wiki_dump.py fawiki-latest-pages-articles.txt
python cleaner.py

取得語料庫的字數

使用此腳本將對語料庫單字進行計數。在此之前，還將對單字進行一些額外的標準化和清理。

 sudo apt-get install pv
bash get_counts.sh

取得 SymSpell[拼字檢查器] 的最常用詞彙

Symspell 需要一個包含詞彙及其出現次數的文字檔案。在Get the word-count of the corpus部分中創建的fa_wiki.counts應修剪為僅包含 80k 個最常見的單詞，並阻止那些頻率低於 50 的單詞。

 python get_spellchecker_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_80k.txt

符號拼字

Symspell 是一個簡單的拼字檢查器。首先，使用以下命令從 pypi 安裝它：

 pip install symspellpy

要使用它，只需使用我們在Get top frequent vocabs for SymSpell部分中創建的詞彙詞典來實例化它

 # import symspell
from symspellpy import SymSpell , Verbosity

# instantiate it
sym_spell = SymSpell ( max_dictionary_edit_distance = 2 , prefix_length = 7 )
dictionary_path = "wiki_fa_80k.txt"
sym_spell . load_dictionary ( dictionary_path , term_index = 0 , count_index = 1 )

# input sample:
input_term = "اهوار"  # misspelling of "اهواز" It's a city name!

# lookup the dictionary
suggestions = sym_spell . lookup ( input_term , Verbosity . ALL , max_edit_distance = 2 )
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions [: 5 ]:
    print ( suggestion )

輸出如下。正如您所看到的اهواز選擇正確！

 اهواز, 1, 4692
ادوار, 1, 1350
الوار, 1, 651
انوار, 1, 305
اهورا, 1, 225

獲取 KenLM 最常用的詞彙

使用以下程式碼，最常見的 80K 樣本將寫入kenlm_vocabs.txt 。為了使其更快，出現次數少於 25 次的詞彙將被丟棄！

 python get_kenlm_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_kenlm_vocabs.txt

訓練KenLM模型

首先使用以下命令安裝 KenLM 要求：

 sudo apt-get update
sudo apt-get install cmake build-essential libssl-dev libeigen3-dev libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev -y

然後clone並製作 C++ 模組：

 git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

如果一切順利，你可以在./kenlm/build/bin目錄下找到lmplz和build_binary 。最終，使用以下 bash 腳本訓練kenlm語言模型。

 bash train_kenlm.sh -o 4 -l fa

注意：也創建了二進位模組，因為它比非二進位模組快得多。

python 上的 Kenlm 推理

安裝KenLM：

 pip install https://github.com/kpu/kenlm/archive/master.zip

使用方法：

 import kenlm

model = kenlm.Model('fa_wiki.binary')
print("score: ", model.score('کشور ایران شهر تهران', bos=True, eos=True))
print("score: ", model.score('کشور تهران شهر ایران', bos=True, eos=True))
# score:  -11.683658599853516
# score:  -15.572178840637207

有關更多範例，請查看以下連結：https://github.com/kpu/kenlm/blob/master/python/example.py

參考

https://github.com/tiefenauer/wiki-lm
https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
https://github.com/kpu/kenlm

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2024-12-30
大小 50MB
來自於 Github

相關應用

Iptv checker

2024-11-15
GitHub sgrebnov/cordova plugin background download

2024-11-05
fReE Daily Monopoly Go Dice Generator 2024 New Updated With Daily Checker

2024-11-03
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
Brave Checker遊戲

2023-05-24
最後的咒語

2022-08-06

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
node telegram bot api

Ai源碼

v0.50.0
typebot.io

Ai源碼

v3.1.2
python wechaty getting started

Ai源碼

1.0.0
waymo open dataset

其他源碼

December 2023 Update
termwind

其他類別

v2.3.0
wp functions

其他類別

1.0.0

相關資訊全部