persian spell checker kenlm下载 - persian spell checker kenlm源代码下载

persian spell checker kenlm

Ai源码

1.0.0

下载

下载波斯语 Wiki-Dump、Train Kenlm 和拼写检查器

在这个项目中，我从维基百科下载波斯语 wiki-dump 数据集，对其进行预处理，最后训练拼写检查器和 kenlm 语言模型。

下载和预处理波斯语 Wiki-Dump

下载波斯语维基转储

使用以下 bash 脚本下载波斯语 wiki 转储。数据集大小约为1G，请耐心等待！

注意：如果您住在伊朗，那么您肯定会这样做，因为此存储库适用于波斯语，请打开您的 VPN！

 language=fa
bash download_wiki_dump.sh $language

提取文本

提取.bz2格式并将其转换为.txt 。使用wikiextractor清理转储并转换为.txt文件。这可能也需要一些时间！

 n_processors=16
bash extract_and_clean_wiki_dump.sh ${language}wiki-latest-pages-articles.xml.bz2 $n_processors

注意：如果出现 pdb 错误，请将expand_templates=True变量更改为expand_templates=False这是位于 wikiextractor/wikiextractor/extract.py 第 948 行左右的clean_text函数的输入参数。

预处理和标准化

输出文本应该进行预处理和规范化，以删除不必要的文本，如“[doc]”，并使用hazm和nltk库规范化文本！

安装python要求：

安装要求：

 pip install -r requirements.txt

预处理和标准化

主要加工。可能需要一些时间！

 python preprocess_wiki_dump.py fawiki-latest-pages-articles.txt
python cleaner.py

获取语料库的字数

使用此脚本将对语料库单词进行计数。在此之前，还将对单词进行一些额外的标准化和清理。

 sudo apt-get install pv
bash get_counts.sh

获取 SymSpell[拼写检查器] 的最常用词汇

Symspell 需要一个包含词汇及其出现次数的文本文件。在Get the word-count of the corpus部分中创建的fa_wiki.counts应修剪为仅包含 80k 个最常见的单词，并阻止那些频率低于 50 的单词。

 python get_spellchecker_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_80k.txt

符号拼写

Symspell 是一个简单的拼写检查器。首先，使用以下命令从 pypi 安装它：

 pip install symspellpy

要使用它，只需使用我们在Get top frequent vocabs for SymSpell部分中创建的词汇词典来实例化它

 # import symspell
from symspellpy import SymSpell , Verbosity

# instantiate it
sym_spell = SymSpell ( max_dictionary_edit_distance = 2 , prefix_length = 7 )
dictionary_path = "wiki_fa_80k.txt"
sym_spell . load_dictionary ( dictionary_path , term_index = 0 , count_index = 1 )

# input sample:
input_term = "اهوار"  # misspelling of "اهواز" It's a city name!

# lookup the dictionary
suggestions = sym_spell . lookup ( input_term , Verbosity . ALL , max_edit_distance = 2 )
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions [: 5 ]:
    print ( suggestion )

输出如下。正如您所看到的اهواز选择正确！

 اهواز, 1, 4692
ادوار, 1, 1350
الوار, 1, 651
انوار, 1, 305
اهورا, 1, 225

获取 KenLM 的最常用词汇

使用以下代码，最常见的 80K 样本将写入kenlm_vocabs.txt 。为了使其更快，出现次数少于 25 次的词汇将被丢弃！

 python get_kenlm_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_kenlm_vocabs.txt

训练KenLM模型

首先使用以下命令安装 KenLM 要求：

 sudo apt-get update
sudo apt-get install cmake build-essential libssl-dev libeigen3-dev libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev -y

然后clone并制作 C++ 模块：

 git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

如果一切顺利，你可以在./kenlm/build/bin目录下找到lmplz和build_binary 。最终，使用以下 bash 脚本训练kenlm语言模型。

 bash train_kenlm.sh -o 4 -l fa

注意：还创建了二进制模块，因为它比非二进制模块快得多。

python 上的 Kenlm 推理

安装KenLM：

 pip install https://github.com/kpu/kenlm/archive/master.zip

使用方法：

 import kenlm

model = kenlm.Model('fa_wiki.binary')
print("score: ", model.score('کشور ایران شهر تهران', bos=True, eos=True))
print("score: ", model.score('کشور تهران شهر ایران', bos=True, eos=True))
# score:  -11.683658599853516
# score:  -15.572178840637207

有关更多示例，请查看以下链接：https://github.com/kpu/kenlm/blob/master/python/example.py

参考

https://github.com/tiefenauer/wiki-lm
https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
https://github.com/kpu/kenlm

展开

附加信息