persian spell checker kenlm ดาวน์โหลด - persian spell checker kenlm ดาวน์โหลดซอร์สโค้ด

persian spell checker kenlm

โค้ดแหล่งที่มา AI

1.0.0

ดาวน์โหลด

ดาวน์โหลด Persian Wiki-Dump, Train Kenlm & Spell Checker

ในโปรเจ็กต์นี้ ฉันดาวน์โหลดชุดข้อมูล wiki-dump เปอร์เซียจากวิกิพีเดีย ประมวลผลล่วงหน้า และสุดท้ายก็ฝึกเครื่องตรวจตัวสะกดและโมเดลภาษา Kenlm

ดาวน์โหลดและประมวลผลเปอร์เซีย Wiki-Dump

ดาวน์โหลด wiki-Dump ภาษาเปอร์เซีย

ดาวน์โหลดดัมพ์วิกิเปอร์เซียโดยใช้สคริปต์ทุบตีต่อไปนี้ ขนาดของชุดข้อมูลประมาณ 1G ดังนั้นอดทนไว้!

หมายเหตุ : หากคุณอาศัยอยู่ในอิหร่าน แน่นอนว่าคุณต้องทำเช่นนั้นเพราะ repo นี้มีไว้สำหรับภาษาเปอร์เซีย ให้เปิด VPN ของคุณ!

 language=fa
bash download_wiki_dump.sh $language

แยก TXT

แยกและแปลงรูปแบบ .bz2 เป็น .txt การใช้ wikiextractor ดัมพ์จะถูกล้างและแปลงไฟล์ . .txt อาจต้องใช้เวลาพอสมควรเช่นกัน!

 n_processors=16
bash extract_and_clean_wiki_dump.sh ${language}wiki-latest-pages-articles.xml.bz2 $n_processors

หมายเหตุ : ในกรณีที่เกิดข้อผิดพลาด pdb ให้เปลี่ยนตัวแปร expand_templates=True เป็น expand_templates=False ซึ่งเป็นอาร์กิวเมนต์อินพุตเป็นฟังก์ชัน clean_text ที่อยู่ในบรรทัด 948 ของ wikiextractor/wikiextractor/extract.py

การประมวลผลล่วงหน้าและการทำให้เป็นมาตรฐาน

ข้อความเอาต์พุตควรได้รับการประมวลผลล่วงหน้าและทำให้เป็นมาตรฐานเพื่อลบข้อความที่ไม่จำเป็น เช่น "[doc]" และทำให้ข้อความเป็นมาตรฐานโดยใช้ไลบรารี hazm และ nltk !

ติดตั้งข้อกำหนดของหลาม:

ติดตั้งข้อกำหนด:

 pip install -r requirements.txt

ประมวลผลล่วงหน้าและทำให้เป็นมาตรฐาน

การประมวลผลหลัก อาจต้องใช้เวลาสักระยะ!

 python preprocess_wiki_dump.py fawiki-latest-pages-articles.txt
python cleaner.py

รับจำนวนคำของคลังข้อมูล

การใช้สคริปต์นี้จะมีการนับคำในคลังข้อมูล ก่อนหน้านั้นจะมีการนำการปรับมาตรฐานและการทำความสะอาดเพิ่มเติมมาใช้กับคำด้วยเช่นกัน

 sudo apt-get install pv
bash get_counts.sh

รับคำศัพท์ที่พบบ่อยที่สุดสำหรับ SymSpell[Spell-Checker]

Symspell ต้องการไฟล์ข้อความที่มีคำศัพท์และรายการที่เกิดขึ้น fa_wiki.counts ที่สร้างขึ้นในส่วน Get the word-count of the corpus ควรถูกตัดให้เหลือเฉพาะคำที่พบบ่อยที่สุด 80,000 คำ และป้องกันไม่ให้มีความถี่ต่ำกว่า 50

 python get_spellchecker_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_80k.txt

สัญลักษณ์

Symspell เป็นตัวตรวจสอบการสะกดคำอย่างง่าย ขั้นแรกให้ติดตั้งจาก pypi โดยใช้คำสั่งต่อไปนี้:

 pip install symspellpy

หากต้องการใช้งาน เพียงยกตัวอย่างด้วยพจนานุกรมคำศัพท์ที่เราสร้างขึ้นในส่วน Get top frequent vocabs for SymSpell

 # import symspell
from symspellpy import SymSpell , Verbosity

# instantiate it
sym_spell = SymSpell ( max_dictionary_edit_distance = 2 , prefix_length = 7 )
dictionary_path = "wiki_fa_80k.txt"
sym_spell . load_dictionary ( dictionary_path , term_index = 0 , count_index = 1 )

# input sample:
input_term = "اهوار"  # misspelling of "اهواز" It's a city name!

# lookup the dictionary
suggestions = sym_spell . lookup ( input_term , Verbosity . ALL , max_edit_distance = 2 )
# display suggestion term, term frequency, and edit distance
for suggestion in suggestions [: 5 ]:
    print ( suggestion )

ผลลัพธ์จะเป็นดังนี้ อย่างที่คุณเห็น اهواز ถูกเลือกอย่างถูกต้อง!

 اهواز, 1, 4692
ادوار, 1, 1350
الوار, 1, 651
انوار, 1, 305
اهورا, 1, 225

รับคำศัพท์ที่พบบ่อยสำหรับ KenLM

เมื่อใช้โค้ดต่อไปนี้ ตัวอย่าง 80K ที่พบบ่อยที่สุดจะถูกเขียนไปที่ kenlm_vocabs.txt เพื่อให้เร็วขึ้น คำศัพท์ที่มีน้อยกว่า 25 รายการจะถูกละทิ้ง!

 python get_kenlm_top_vocabs.py --top-vocabs 80000 --ignore-less 25 --output wiki_fa_kenlm_vocabs.txt

ฝึกโมเดล KenLM

ขั้นแรกให้ติดตั้งข้อกำหนด KenLM โดยใช้คำสั่งต่อไปนี้:

 sudo apt-get update
sudo apt-get install cmake build-essential libssl-dev libeigen3-dev libboost-all-dev zlib1g-dev libbz2-dev liblzma-dev -y

จากนั้น clone และสร้างโมดูล c++:

 git clone https://github.com/kpu/kenlm.git
cd kenlm
mkdir -p build
cd build
cmake ..
make -j 4

หากทุกอย่างเป็นไปด้วยดี คุณสามารถค้นหา lmplz และ build_binary ใต้ไดเร็กทอรี . ./kenlm/build/bin ในที่สุด ฝึกฝนโมเดลภาษา kenlm โดยใช้สคริปต์ทุบตีต่อไปนี้

 bash train_kenlm.sh -o 4 -l fa

หมายเหตุ: โมดูลไบนารี่ก็ถูกสร้างขึ้นเช่นกันเพราะมันเร็วกว่าโมดูลที่ไม่ใช่ไบนารี่มาก

การอนุมาน Kenlm บนหลาม

ติดตั้ง KenLm:

 pip install https://github.com/kpu/kenlm/archive/master.zip

วิธีใช้:

 import kenlm

model = kenlm.Model('fa_wiki.binary')
print("score: ", model.score('کشور ایران شهر تهران', bos=True, eos=True))
print("score: ", model.score('کشور تهران شهر ایران', bos=True, eos=True))
# score:  -11.683658599853516
# score:  -15.572178840637207

สำหรับตัวอย่างเพิ่มเติม โปรดดูลิงก์ต่อไปนี้: https://github.com/kpu/kenlm/blob/master/python/example.py

อ้างอิง

https://github.com/tiefenauer/wiki-lm
https://towardsdatascience.com/pre-processing-a-wikipedia-dump-for-nlp-model-training-a-write-up-3b9176fdf67
https://github.com/kpu/kenlm

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน 1.0.0
ประเภท โค้ดแหล่งที่มา AI
เวลาอัปเดต 2024-12-30
ขนาด 50MB
มาจาก Github

แอปที่เกี่ยวข้อง

Iptv checker

2024-11-15
GitHub sgrebnov/cordova plugin background download

2024-11-05
fReE Daily Monopoly Go Dice Generator 2024 New Updated With Daily Checker

2024-11-03
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
เกม Brave Checker

2023-05-24
คาถาสุดท้าย

2022-08-06

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
node telegram bot api

โค้ดแหล่งที่มา AI

v0.50.0
typebot.io

โค้ดแหล่งที่มา AI

v3.1.2
python wechaty getting started

โค้ดแหล่งที่มา AI

1.0.0
waymo open dataset

ซอร์สโค้ดอื่น ๆ

December 2023 Update
termwind

หมวดหมู่อื่นๆ

v2.3.0
wp functions

หมวดหมู่อื่นๆ

1.0.0

ข้อมูลที่เกี่ยวข้อง ทั้งหมด