ดาวน์โหลด ctc บังคับ aligner - ctc บังคับ aligner ดาวน์โหลดซอร์สโค้ด

ctc บังคับจัดตำแหน่ง

ซอร์สโค้ดอื่น ๆ

v0.2

ดาวน์โหลด

การบังคับให้จัดตำแหน่งด้วยโมเดล CTC ของใบหน้ากอด

โปรดติดดาวโปรเจ็กต์บน GitHub (ดูมุมขวาบน) หากคุณซาบซึ้งในการมีส่วนร่วมของฉันต่อชุมชน!

แพ็คเกจ Python นี้มอบวิธีที่มีประสิทธิภาพในการดำเนินการบังคับการจัดแนวระหว่างข้อความและเสียงโดยใช้โมเดลที่ได้รับการฝึกล่วงหน้าของ Hugging Face โดยใช้ประโยชน์จากประสิทธิภาพของโมเดล Wav2Vec2, HuBERT และ MMS เพื่อการจัดตำแหน่งที่แม่นยำ ทำให้เป็นเครื่องมืออันทรงพลังสำหรับการสร้างคลังคำพูด

คุณสมบัติ

การใช้หน่วยความจำน้อยลงอย่างน้อย 5 เท่า: ปรับปรุงการใช้งานเพื่อใช้หน่วยความจำน้อยกว่า API การจัดตำแหน่งที่บังคับของ TorchAudio
รองรับภาษาที่หลากหลาย: ใช้งานได้กับหลายภาษา รวมถึงภาษาอังกฤษ อาหรับ รัสเซีย เยอรมัน และอีก 1126 ภาษา
ความยืดหยุ่นในการจัดระดับรายละเอียด: เลือกระหว่างการจัดระดับประโยค คำ หรืออักขระ
พารามิเตอร์การจัดตำแหน่งที่ปรับแต่งได้: ควบคุมความถี่ของการแทรกโทเค็น เกณฑ์การรวมสำหรับการรวมส่วน และอื่นๆ
บูรณาการกับโมเดลของ Hugging Face: ใช้ประโยชน์จากพลังของโมเดล Wav2Vec2, HuBERT และ MMS ที่ได้รับการฝึกมาล่วงหน้าเพื่อการจัดตำแหน่งที่แม่นยำ
การเร่งความเร็ว GPU: ใช้ GPU ของคุณเพื่อการอนุมานที่รวดเร็วยิ่งขึ้น
เอาต์พุตในรูปแบบ JSON: ให้ผลลัพธ์การจัดตำแหน่งที่ชัดเจนและมีโครงสร้างเพื่อการวิเคราะห์และบูรณาการที่ง่ายดาย

การติดตั้ง

pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

การใช้งาน

ctc-forced-aligner --audio_path " path/to/audio.wav " --text_path " path/to/text.txt " --language " eng " --romanize

การใช้งานเทอร์มินัล

ข้อโต้แย้ง

การโต้แย้ง	คำอธิบาย	ค่าเริ่มต้น
`--audio_path`	เส้นทางไปยังไฟล์เสียง	ที่จำเป็น
`--text_path`	เส้นทางไปยังไฟล์ข้อความ	ที่จำเป็น
`--language`	ภาษาในรหัส ISO 639-3	ที่จำเป็น
`--romanize`	เปิดใช้งานการเขียนอักษรโรมันสำหรับสคริปต์ที่ไม่ใช่ละตินหรือสำหรับโมเดลหลายภาษาโดยไม่คำนึงถึงภาษา ซึ่งจำเป็นเมื่อใช้โมเดลเริ่มต้น	เท็จ
`--split_size`	รายละเอียดการจัดตำแหน่ง: "ประโยค", "คำ" หรือ "อักขระ"	"คำ"
`--star_frequency`	ความถี่ของโทเค็น : "เซ็กเมนต์" หรือ "ขอบ"	"ขอบ"
`--merge_threshold`	เกณฑ์การรวมสำหรับการรวมกลุ่ม	0.00
`--alignment_model`	ชื่อของแบบจำลองการจัดตำแหน่ง	MahmoudAshraf/mms-300m-1130-บังคับจัดฟัน
`--compute_dtype`	คำนวณประเภทสำหรับการอนุมาน	"float32"
`--batch_size`	ขนาดแบทช์สำหรับการอนุมาน	4
`--window_size`	ขนาดหน้าต่างเป็นวินาทีสำหรับการแยกเสียง	30
`--context_size`	ทับซ้อนกันระหว่างชิ้นส่วนในไม่กี่วินาที	2
`--attn_implementation`	การดำเนินการให้ความสนใจ	"กระตือรือร้น"
`--device`	อุปกรณ์ที่ใช้สำหรับการอนุมาน: "cuda" หรือ "cpu"	"cuda" ถ้ามี อย่างอื่นคือ "cpu"

ตัวอย่าง

 # Align an English audio file with the text file
ctc-forced-aligner --audio_path " english_audio.wav " --text_path " english_text.txt " --language " eng " --romanize

# Align a Russian audio file with romanized text
ctc-forced-aligner --audio_path " russian_audio.wav " --text_path " russian_text.txt " --language " rus " --romanize

# Align on a sentence level
ctc-forced-aligner --audio_path " audio.wav " --text_path " text.txt " --language " eng " --split_size " sentence " --romanize

# Align using a model with native vocabulary
ctc-forced-aligner --audio_path " audio.wav " --text_path " text.txt " --language " ara " --alignment_model " jonatasgrosman/wav2vec2-large-xlsr-53-arabic "

การใช้งานหลาม

 import torch
from ctc_forced_aligner import (
    load_audio ,
    load_alignment_model ,
    generate_emissions ,
    preprocess_text ,
    get_alignments ,
    get_spans ,
    postprocess_results ,
)

audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch . cuda . is_available () else "cpu"
batch_size = 16


alignment_model , alignment_tokenizer = load_alignment_model (
    device ,
    dtype = torch . float16 if device == "cuda" else torch . float32 ,
)

audio_waveform = load_audio ( audio_path , alignment_model . dtype , alignment_model . device )


with open ( text_path , "r" ) as f :
    lines = f . readlines ()
text = "" . join ( line for line in lines ). replace ( " n " , " " ). strip ()

emissions , stride = generate_emissions (
    alignment_model , audio_waveform , batch_size = batch_size
)

tokens_starred , text_starred = preprocess_text (
    text ,
    romanize = True ,
    language = language ,
)

segments , scores , blank_token = get_alignments (
    emissions ,
    tokens_starred ,
    alignment_tokenizer ,
)

spans = get_spans ( tokens_starred , segments , blank_token )

word_timestamps = postprocess_results ( text_starred , spans , stride , scores )

เอาท์พุต

ผลลัพธ์การจัดตำแหน่งจะถูกบันทึกลงในไฟล์ที่มีข้อมูลต่อไปนี้ในรูปแบบ JSON:

text : ข้อความที่จัดแนว
segments : รายการเซ็กเมนต์ โดยแต่ละเซ็กเมนต์มีเวลาเริ่มต้นและสิ้นสุดของเซ็กเมนต์ข้อความที่เกี่ยวข้อง

เจสัน

{
  "text" : " This is a sample text to be aligned with the audio. " ,
  "segments" : [
    {
      "start" : 0.000 ,
      "end" : 1.234 ,
      "text" : " This "
    },
    {
      "start" : 1.234 ,
      "end" : 2.567 ,
      "text" : " is "
    },
    {
      "start" : 2.567 ,
      "end" : 3.890 ,
      "text" : " a "
    },
    {
      "start" : 3.890 ,
      "end" : 5.213 ,
      "text" : " sample "
    },
    {
      "start" : 5.213 ,
      "end" : 6.536 ,
      "text" : " text "
    },
    {
      "start" : 6.536 ,
      "end" : 7.859 ,
      "text" : " to "
    },
    {
      "start" : 7.859 ,
      "end" : 9.182 ,
      "text" : " be "
    },
    {
      "start" : 9.182 ,
      "end" : 10.405 ,
      "text" : " aligned "
    },
    {
      "start" : 10.405 ,
      "end" : 11.728 ,
      "text" : " with "
    },
    {
      "start" : 11.728 ,
      "end" : 13.051 ,
      "text" : " the "
    },
    {
      "start" : 13.051 ,
      "end" : 14.374 ,
      "text" : " audio. "
    }
  ]
}