ctc forced aligner下載 - ctc forced aligner原始碼下載

ctc forced aligner

其他源碼

v0.2

下載

與擁抱面 CTC 模型強制對齊

如果您感謝我對社區的貢獻，請在 github 上為該專案加註星標（請參見右上角）！

此 Python 套件提供了一種使用 Hugging Face 的預訓練模型在文字和音訊之間執行強制對齊的有效方法。它利用 Wav2Vec2、HuBERT 和 MMS 模型的強大功能進行精確對齊，使其成為創建語音語料庫的強大工具。

特徵

記憶體使用量至少減少 5 倍：改進的實作比 TorchAudio 強制對齊 API 使用更少的記憶體。
廣泛的語言支援：適用於多種語言，包括英語、阿拉伯語、俄語、德語和 1126 種以上語言。
對齊粒度的靈活性：選擇在句子、單字或字元層級對齊。
可自訂的對齊參數：控制<star>標記插入的頻率、段合併的合併閾值等。
與 Hugging Face 模型整合：利用預先訓練的 Wav2Vec2、HuBERT 和 MMS 模型的強大功能來實現精確對齊。
GPU 加速：利用 GPU 進行更快的推理。
JSON 格式輸出：提供清晰、結構化的對齊結果，以便於分析和整合。

安裝

pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

用法

ctc-forced-aligner --audio_path " path/to/audio.wav " --text_path " path/to/text.txt " --language " eng " --romanize

終端使用

論點

爭論	描述	預設
`--audio_path`	音訊檔案的路徑	必需的
`--text_path`	文字檔案的路徑	必需的
`--language`	ISO 639-3 代碼中的語言	必需的
`--romanize`	為非拉丁腳本或多語言模型啟用羅馬化，無論使用預設模型時需要什麼語言	錯誤的
`--split_size`	對齊粒度：“sentence”、“word”或“char”	“單字”
`--star_frequency`	`<star>`標記的頻率：“段”或“邊”	“邊緣”
`--merge_threshold`	段合併的合併閾值	0.00
`--alignment_model`	對齊模型的名稱	MahmoudAshraf/mms-300m-1130-強制對準器
`--compute_dtype`	計算用於推理的 dtype	“浮動32”
`--batch_size`	用於推理的批量大小	4
`--window_size`	音訊分塊的視窗大小（以秒為單位）	30
`--context_size`	區塊之間的重疊（以秒為單位）	2
`--attn_implementation`	注意執行	“渴望的”
`--device`	用於推理的設備：“cuda”或“cpu”	“cuda”（如果可用），否則“cpu”

範例

 # Align an English audio file with the text file
ctc-forced-aligner --audio_path " english_audio.wav " --text_path " english_text.txt " --language " eng " --romanize

# Align a Russian audio file with romanized text
ctc-forced-aligner --audio_path " russian_audio.wav " --text_path " russian_text.txt " --language " rus " --romanize

# Align on a sentence level
ctc-forced-aligner --audio_path " audio.wav " --text_path " text.txt " --language " eng " --split_size " sentence " --romanize

# Align using a model with native vocabulary
ctc-forced-aligner --audio_path " audio.wav " --text_path " text.txt " --language " ara " --alignment_model " jonatasgrosman/wav2vec2-large-xlsr-53-arabic "

Python 用法

 import torch
from ctc_forced_aligner import (
    load_audio ,
    load_alignment_model ,
    generate_emissions ,
    preprocess_text ,
    get_alignments ,
    get_spans ,
    postprocess_results ,
)

audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch . cuda . is_available () else "cpu"
batch_size = 16


alignment_model , alignment_tokenizer = load_alignment_model (
    device ,
    dtype = torch . float16 if device == "cuda" else torch . float32 ,
)

audio_waveform = load_audio ( audio_path , alignment_model . dtype , alignment_model . device )


with open ( text_path , "r" ) as f :
    lines = f . readlines ()
text = "" . join ( line for line in lines ). replace ( " n " , " " ). strip ()

emissions , stride = generate_emissions (
    alignment_model , audio_waveform , batch_size = batch_size
)

tokens_starred , text_starred = preprocess_text (
    text ,
    romanize = True ,
    language = language ,
)

segments , scores , blank_token = get_alignments (
    emissions ,
    tokens_starred ,
    alignment_tokenizer ,
)

spans = get_spans ( tokens_starred , segments , blank_token )

word_timestamps = postprocess_results ( text_starred , spans , stride , scores )

輸出

對齊結果將儲存到包含以下 JSON 格式資訊的檔案中：

text ：對齊的文字。
segments ：段落列表，每個段落包含對應文字段的開始和結束時間。

JSON

{
  "text" : " This is a sample text to be aligned with the audio. " ,
  "segments" : [
    {
      "start" : 0.000 ,
      "end" : 1.234 ,
      "text" : " This "
    },
    {
      "start" : 1.234 ,
      "end" : 2.567 ,
      "text" : " is "
    },
    {
      "start" : 2.567 ,
      "end" : 3.890 ,
      "text" : " a "
    },
    {
      "start" : 3.890 ,
      "end" : 5.213 ,
      "text" : " sample "
    },
    {
      "start" : 5.213 ,
      "end" : 6.536 ,
      "text" : " text "
    },
    {
      "start" : 6.536 ,
      "end" : 7.859 ,
      "text" : " to "
    },
    {
      "start" : 7.859 ,
      "end" : 9.182 ,
      "text" : " be "
    },
    {
      "start" : 9.182 ,
      "end" : 10.405 ,
      "text" : " aligned "
    },
    {
      "start" : 10.405 ,
      "end" : 11.728 ,
      "text" : " with "
    },
    {
      "start" : 11.728 ,
      "end" : 13.051 ,
      "text" : " the "
    },
    {
      "start" : 13.051 ,
      "end" : 14.374 ,
      "text" : " audio. "
    }
  ]
}