(簡體中文|English)
FunASR希望在語音辨識的學術研究和工業應用之間架起一座橋樑。透過支援工業級語音辨識模型的訓練和微調,研究開發者可以更便捷地進行語音辨識模型的研究和生產,並推動語音辨識生態的發展。 ASR 的樂趣!
亮點|新聞|安裝|快速入門|教程|運行時|模型動物園|接觸
python>=3.8
torch>=1.13
torchaudio
pip3 install -U funasr
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
pip3 install -U modelscope huggingface_hub
FunASR開源了大量工業資料的預訓練模型。根據模型授權協議,您可以自由使用、複製、修改和分享 FunASR 模型。以下為部分代表性模型,更多模型請參考模型動物園。
(註:⭐代表ModelScope模型動物園,?代表Hugingface模型動物園,?代表OpenAI模型動物園)
型號名稱 | 任務詳情 | 訓練資料 | 參數 |
---|---|---|---|
SenseVoiceSmall (⭐?) | 多元語音理解能力,包括ASR、ITN、LID、SER、AED,支援zh、yue、en、ja、ko等語言 | 300000小時 | 234M |
多聚體-zh (⭐?) | 語音識別,帶時間戳,非串流傳輸 | 60000小時,國語 | 220M |
paraformer-zh-streaming (⭐?) | 語音辨識、串流媒體 | 60000小時,國語 | 220M |
帕拉福默爾 (⭐?) | 語音識別,無時間戳,非串流傳輸 | 50000小時,英語 | 220M |
構象-en (⭐?) | 語音識別,非串流媒體 | 50000小時,英語 | 220M |
ct-穿刺 (⭐?) | 標點符號恢復 | 100M,國語和英語 | 290M |
fsmn-vad (⭐?) | 語音活動偵測 | 5000小時,國語和英語 | 0.4M |
FSMN-KWS ( ⭐ ) | 關鍵字發現,串流媒體 | 5000小時,國語 | 0.7M |
法子 (⭐?) | 時間戳預測 | 5000小時,國語 | 38M |
凸輪++ (⭐?) | 說話者驗證/分類 | 5000小時 | 7.2M |
耳語大v3 (⭐?) | 語音識別,帶時間戳,非串流傳輸 | 多種語言 | 1550米 |
低語-大型-v3-渦輪增壓 (⭐?) | 語音識別,帶時間戳,非串流傳輸 | 多種語言 | 809米 |
Qwen-音訊 (⭐?) | 音訊文字多模態模型(預訓練) | 多種語言 | 8B |
Qwen-音訊聊天 (⭐?) | 音訊-文字多模式模型(聊天) | 多種語言 | 8B |
情感2vec+大 (⭐?) | 言語情緒重新識別 | 40000小時 | 300M |
以下是快速入門教學。測試音訊檔案(普通話、英語)。
funasr ++model=paraformer-zh ++vad_model= " fsmn-vad " ++punc_model= " ct-punc " ++input=asr_example_zh.wav
附註:支援識別單一音訊文件,以及Kaldi風格的wav.scp格式的文件清單: wav_id wav_pat
from funasr import AutoModel
from funasr . utils . postprocess_utils import rich_transcription_postprocess
model_dir = "iic/SenseVoiceSmall"
model = AutoModel (
model = model_dir ,
vad_model = "fsmn-vad" ,
vad_kwargs = { "max_single_segment_time" : 30000 },
device = "cuda:0" ,
)
# en
res = model . generate (
input = f" { model . model_path } /example/en.mp3" ,
cache = {},
language = "auto" , # "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn = True ,
batch_size_s = 60 ,
merge_vad = True , #
merge_length_s = 15 ,
)
text = rich_transcription_postprocess ( res [ 0 ][ "text" ])
print ( text )
參數說明:
model_dir
:模型的名稱,或模型在本機磁碟上的路徑。vad_model
:這表示VAD(語音活動偵測)的活化。 VAD 的目的是將長音訊分割成較短的剪輯。在這種情況下,推理時間包括VAD和SenseVoice的總消耗,並代表端對端延遲。如果您想單獨測試SenseVoice模型的推理時間,可以停用VAD模型。vad_kwargs
:指定VAD模型的配置。 max_single_segment_time
:表示vad_model
進行音訊分段的最大時長,單位為毫秒(ms)。use_itn
:輸出結果是否包含標點符號和反文字標準化。batch_size_s
:指示使用動態批次,其中批次中音訊的總持續時間以秒 (s) 為單位。merge_vad
:是否合併VAD模型分割的短音頻片段,合併長度為merge_length_s
,單位為秒(s)。ban_emo_unk
:是否禁止emo_unk
代幣的輸出。 from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel ( model = "paraformer-zh" , vad_model = "fsmn-vad" , punc_model = "ct-punc" ,
# spk_model="cam++",
)
res = model . generate ( input = f" { model . model_path } /example/asr_example.wav" ,
batch_size_s = 300 ,
hotword = '魔搭' )
print ( res )
註: hub
:代表模型庫, ms
代表選擇 ModelScope 下載, hf
代表選擇 Huggingface 下載。
from funasr import AutoModel
chunk_size = [ 0 , 10 , 5 ] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel ( model = "paraformer-zh-streaming" )
import soundfile
import os
wav_file = os . path . join ( model . model_path , "example/asr_example.wav" )
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = chunk_size [ 1 ] * 960 # 600ms
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size , encoder_chunk_look_back = encoder_chunk_look_back , decoder_chunk_look_back = decoder_chunk_look_back )
print ( res )
注意: chunk_size
是串流延遲的配置。 [0,10,5]
表示即時顯示粒度為10*60=600ms
,前瞻資訊為5*60=300ms
。每個推理輸入為600ms
(樣本點為16000*0.6=960
),輸出為對應的文字。最後一個語音段輸入,需要設定is_final=True
強制輸出最後一個字。
from funasr import AutoModel
model = AutoModel ( model = "fsmn-vad" )
wav_file = f" { model . model_path } /example/vad_example.wav"
res = model . generate ( input = wav_file )
print ( res )
註:VAD模型的輸出格式為: [[beg1, end1], [beg2, end2], ..., [begN, endN]]
,其中begN/endN
表示N-th
的起點/終點有效的音訊片段,以毫秒為單位。
from funasr import AutoModel
chunk_size = 200 # ms
model = AutoModel ( model = "fsmn-vad" )
import soundfile
wav_file = f" { model . model_path } /example/vad_example.wav"
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = int ( chunk_size * sample_rate / 1000 )
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size )
if len ( res [ 0 ][ "value" ]):
print ( res )
注意:流式 VAD 模型的輸出格式可以是以下四種情況之一:
[[beg1, end1], [beg2, end2], .., [begN, endN]]
:與上述離線VAD輸出結果相同。[[beg, -1]]
:表示只偵測到一個起點。[[-1, end]]
:表示只偵測到一個結束點。[]
:表示未偵測到起點和終點。輸出以毫秒為單位測量,表示從起點開始的絕對時間。
from funasr import AutoModel
model = AutoModel ( model = "ct-punc" )
res = model . generate ( input = "那今天的会就到这里吧 happy new year 明年见" )
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "fa-zh" )
wav_file = f" { model . model_path } /example/asr_example.wav"
text_file = f" { model . model_path } /example/text.txt"
res = model . generate ( input = ( wav_file , text_file ), data_type = ( "sound" , "text" ))
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "emotion2vec_plus_large" )
wav_file = f" { model . model_path } /example/test.wav"
res = model . generate ( wav_file , output_dir = "./outputs" , granularity = "utterance" , extract_embedding = False )
print ( res )
更多用法參考文檔,更多範例參考demo
funasr-export ++model=paraformer ++quantize=false ++device=cpu
from funasr import AutoModel
model = AutoModel ( model = "paraformer" , device = "cpu" )
res = model . export ( quantize = False )
# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer ( model_dir , batch_size = 1 , quantize = True )
wav_path = [ '~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav' ]
result = model ( wav_path )
print ( result )
更多範例參考demo
FunASR 支援部署預先訓練或進一步微調的模型來提供服務。目前支援以下類型的服務部署:
更詳細的資訊請參考服務部署文件。
如果您在使用上遇到問題,可以直接在github頁面提出Issue。
您也可以掃描以下釘釘群,加入社區群進行交流和討論。
釘釘群 |
---|
貢獻者可以在貢獻者清單中找到
該計畫已獲得麻省理工學院許可證的許可。 FunASR 還包含各種第三方元件和一些在其他開源許可證下從其他儲存庫修改的程式碼。預訓練模型的使用需獲得模型許可
@inproceedings { gao2023funasr ,
author = { Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang } ,
title = { FunASR: A Fundamental End-to-End Speech Recognition Toolkit } ,
year = { 2023 } ,
booktitle = { INTERSPEECH } ,
}
@inproceedings { An2023bat ,
author = { Keyu An and Xian Shi and Shiliang Zhang } ,
title = { BAT: Boundary aware transducer for memory-efficient and low-latency ASR } ,
year = { 2023 } ,
booktitle = { INTERSPEECH } ,
}
@inproceedings { gao22b_interspeech ,
author = { Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan } ,
title = { Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition } ,
year = 2022 ,
booktitle = { Proc. Interspeech 2022 } ,
pages = { 2063--2067 } ,
doi = { 10.21437/Interspeech.2022-9996 }
}
@inproceedings { shi2023seaco ,
author = { Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang } ,
title = { SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability } ,
year = { 2023 } ,
booktitle = { ICASSP2024 }
}