(简体中文|English)
FunASR希望在语音识别的学术研究和工业应用之间架起一座桥梁。通过支持工业级语音识别模型的训练和微调,研究开发者可以更便捷地进行语音识别模型的研究和生产,推动语音识别生态的发展。 ASR 的乐趣!
亮点|新闻|安装|快速入门|教程|运行时|模型动物园|接触
python>=3.8
torch>=1.13
torchaudio
pip3 install -U funasr
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
pip3 install -U modelscope huggingface_hub
FunASR开源了大量工业数据的预训练模型。根据模型许可协议,您可以自由使用、复制、修改和共享 FunASR 模型。以下为部分代表性模型,更多模型请参考模型动物园。
(注:⭐代表ModelScope模型动物园,?代表Hugingface模型动物园,?代表OpenAI模型动物园)
型号名称 | 任务详情 | 训练数据 | 参数 |
---|---|---|---|
SenseVoiceSmall (⭐?) | 多种语音理解能力,包括ASR、ITN、LID、SER、AED,支持zh、yue、en、ja、ko等语言 | 300000小时 | 234M |
多聚体-zh (⭐?) | 语音识别,带时间戳,非流式传输 | 60000小时,普通话 | 220M |
paraformer-zh-streaming (⭐?) | 语音识别、流媒体 | 60000小时,普通话 | 220M |
帕拉福默尔 (⭐?) | 语音识别,无时间戳,非流式传输 | 50000小时,英语 | 220M |
构象-en (⭐?) | 语音识别,非流式传输 | 50000小时,英语 | 220M |
ct-穿刺 (⭐?) | 标点符号恢复 | 100M,普通话和英语 | 290M |
fsmn-vad (⭐?) | 语音活动检测 | 5000小时,普通话和英语 | 0.4M |
FSMN-KWS ( ⭐ ) | 关键词发现,流媒体 | 5000小时,普通话 | 0.7M |
法子 (⭐?) | 时间戳预测 | 5000小时,普通话 | 38M |
凸轮++ (⭐?) | 说话人验证/分类 | 5000小时 | 7.2M |
耳语大v3 (⭐?) | 语音识别,带时间戳,非流式传输 | 多种语言 | 1550米 |
低语-大型-v3-涡轮增压 (⭐?) | 语音识别,带时间戳,非流式传输 | 多种语言 | 809米 |
Qwen-音频 (⭐?) | 音频文本多模态模型(预训练) | 多种语言 | 8B |
Qwen-音频聊天 (⭐?) | 音频文本多模式模型(聊天) | 多种语言 | 8B |
情感2vec+大 (⭐?) | 言语情绪重新识别 | 40000小时 | 300M |
以下是快速入门教程。测试音频文件(普通话,英语)。
funasr ++model=paraformer-zh ++vad_model= " fsmn-vad " ++punc_model= " ct-punc " ++input=asr_example_zh.wav
注:支持识别单个音频文件,以及Kaldi风格的wav.scp格式的文件列表: wav_id wav_pat
from funasr import AutoModel
from funasr . utils . postprocess_utils import rich_transcription_postprocess
model_dir = "iic/SenseVoiceSmall"
model = AutoModel (
model = model_dir ,
vad_model = "fsmn-vad" ,
vad_kwargs = { "max_single_segment_time" : 30000 },
device = "cuda:0" ,
)
# en
res = model . generate (
input = f" { model . model_path } /example/en.mp3" ,
cache = {},
language = "auto" , # "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn = True ,
batch_size_s = 60 ,
merge_vad = True , #
merge_length_s = 15 ,
)
text = rich_transcription_postprocess ( res [ 0 ][ "text" ])
print ( text )
参数说明:
model_dir
:模型的名称,或者模型在本地磁盘上的路径。vad_model
:这表示VAD(语音活动检测)的激活。 VAD 的目的是将长音频分割成较短的剪辑。在这种情况下,推理时间包括VAD和SenseVoice的总消耗,并代表端到端延迟。如果您想单独测试SenseVoice模型的推理时间,可以禁用VAD模型。vad_kwargs
:指定VAD模型的配置。 max_single_segment_time
:表示vad_model
进行音频分段的最大时长,单位为毫秒(ms)。use_itn
:输出结果是否包含标点符号和反文本标准化。batch_size_s
:指示使用动态批处理,其中批处理中音频的总持续时间以秒 (s) 为单位。merge_vad
:是否合并VAD模型分割的短音频片段,合并长度为merge_length_s
,单位为秒(s)。ban_emo_unk
:是否禁止emo_unk
代币的输出。 from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel ( model = "paraformer-zh" , vad_model = "fsmn-vad" , punc_model = "ct-punc" ,
# spk_model="cam++",
)
res = model . generate ( input = f" { model . model_path } /example/asr_example.wav" ,
batch_size_s = 300 ,
hotword = '魔搭' )
print ( res )
注: hub
:代表模型库, ms
代表选择 ModelScope 下载, hf
代表选择 Huggingface 下载。
from funasr import AutoModel
chunk_size = [ 0 , 10 , 5 ] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel ( model = "paraformer-zh-streaming" )
import soundfile
import os
wav_file = os . path . join ( model . model_path , "example/asr_example.wav" )
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = chunk_size [ 1 ] * 960 # 600ms
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size , encoder_chunk_look_back = encoder_chunk_look_back , decoder_chunk_look_back = decoder_chunk_look_back )
print ( res )
注意: chunk_size
是流延迟的配置。 [0,10,5]
表示实时显示粒度为10*60=600ms
,前瞻信息为5*60=300ms
。每个推理输入为600ms
(样本点为16000*0.6=960
),输出为对应的文本。对于最后一个语音段输入,需要设置is_final=True
强制输出最后一个词。
from funasr import AutoModel
model = AutoModel ( model = "fsmn-vad" )
wav_file = f" { model . model_path } /example/vad_example.wav"
res = model . generate ( input = wav_file )
print ( res )
注:VAD模型的输出格式为: [[beg1, end1], [beg2, end2], ..., [begN, endN]]
,其中begN/endN
表示N-th
的起点/终点有效的音频片段,以毫秒为单位。
from funasr import AutoModel
chunk_size = 200 # ms
model = AutoModel ( model = "fsmn-vad" )
import soundfile
wav_file = f" { model . model_path } /example/vad_example.wav"
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = int ( chunk_size * sample_rate / 1000 )
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size )
if len ( res [ 0 ][ "value" ]):
print ( res )
注意:流式 VAD 模型的输出格式可以是以下四种情况之一:
[[beg1, end1], [beg2, end2], .., [begN, endN]]
:与上述离线VAD输出结果相同。[[beg, -1]]
:表示仅检测到一个起点。[[-1, end]]
:表示仅检测到一个结束点。[]
:表示未检测到起点和终点。输出以毫秒为单位测量,表示从起点开始的绝对时间。
from funasr import AutoModel
model = AutoModel ( model = "ct-punc" )
res = model . generate ( input = "那今天的会就到这里吧 happy new year 明年见" )
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "fa-zh" )
wav_file = f" { model . model_path } /example/asr_example.wav"
text_file = f" { model . model_path } /example/text.txt"
res = model . generate ( input = ( wav_file , text_file ), data_type = ( "sound" , "text" ))
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "emotion2vec_plus_large" )
wav_file = f" { model . model_path } /example/test.wav"
res = model . generate ( wav_file , output_dir = "./outputs" , granularity = "utterance" , extract_embedding = False )
print ( res )
更多用法参考文档,更多示例参考demo
funasr-export ++model=paraformer ++quantize=false ++device=cpu
from funasr import AutoModel
model = AutoModel ( model = "paraformer" , device = "cpu" )
res = model . export ( quantize = False )
# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer ( model_dir , batch_size = 1 , quantize = True )
wav_path = [ '~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav' ]
result = model ( wav_path )
print ( result )
更多示例参考demo
FunASR 支持部署预先训练或进一步微调的模型来提供服务。目前支持以下类型的服务部署:
更详细的信息请参考服务部署文档。
如果您在使用中遇到问题,可以直接在github页面提出Issue。
您也可以扫描以下钉钉群,加入社区群进行交流和讨论。
钉钉群 |
---|
贡献者可以在贡献者列表中找到
该项目已获得麻省理工学院许可证的许可。 FunASR 还包含各种第三方组件和一些在其他开源许可证下从其他存储库修改的代码。预训练模型的使用需获得模型许可
@inproceedings { gao2023funasr ,
author = { Zhifu Gao and Zerui Li and Jiaming Wang and Haoneng Luo and Xian Shi and Mengzhe Chen and Yabin Li and Lingyun Zuo and Zhihao Du and Zhangyu Xiao and Shiliang Zhang } ,
title = { FunASR: A Fundamental End-to-End Speech Recognition Toolkit } ,
year = { 2023 } ,
booktitle = { INTERSPEECH } ,
}
@inproceedings { An2023bat ,
author = { Keyu An and Xian Shi and Shiliang Zhang } ,
title = { BAT: Boundary aware transducer for memory-efficient and low-latency ASR } ,
year = { 2023 } ,
booktitle = { INTERSPEECH } ,
}
@inproceedings { gao22b_interspeech ,
author = { Zhifu Gao and ShiLiang Zhang and Ian McLoughlin and Zhijie Yan } ,
title = { Paraformer: Fast and Accurate Parallel Transformer for Non-autoregressive End-to-End Speech Recognition } ,
year = 2022 ,
booktitle = { Proc. Interspeech 2022 } ,
pages = { 2063--2067 } ,
doi = { 10.21437/Interspeech.2022-9996 }
}
@inproceedings { shi2023seaco ,
author = { Xian Shi and Yexin Yang and Zerui Li and Yanni Chen and Zhifu Gao and Shiliang Zhang } ,
title = { SeACo-Paraformer: A Non-Autoregressive ASR System with Flexible and Effective Hotword Customization Ability } ,
year = { 2023 } ,
booktitle = { ICASSP2024 }
}