FunASR希望在语音识别的学术研究和工业应用之间架起一座桥梁。通过支持工业级语音识别模型的训练和微调,研究开发者可以更便捷地进行语音识别模型的研究和生产,推动语音识别生态的发展。 ASR 的乐趣!
pip3 install -U funasr
git clone https://github.com/alibaba/FunASR.git && cd FunASR
pip3 install -e ./
pip3 install -U modelscope huggingface_hub
FunASR开源了大量工业数据的预训练模型。根据模型许可协议,您可以自由使用、复制、修改和共享 FunASR 模型。以下为部分代表性模型,更多模型请参考模型动物园。
型号名称 | 任务详情 | 训练数据 | 参数 |
SenseVoiceSmall (⭐?) | 多种语音理解能力,包括ASR、ITN、LID、SER、AED,支持zh、yue、en、ja、ko等语言 | 300000小时 | 234M |
多聚体-zh (⭐?) | 语音识别,带时间戳,非流式传输 | 60000小时,普通话 | 220M |
paraformer-zh-streaming (⭐?) | 语音识别、流媒体 | 60000小时,普通话 | 220M |
帕拉福默尔 (⭐?) | 语音识别,无时间戳,非流式传输 | 50000小时,英语 | 220M |
构象-en (⭐?) | 语音识别,非流式传输 | 50000小时,英语 | 220M |
ct-穿刺 (⭐?) | 标点符号恢复 | 100M,普通话和英语 | 290M |
fsmn-vad (⭐?) | 语音活动检测 | 5000小时,普通话和英语 | 0.4M |
FSMN-KWS ( ⭐ ) | 关键词发现,流媒体 | 5000小时,普通话 | 0.7M |
法子 (⭐?) | 时间戳预测 | 5000小时,普通话 | 38M |
凸轮++ (⭐?) | 说话人验证/分类 | 5000小时 | 7.2M |
耳语大v3 (⭐?) | 语音识别,带时间戳,非流式传输 | 多种语言 | 1550米 |
低语-大型-v3-涡轮增压 (⭐?) | 语音识别,带时间戳,非流式传输 | 多种语言 | 809米 |
Qwen-音频 (⭐?) | 音频文本多模态模型(预训练) | 多种语言 | 8B |
Qwen-音频聊天 (⭐?) | 音频-文本多模式模型(聊天) | 多种语言 | 8B |
情感2vec+大 (⭐?) | 言语情绪重新识别 | 40000小时 | 300M |
funasr ++model=paraformer-zh ++vad_model= " fsmn-vad " ++punc_model= " ct-punc " ++input=asr_example_zh.wav
注:支持识别单个音频文件,以及Kaldi风格的wav.scp格式的文件列表: wav_id wav_pat
from funasr import AutoModel
from funasr . utils . postprocess_utils import rich_transcription_postprocess
model_dir = "iic/SenseVoiceSmall"
model = AutoModel (
model = model_dir ,
vad_model = "fsmn-vad" ,
vad_kwargs = { "max_single_segment_time" : 30000 },
device = "cuda:0" ,
# en
res = model . generate (
input = f" { model . model_path } /example/en.mp3" ,
cache = {},
language = "auto" , # "zn", "en", "yue", "ja", "ko", "nospeech"
use_itn = True ,
batch_size_s = 60 ,
merge_vad = True , #
merge_length_s = 15 ,
text = rich_transcription_postprocess ( res [ 0 ][ "text" ])
print ( text )
:这表示VAD(语音活动检测)的激活。 VAD 的目的是将长音频分割成较短的剪辑。在这种情况下,推理时间包括VAD和SenseVoice的总消耗,并代表端到端延迟。如果您想单独测试SenseVoice模型的推理时间,可以禁用VAD模型。vad_kwargs
:指定VAD模型的配置。 max_single_segment_time
:指示使用动态批处理,其中批处理中音频的总持续时间以秒 (s) 为单位。merge_vad
代币的输出。 from funasr import AutoModel
# paraformer-zh is a multi-functional asr model
# use vad, punc, spk or not as you need
model = AutoModel ( model = "paraformer-zh" , vad_model = "fsmn-vad" , punc_model = "ct-punc" ,
# spk_model="cam++",
res = model . generate ( input = f" { model . model_path } /example/asr_example.wav" ,
batch_size_s = 300 ,
hotword = '魔搭' )
print ( res )
注: hub
:代表模型库, ms
代表选择 ModelScope 下载, hf
代表选择 Huggingface 下载。
from funasr import AutoModel
chunk_size = [ 0 , 10 , 5 ] #[0, 10, 5] 600ms, [0, 8, 4] 480ms
encoder_chunk_look_back = 4 #number of chunks to lookback for encoder self-attention
decoder_chunk_look_back = 1 #number of encoder chunks to lookback for decoder cross-attention
model = AutoModel ( model = "paraformer-zh-streaming" )
import soundfile
import os
wav_file = os . path . join ( model . model_path , "example/asr_example.wav" )
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = chunk_size [ 1 ] * 960 # 600ms
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size , encoder_chunk_look_back = encoder_chunk_look_back , decoder_chunk_look_back = decoder_chunk_look_back )
print ( res )
注意: chunk_size
是流延迟的配置。 [0,10,5]
from funasr import AutoModel
model = AutoModel ( model = "fsmn-vad" )
wav_file = f" { model . model_path } /example/vad_example.wav"
res = model . generate ( input = wav_file )
print ( res )
注:VAD模型的输出格式为: [[beg1, end1], [beg2, end2], ..., [begN, endN]]
from funasr import AutoModel
chunk_size = 200 # ms
model = AutoModel ( model = "fsmn-vad" )
import soundfile
wav_file = f" { model . model_path } /example/vad_example.wav"
speech , sample_rate = soundfile . read ( wav_file )
chunk_stride = int ( chunk_size * sample_rate / 1000 )
cache = {}
total_chunk_num = int ( len (( speech ) - 1 ) / chunk_stride + 1 )
for i in range ( total_chunk_num ):
speech_chunk = speech [ i * chunk_stride :( i + 1 ) * chunk_stride ]
is_final = i == total_chunk_num - 1
res = model . generate ( input = speech_chunk , cache = cache , is_final = is_final , chunk_size = chunk_size )
if len ( res [ 0 ][ "value" ]):
print ( res )
注意:流式 VAD 模型的输出格式可以是以下四种情况之一:
[[beg1, end1], [beg2, end2], .., [begN, endN]]
:与上述离线VAD输出结果相同。[[beg, -1]]
:表示仅检测到一个起点。[[-1, end]]
from funasr import AutoModel
model = AutoModel ( model = "ct-punc" )
res = model . generate ( input = "那今天的会就到这里吧 happy new year 明年见" )
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "fa-zh" )
wav_file = f" { model . model_path } /example/asr_example.wav"
text_file = f" { model . model_path } /example/text.txt"
res = model . generate ( input = ( wav_file , text_file ), data_type = ( "sound" , "text" ))
print ( res )
from funasr import AutoModel
model = AutoModel ( model = "emotion2vec_plus_large" )
wav_file = f" { model . model_path } /example/test.wav"
res = model . generate ( wav_file , output_dir = "./outputs" , granularity = "utterance" , extract_embedding = False )
print ( res )
funasr-export ++model=paraformer ++quantize=false ++device=cpu
from funasr import AutoModel
model = AutoModel ( model = "paraformer" , device = "cpu" )
res = model . export ( quantize = False )
# pip3 install -U funasr-onnx
from funasr_onnx import Paraformer
model_dir = "damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch"
model = Paraformer ( model_dir , batch_size = 1 , quantize = True )
wav_path = [ '~/.cache/modelscope/hub/damo/speech_paraformer-large_asr_nat-zh-cn-16k-common-vocab8404-pytorch/example/asr_example.wav' ]
result = model ( wav_path )
print ( result )
FunASR 支持部署预先训练或进一步微调的模型来提供服务。目前支持以下类型的服务部署:
该项目已获得麻省理工学院许可证的许可。 FunASR 还包含各种第三方组件和一些在其他开源许可证下从其他存储库修改的代码。预训练模型的使用需获得模型许可
