xcodec
1.0.0
音频语言模型的统一语义和声学编解码器。
标题:编解码器确实很重要:探索音频语言模型编解码器的语义缺点
作者:叶震、孙培文、雷嘉禾、林红战、谭旭、戴哲琪、孔秋强、陈建一、潘家豪、刘奇峰、郭一克*、薛伟*
您可以轻松应用我们的方法来增强任何现有的声学编解码器:
例如
class Codec ():
def __init__ ( self ):
# Acoustic codec components
self . encoder = Encoder (...) # Acoustic encoder
self . decoder = Decoder (...) # Acoustic decoder
self . quantizer = RVQ (...) # Residual Vector Quantizer (RVQ)
# Adding the semantic module
self . semantic_model = AutoModel . from_pretrained (...) # e.g., Hubert, WavLM
# Adding Projector
self . fc_prior = nn . Linear (...)
self . fc_post1 = nn . Linear (...)
self . fc_post2 = nn . Linear (...)
def forward ( self , x , bw ):
# Encode the input acoustically and semantically
e_acoustic = self . encoder ( x )
e_semantic = self . semantic_model ( x )
# Combine acoustic and semantic features
combined_features = torch . cat ([ e_acoustic , e_semantic ])
# Apply prior transformation
transformed_features = self . fc_prior ( combined_features )
# Quantize the unified semantic and acoustic features
quantized , codes , bandwidth , commit_loss = self . quantizer ( transformed_features , bw )
# Post-process the quantized features
quantized_semantic = self . fc_post1 ( quantized )
quantized_acoustic = self . fc_post2 ( quantized )
# Decode the quantized acoustic features
output = self . decoder ( quantized_acoustic )
def semantic_loss ( self , semantic , quantized_semantic ):
return F . mse_loss ( semantic , quantized_semantic )
欲了解更多详情,请参阅我们的代码。
?链接到 Huggingface 模型中心。
型号名称 | 抱脸 | 配置 | 语义模型 | 领域 | 训练数据 |
---|---|---|---|---|---|
xcodec_hubert_librispeech | ? | ? | ?休伯特基 | 演讲 | 书本演讲 |
xcodec_wavlm_mls(论文中未提及) | ? | ? | ? Wavlm-base-plus | 演讲 | 木林森英语 |
xcodec_wavlm_more_data(论文中未提及) | ? | ? | ? Wavlm-base-plus | 演讲 | MLS 英文+内部数据 |
xcodec_hubert_general_audio | ? | ? | ?Hubert-base-通用音频 | 通用音频 | 20万小时内部数据 |
xcodec_hubert_general_audio_more_data(论文中未提及) | ? | ? | ?Hubert-base-通用音频 | 通用音频 | 数据更均衡 |
要运行推理,首先从 Hugging Face 下载模型和配置。
python inference.py
在config中准备training_file和validation_file。该文件应列出音频文件的路径:
/path/to/your/xxx.wav
/path/to/your/yyy.wav
...
然后:
torchrun --nnodes=1 --nproc-per-node=8 main_launch_vqdp.py
我要特别感谢 Uniaudio 和 DAC 的作者,因为我们的代码库主要借鉴了 Uniaudio 和 DAC。
如果您发现此存储库有帮助,请考虑按以下格式引用:
@article { ye2024codecdoesmatterexploring ,
title = { Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model } ,
author = { Zhen Ye and Peiwen Sun and Jiahe Lei and Hongzhan Lin and Xu Tan and Zheqi Dai and Qiuqiang Kong and Jianyi Chen and Jiahao Pan and Qifeng Liu and Yike Guo and Wei Xue } ,
journal = { arXiv preprint arXiv:2408.17175 } ,
year = { 2024 } ,
}