PCPM下载 - PCPM源码下载

PCPM

Ai源码

1.0.0

下载

PCPM

呈现预训练模型的语料库。链接到 NLP 和语音中的预训练模型以及训练脚本。

随着 NLP 的快速进步，引导涉及文本的机器学习项目变得越来越容易。现在，我们可以从基本的预训练模型开始，而不是从基本代码开始，并在几次迭代内获得 SOTA 性能。该存储库的建立是为了使预训练模型最大限度地减少集体人力和资源成本，从而加速该领域的发展。

由于其广泛使用，列出的模型是针对 pytorch 或 tensorflow 精心设计的。

注意： pytorch-transofmers是一个很棒的库，可用于从 NLP 中的许多预训练模型中快速推断/微调。这里不包括这些的预训练模型。

内容

文本机器学习模型
语音转文本模型
数据集
耻辱堂
非英文型号
其他系列

文本机器学习

语言模型

姓名	关联	受训于	训练脚本
变形金刚-xl	https://github.com/kimiyoung/transformer-xl/tree/master/tf#obtain-and-evaluate-pretrained-sota-models	`enwik8` 、 `lm1b` 、 `wt103` 、 `text8`	https://github.com/kimiyoung/transformer-xl
GPT-2	https://github.com/openai/gpt-2/blob/master/download_model.py	`webtext`	https://github.com/nshepperd/gpt-2/
自适应输入（fairseq）	https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md#pre-trained-models	`lm1b`	https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md

基于排列语言建模 - XLNet

姓名	关联	受训于	训练脚本
XL网	https://github.com/zihangdai/xlnet/#released-models	`booksCorpus` + `English Wikipedia` + `Giga5` + `ClueWeb 2012-B` + `Common Crawl`	https://github.com/zihangdai/xlnet/

基于掩码语言建模——Bert

姓名	关联	受训于	训练脚本
罗伯塔	https://github.com/pytorch/fairseq/tree/master/examples/roberta#pre-trained-models	书籍Corpus+CC-N EWS+OpenWebText+CommonCrawl-Stories	https://github.com/huggingface/transformers
伯特	https://github.com/google-research/bert/	书籍语料库+英语维基百科	https://github.com/huggingface/transformers
MT-DNN	https://mrc.blob.core.windows.net/mt-dnn-model/mt_dnn_base.pt（https://github.com/namisan/mt-dnn/blob/master/download.sh）	胶水	https://github.com/namisan/mt-dnn

机器翻译

姓名	关联	受训于	训练脚本
开放式神经机器翻译	http://opennmt.net/Models-py/（pytorch） http://opennmt.net/Models-tf/（张量流）	英语-德语	https://github.com/OpenNMT/OpenNMT-py
Fairseq（多个模型）	https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md#pre-trained-models	WMT14 英语-法语、WMT16 英语-德语	https://github.com/pytorch/fairseq/blob/master/examples/translation/README.md

情绪

姓名	关联	受训于	训练脚本
Nvidia 情绪发现	https://github.com/NVIDIA/sentiment-discovery#pretrained-models	SST、IMDB、Semeval-2018-tweet-emotion	https://github.com/NVIDIA/sentiment-discovery
MT-DNN 情感	https://drive.google.com/open?id=1-ld8_WpdQVDjPeYhb3AK8XYLGlZEbs-l	海温	https://github.com/namisan/mt-dnn

阅读理解

小队1.1

秩	姓名	关联	训练脚本
49	比达夫	https://s3-us-west-2.amazonaws.com/allennlp/models/bidaf-model-2017.09.15-charpad.tar.gz	https://github.com/allenai/allennlp

总结

英语摘要模型

姓名	关联	受训于	训练脚本
开放式神经机器翻译	http://opennmt.net/Models-py/	千兆字标准	https://github.com/OpenNMT/OpenNMT-py

语音转文字

姓名	关联	受训于	训练脚本
NeMo石英网	https://ngc.nvidia.com/catalog/models/nvidia:quartznet15x5	librispeech，mozilla 通用语音	https://github.com/NVIDIA/NeMo
OpenSeq2Seq-Jasper	https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#models	书本言语	https://github.com/NVIDIA/OpenSeq2Seq
埃斯普网络	https://github.com/espnet/espnet#asr-results	librispeech、Aishell、科大、TEDLIUM2	https://github.com/espnet/espnet
wav2字母++	https://talonvoice.com/research/	书本言语	https://github.com/facebookresearch/wav2letter
Deepspeech2 pytorch	SeanNaren/deepspeech.pytorch#299（评论）	书本言语	https://github.com/SeanNaren/deepspeech.pytorch
深度语音	https://github.com/mozilla/DeepSpeech#getting-the-pre-trained-model	mozilla-common-voice、librispeech、fisher、总机	https://github.com/mozilla/DeepSpeech
语音到文本 Wavenet	https://github.com/buriburisuri/speech-to-text-wavenet#pre-trained-models	VCTK	https://github.com/buriburisuri/speech-to-text-wavenet
16k	https://github.com/at16k/at16k#download-models	不适用	不适用

数据集

本文档中引用的数据集

语言模型数据

普通爬行

http://commoncrawl.org/

恩威克8

维基百科数据转储（大文本压缩基准）http://mattmahoney.net/dc/textdata.html

文本8

维基百科清理文本（大文本压缩基准）http://mattmahoney.net/dc/textdata.html

LM1b

10 亿字语言模型基准 https://www.statmt.org/lm-benchmark/

wt103

维基文本 103 https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/

网络文本

作者未发布原始数据集。开源集合位于 https://skylion007.github.io/OpenWebTextCorpus/

英文维基百科

https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

图书语料库

https://yknzhu.wixsite.com/mbweb https://github.com/soskek/bookcorpus

情绪

海温

斯坦福情感树库 https://nlp.stanford.edu/sentiment/index.html。胶水任务之一。

互联网数据库

用于情感分类的 IMDB 电影评论数据集 http://ai.stanford.edu/~amaas/data/sentiment

塞梅瓦尔2018te

Semeval 2018 年推文情感数据集 https://competitions.codalab.org/competitions/17751

胶水

Glue 是用于对自然语言系统进行基准测试的资源集合。 https://gluebenchmark.com/ 包含自然语言推理、情感分类、释义检测、相似性匹配和语言可接受性的数据集。

语音转文本数据

渔夫

https://pdfs.semanticscholar.org/a723/97679079439b075de815553c7b687ccfa886.pdf

书本言语

www.danielpovey.com/files/2015_icassp_librispeech.pdf

总机

https://ieeexplore.ieee.org/document/225858/

Mozilla 通用语音

https://github.com/mozilla/voice-web

VCTK

https://datashare.is.ed.ac.uk/handle/10283/2651

耻辱堂

高质量的研究，不包括供公众使用的预训练模型和/或代码。

KERMIT https://arxiv.org/abs/1906.01604 基于生成插入的序列建模。没有代码。

非英语

其他系列

艾伦自然语言处理

allen nlp 基于 pytorch 构建，制作了 SOTA 模型并将其开源。 https://github.com/allenai/allennlp/blob/master/MODELS.md

他们在 https://demo.allennlp.org/ 上提供了有关各种任务的简洁交互式演示

胶子自然语言处理

该库基于 MXNet，拥有针对 NLP 中各种任务的大量预训练模型。 http://gluon-nlp.mxnet.io/master/index.html#model-zoo

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2024-12-31
大小 50MB
来自于 Github