DeBERTa下载 - DeBERTa源代码下载

DeBERTa

Ai源码

1.0.0

下载

DeBERTa ：具有解纠缠注意力的解码增强 BERT

该存储库是DeBERTa的官方实现： Decoding - enhanced BERT with Disentangled Attention和DeBERTa V3：Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embeddingsharing

消息

2023年3月18日

DeBERTa V3 论文被 ICLR 2023 接收。
添加DeBERTa V3预训练和持续训练的代码。详细信息请查看语言模型。

12/8/2021

添加了DeBERTa -V3-XSmall。仅22M主干参数，仅为 RoBERTa-Base 和 XLNet-Base 的 1/4， DeBERTa -V3-XSmall 在 MNLI 和 SQuAD v2.0 任务上显着优于后者（即 MNLI-m 上的 1.2%，EM 得分为 1.5%）在 SQuAD v2.0 上）。这进一步证明了DeBERTa V3 模型的效率。

11/16/2021

我们的新作品DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embeddingsharing 的模型现已在 Huggingface 模型中心公开发布。新模型基于DeBERTa -V2 模型，用 ELECTRA 风格的目标代替 MLM，加上梯度解纠缠的嵌入共享，进一步提高了模型效率。
添加DeBERTa V3模型微调脚本
添加RTD任务头代码
增加语言模型预训练文档

2021 年 3 月 31 日

添加屏蔽语言模型任务
添加了 SuperGLUE 任务
添加 SiFT 代码

2021年2月3日

DeBERTa v2 代码和900M、1.5B模型现已发布。这包括用于我们的 SuperGLUE 单模型提交的 1.5B 模型，其得分为 89.9，而人类基线为 89.8。您可以在我们的博客中找到有关此提交的更多详细信息

v2 中的新增内容

词汇表在 v2 中，我们使用根据训练数据构建的大小为 128K 的新词汇表。我们使用句子分词器，而不是 GPT2 分词器。
nGiE（nGram 诱导输入编码）在 v2 中，我们除了第一个转换器层之外还使用了一个额外的卷积层，以更好地学习输入标记的局部依赖性。我们将对此功能添加更多消融研究。
在注意力层中与内容投影矩阵共享位置投影矩阵根据我们之前的实验，我们发现这可以节省参数而不影响性能。
应用桶来编码相对位置在 v2 中，我们使用日志桶来编码相对位置，类似于 T5。
900M 模型和 1.5B 模型在 v2 中，我们将模型大小扩展到 900M 和 1.5B，这显着提高了下游任务的性能。

2020年12月29日

借助DeBERTa 1.5B 模型，我们在 SuperGLUE 排行榜上超越了 T5 11B 模型和人类表现。代码和模型即将发布。请查看我们的论文了解更多详细信息。

2020年6月13日

我们发布了预训练的模型、源代码和微调脚本，以重现论文中的一些实验结果。您可以按照类似的脚本将DeBERTa应用到您自己的实验或应用程序中。预训练脚本将在下一步发布。

DeBERTa简介

DeBERTa （具有解纠缠注意力的解码增强型 BERT）使用两种新技术改进了 BERT 和 RoBERTa 模型。第一个是解缠结注意力机制，其中每个单词使用分别编码其内容和位置的两个向量来表示，并且单词之间的注意力权重使用其内容和相对位置的解缠结矩阵来计算。其次，使用增强型掩码解码器替换输出 softmax 层来预测模型预训练的掩码标记。我们证明这两种技术显着提高了模型预训练的效率和下游任务的性能。

预训练模型

我们预先训练的模型被打包成压缩文件。您可以从我们的版本中下载它们，或通过以下链接下载单个模型：

模型	词汇(K)	主干网参数(M)	隐藏尺寸	层数	笔记
V2-XXL ¹	128	1320	1536	48	128K 新 SPM 词汇
V2-XL大号	128	710	1536	24	128K 新 SPM 词汇
特大号	50	700	1024	48	与 RoBERTa 相同的词汇
大的	50	350	1024	24	与 RoBERTa 相同的词汇
根据	50	100	第768章	12	与 RoBERTa 相同的词汇
V2-XXLarge-MNLI	128	1320	1536	48	使用 MNLI 进行精车削
V2-XLarge-MNLI	128	710	1536	24	使用 MNLI 进行精车削
XLarge-MNLI	50	700	1024	48	使用 MNLI 进行精车削
大型MNLI	50	350	1024	24	使用 MNLI 进行精车削
基地-MNLI	50	86	第768章	12	使用 MNLI 进行精车削
DeBERTa -V3-大²	128	304	1024	24	128K 新 SPM 词汇
DeBERTa -V3-基础²	128	86	第768章	12	128K 新 SPM 词汇
DeBERTa -V3-小²	128	44	第768章	6	128K 新 SPM 词汇
DeBERTa -V3-X小²	128	22	第384章	12	128K 新 SPM 词汇
m DeBERTa -V3-Base ²	250	86	第768章	12	25万个新SPM词汇，102种语言的多语言模型

笔记

1 这是在SuperGLUE上首次超越T5 11B（89.3）和人类表现（89.8）的模型（89.9）。 128K 新 SPM 词汇。
2 这些 V3 DeBERTa模型是使用 ELECTRA 式目标加上梯度解缠嵌入共享进行预训练的DeBERTa模型，可显着提高模型效率。

尝试模型

阅读我们的文档

要求

Linux系统，例如Ubuntu 18.04LTS
CUDA 10.0
火炬1.3.0
蟒蛇3.6
bash shell 4.0
卷曲
泊坞窗（可选）
nvidia-docker2（可选）

有多种方法可以尝试我们的代码，

使用码头工人

Docker 是运行代码的推荐方式，因为我们已经将所有依赖项构建到了 docker bagai/ DeBERTa中，您可以按照 docker 官方网站在您的计算机上安装 docker。

要使用 docker 运行，请确保您的系统满足上面列表中的要求。以下是尝试 GLUE 实验的步骤：拉取代码，运行./run_docker.sh ，然后就可以运行/ DeBERTa /experiments/glue/下的 bash 命令

使用点

拉取代码并在代码根目录下运行pip3 install -r requirements.txt ，然后进入代码的experiments/glue/文件夹，尝试该文件夹下的bash命令进行glue实验。

作为 pip 包安装

pip install DeBERTa

在现有代码中使用DeBERTa

DeBERTa to your existing code, you need to make two changes to your code, # 1. change your model to consume DeBERTa as the encoder from DeBERTa import DeBERTa import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self. DeBERTa = DeBERTa . DeBERTa (pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2' # Your existing model code # do inilization as before # self. DeBERTa .apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor # def forward(self, input_ids): # The inputs to DeBERTa forward are # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. # Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). # `attention_mask`: an optional parameter for input mask or attention mask. # - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. # It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. # It's the mask that we typically use for attention when a batch has varying length sentences. # - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. # In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = DeBERTa .bert(input_ids)[-1] # 2. Change your tokenizer with the tokenizer built-in DeBERTa from DeBERTa import DeBERTa vocab_path, vocab_type = DeBERTa .load_vocab(pretrained_id='base') tokenizer = DeBERTa .tokenizers[vocab_type](vocab_path) # We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK] max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa ') # Truncate long sequence tokens = tokens[:max_seq_len -2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len-len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ">

 # To apply DeBERTa to your existing code, you need to make two changes to your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import DeBERTa
import torch
class MyModel ( torch . nn . Module ):
  def __init__ ( self ):
    super (). __init__ ()
    # Your existing model code
    self . DeBERTa = DeBERTa . DeBERTa ( pre_trained = 'base' ) # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self . DeBERTa . apply_state () # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward ( self , input_ids ):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = DeBERTa . bert ( input_ids )[ - 1 ]

# 2. Change your tokenizer with the tokenizer built-in DeBERTa
from DeBERTa import DeBERTa
vocab_path , vocab_type = DeBERTa . load_vocab ( pretrained_id = 'base' )
tokenizer = DeBERTa . tokenizers [ vocab_type ]( vocab_path )
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer . tokenize ( 'Examples input text of DeBERTa ' )
# Truncate long sequence
tokens = tokens [: max_seq_len - 2 ]
# Add special tokens to the `tokens`
tokens = [ '[CLS]' ] + tokens + [ '[SEP]' ]
input_ids = tokenizer . convert_tokens_to_ids ( tokens )
input_mask = [ 1 ] * len ( input_ids )
# padding
paddings = max_seq_len - len ( input_ids )
input_ids = input_ids + [ 0 ] * paddings
input_mask = input_mask + [ 0 ] * paddings
features = {
'input_ids' : torch . tensor ( input_ids , dtype = torch . int ),
'input_mask' : torch . tensor ( input_mask , dtype = torch . int )
}

从命令行运行DeBERTa实验

对于胶水任务，

获取数据

DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks">

cache_dir=/tmp/ DeBERTa /
cd experiments/glue
./download_data.sh  $cache_dir /glue_tasks

运行任务

DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa .apps.run --task_name $task --do_train --data_dir $cache_dir/glue_tasks/$task --eval_batch_size 128 --predict_batch_size 128 --output_dir $OUTPUT --scale_steps 250 --loss_scale 16384 --accumulative_update 1 --num_train_epochs 6 --warmup 100 --learning_rate 2e-5 --train_batch_size 32 --max_seq_len 128">

task=STS-B 
OUTPUT=/tmp/ DeBERTa /exps/ $task
export OMP_NUM_THREADS=1
python3 -m DeBERTa .apps.run --task_name $task --do_train  
  --data_dir $cache_dir /glue_tasks/ $task 
  --eval_batch_size 128 
  --predict_batch_size 128 
  --output_dir $OUTPUT 
  --scale_steps 250 
  --loss_scale 16384 
  --accumulative_update 1   
  --num_train_epochs 6 
  --warmup 100 
  --learning_rate 2e-5 
  --train_batch_size 32 
  --max_seq_len 128

笔记

1. 默认情况下，我们会将预训练的模型和分词器缓存在$HOME/.~ DeBERTa ，如果下载意外失败，您可能需要清理它。
1. 您还可以尝试使用我们的带有高频变压器的型号。但是当您尝试 XXLarge 模型时，您需要指定 --sharded_ddp 参数。请查看我们的 XXLarge 型号卡了解更多详细信息。

实验

我们的微调实验是在带有 8x32 V100 GPU 卡的半个 DGX-2 节点上进行的，结果可能会因 GPU 型号、驱动程序、CUDA SDK 版本、使用 FP16 或 FP32 以及随机种子的不同而有所不同。我们在这里根据使用不同随机种子的多次运行来报告我们的数字。以下是大型模型的结果：

任务	命令	结果	运行时间（8x32G V100 GPU）
MNLI xxlarge v2	`experiments/glue/mnli.sh xxlarge-v2`	91.7/91.9 +/-0.1	4小时
MNLI xlarge v2	`experiments/glue/mnli.sh xlarge-v2`	91.7/91.6+/-0.1	2.5小时
MNLI 超大	`experiments/glue/mnli.sh xlarge`	91.5/91.2+/-0.1	2.5小时
MNLI 大号	`experiments/glue/mnli.sh large`	91.3/91.1+/-0.1	2.5小时
QQP大号	`experiments/glue/qqp.sh large`	92.3+/-0.1	6小时
QNLI大号	`experiments/glue/qnli.sh large`	95.3+/-0.2	2小时
MRPC大型	`experiments/glue/mrpc.sh large`	91.9+/-0.5	0.5小时
RTE 大号	`experiments/glue/rte.sh large`	86.6+/-1.0	0.5小时
SST-2大型	`experiments/glue/sst2.sh large`	96.7+/-0.3	1小时
STS-b 大号	`experiments/glue/Stsb.sh large`	92.5+/-0.3	0.5小时
可乐大号	`experiments/glue/cola.sh`	70.5+/-1.0	0.5小时

这是基本模型的结果

任务	命令	结果	运行时间（8x32G V100 GPU）
MNLI基地	`experiments/glue/mnli.sh base`	88.8/88.5+/-0.2	1.5小时

NLU 任务的微调

我们展示了 SQuAD 1.1/2.0 和几个 GLUE 基准测试任务的开发结果。

模型	小队1.1	小队2.0	MNLI-m/mm	SST-2	QNLI	辅酶A	即时通讯	物料循环过程控制	QQP	STS-B
	F1/EM	F1/EM	加速器	加速器	加速器	中冶集团	加速器	加速/F1	加速/F1	压力/压力
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
罗伯特·塔·拉格	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-大型	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa -大¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa -XLarge ¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa -V2-XLarge ¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa -V2-XXLarge ^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1
DeBERTa -V3-大号	-/-	91.5/89.0	91.8/91.9	96.9	96.0	75.3	92.7	92.2/-	93.0/-	93.0/-
DeBERTa -V3-基础	-/-	88.4/85.4	90.6/90.7	-	-	-	-	-	-	-
DeBERTa -V3-小号	-/-	82.9/80.4	88.3/87.7	-	-	-	-	-	-	-
DeBERTa -V3-XSmall	-/-	84.8/82.0	88.1/88.3	-	-	-	-	-	-	-

XNLI 上的微调

我们展示了 XNLI 上零样本跨语言迁移设置的开发结果，即仅使用英语数据进行训练，在其他语言上进行测试。

模型	平均	zh	FR	英语	德	埃尔	背景	茹	t	阿尔	六	th	zh	你好	SW	你的
XLM-R-底座	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
m DeBERTa -V3-基础	79.8 +/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

笔记。

¹继 RoBERTa 之后，对于 RTE、MRPC、STS-B，我们基于DeBERTa -Large-MNLI、 DeBERTa -XLarge-MNLI、 DeBERTa -V2-XLarge-MNLI、 DeBERTa -V2-XXLarge-MNLI 对任务进行微调。当从 MNLI 微调模型开始时，SST-2/QQP/QNLI/SQuADv2 的结果也会略有改善，但是，我们只报告从这 4 个任务的预训练基础模型微调的数字。

具有 MLM 和 RTD 目标的预培训

要使用 MLM 和 RTD 目标预训练DeBERTa ，请检查experiments/language_models

联系方式

何鹏程([email protected])、刘晓东([email protected])、高剑锋([email protected])、陈伟柱([email protected])

引文

DeBERTav3, title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, author={Pengcheng He and Jianfeng Gao and Weizhu Chen}, year={2021}, eprint={2111.09543}, archivePrefix={arXiv}, primaryClass={cs.CL} }">

@misc{he2021 DeBERTa v3,
      title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

DeBERTa, title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }">

@inproceedings{
he2021 DeBERTa ,
title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

展开

附加信息

版本 1.0.0
类型 Ai源码
更新时间 2024-12-31
大小 50MB
来自于 Github