DeBERTa下載 - DeBERTa原始碼下載

DeBERTa

Ai源碼

1.0.0

下載

DeBERTa ：具有解糾纏注意力的解碼增強 BERT

該儲存庫是DeBERTa的官方實作： Decoding - enhanced BERT with Disentangled Attention和DeBERTa V3：Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embeddingsharing

訊息

2023年3月18日

DeBERTa V3 論文被 ICLR 2023 接收。
新增DeBERTa V3預訓練和持續訓練的程式碼。詳細資訊請查看語言模型。

12/8/2021

新增了DeBERTa -V3-XSmall。僅22M主幹參數，僅為 RoBERTa-Base 和 XLNet-Base 的 1/4， DeBERTa -V3-XSmall 在MNLI 和SQuAD v2.0 任務上顯著優於後者（即MNLI-m 上的1.2%，EM得分為1.5%）在 SQuAD v2.0 上）。這進一步證明了DeBERTa V3 模型的效率。

11/16/2021

我們的新作品DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embeddingsharing 的模型現已在 Huggingface 模型中心公開發布。新模型基於DeBERTa -V2 模型，以 ELECTRA 風格的目標取代 MLM，加上梯度解糾纏的嵌入共享，進一步提高了模型效率。
新增DeBERTa V3模型微調腳本
新增RTD任務頭代碼
增加語言模型預訓練文檔

2021 年 3 月 31 日

新增屏蔽語言模型任務
新增了 SuperGLUE 任務
新增 SiFT 程式碼

2021年2月3日

DeBERTa v2 代碼和900M、1.5B模型現已發布。這包括用於我們的 SuperGLUE 單模型提交的 1.5B 模型，其得分為 89.9，而人類基線為 89.8。您可以在我們的部落格中找到有關此提交的更多詳細信息

v2 中的新增內容

詞彙表在 v2 中，我們使用根據訓練資料建立的大小為 128K 的新詞彙表。我們使用句子分詞器，而不是 GPT2 分詞器。
nGiE（nGram 誘導輸入編碼）在 v2 中，我們除了第一個轉換器層之外還使用了一個額外的捲積層，以更好地學習輸入標記的局部依賴性。我們將對此功能添加更多消融研究。
在註意力層中與內容投影矩陣共享位置投影矩陣根據我們先前的實驗，我們發現這可以節省參數而不影響效能。
應用桶來編碼相對位置在 v2 中，我們使用日誌桶來編碼相對位置，類似於 T5。
900M 模型和 1.5B 模型在 v2 中，我們將模型大小擴展到 900M 和 1.5B，這顯著提高了下游任務的效能。

2020年12月29日

借助DeBERTa 1.5B 模型，我們在 SuperGLUE 排行榜上超越了 T5 11B 模型和人類表現。程式碼和模型即將發布。請查看我們的論文以了解更多詳細資訊。

2020年6月13日

我們發布了預訓練的模型、原始碼和微調腳本，以重現論文中的一些實驗結果。您可以按照類似的腳本將DeBERTa應用到您自己的實驗或應用程式中。預訓練腳本將在下一步發布。

DeBERTa簡介

DeBERTa （具有解糾纏注意力的解碼增強型 BERT）使用兩種新技術改進了 BERT 和 RoBERTa 模型。第一個是解纏結注意力機制，其中每個單字使用分別編碼其內容和位置的兩個向量來表示，單字之間的注意力權重使用其內容和相對位置的解纏結矩陣來計算。其次，使用增強型遮罩解碼器取代輸出 softmax 層來預測模型預訓練的遮罩標記。我們證明這兩種技術顯著提高了模型預訓練的效率和下游任務的表現。

預訓練模型

我們預先訓練的模型被打包成壓縮檔。您可以從我們的版本下載它們，或透過以下連結下載單一模型：

模型	詞彙(K)	主幹網路參數(M)	隱藏尺寸	層數	筆記
V2-XXL ¹	128	1320	1536	48	128K 新 SPM 詞彙
V2-XL大號	128	710	1536	24	128K 新 SPM 詞彙
特大號	50	700	1024	48	與 RoBERTa 相同的詞彙
大的	50	350	1024	24	與 RoBERTa 相同的詞彙
根據	50	100	第768章	12	與 RoBERTa 相同的詞彙
V2-XXLarge-MNLI	128	1320	1536	48	使用 MNLI 進行精車削
V2-XLarge-MNLI	128	710	1536	24	使用 MNLI 進行精車削
XLarge-MNLI	50	700	1024	48	使用 MNLI 進行精車削
大型MNLI	50	350	1024	24	使用 MNLI 進行精車削
基地-MNLI	50	86	第768章	12	使用 MNLI 進行精車削
DeBERTa -V3-大²	128	304	1024	24	128K 新 SPM 詞彙
DeBERTa -V3-基礎²	128	86	第768章	12	128K 新 SPM 詞彙
DeBERTa -V3-小²	128	44	第768章	6	128K 新 SPM 詞彙
DeBERTa -V3-X小²	128	22	第384章	12	128K 新 SPM 詞彙
m DeBERTa -V3-Base ²	250	86	第768章	12	25萬個新SPM詞彙，102種語言的多語言模型

筆記

1 這是在SuperGLUE上首次超越T5 11B（89.3）和人類表現（89.8）的模型（89.9）。 128K 新 SPM 詞彙。
2 這些 V3 DeBERTa模型是使用 ELECTRA 式目標加上梯度解纏嵌入共享進行預訓練的DeBERTa模型，可顯著提高模型效率。

嘗試模型

閱讀我們的文檔

要求

Linux系統，例如Ubuntu 18.04LTS
CUDA 10.0
火炬1.3.0
蟒蛇3.6
bash shell 4.0
捲曲
泊塢窗戶（可選）
nvidia-docker2（可選）

有多種方法可以嘗試我們的程式碼，

使用碼頭工人

Docker 是運行程式碼的建議方式，因為我們已經將所有依賴項建置到了 docker bagai/ DeBERTa中，您可以按照 docker 官方網站在您的電腦上安裝 docker。

要使用 docker 運行，請確保您的系統符合上面列表中的要求。以下是嘗試 GLUE 實驗的步驟：拉取程式碼，執行./run_docker.sh ，然後就可以執行/ DeBERTa /experiments/glue/下的 bash 指令

使用點

拉取程式碼並在程式碼根目錄下執行pip3 install -r requirements.txt ，然後進入程式碼的experiments/glue/資料夾，嘗試該資料夾下的bash指令進行glue實驗。

作為 pip 套件安裝

pip install DeBERTa

在現有程式碼中使用DeBERTa

DeBERTa to your existing code, you need to make two changes to your code, # 1. change your model to consume DeBERTa as the encoder from DeBERTa import DeBERTa import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self. DeBERTa = DeBERTa . DeBERTa (pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2' # Your existing model code # do inilization as before # self. DeBERTa .apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor # def forward(self, input_ids): # The inputs to DeBERTa forward are # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. # Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). # `attention_mask`: an optional parameter for input mask or attention mask. # - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. # It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. # It's the mask that we typically use for attention when a batch has varying length sentences. # - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. # In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = DeBERTa .bert(input_ids)[-1] # 2. Change your tokenizer with the tokenizer built-in DeBERTa from DeBERTa import DeBERTa vocab_path, vocab_type = DeBERTa .load_vocab(pretrained_id='base') tokenizer = DeBERTa .tokenizers[vocab_type](vocab_path) # We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK] max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa ') # Truncate long sequence tokens = tokens[:max_seq_len -2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len-len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ">

 # To apply DeBERTa to your existing code, you need to make two changes to your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import DeBERTa
import torch
class MyModel ( torch . nn . Module ):
  def __init__ ( self ):
    super (). __init__ ()
    # Your existing model code
    self . DeBERTa = DeBERTa . DeBERTa ( pre_trained = 'base' ) # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self . DeBERTa . apply_state () # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward ( self , input_ids ):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = DeBERTa . bert ( input_ids )[ - 1 ]

# 2. Change your tokenizer with the tokenizer built-in DeBERTa
from DeBERTa import DeBERTa
vocab_path , vocab_type = DeBERTa . load_vocab ( pretrained_id = 'base' )
tokenizer = DeBERTa . tokenizers [ vocab_type ]( vocab_path )
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer . tokenize ( 'Examples input text of DeBERTa ' )
# Truncate long sequence
tokens = tokens [: max_seq_len - 2 ]
# Add special tokens to the `tokens`
tokens = [ '[CLS]' ] + tokens + [ '[SEP]' ]
input_ids = tokenizer . convert_tokens_to_ids ( tokens )
input_mask = [ 1 ] * len ( input_ids )
# padding
paddings = max_seq_len - len ( input_ids )
input_ids = input_ids + [ 0 ] * paddings
input_mask = input_mask + [ 0 ] * paddings
features = {
'input_ids' : torch . tensor ( input_ids , dtype = torch . int ),
'input_mask' : torch . tensor ( input_mask , dtype = torch . int )
}

從命令列運行DeBERTa實驗

對於膠水任務，

取得數據

DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks">

cache_dir=/tmp/ DeBERTa /
cd experiments/glue
./download_data.sh  $cache_dir /glue_tasks

運行任務

DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa .apps.run --task_name $task --do_train --data_dir $cache_dir/glue_tasks/$task --eval_batch_size 128 --predict_batch_size 128 --output_dir $OUTPUT --scale_steps 250 --loss_scale 16384 --accumulative_update 1 --num_train_epochs 6 --warmup 100 --learning_rate 2e-5 --train_batch_size 32 --max_seq_len 128">

task=STS-B 
OUTPUT=/tmp/ DeBERTa /exps/ $task
export OMP_NUM_THREADS=1
python3 -m DeBERTa .apps.run --task_name $task --do_train  
  --data_dir $cache_dir /glue_tasks/ $task 
  --eval_batch_size 128 
  --predict_batch_size 128 
  --output_dir $OUTPUT 
  --scale_steps 250 
  --loss_scale 16384 
  --accumulative_update 1   
  --num_train_epochs 6 
  --warmup 100 
  --learning_rate 2e-5 
  --train_batch_size 32 
  --max_seq_len 128

筆記

1. 預設情況下，我們會將預先訓練的模型和分詞器緩存在$HOME/.~ DeBERTa ，如果下載意外失敗，您可能需要清理它。
1. 您也可以嘗試使用我們的高頻變壓器的型號。但是當您嘗試 XXLarge 模型時，您需要指定 --sharded_ddp 參數。請查看我們的 XXLarge 型號卡以了解更多詳細資訊。

實驗

我們的微調實驗是在具有8x32 V100 GPU 卡的半個DGX-2 節點上進行的，結果可能會因GPU 型號、驅動程式、CUDA SDK 版本、使用FP16 或FP32 以及隨機種子的不同而有所不同。我們在這裡根據使用不同隨機種子的多次運行來報告我們的數字。以下是大型模型的結果：

任務	命令	結果	運行時間（8x32G V100 GPU）
MNLI xxlarge v2	`experiments/glue/mnli.sh xxlarge-v2`	91.7/91.9 +/-0.1	4小時
MNLI xlarge v2	`experiments/glue/mnli.sh xlarge-v2`	91.7/91.6+/-0.1	2.5小時
MNLI 超大	`experiments/glue/mnli.sh xlarge`	91.5/91.2+/-0.1	2.5小時
MNLI 大號	`experiments/glue/mnli.sh large`	91.3/91.1+/-0.1	2.5小時
QQP大號	`experiments/glue/qqp.sh large`	92.3+/-0.1	6小時
QNLI大號	`experiments/glue/qnli.sh large`	95.3+/-0.2	2小時
MRPC大型	`experiments/glue/mrpc.sh large`	91.9+/-0.5	0.5小時
RTE 大號	`experiments/glue/rte.sh large`	86.6+/-1.0	0.5小時
SST-2大型	`experiments/glue/sst2.sh large`	96.7+/-0.3	1小時
STS-b 大號	`experiments/glue/Stsb.sh large`	92.5+/-0.3	0.5小時
可樂大號	`experiments/glue/cola.sh`	70.5+/-1.0	0.5小時

這是基本模型的結果

任務	命令	結果	運行時間（8x32G V100 GPU）
MNLI基地	`experiments/glue/mnli.sh base`	88.8/88.5+/-0.2	1.5小時

NLU 任務的微調

我們展示了 SQuAD 1.1/2.0 和幾個 GLUE 基準測試任務的開發結果。

模型	小隊1.1	小隊2.0	MNLI-m/mm	SST-2	QNLI	輔酶A	即時通訊	物料循環過程控制	QQP	STS-B
	F1/EM	F1/EM	加速器	加速器	加速器	中冶集團	加速器	加速/F1	加速/F1	壓力/壓力
BERT-Large	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
羅伯特·塔·拉格	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-大型	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa -大¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa -XLarge ¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa -V2-XLarge ¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa -V2-XXLarge ^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1
DeBERTa -V3-大號	-/-	91.5/89.0	91.8/91.9	96.9	96.0	75.3	92.7	92.2/-	93.0/-	93.0/-
DeBERTa -V3-基礎	-/-	88.4/85.4	90.6/90.7	-	-	-	-	-	-	-
DeBERTa -V3-小號	-/-	82.9/80.4	88.3/87.7	-	-	-	-	-	-	-
DeBERTa -V3-XSmall	-/-	84.8/82.0	88.1/88.3	-	-	-	-	-	-	-

XNLI 上的微調

我們展示了 XNLI 上零樣本跨語言遷移設定的開發結果，即僅使用英語資料進行訓練，在其他語言上進行測試。

模型	平均	zh	FR	英語	德	艾爾	背景	茹	t	阿爾	六	th	zh	你好	SW	你的
XLM-R-底座	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
m DeBERTa -V3-基礎	79.8 +/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

筆記。

¹繼 RoBERTa 之後，對於 RTE、MRPC、STS-B，我們基於DeBERTa -Large-MNLI、 DeBERTa -XLarge-MNLI、 DeBERTa -V2-XLarge-MNLI、 DeBERTa -V2-XXLarge-MNLI 對任務進行微調。當從 MNLI 微調模型開始時，SST-2/QQP/QNLI/SQuADv2 的結果也會略有改善，但是，我們只報告從這 4 個任務的預訓練基礎模型微調的數字。

具有 MLM 和 RTD 目標的預培訓

若要使用 MLM 和 RTD 目標預先訓練DeBERTa ，請檢查experiments/language_models

聯絡方式

何鵬程([email protected])、劉曉東([email protected])、高劍鋒([email protected])、陳偉柱([email protected])

引文

DeBERTav3, title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, author={Pengcheng He and Jianfeng Gao and Weizhu Chen}, year={2021}, eprint={2111.09543}, archivePrefix={arXiv}, primaryClass={cs.CL} }">

@misc{he2021 DeBERTa v3,
      title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

DeBERTa, title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }">

@inproceedings{
he2021 DeBERTa ,
title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

展開

附加信息

版本 1.0.0
類型 Ai源碼
更新時間 2024-12-31
大小 50MB
來自於 Github

相關應用

node telegram bot api

2024-12-14
typebot.io

2024-12-14
python wechaty getting started

2024-12-14
TranscriberBot

2024-12-14
genal chat

2024-12-14
Facemoji

2024-12-14

爲您推薦

chat.petals.dev

其他源碼

1.0.0
GPT Prompt Templates

其他源碼

1.0.0
GPTyped

其他源碼

GPTyped 1.0.5
node telegram bot api

Ai源碼

v0.50.0
typebot.io

Ai源碼

v3.1.2
python wechaty getting started

Ai源碼

1.0.0
waymo open dataset

其他源碼

December 2023 Update
termwind

其他類別

v2.3.0
wp functions

其他類別

1.0.0

相關資訊全部