DeBERTaダウンロード - DeBERTaソースコードのダウンロード

DeBERTa

AI ソースコード

1.0.0

ダウンロード

DeBERTa : アテンションを解除したデコード強化型 BERT

このリポジトリは、 DeBERTa : Decoding -enhanced BERT with Disentangled AttentionおよびDeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing の公式実装です。

ニュース

2023/03/18

DeBERTa V3 論文が ICLR 2023 に受理されました。
DeBERTa V3の事前トレーニングと継続トレーニングのコードが追加されました。詳細については、言語モデルを確認してください。

2021年12月8日

DeBERTa -V3-XSmall が追加されました。 RoBERTa-Base および XLNet-Base のわずか 1/4 である2,200 万のバックボーンパラメータのみを使用したDeBERTa -V3-XSmall は、後の MNLI および SQuAD v2.0 タスクよりも大幅に優れたパフォーマンスを示します (つまり、MNLI-m で 1.2%、EM スコア 1.5%) SQuAD v2.0 上)。これは、 DeBERTa V3 モデルの効率性をさらに示しています。

2021年11月16日

私たちの新作DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing」のモデルは現在、huggingface モデルハブで公開されています。新しいモデルは、 DeBERTa -V2 モデルに基づいており、MLM を ELECTRA スタイルの対物レンズに置き換え、さらにモデルの効率をさらに向上させる勾配分解埋め込み共有を追加しています。
DeBERTa V3 モデル微調整用のスクリプトが追加されました
RTDタスクヘッドのコードを追加
言語モデルの事前トレーニングに関するドキュメントを追加しました

2021年3月31日

マスクされた言語モデルタスクが追加されました
SuperGLUEタスクが追加されました
SiFTコードが追加されました

2021/02/03

DeBERTa v2 コードと900M、1.5Bモデルがここにあります。これには、SuperGLUE 単一モデルの提出に使用され、人間のベースライン 89.8 に対して 89.9 を達成した 1.5B モデルが含まれます。この申請に関する詳細については、ブログをご覧ください。

v2 の新機能

語彙v2 では、トレーニングデータから構築されたサイズ 128K の新しい語彙を使用します。 GPT2 トークナイザーの代わりに、センテンスピーストークナイザーを使用します。
nGiE(nGram Induced Input Encoding) v2 では、入力トークンのローカル依存関係をより良く学習するために、最初のトランスフォーマー層とは別に追加の畳み込み層を使用します。この機能に関するアブレーションの研究をさらに追加する予定です。
アテンションレイヤーのコンテンツ射影行列と位置射影行列を共有する以前の実験に基づいて、これによりパフォーマンスに影響を与えることなくパラメータを保存できることがわかりました。
バケットを適用して相対位置をエンコードするv2 では、T5 と同様にログバケットを使用して相対位置をエンコードします。
900M モデルと 1.5B モデルv2 では、モデルサイズを 900M および 1.5B にスケールし、ダウンストリームタスクのパフォーマンスを大幅に向上させます。

2020年12月29日

DeBERTa 1.5B モデルでは、SuperGLUE リーダーボードで T5 11B モデルと人間のパフォーマンスを上回りました。コードとモデルは近日公開予定です。詳細については、論文をご覧ください。

2020/06/13

論文内の実験結果の一部を再現するために、事前トレーニングされたモデル、ソースコード、および微調整スクリプトをリリースしました。同様のスクリプトに従って、 DeBERTa独自の実験やアプリケーションに適用できます。事前トレーニングスクリプトは次のステップでリリースされます。

DeBERTaの紹介

DeBERTa (デコーディング強化された BERT とデエンタングルドアテンション) は、2 つの新しい技術を使用して BERT モデルと RoBERTa モデルを改善します。 1 つ目は、解きほぐされたアテンションメカニズムです。このメカニズムでは、各単語は、その内容と位置をそれぞれエンコードする 2 つのベクトルを使用して表現され、単語間のアテンションの重みは、内容と相対位置に関する解きほぐされた行列を使用して計算されます。次に、強化されたマスクデコーダを使用して出力ソフトマックスレイヤーを置き換え、モデルの事前トレーニング用のマスクされたトークンを予測します。これら 2 つの手法により、モデルの事前トレーニングの効率と下流タスクのパフォーマンスが大幅に向上することを示します。

事前トレーニングされたモデル

事前トレーニングされたモデルは zip ファイルにパッケージ化されています。これらはリリースからダウンロードすることも、以下のリンクから個別のモデルをダウンロードすることもできます。

モデル	語彙(K)	バックボーンパラメータ(M)	隠しサイズ	レイヤー	注記
V2-XX大¹	128	1320	1536年	48	128K の新しい SPM 語彙
V2-XLarge	128	710	1536年	24	128K の新しい SPM 語彙
特大	50	700	1024	48	RoBERTaと同じ語彙
大きい	50	350	1024	24	RoBERTaと同じ語彙
ベース	50	100	768	12	RoBERTaと同じ語彙
V2-XXLarge-MNLI	128	1320	1536年	48	MNLIで微調整
V2-XLarge-MNLI	128	710	1536年	24	MNLIで微調整
XLarge-MNLI	50	700	1024	48	MNLIで微調整
大規模MNLI	50	350	1024	24	MNLIで微調整
ベース-MNLI	50	86	768	12	MNLIで微調整
DeBERTa -V3-Large ²	128	304	1024	24	128K の新しい SPM 語彙
DeBERTa -V3-Base ²	128	86	768	12	128K の新しい SPM 語彙
DeBERTa -V3-Small ²	128	44	768	6	128K の新しい SPM 語彙
DeBERTa -V3-XSmall ²	128	22	384	12	128K の新しい SPM 語彙
m DeBERTa -V3-Base ²	250	86	768	12	250K の新しい SPM 語彙、102 言語の多言語モデル

注記

1 SuperGLUE上で初めてT5 11B(89.3)や人間のパフォーマンス(89.8)を超えたモデル(89.9)です。 128K の新しい SPM 語彙。
2 これらの V3 DeBERTaモデルは、ELECTRA スタイルの目的と、モデルの効率を大幅に向上させる勾配分解埋め込み共有を使用して事前トレーニングされたDeBERTaモデルです。

モデルを試してみる

ドキュメントを読む

要件

Linux システム (例: Ubuntu 18.04LTS)
CUDA 10.0
pytorch 1.3.0
Python 3.6
バッシュシェル4.0
カール
ドッカー (オプション)
nvidia-docker2 (オプション)

コードを試す方法はいくつかありますが、

ドッカーを使用する

コードを実行するには Docker が推奨される方法です。すべての依存関係が既に docker Bagai/ DeBERTaに組み込まれており、docker 公式サイトに従ってマシンに docker をインストールできます。

docker で実行するには、システムが上記のリストの要件を満たしていることを確認してください。 GLUE 実験を試す手順は次のとおりです。コードをプルして./run_docker.shを実行すると、 / DeBERTa /experiments/glue/で bash コマンドを実行できます。

ピップを使用する

コードをプルし、コードのルートディレクトリでpip3 install -r requirements.txtを実行します。次に、コードのexperiments/glue/フォルダーに入り、そのフォルダーで Glue 実験用の bash コマンドを試します。

pip パッケージとしてインストールする

pip install DeBERTa

既存のコードでDeBERTa使用する

DeBERTa to your existing code, you need to make two changes to your code, # 1. change your model to consume DeBERTa as the encoder from DeBERTa import DeBERTa import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self. DeBERTa = DeBERTa . DeBERTa (pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2' # Your existing model code # do inilization as before # self. DeBERTa .apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor # def forward(self, input_ids): # The inputs to DeBERTa forward are # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. # Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). # `attention_mask`: an optional parameter for input mask or attention mask. # - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. # It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. # It's the mask that we typically use for attention when a batch has varying length sentences. # - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. # In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = DeBERTa .bert(input_ids)[-1] # 2. Change your tokenizer with the tokenizer built-in DeBERTa from DeBERTa import DeBERTa vocab_path, vocab_type = DeBERTa .load_vocab(pretrained_id='base') tokenizer = DeBERTa .tokenizers[vocab_type](vocab_path) # We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK] max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa ') # Truncate long sequence tokens = tokens[:max_seq_len -2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len-len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ">

 # To apply DeBERTa to your existing code, you need to make two changes to your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import DeBERTa
import torch
class MyModel ( torch . nn . Module ):
  def __init__ ( self ):
    super (). __init__ ()
    # Your existing model code
    self . DeBERTa = DeBERTa . DeBERTa ( pre_trained = 'base' ) # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self . DeBERTa . apply_state () # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward ( self , input_ids ):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = DeBERTa . bert ( input_ids )[ - 1 ]

# 2. Change your tokenizer with the tokenizer built-in DeBERTa
from DeBERTa import DeBERTa
vocab_path , vocab_type = DeBERTa . load_vocab ( pretrained_id = 'base' )
tokenizer = DeBERTa . tokenizers [ vocab_type ]( vocab_path )
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer . tokenize ( 'Examples input text of DeBERTa ' )
# Truncate long sequence
tokens = tokens [: max_seq_len - 2 ]
# Add special tokens to the `tokens`
tokens = [ '[CLS]' ] + tokens + [ '[SEP]' ]
input_ids = tokenizer . convert_tokens_to_ids ( tokens )
input_mask = [ 1 ] * len ( input_ids )
# padding
paddings = max_seq_len - len ( input_ids )
input_ids = input_ids + [ 0 ] * paddings
input_mask = input_mask + [ 0 ] * paddings
features = {
'input_ids' : torch . tensor ( input_ids , dtype = torch . int ),
'input_mask' : torch . tensor ( input_mask , dtype = torch . int )
}

コマンドラインからDeBERTa実験を実行する

接着作業の場合は、

データを取得する

DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks">

cache_dir=/tmp/ DeBERTa /
cd experiments/glue
./download_data.sh  $cache_dir /glue_tasks

タスクの実行

DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa .apps.run --task_name $task --do_train --data_dir $cache_dir/glue_tasks/$task --eval_batch_size 128 --predict_batch_size 128 --output_dir $OUTPUT --scale_steps 250 --loss_scale 16384 --accumulative_update 1 --num_train_epochs 6 --warmup 100 --learning_rate 2e-5 --train_batch_size 32 --max_seq_len 128">

task=STS-B 
OUTPUT=/tmp/ DeBERTa /exps/ $task
export OMP_NUM_THREADS=1
python3 -m DeBERTa .apps.run --task_name $task --do_train  
  --data_dir $cache_dir /glue_tasks/ $task 
  --eval_batch_size 128 
  --predict_batch_size 128 
  --output_dir $OUTPUT 
  --scale_steps 250 
  --loss_scale 16384 
  --accumulative_update 1   
  --num_train_epochs 6 
  --warmup 100 
  --learning_rate 2e-5 
  --train_batch_size 32 
  --max_seq_len 128

注意事項

1. デフォルトでは、事前トレーニングされたモデルとトークナイザーは$HOME/.~ DeBERTaにキャッシュされます。ダウンロードが予期せず失敗した場合は、クリーンアップする必要がある場合があります。
1. HF トランスを搭載したモデルもお試しいただけます。ただし、XXLarge モデルを試す場合は、--sharded_ddp 引数を指定する必要があります。詳細については、XXLarge モデルカードをご覧ください。

実験

微調整実験は、8x32 V100 GPU カードを搭載した DGX-2 ノードの半分で実行されます。結果は、GPU モデル、ドライバー、CUDA SDK バージョン、FP16 または FP32 の使用、およびランダムシードの違いによって異なる場合があります。ここでは、異なるランダムシードを使用した複数の実行に基づいて数値を報告します。大規模モデルの結果は次のとおりです。

タスク	指示	結果	実行時間(8x32G V100 GPU)
MNLI xxlarge v2	`experiments/glue/mnli.sh xxlarge-v2`	91.7/91.9 +/-0.1	4時間
MNLI xlarge v2	`experiments/glue/mnli.sh xlarge-v2`	91.7/91.6 +/-0.1	2.5時間
MNLI エクスラージ	`experiments/glue/mnli.sh xlarge`	91.5/91.2 +/-0.1	2.5時間
MNLI大	`experiments/glue/mnli.sh large`	91.3/91.1 +/-0.1	2.5時間
QQP大	`experiments/glue/qqp.sh large`	92.3 +/-0.1	6時間
QNLI大	`experiments/glue/qnli.sh large`	95.3 +/-0.2	2時間
MRPC大	`experiments/glue/mrpc.sh large`	91.9 +/-0.5	0.5時間
RTE大	`experiments/glue/rte.sh large`	86.6 +/-1.0	0.5時間
SST-2大	`experiments/glue/sst2.sh large`	96.7 +/-0.3	1時間
STS-b 大	`experiments/glue/Stsb.sh large`	92.5 +/-0.3	0.5時間
CoLA大	`experiments/glue/cola.sh`	70.5 +/-1.0	0.5時間

そして、これがベースモデルの結果です

タスク	指示	結果	実行時間(8x32G V100 GPU)
MNLIベース	`experiments/glue/mnli.sh base`	88.8/88.5 +/-0.2	1.5時間

NLU タスクの微調整

SQuAD 1.1/2.0 およびいくつかの GLUE ベンチマークタスクに関する開発結果を紹介します。

モデル	分隊 1.1	スクワッド 2.0	MNLI-m/mm	SST-2	QNLI	コーラ	RTE	MRPC	QQP	STS-B
	F1/EM	F1/EM	ACC	ACC	ACC	MCC	ACC	ACC/F1	ACC/F1	追伸
BERT-ラージ	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa-Large	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-Large	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa -大¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa -XLarge ¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa -V2-XLarge ¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa -V2-XXLarge ^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1
DeBERTa -V3-Large	-/-	91.5/89.0	91.8/91.9	96.9	96.0	75.3	92.7	92.2/-	93.0/-	93.0/-
DeBERTa -V3-Base	-/-	88.4/85.4	90.6/90.7	-	-	-	-	-	-	-
DeBERTa -V3-Small	-/-	82.9/80.4	88.3/87.7	-	-	-	-	-	-	-
DeBERTa -V3-XSmall	-/-	84.8/82.0	88.1/88.3	-	-	-	-	-	-	-

XNLI での微調整

ゼロショットクロスリンガル転送設定、つまり英語データのみを使用したトレーニング、他の言語でのテストを使用した開発結果を XNLI で示します。

モデル	平均	jp	フランス	エス	デ	エル	バックグラウンド	る	tr	あーる	ヴィ	番目	zh	こんにちは	スイス	あなた
XLM-Rベース	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
m DeBERTa -V3-Base	79.8 +/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

注意事項。

¹ RoBERTa に続いて、RTE、MRPC、STS-B については、 DeBERTa -Large-MNLI、 DeBERTa -XLarge-MNLI、 DeBERTa -V2-XLarge-MNLI、 DeBERTa -V2-XXLarge-MNLI に基づいてタスクを微調整します。 SST-2/QQP/QNLI/SQuADv2 の結果も、MNLI 微調整モデルから開始するとわずかに改善されますが、これら 4 つのタスクについては、事前トレーニングされたベースモデルから微調整された数値のみが報告されます。

MLM と RTD の目標を設定した事前トレーニング

MLM および RTD 目標を使用してDeBERTa事前トレーニングするには、 experiments/language_modelsを確認してください。

連絡先

He Pengcheng ([email protected])、Xiaodong Liu ([email protected])、Jianfeng Gao ([email protected])、Weizhu Chen ([email protected])

引用

DeBERTav3, title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, author={Pengcheng He and Jianfeng Gao and Weizhu Chen}, year={2021}, eprint={2111.09543}, archivePrefix={arXiv}, primaryClass={cs.CL} }">

@misc{he2021 DeBERTa v3,
      title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

DeBERTa, title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }">

@inproceedings{
he2021 DeBERTa ,
title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

拡大する

追加情報

バージョン 1.0.0
タイプ AI ソースコード
更新時間 2024-12-31
サイズ 50MB
から Github