DeBERTa 다운로드 - DeBERTa 소스 코드 다운로드

DeBERTa

AI 소스 코드

1.0.0

다운로드

DeBERTa : Disentangled Attention을 갖춘 디코딩 강화 BERT

이 저장소는 DeBERTa 의 공식 구현입니다: Disentangled Attention 을 사용한 디코딩 강화 BERT 및 DeBERTa V3: Gradient-Disentangled Embedding Sharing을 통한 ELECTRA 스타일 사전 훈련을 사용하여 DeBERTa 개선

소식

2023년 3월 18일

DeBERTa V3 논문은 ICLR 2023에 의해 승인되었습니다.
DeBERTa V3 사전 훈련 및 지속적 훈련을 위한 코드가 추가되었습니다. 자세한 내용은 언어 모델을 확인하세요.

2021년 12월 8일

DeBERTa -V3-XSmall이 추가되었습니다. RoBERTa-Base 및 XLNet-Base의 1/4에 불과한 22M 백본 매개변수만으로 DeBERTa -V3-XSmall은 MNLI 및 SQuAD v2.0 작업에서 이후 버전보다 훨씬 뛰어난 성능을 발휘합니다(예: MNLI-m에서 1.2%, EM 점수 1.5%). SQuAD v2.0에서). 이는 DeBERTa V3 모델의 효율성을 더욱 입증합니다.

2021년 11월 16일

새로운 작업인 DeBERTa V3: Gradient-Disentangled Embedding Sharing을 통한 ELECTRA 스타일 사전 훈련을 사용한 DeBERTa 개선의 모델은 현재 Huggingface 모델 허브에서 공개적으로 제공됩니다. 새로운 모델은 MLM을 ELECTRA 스타일 대물렌즈와 그라디언트 분리 임베딩 공유로 대체하여 모델 효율성을 더욱 향상시키는 DeBERTa -V2 모델을 기반으로 합니다.
DeBERTa V3 모델 미세 조정을 위한 스크립트가 추가되었습니다.
RTD 태스크 헤드 코드가 추가되었습니다.
언어 모델 사전 학습 문서가 추가되었습니다.

2021년 3월 31일

마스크된 언어 모델 작업이 추가되었습니다.
SuperGLUE 작업이 추가되었습니다.
SiFT 코드가 추가되었습니다

2021년 2월 3일

DeBERTa v2 코드와 900M, 1.5B 모델이 출시되었습니다. 여기에는 SuperGLUE 단일 모델 제출에 사용된 1.5B 모델이 포함되어 있으며 인간 기준 89.8에 비해 89.9를 달성했습니다. 이 제출물에 대한 자세한 내용은 당사 블로그에서 확인할 수 있습니다.

v2의 새로운 기능

어휘 v2에서는 훈련 데이터로 구축된 128K 크기의 새로운 어휘를 사용합니다. GPT2 토크나이저 대신 문장 조각 토크나이저를 사용합니다.
nGiE(nGram 유도 입력 인코딩) v2에서는 입력 토큰의 로컬 종속성을 더 잘 학습하기 위해 첫 번째 변환기 레이어 외에 추가 컨볼루션 레이어를 사용합니다. 이 기능에 대한 더 많은 절제 연구를 추가할 것입니다.
Attention 레이어의 콘텐츠 투영 행렬과 위치 투영 행렬을 공유하는 것은 이전 실험을 기반으로 성능에 영향을 주지 않고 매개변수를 저장할 수 있다는 것을 발견했습니다.
상대 위치를 인코딩하기 위해 버킷 적용 v2에서는 T5와 유사한 상대 위치를 인코딩하기 위해 로그 버킷을 사용합니다.
900M 모델 및 1.5B 모델 v2에서는 모델 크기를 900M 및 1.5B로 확장하여 다운스트림 작업의 성능을 크게 향상시킵니다.

2020년 12월 29일

DeBERTa 1.5B 모델을 사용하면 SuperGLUE 리더보드에서 T5 11B 모델과 인간 성능을 능가합니다. 코드와 모델은 곧 공개될 예정입니다. 자세한 내용은 우리의 논문을 확인하세요.

2020년 6월 13일

우리는 논문에서 실험 결과 중 일부를 재현하기 위해 사전 훈련된 모델, 소스 코드 및 미세 조정 스크립트를 공개했습니다. 비슷한 스크립트를 따라 DeBERTa 자신의 실험이나 애플리케이션에 적용할 수 있습니다. 사전 훈련 스크립트는 다음 단계에서 공개될 예정입니다.

DeBERTa 소개

DeBERTa (Decoding-enhanced BERT with disentangled attention)는 두 가지 새로운 기술을 사용하여 BERT 및 RoBERTa 모델을 개선합니다. 첫 번째는 분리된 주의 메커니즘으로, 각 단어는 해당 내용과 위치를 각각 인코딩하는 두 개의 벡터를 사용하여 표현되고, 단어 간의 주의 가중치는 해당 내용과 상대 위치에 대한 분리된 행렬을 사용하여 계산됩니다. 둘째, 향상된 마스크 디코더를 사용하여 출력 소프트맥스 레이어를 대체하여 모델 사전 학습을 위한 마스크된 토큰을 예측합니다. 우리는 이 두 가지 기술이 모델 사전 훈련의 효율성과 다운스트림 작업의 성능을 크게 향상한다는 것을 보여줍니다.

사전 훈련된 모델

사전 훈련된 모델은 압축 파일로 패키징됩니다. 당사 릴리스에서 다운로드하거나 아래 링크를 통해 개별 모델을 다운로드할 수 있습니다.

모델	어휘(K)	백본 매개변수(M)	숨겨진 크기	레이어	메모
V2-XXL대형 ¹	128	1320	1536년	48	128K 새로운 SPM 어휘
V2-XLarge	128	710	1536년	24	128K 새로운 SPM 어휘
특대형	50	700	1024	48	RoBERTa와 동일한 어휘
크기가 큰	50	350	1024	24	RoBERTa와 동일한 어휘
베이스	50	100	768	12	RoBERTa와 동일한 어휘
V2-XXL대형-MNLI	128	1320	1536년	48	MNLI로 미세 가공
V2-XLarge-MNLI	128	710	1536년	24	MNLI로 미세 가공
초대형-MNLI	50	700	1024	48	MNLI로 미세 가공
대형 MNLI	50	350	1024	24	MNLI로 미세 가공
베이스-MNLI	50	86	768	12	MNLI로 미세 가공
DeBERTa -V3-대형 ²	128	304	1024	24	128K 새로운 SPM 어휘
DeBERTa -V3-베이스 ²	128	86	768	12	128K 새로운 SPM 어휘
DeBERTa -V3-소형 ²	128	44	768	6	128K 새로운 SPM 어휘
DeBERTa -V3-XSmall ²	128	22	384	12	128K 새로운 SPM 어휘
m DeBERTa -V3-베이스 ²	250	86	768	12	250,000개의 새로운 SPM 어휘, 102개 언어를 지원하는 다국어 모델

메모

1 SuperGLUE 에서 처음으로 T5 11B(89.3)와 인간 성능(89.8)을 능가한 모델(89.9)입니다. 128K 새로운 SPM 어휘.
2 이러한 V3 DeBERTa 모델은 모델 효율성을 크게 향상시키는 ELECTRA 스타일 목표와 그라디언트 분리 임베딩 공유로 사전 훈련된 DeBERTa 모델입니다.

모델을 사용해 보세요

우리의 문서를 읽어보세요

요구사항

Linux 시스템(예: Ubuntu 18.04LTS)
쿠다 10.0
파이토치 1.3.0
파이썬 3.6
배쉬 쉘 4.0
컬
도커(선택사항)
nvidia-docker2(선택 사항)

우리 코드를 시험해 보는 방법에는 여러 가지가 있습니다.

도커 사용

Docker는 이미 docker bagai/ DeBERTa 에 모든 종속성을 구축했기 때문에 코드를 실행하는 데 권장되는 방법이며 docker 공식 사이트를 따라 컴퓨터에 docker를 설치할 수 있습니다.

docker를 실행하려면 시스템이 위 목록의 요구 사항을 충족하는지 확인하세요. GLUE 실험을 시도하는 단계는 다음과 같습니다. 코드를 풀고 ./run_docker.sh 를 실행한 다음 / DeBERTa /experiments/glue/ 에서 bash 명령을 실행할 수 있습니다.

핍 사용

코드를 가져와서 코드의 루트 디렉터리에서 pip3 install -r requirements.txt 실행한 다음 코드의 experiments/glue/ 폴더에 들어가서 Glue 실험을 위해 해당 폴더 아래에 있는 bash 명령을 실행해 보세요.

pip 패키지로 설치

pip install DeBERTa

기존 코드에서 DeBERTa 사용

DeBERTa to your existing code, you need to make two changes to your code, # 1. change your model to consume DeBERTa as the encoder from DeBERTa import DeBERTa import torch class MyModel(torch.nn.Module): def __init__(self): super().__init__() # Your existing model code self. DeBERTa = DeBERTa . DeBERTa (pre_trained='base') # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2' # Your existing model code # do inilization as before # self. DeBERTa .apply_state() # Apply the pre-trained model of DeBERTa at the end of the constructor # def forward(self, input_ids): # The inputs to DeBERTa forward are # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. # Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details). # `attention_mask`: an optional parameter for input mask or attention mask. # - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. # It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. # It's the mask that we typically use for attention when a batch has varying length sentences. # - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. # In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True encoding = DeBERTa .bert(input_ids)[-1] # 2. Change your tokenizer with the tokenizer built-in DeBERTa from DeBERTa import DeBERTa vocab_path, vocab_type = DeBERTa .load_vocab(pretrained_id='base') tokenizer = DeBERTa .tokenizers[vocab_type](vocab_path) # We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK] max_seq_len = 512 tokens = tokenizer.tokenize('Examples input text of DeBERTa ') # Truncate long sequence tokens = tokens[:max_seq_len -2] # Add special tokens to the `tokens` tokens = ['[CLS]'] + tokens + ['[SEP]'] input_ids = tokenizer.convert_tokens_to_ids(tokens) input_mask = [1]*len(input_ids) # padding paddings = max_seq_len-len(input_ids) input_ids = input_ids + [0]*paddings input_mask = input_mask + [0]*paddings features = { 'input_ids': torch.tensor(input_ids, dtype=torch.int), 'input_mask': torch.tensor(input_mask, dtype=torch.int) } ">

 # To apply DeBERTa to your existing code, you need to make two changes to your code,
# 1. change your model to consume DeBERTa as the encoder
from DeBERTa import DeBERTa
import torch
class MyModel ( torch . nn . Module ):
  def __init__ ( self ):
    super (). __init__ ()
    # Your existing model code
    self . DeBERTa = DeBERTa . DeBERTa ( pre_trained = 'base' ) # Or 'large' 'base-mnli' 'large-mnli' 'xlarge' 'xlarge-mnli' 'xlarge-v2' 'xxlarge-v2'
    # Your existing model code
    # do inilization as before
    # 
    self . DeBERTa . apply_state () # Apply the pre-trained model of DeBERTa at the end of the constructor
    #
  def forward ( self , input_ids ):
    # The inputs to DeBERTa forward are
    # `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length] with the word token indices in the vocabulary
    # `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token types indices selected in [0, 1]. 
    #    Type 0 corresponds to a `sentence A` and type 1 corresponds to a `sentence B` token (see BERT paper for more details).
    # `attention_mask`: an optional parameter for input mask or attention mask. 
    #   - If it's an input mask, then it will be torch.LongTensor of shape [batch_size, sequence_length] with indices selected in [0, 1]. 
    #      It's a mask to be used if the input sequence length is smaller than the max input sequence length in the current batch. 
    #      It's the mask that we typically use for attention when a batch has varying length sentences.
    #   - If it's an attention mask then if will be torch.LongTensor of shape [batch_size, sequence_length, sequence_length]. 
    #      In this case, it's a mask indicating which tokens in the sequence should be attended by other tokens in the sequence. 
    # `output_all_encoded_layers`: whether to output results of all encoder layers, default, True
    encoding = DeBERTa . bert ( input_ids )[ - 1 ]

# 2. Change your tokenizer with the tokenizer built-in DeBERTa
from DeBERTa import DeBERTa
vocab_path , vocab_type = DeBERTa . load_vocab ( pretrained_id = 'base' )
tokenizer = DeBERTa . tokenizers [ vocab_type ]( vocab_path )
# We apply the same schema of special tokens as BERT, e.g. [CLS], [SEP], [MASK]
max_seq_len = 512
tokens = tokenizer . tokenize ( 'Examples input text of DeBERTa ' )
# Truncate long sequence
tokens = tokens [: max_seq_len - 2 ]
# Add special tokens to the `tokens`
tokens = [ '[CLS]' ] + tokens + [ '[SEP]' ]
input_ids = tokenizer . convert_tokens_to_ids ( tokens )
input_mask = [ 1 ] * len ( input_ids )
# padding
paddings = max_seq_len - len ( input_ids )
input_ids = input_ids + [ 0 ] * paddings
input_mask = input_mask + [ 0 ] * paddings
features = {
'input_ids' : torch . tensor ( input_ids , dtype = torch . int ),
'input_mask' : torch . tensor ( input_mask , dtype = torch . int )
}

명령줄에서 DeBERTa 실험 실행

접착제 작업의 경우,

데이터 가져오기

DeBERTa/ cd experiments/glue ./download_data.sh $cache_dir/glue_tasks">

cache_dir=/tmp/ DeBERTa /
cd experiments/glue
./download_data.sh  $cache_dir /glue_tasks

작업 실행

DeBERTa/exps/$task export OMP_NUM_THREADS=1 python3 -m DeBERTa .apps.run --task_name $task --do_train --data_dir $cache_dir/glue_tasks/$task --eval_batch_size 128 --predict_batch_size 128 --output_dir $OUTPUT --scale_steps 250 --loss_scale 16384 --accumulative_update 1 --num_train_epochs 6 --warmup 100 --learning_rate 2e-5 --train_batch_size 32 --max_seq_len 128">

task=STS-B 
OUTPUT=/tmp/ DeBERTa /exps/ $task
export OMP_NUM_THREADS=1
python3 -m DeBERTa .apps.run --task_name $task --do_train  
  --data_dir $cache_dir /glue_tasks/ $task 
  --eval_batch_size 128 
  --predict_batch_size 128 
  --output_dir $OUTPUT 
  --scale_steps 250 
  --loss_scale 16384 
  --accumulative_update 1   
  --num_train_epochs 6 
  --warmup 100 
  --learning_rate 2e-5 
  --train_batch_size 32 
  --max_seq_len 128

메모

1. 기본적으로 $HOME/.~ DeBERTa 예기치 않게 다운로드가 실패한 경우 이를 정리해야 할 수도 있습니다.
1. HF 변압기가 포함된 모델을 시험해 볼 수도 있습니다. 하지만 XXLarge 모델을 사용하려면 --sharded_ddp 인수를 지정해야 합니다. 자세한 내용은 XXLarge 모델 카드를 확인하세요.

실험

우리의 미세 조정 실험은 8x32 V100 GPU 카드가 장착된 DGX-2 노드의 절반에서 수행되었으며 결과는 GPU 모델, 드라이버, FP16 또는 FP32를 사용하는 CUDA SDK 버전 및 임의 시드에 따라 달라질 수 있습니다. 여기서는 다양한 무작위 시드를 사용하여 여러 번 실행한 결과를 보고합니다. 대형 모델의 결과는 다음과 같습니다.

일	명령	결과	실행 시간(8x32G V100 GPU)
MNLI xxlarge v2	`experiments/glue/mnli.sh xxlarge-v2`	91.7/91.9 +/-0.1	4시간
MNLI 특대형 v2	`experiments/glue/mnli.sh xlarge-v2`	91.7/91.6 +/-0.1	2.5시간
MNLI 특대형	`experiments/glue/mnli.sh xlarge`	91.5/91.2 +/-0.1	2.5시간
MNLI 대형	`experiments/glue/mnli.sh large`	91.3/91.1 +/-0.1	2.5시간
QQP 대형	`experiments/glue/qqp.sh large`	92.3 +/-0.1	6시간
QNLI 대형	`experiments/glue/qnli.sh large`	95.3 +/-0.2	2시간
MRPC 대형	`experiments/glue/mrpc.sh large`	91.9 +/-0.5	0.5시간
RTE 대형	`experiments/glue/rte.sh large`	86.6 +/-1.0	0.5시간
SST-2 대형	`experiments/glue/sst2.sh large`	96.7 +/-0.3	1시간
STS-b 대형	`experiments/glue/Stsb.sh large`	92.5 +/-0.3	0.5시간
CoLA 라지	`experiments/glue/cola.sh`	70.5 +/-1.0	0.5시간

기본 모델의 결과는 다음과 같습니다.

일	명령	결과	실행 시간(8x32G V100 GPU)
MNLI 베이스	`experiments/glue/mnli.sh base`	88.8/88.5 +/-0.2	1시간 30분

NLU 작업 미세 조정

SQuAD 1.1/2.0 및 여러 GLUE 벤치마크 작업에 대한 개발 결과를 제시합니다.

모델	스쿼드 1.1	스쿼드 2.0	MNLI-m/mm	SST-2	QNLI	콜라	RTE	MRPC	QQP	STS-B
	F1/EM	F1/EM	Acc	Acc	Acc	MCC	Acc	Acc/F1	Acc/F1	추신
BERT-대형	90.9/84.1	81.8/79.0	86.6/-	93.2	92.3	60.6	70.4	88.0/-	91.3/-	90.0/-
RoBERTa-대형	94.6/88.9	89.4/86.5	90.2/-	96.4	93.9	68.0	86.6	90.9/-	92.2/-	92.4/-
XLNet-대형	95.1/89.7	90.6/87.9	90.8/-	97.0	94.9	69.0	85.9	90.8/-	92.3/-	92.5/-
DeBERTa -대형 ¹	95.5/90.1	90.7/88.0	91.3/91.1	96.5	95.3	69.5	91.0	92.6/94.6	92.3/-	92.8/92.5
DeBERTa -XLarge ¹	-/-	-/-	91.5/91.2	97.0	-	-	93.1	92.1/94.3	-	92.9/92.7
DeBERTa -V2-XLarge ¹	95.8/90.8	91.4/88.9	91.7/91.6	97.5	95.8	71.1	93.9	92.0/94.2	92.3/89.8	92.9/92.9
DeBERTa -V2-XXLarge ^1,2	96.1/91.4	92.2/89.7	91.7/91.9	97.2	96.0	72.0	93.5	93.1/94.9	92.7/90.3	93.2/93.1
DeBERTa -V3-대형	-/-	91.5/89.0	91.8/91.9	96.9	96.0	75.3	92.7	92.2/-	93.0/-	93.0/-
DeBERTa -V3-베이스	-/-	88.4/85.4	90.6/90.7	-	-	-	-	-	-	-
DeBERTa -V3-소형	-/-	82.9/80.4	88.3/87.7	-	-	-	-	-	-	-
DeBERTa -V3-XSmall	-/-	84.8/82.0	88.1/88.3	-	-	-	-	-	-	-

XNLI의 미세 조정

우리는 제로 샷 교차 언어 전송 설정을 사용하여 XNLI에서 개발 결과를 제시합니다. 즉, 영어 데이터로만 학습하고 다른 언어로 테스트합니다.

모델	평균	ko	정말로	예	드	엘자	bg	루	tr	아르	vi	일	zh	안녕	남서	당신의
XLM-R 베이스	76.2	85.8	79.7	80.7	78.7	77.5	79.6	78.1	74.2	73.8	76.5	74.6	76.7	72.4	66.5	68.3
m DeBERTa -V3-베이스	79.8 +/-0.2	88.2	82.6	84.4	82.7	82.3	82.4	80.8	79.5	78.5	78.1	76.4	79.5	75.9	73.9	72.4

메모.

¹ RoBERTa에 이어 RTE, MRPC, STS-B에 대해 DeBERTa -Large-MNLI, DeBERTa -XLarge-MNLI, DeBERTa -V2-XLarge-MNLI, DeBERTa -V2-XXLarge-MNLI를 기반으로 작업을 미세 조정합니다. MNLI 미세 조정 모델에서 시작할 때 SST-2/QQP/QNLI/SQuADv2의 결과도 약간 향상되지만, 해당 4가지 작업에 대해 사전 학습된 기본 모델에서 미세 조정된 숫자만 보고합니다.

MLM 및 RTD 목표를 통한 사전 교육

MLM 및 RTD 목표로 DeBERTa 사전 교육하려면 experiments/language_models 확인하세요.

콘택트 렌즈

Pengcheng He([email protected]), Xiaodong Liu([email protected]), Jianfeng Gao([email protected]), Weizhu Chen([email protected])

소환

DeBERTav3, title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, author={Pengcheng He and Jianfeng Gao and Weizhu Chen}, year={2021}, eprint={2111.09543}, archivePrefix={arXiv}, primaryClass={cs.CL} }">

@misc{he2021 DeBERTa v3,
      title={ DeBERTa V3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing}, 
      author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
      year={2021},
      eprint={2111.09543},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

DeBERTa, title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION}, author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen}, booktitle={International Conference on Learning Representations}, year={2021}, url={https://openreview.net/forum?id=XPZIaotutsD} }">

@inproceedings{
he2021 DeBERTa ,
title={ DeBERTa : DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
booktitle={International Conference on Learning Representations},
year={2021},
url={https://openreview.net/forum?id=XPZIaotutsD}
}

확장하다

추가 정보

버전 1.0.0
유형 AI 소스 코드
업데이트 시간 2024-12-31
크기 50MB
출처 Github