bluebert下載 - bluebert源碼下載

bluebert

***** 2020 年 11 月 1 日新消息： bluebert可以在 Huggingface 找到 *****

***** 2019 年 12 月 5 日新增：NCBI_BERT 更名為bluebert *****

***** 2019 年 7 月 11 日新增：預處理的 PubMed 文字 *****

我們上傳了用於預訓練bluebert模型的預處理 PubMed 文字。

該儲存庫提供bluebert的程式碼和模型，並在 PubMed 摘要和臨床記錄 (MIMIC-III) 上進行了預訓練。請參閱我們的論文《生物醫學自然語言處理中的遷移學習：對 BERT 和 ELMo 在十個基準資料集上的評估》以了解更多詳細資訊。

預訓練模型和基準資料集

預訓練的bluebert權重、詞彙和設定檔可以從以下位置下載：

bluebert -Base、Uncased、PubMed：此模型是在 PubMed 摘要上進行預先訓練的。
bluebert -Base、Uncased、PubMed+MIMIC-III：此模型在 PubMed 摘要和 MIMIC-III 上進行了預訓練。
bluebert -Large、Uncased、PubMed：此模型是在 PubMed 摘要上進行預先訓練的。
bluebert -Large、Uncased、PubMed+MIMIC-III：此模型在 PubMed 摘要和 MIMIC-III 上進行了預訓練。

預訓練的權重也可以在 Huggingface 中找到：

https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16

基準資料集可以從 https://github.com/ncbi-nlp/BLUE_Benchmark 下載

微調bluebert

我們假設bluebert模型已在$ bluebert _DIR下載，且資料集已在$DATASET_DIR下載。

如果需要，將本機目錄新增至$PYTHONPATH 。

 export PYTHONPATH=. ; $PYTHONPATH

句子相似度

bluebert _sts.py --task_name='sts' --do_train=true --do_eval=false --do_test=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _sts.py 
  --task_name= ' sts ' 
  --do_train=true 
  --do_eval=false 
  --do_test=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

命名實體識別

bluebert _ner.py --do_prepare=true --do_train=true --do_eval=true --do_predict=true --task_name="bc5cdr" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _ner.py 
  --do_prepare=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --task_name= " bc5cdr " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

任務名稱可以是

bc5cdr : BC5CDR 化學或疾病任務
clefe ：共享/CLEFE 任務

關係抽取

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="chemprot" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " chemprot " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

任務名稱可以是

chemprot ：BC6 ChemProt 任務
ddi ：DDI 2013 任務
i2b2_2010 ：I2B2 2010 任務

文件多標籤分類

bluebert _multi_labels.py --task_name="hoc" --do_train=true --do_eval=true --do_predict=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --train_batch_size=4 --learning_rate=2e-5 --num_train_epochs=3 --num_classes=20 --num_aspects=10 --aspect_value_list="0,1" --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _multi_labels.py 
  --task_name= " hoc " 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=4 
  --learning_rate=2e-5 
  --num_train_epochs=3 
  --num_classes=20 
  --num_aspects=10 
  --aspect_value_list= " 0,1 " 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

推理任務

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="mednli" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " mednli " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

預處理的 PubMed 文本

我們提供用於預訓練bluebert模型的預處理 PubMed 文本。該語料庫包含從 PubMed ASCII 代碼版本中提取的約 4000M 單字。其他操作包括

小寫文字
刪除特殊字元x00 - x7F
使用 NLTK Treebank 分詞器對文本進行分詞

以下是更多詳細資訊的程式碼片段。

 value = value . lower ()
value = re . sub ( r'[rn]+' , ' ' , value )
value = re . sub ( r'[^x00-x7F]+' , ' ' , value )

tokenized = TreebankWordTokenizer (). tokenize ( value )
sentence = ' ' . join ( tokenized )
sentence = re . sub ( r"s'sb" , "'s" , sentence )

使用 BERT 進行預訓練

之後，我們使用以下程式碼產生預訓練資料。請參閱 https://github.com/google-research/bert 以了解更多詳細資訊。

python bert/create_pretraining_data.py 
  --input_file=pubmed_uncased_sentence_nltk.txt 
  --output_file=pubmed_uncased_sentence_nltk.tfrecord 
  --vocab_file=bert_uncased_L-12_H-768_A-12_vocab.txt 
  --do_lower_case=True 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --masked_lm_prob=0.15 
  --random_seed=12345 
  --dupe_factor=5

我們使用以下程式碼來訓練 BERT 模型。如果您是從頭開始預先訓練，請不要包含init_checkpoint 。請參閱 https://github.com/google-research/bert 以了解更多詳細資訊。

bluebert_DIR --do_train=True --do_eval=True --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20000 --num_warmup_steps=10 --learning_rate=2e-5">

python bert/run_pretraining.py 
  --input_file=pubmed_uncased_sentence_nltk.tfrecord 
  --output_dir= $ bluebert _DIR 
  --do_train=True 
  --do_eval=True 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --train_batch_size=32 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --num_train_steps=20000 
  --num_warmup_steps=10 
  --learning_rate=2e-5

引用bluebert話

Peng Y，Yan S，Lu Z。生物醫學自然語言處理 (BioNLP) 研討會論文集。 2019.

 @InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}

致謝

這項工作得到了美國國立衛生研究院、國家醫學圖書館和臨床中心的校內研究計畫的支持。這項工作得到了美國國立衛生研究院國家醫學圖書館的支持，獎項號碼為 K99LM013001-01。

我們也感謝 BERT 和 ELMo 的作者公開提供數據和程式碼。

我們要感謝 Sun Kim 博士處理 PubMed 文本。

免責聲明

該工具顯示了 NCBI 計算生物學分部進行的研究結果。本網站產生的資訊不用於未經臨床專業人員審查和監督的直接診斷用途或醫療決策。個人不應僅根據本網站提供的資訊來改變其健康行為。 NIH 不會獨立驗證該工具產生的資訊的有效性或實用性。如果您對本網站上提供的資訊有疑問，請諮詢醫療保健專業人員。我們提供了有關 NCBI 免責聲明政策的更多資訊。

展開