bluebertダウンロード - bluebertソースコードのダウンロード

bluebert

***** 2020 年 11 月 1 日の新規: bluebert 、huggingface で見つけることができます *****

***** 2019 年 12 月 5 日の新規: NCBI_BERT はbluebertに名前変更されました *****

***** 2019 年 7 月 11 日の新規: 前処理された PubMed テキスト *****

bluebertモデルの事前トレーニングに使用された、前処理された PubMed テキストをアップロードしました。

このリポジトリは、PubMed 抄録および臨床ノート (MIMIC-III) で事前トレーニングされた、 bluebertのコードとモデルを提供します。詳細については、論文『生物医学自然言語処理における転移学習: 10 のベンチマークデータセットに関する BERT および ELMo の評価』を参照してください。

事前トレーニングされたモデルとベンチマークデータセット

事前トレーニングされたbluebert重み、語彙、および構成ファイルは、次からダウンロードできます。

bluebert -Base、Uncased、PubMed: このモデルは PubMed 抄録で事前トレーニングされました。
bluebert -Base、Uncased、PubMed+MIMIC-III: このモデルは、PubMed 抄録と MIMIC-III で事前トレーニングされました。
bluebert -Large、Uncased、PubMed: このモデルは PubMed 抄録で事前トレーニングされました。
bluebert -Large、Uncased、PubMed+MIMIC-III: このモデルは、PubMed 抄録と MIMIC-III で事前トレーニングされました。

事前にトレーニングされた重みは、Huggingface にもあります。

https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16

ベンチマークデータセットは https://github.com/ncbi-nlp/BLUE_Benchmark からダウンロードできます。

bluebertの微調整

bluebertモデルは$ bluebert _DIRにダウンロードされ、データセットは$DATASET_DIRにダウンロードされたと仮定します。

必要に応じて、ローカルディレクトリを$PYTHONPATHに追加します。

 export PYTHONPATH=. ; $PYTHONPATH

文の類似性

bluebert _sts.py --task_name='sts' --do_train=true --do_eval=false --do_test=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _sts.py 
  --task_name= ' sts ' 
  --do_train=true 
  --do_eval=false 
  --do_test=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

固有表現の認識

bluebert _ner.py --do_prepare=true --do_train=true --do_eval=true --do_predict=true --task_name="bc5cdr" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _ner.py 
  --do_prepare=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --task_name= " bc5cdr " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

タスク名は次のとおりです。

bc5cdr : BC5CDR 化学物質または疾患タスク
clefe : ShARe/CLEFE タスク

関係抽出

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="chemprot" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " chemprot " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

タスク名は次のとおりです。

chemprot : BC6 ChemProt タスク
ddi : DDI 2013 タスク
i2b2_2010 : I2B2 2010 タスク

文書の複数ラベルの分類

bluebert _multi_labels.py --task_name="hoc" --do_train=true --do_eval=true --do_predict=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --train_batch_size=4 --learning_rate=2e-5 --num_train_epochs=3 --num_classes=20 --num_aspects=10 --aspect_value_list="0,1" --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _multi_labels.py 
  --task_name= " hoc " 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=4 
  --learning_rate=2e-5 
  --num_train_epochs=3 
  --num_classes=20 
  --num_aspects=10 
  --aspect_value_list= " 0,1 " 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

推論タスク

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="mednli" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " mednli " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

前処理された PubMed テキスト

bluebertモデルの事前トレーニングに使用された、前処理された PubMed テキストを提供します。このコーパスには、PubMed ASCII コードバージョンから抽出された約 4000 万語が含まれています。その他の操作には次のものがあります。

テキストを小文字にする
特殊な文字x00 ～ x7Fを削除しています
NLTK Treebank トークナイザーを使用したテキストのトークン化

詳細については、以下のコードスニペットを参照してください。

 value = value . lower ()
value = re . sub ( r'[rn]+' , ' ' , value )
value = re . sub ( r'[^x00-x7F]+' , ' ' , value )

tokenized = TreebankWordTokenizer (). tokenize ( value )
sentence = ' ' . join ( tokenized )
sentence = re . sub ( r"s'sb" , "'s" , sentence )

BERT による事前トレーニング

その後、次のコードを使用して事前トレーニングデータを生成しました。詳細については、https://github.com/google-research/bert をご覧ください。

python bert/create_pretraining_data.py 
  --input_file=pubmed_uncased_sentence_nltk.txt 
  --output_file=pubmed_uncased_sentence_nltk.tfrecord 
  --vocab_file=bert_uncased_L-12_H-768_A-12_vocab.txt 
  --do_lower_case=True 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --masked_lm_prob=0.15 
  --random_seed=12345 
  --dupe_factor=5

BERT モデルをトレーニングするために次のコードを使用しました。最初から事前トレーニングを行う場合は、 init_checkpointを含めないでください。詳細については、https://github.com/google-research/bert をご覧ください。

bluebert_DIR --do_train=True --do_eval=True --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20000 --num_warmup_steps=10 --learning_rate=2e-5">

python bert/run_pretraining.py 
  --input_file=pubmed_uncased_sentence_nltk.tfrecord 
  --output_dir= $ bluebert _DIR 
  --do_train=True 
  --do_eval=True 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --train_batch_size=32 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --num_train_steps=20000 
  --num_warmup_steps=10 
  --learning_rate=2e-5

bluebertの引用

Peng Y、Yan S、Lu Z. 生物医学自然言語処理における転移学習: 10 個のベンチマークデータセットでの BERT と ELMo の評価。生物医学的自然言語処理 (BioNLP) に関するワークショップの議事録。 2019年。

 @InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}

謝辞

この研究は、国立衛生研究所、国立医学図書館および臨床センターの学内研究プログラムによって支援されました。この研究は、賞番号 K99LM013001-01 で国立衛生研究所の国立医学図書館によって支援されました。

データとコードを公開してくれた BERT と ELMo の作者にも感謝します。

PubMed のテキストを処理してくださった Sun Kim 博士に感謝いたします。

免責事項

このツールは、NCBI の計算生物学部門で実施された研究の結果を示します。このウェブサイトで作成された情報は、臨床専門家によるレビューや監督なしに直接診断に使用したり、医療上の意思決定を行うことを目的としたものではありません。個人は、このウェブサイトで生成された情報のみに基づいて自分の健康行動を変更すべきではありません。 NIH は、このツールによって生成された情報の有効性や有用性を独自に検証しません。このウェブサイトで作成された情報について質問がある場合は、医療専門家に相談してください。 NCBI の免責事項ポリシーの詳細については、こちらをご覧ください。

拡大する