bluebert下载 - bluebert源码下载

bluebert

***** 2020 年 11 月 1 日新消息： bluebert可以在 Huggingface 找到 *****

***** 2019 年 12 月 5 日新增：NCBI_BERT 更名为bluebert *****

***** 2019 年 7 月 11 日新增：预处理的 PubMed 文本 *****

我们上传了用于预训练bluebert模型的预处理 PubMed 文本。

该存储库提供bluebert的代码和模型，并在 PubMed 摘要和临床记录 (MIMIC-III) 上进行了预训练。请参阅我们的论文《生物医学自然语言处理中的迁移学习：在十个基准数据集上对 BERT 和 ELMo 的评估》了解更多详细信息。

预训练模型和基准数据集

预训练的bluebert权重、词汇和配置文件可以从以下位置下载：

bluebert -Base、Uncased、PubMed：该模型是在 PubMed 摘要上进行预训练的。
bluebert -Base、Uncased、PubMed+MIMIC-III：该模型在 PubMed 摘要和 MIMIC-III 上进行了预训练。
bluebert -Large、Uncased、PubMed：该模型是在 PubMed 摘要上进行预训练的。
bluebert -Large、Uncased、PubMed+MIMIC-III：该模型在 PubMed 摘要和 MIMIC-III 上进行了预训练。

预训练的权重也可以在 Huggingface 中找到：

https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16

基准数据集可以从 https://github.com/ncbi-nlp/BLUE_Benchmark 下载

微调bluebert

我们假设bluebert模型已在$ bluebert _DIR下载，并且数据集已在$DATASET_DIR下载。

如果需要，将本地目录添加到$PYTHONPATH 。

 export PYTHONPATH=. ; $PYTHONPATH

句子相似度

bluebert _sts.py --task_name='sts' --do_train=true --do_eval=false --do_test=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _sts.py 
  --task_name= ' sts ' 
  --do_train=true 
  --do_eval=false 
  --do_test=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

命名实体识别

bluebert _ner.py --do_prepare=true --do_train=true --do_eval=true --do_predict=true --task_name="bc5cdr" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _ner.py 
  --do_prepare=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --task_name= " bc5cdr " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

任务名称可以是

bc5cdr : BC5CDR 化学或疾病任务
clefe ：共享/CLEFE 任务

关系抽取

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="chemprot" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " chemprot " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

任务名称可以是

chemprot ：BC6 ChemProt 任务
ddi ：DDI 2013 任务
i2b2_2010 ：I2B2 2010 任务

文档多标签分类

bluebert _multi_labels.py --task_name="hoc" --do_train=true --do_eval=true --do_predict=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --train_batch_size=4 --learning_rate=2e-5 --num_train_epochs=3 --num_classes=20 --num_aspects=10 --aspect_value_list="0,1" --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _multi_labels.py 
  --task_name= " hoc " 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=4 
  --learning_rate=2e-5 
  --num_train_epochs=3 
  --num_classes=20 
  --num_aspects=10 
  --aspect_value_list= " 0,1 " 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

推理任务

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="mednli" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " mednli " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

预处理的 PubMed 文本

我们提供用于预训练bluebert模型的预处理 PubMed 文本。该语料库包含从 PubMed ASCII 代码版本中提取的约 4000M 单词。其他操作包括

小写文本
删除特殊字符x00 - x7F
使用 NLTK Treebank 分词器对文本进行分词

以下是更多详细信息的代码片段。

 value = value . lower ()
value = re . sub ( r'[rn]+' , ' ' , value )
value = re . sub ( r'[^x00-x7F]+' , ' ' , value )

tokenized = TreebankWordTokenizer (). tokenize ( value )
sentence = ' ' . join ( tokenized )
sentence = re . sub ( r"s'sb" , "'s" , sentence )

使用 BERT 进行预训练

之后，我们使用以下代码生成预训练数据。请参阅 https://github.com/google-research/bert 了解更多详细信息。

python bert/create_pretraining_data.py 
  --input_file=pubmed_uncased_sentence_nltk.txt 
  --output_file=pubmed_uncased_sentence_nltk.tfrecord 
  --vocab_file=bert_uncased_L-12_H-768_A-12_vocab.txt 
  --do_lower_case=True 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --masked_lm_prob=0.15 
  --random_seed=12345 
  --dupe_factor=5

我们使用以下代码来训练 BERT 模型。如果您是从头开始预训练，请不要包含init_checkpoint 。请参阅 https://github.com/google-research/bert 了解更多详细信息。

bluebert_DIR --do_train=True --do_eval=True --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20000 --num_warmup_steps=10 --learning_rate=2e-5">

python bert/run_pretraining.py 
  --input_file=pubmed_uncased_sentence_nltk.tfrecord 
  --output_dir= $ bluebert _DIR 
  --do_train=True 
  --do_eval=True 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --train_batch_size=32 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --num_train_steps=20000 
  --num_warmup_steps=10 
  --learning_rate=2e-5

引用bluebert话

Peng Y，Yan S，Lu Z。生物医学自然语言处理中的迁移学习：BERT 和 ELMo 在十个基准数据集上的评估。生物医学自然语言处理 (BioNLP) 研讨会论文集。 2019.

 @InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}

致谢

这项工作得到了美国国立卫生研究院、国家医学图书馆和临床中心的校内研究计划的支持。这项工作得到了美国国立卫生研究院国家医学图书馆的支持，奖项号为 K99LM013001-01。

我们也感谢 BERT 和 ELMo 的作者公开提供数据和代码。

我们要感谢 Sun Kim 博士处理 PubMed 文本。

免责声明

该工具显示了 NCBI 计算生物学分部进行的研究结果。本网站上产生的信息不用于未经临床专业人员审查和监督的直接诊断用途或医疗决策。个人不应仅根据本网站提供的信息来改变其健康行为。 NIH 不会独立验证该工具生成的信息的有效性或实用性。如果您对本网站上提供的信息有疑问，请咨询医疗保健专业人员。我们提供了有关 NCBI 免责声明政策的更多信息。

展开