bluebert ดาวน์โหลด - ดาวน์โหลด bluebert ซอร์สโค้ด

bluebert

***** ใหม่ 1 พ.ย. 2020 : พบ bluebert ได้ที่ Huggingface *****

***** ใหม่ 5 ธันวาคม 2019: NCBI_BERT เปลี่ยนชื่อเป็น bluebert *****

***** ใหม่วันที่ 11 กรกฎาคม 2019: ข้อความ PubMed ที่ประมวลผลล่วงหน้า *****

เราอัปโหลดข้อความ PubMed ที่ประมวลผลล่วงหน้าซึ่งใช้ในการฝึกโมเดล bluebert ล่วงหน้า

พื้นที่เก็บข้อมูลนี้จัดเตรียมรหัสและแบบจำลองของ bluebert ที่ได้รับการฝึกอบรมล่วงหน้าเกี่ยวกับบทคัดย่อของ PubMed และบันทึกทางคลินิก (MIMIC-III) โปรดดูรายละเอียดเพิ่มเติมในเอกสารการเรียนรู้เกี่ยวกับการถ่ายโอนในการประมวลผลภาษาธรรมชาติทางชีวการแพทย์: การประเมิน BERT และ ELMo บนชุดข้อมูลการเปรียบเทียบ 10 ชุด

โมเดลที่ได้รับการฝึกอบรมล่วงหน้าและชุดข้อมูลเบนช์มาร์ก

คุณสามารถดาวน์โหลดไฟล์น้ำหนัก คำศัพท์ และการกำหนดค่า bluebert ที่ได้รับการฝึกอบรมล่วงหน้าได้จาก:

bluebert -Base, Uncased, PubMed: โมเดลนี้ได้รับการฝึกอบรมเกี่ยวกับบทคัดย่อของ PubMed
bluebert -Base, Uncased, PubMed+MIMIC-III: โมเดลนี้ได้รับการฝึกอบรมเกี่ยวกับบทคัดย่อ PubMed และ MIMIC-III
bluebert -Large, Uncased, PubMed: โมเดลนี้ได้รับการฝึกอบรมเกี่ยวกับบทคัดย่อของ PubMed มาก่อน
bluebert -Large, Uncased, PubMed+MIMIC-III: โมเดลนี้ได้รับการฝึกอบรมเกี่ยวกับบทคัดย่อ PubMed และ MIMIC-III

สามารถดูตุ้มน้ำหนักที่ฝึกไว้ล่วงหน้าได้ที่ Huggingface:

https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-12_H-768_A-12
https://huggingface.co/bionlp/bluebert_pubmed_uncased_L-24_H-1024_A-16
https://huggingface.co/bionlp/bluebert_pubmed_mimic_uncased_L-24_H-1024_A-16

สามารถดาวน์โหลดชุดข้อมูลการวัดประสิทธิภาพได้จากhttps://github.com/ncbi-nlp/BLUE_Benchmark

การปรับแต่ง bluebert อย่างละเอียด

เราถือว่าโมเดล bluebert ได้รับการดาวน์โหลดที่ $ bluebert _DIR และชุดข้อมูลได้รับการดาวน์โหลดที่ $DATASET_DIR

เพิ่มไดเรกทอรีท้องถิ่นให้กับ $PYTHONPATH หากจำเป็น

 export PYTHONPATH=. ; $PYTHONPATH

ความคล้ายคลึงกันของประโยค

bluebert _sts.py --task_name='sts' --do_train=true --do_eval=false --do_test=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _sts.py 
  --task_name= ' sts ' 
  --do_train=true 
  --do_eval=false 
  --do_test=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

การรับรู้เอนทิตีที่มีชื่อ

bluebert _ner.py --do_prepare=true --do_train=true --do_eval=true --do_predict=true --task_name="bc5cdr" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=30.0 --do_lower_case=true --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _ner.py 
  --do_prepare=true 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --task_name= " bc5cdr " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=30.0 
  --do_lower_case=true 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

ชื่องานสามารถเป็นได้

bc5cdr : BC5CDR งานทางเคมีหรือโรค
clefe : งาน Share/CLEFE

การสกัดความสัมพันธ์

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="chemprot" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " chemprot " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

ชื่องานสามารถเป็นได้

chemprot : งาน BC6 ChemProt
ddi : งาน DDI 2013
i2b2_2010 : งาน I2B2 2010

การจำแนกประเภทหลายป้ายกำกับของเอกสาร

bluebert _multi_labels.py --task_name="hoc" --do_train=true --do_eval=true --do_predict=true --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --max_seq_length=128 --train_batch_size=4 --learning_rate=2e-5 --num_train_epochs=3 --num_classes=20 --num_aspects=10 --aspect_value_list="0,1" --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR">

python bluebert /run_ bluebert _multi_labels.py 
  --task_name= " hoc " 
  --do_train=true 
  --do_eval=true 
  --do_predict=true 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --max_seq_length=128 
  --train_batch_size=4 
  --learning_rate=2e-5 
  --num_train_epochs=3 
  --num_classes=20 
  --num_aspects=10 
  --aspect_value_list= " 0,1 " 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR

งานอนุมาน

bluebert .py --do_train=true --do_eval=false --do_predict=true --task_name="mednli" --vocab_file=$ bluebert _DIR/vocab.txt --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --num_train_epochs=10.0 --data_dir=$DATASET_DIR --output_dir=$OUTPUT_DIR --do_lower_case=true ">

python bluebert /run_ bluebert .py 
  --do_train=true 
  --do_eval=false 
  --do_predict=true 
  --task_name= " mednli " 
  --vocab_file= $ bluebert _DIR /vocab.txt 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --num_train_epochs=10.0 
  --data_dir= $DATASET_DIR 
  --output_dir= $OUTPUT_DIR 
  --do_lower_case=true

ข้อความ PubMed ที่ประมวลผลล่วงหน้า

เราจัดเตรียมข้อความ PubMed ที่ประมวลผลล่วงหน้าซึ่งใช้ในการฝึกโมเดล bluebert ล่วงหน้า คลังข้อมูลประกอบด้วยคำประมาณ 4,000 ล้านคำที่แยกมาจากเวอร์ชันรหัส PubMed ASCII การดำเนินงานอื่นๆ ได้แก่

อักษรตัวพิมพ์เล็ก
ลบตัวอักษรพิเศษ x00 - x7F
โทเค็นข้อความโดยใช้โทเค็น NLTK Treebank

ด้านล่างนี้คือข้อมูลโค้ดสำหรับรายละเอียดเพิ่มเติม

 value = value . lower ()
value = re . sub ( r'[rn]+' , ' ' , value )
value = re . sub ( r'[^x00-x7F]+' , ' ' , value )

tokenized = TreebankWordTokenizer (). tokenize ( value )
sentence = ' ' . join ( tokenized )
sentence = re . sub ( r"s'sb" , "'s" , sentence )

การฝึกอบรมล่วงหน้ากับ BERT

หลังจากนั้น เราใช้โค้ดต่อไปนี้เพื่อสร้างข้อมูลก่อนการฝึกอบรม โปรดดู https://github.com/google-research/bert สำหรับรายละเอียดเพิ่มเติม

python bert/create_pretraining_data.py 
  --input_file=pubmed_uncased_sentence_nltk.txt 
  --output_file=pubmed_uncased_sentence_nltk.tfrecord 
  --vocab_file=bert_uncased_L-12_H-768_A-12_vocab.txt 
  --do_lower_case=True 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --masked_lm_prob=0.15 
  --random_seed=12345 
  --dupe_factor=5

เราใช้โค้ดต่อไปนี้เพื่อฝึกโมเดล BERT โปรดอย่ารวม init_checkpoint หากคุณกำลังฝึกล่วงหน้าตั้งแต่เริ่มต้น โปรดดู https://github.com/google-research/bert สำหรับรายละเอียดเพิ่มเติม

bluebert_DIR --do_train=True --do_eval=True --bert_config_file=$ bluebert _DIR/bert_config.json --init_checkpoint=$ bluebert _DIR/bert_model.ckpt --train_batch_size=32 --max_seq_length=128 --max_predictions_per_seq=20 --num_train_steps=20000 --num_warmup_steps=10 --learning_rate=2e-5">

python bert/run_pretraining.py 
  --input_file=pubmed_uncased_sentence_nltk.tfrecord 
  --output_dir= $ bluebert _DIR 
  --do_train=True 
  --do_eval=True 
  --bert_config_file= $ bluebert _DIR /bert_config.json 
  --init_checkpoint= $ bluebert _DIR /bert_model.ckpt 
  --train_batch_size=32 
  --max_seq_length=128 
  --max_predictions_per_seq=20 
  --num_train_steps=20000 
  --num_warmup_steps=10 
  --learning_rate=2e-5

อ้างถึง bluebert

Peng Y, Yan S, Lu Z. ถ่ายโอนการเรียนรู้ในการประมวลผลภาษาธรรมชาติทางชีวการแพทย์: การประเมิน BERT และ ELMo บนชุดข้อมูลการเปรียบเทียบสิบชุด ใน การประชุมเชิงปฏิบัติการเรื่องการประมวลผลภาษาธรรมชาติทางชีวการแพทย์ (BioNLP) 2019.

 @InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
  pages     = {58--65},
}

รับทราบ

งานนี้ได้รับการสนับสนุนจากโครงการวิจัยภายในของสถาบันสุขภาพแห่งชาติ หอสมุดแพทยศาสตร์และศูนย์คลินิกแห่งชาติ งานนี้ได้รับการสนับสนุนจากหอสมุดแพทยศาสตร์แห่งชาติ สถาบันสุขภาพแห่งชาติ ภายใต้รางวัลหมายเลข K99LM013001-01

นอกจากนี้เรายังรู้สึกขอบคุณผู้เขียน BERT และ ELMo ที่เผยแพร่ข้อมูลและรหัสต่อสาธารณะ

เราขอขอบคุณดร. ซัน คิม สำหรับการประมวลผลข้อความของ PubMed

ข้อสงวนสิทธิ์

เครื่องมือนี้แสดงผลการวิจัยที่ดำเนินการในสาขาวิชาชีววิทยาคอมพิวเตอร์ NCBI ข้อมูลที่ผลิตบนเว็บไซต์นี้ไม่ได้มีวัตถุประสงค์เพื่อใช้ในการวินิจฉัยโดยตรงหรือการตัดสินใจทางการแพทย์โดยไม่ได้รับการตรวจสอบและกำกับดูแลโดยผู้เชี่ยวชาญทางคลินิก บุคคลไม่ควรเปลี่ยนพฤติกรรมด้านสุขภาพของตนโดยอาศัยข้อมูลที่ผลิตบนเว็บไซต์นี้เพียงอย่างเดียว NIH ไม่ได้ตรวจสอบความถูกต้องหรือประโยชน์ของข้อมูลที่ผลิตโดยเครื่องมือนี้โดยอิสระ หากคุณมีคำถามเกี่ยวกับข้อมูลที่ผลิตบนเว็บไซต์นี้ โปรดดูผู้เชี่ยวชาญด้านการดูแลสุขภาพ ข้อมูลเพิ่มเติมเกี่ยวกับนโยบายข้อจำกัดความรับผิดชอบของ NCBI มีอยู่

ขยาย