CAMEL：从 LLaMA 增强的临床适应模型

来自 Bing Image Creator 的CAMEL

更新：新型号公告

我们很自豪地推出Asclepius ，一种更先进的临床大语言模型。由于该模型是根据综合临床记录进行训练的，因此可以通过 Huggingface 公开访问。如果您正在考虑使用 CAMEL，我们强烈建议改用 Asclepius。欲了解更多信息，请访问此链接。

我们的博客文章

我们的演示

我们推出CAMEL ，从 LLaMA 增强的临床适应模型。作为 LLaMA 的基础， CAMEL进一步根据 MIMIC-III 和 MIMIC-IV 临床记录进行预训练，并根据临床说明进行微调（图 2）。我们对 GPT-4 的初步评估表明， CAMEL 的质量达到了 OpenAI GPT-3.5 的 96% 以上（图 1）。根据我们源数据的数据使用政策，我们的指令数据集和模型都将在 PhysioNet 上发布，并具有凭证访问权限。为了便于复制，我们还将发布所有代码，允许各个医疗机构使用自己的临床记录复制我们的模型。有关更多详细信息，请参阅我们的博客文章。

图 1. 性能比较

图 2. 模型管道

复制指南

由于MIMIC和i2b2数据集的许可问题，我们无法发布指令数据集和检查点。我们将在几周内通过 phyonet 发布我们的模型和数据。

环境设置

 conda create -n camel python=3.9 -y
conda activate camel
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia -y
pip install pandarallel pandas jupyter numpy datasets sentencepiece openai fire
pip install git+https://github.com/huggingface/transformers.git@871598be552c38537bc047a409b4a6840ba1c1e4

预训练

注释预处理
- 对于每个注释，我们将类别连接在文本前面。
- 为了防止测试集泄漏，我们使用 RadQA、CLIP、n2c2 2018 数据集从 MIMIC-III 中删除了 404 个重叠注释，以进行进一步评估。
- 我们将所有注释与标记连接起来。
- $ python pretraining_preprocess/mimiciii_preproc.py --mimiciii_note_path {MIMICIII_NOTE_PATH} --output_path {OUTPUT_PATH}
- $ python pretraining_preprocess/mimiciv_preproc.py --discharge_note_path {DISCHAGE_NOTE_PATH} --radiology_note_path {RADIOLOGY_NOTE_PATH} --output_path {OUTPUT_PATH}
- $ python pretraining_preprocess/tokenize_data.py --data_path {DATA_PATH} --save_path {SAVE_PATH}

运行预训练

 $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} 
    src/train.py 
    --model_name_or_path "decapoda-research/llama-7b-hf" 
    --data_path  {DATA_FILE} 
    --bf16 True 
    --output_dir ./checkpoints 
    --num_train_epochs 1 
    --per_device_train_batch_size 2 
    --per_device_eval_batch_size 2 
    --gradient_accumulation_steps 8 
    --evaluation_strategy "no" 
    --save_strategy "steps" 
    --save_steps 1000 
    --learning_rate 2e-5 
    --weight_decay 0. 
    --warmup_ratio 0.03 
    --lr_scheduler_type "cosine" 
    --logging_steps 1 
    --fsdp "full_shard auto_wrap" 
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' 
    --tf32 True 
    --model_max_length 2048 
    --gradient_checkpointing True

指令微调

注意：要生成指令，您应该使用经过认证的 Azure Openai API。
指令生成
- 设置环境变量
  - OPENAI_API_KEY
  - OPENAI_API_BASE
  - OPENAI_DEPLOYMENT_NAME
- 预处理注释
  - $ python instructino/preprocess_note.py
- 去识别指令生成
  - $ python instruction/de_id_gen.py --input {PREPROCESSED_NOTES} --output {OUTPUT_FILE_1} --mode inst
  - $ python instruction/de_id_postprocess.py --input {OUTPUT_FILE_1} --output {OUTPUT_FILE_2}
  - $ python instruction/de_id_gen.py --input {OUTPUT__FILE_2} --output {inst_output/OUTPUT_FILE_deid} --mode ans
- 其他任务指令生成
  - 您可以有选择地为每个数据集生成指令。
  - $ python instruction/instructtion_gen.py --input {PREPROCESSED_NOTES} --output {inst_output/OUTPUT_FILE} --source {mimiciii, mimiciv, i2b2}
- 合并和格式化文件
  - $ python instruction/merge_data.py --data_path {inst_output} --output {OUTPUT_FILE_FINAL}
运行指令微调
- 我们所有的实验都是使用 8x A6000 GPU 进行的。
- 调整nproc_per_node和gradient accumulate step以适合您的硬件（全局批量大小 = 128）。

    $ torchrun --nproc_per_node=8 --master_port={YOUR_PORT} 
        src/instruction_ft.py 
        --model_name_or_path "decapoda-research/llama-7b-hf" 
        --data_path  {OUTPUT_FILE_FINAL} 
        --bf16 True 
        --output_dir ./checkpoints 
        --num_train_epochs 3 
        --per_device_train_batch_size 2 
        --per_device_eval_batch_size 2 
        --gradient_accumulation_steps 8 
        --evaluation_strategy "no" 
        --save_strategy "epoch" 
        --learning_rate 2e-5 
        --weight_decay 0. 
        --warmup_ratio 0.03 
        --lr_scheduler_type "cosine" 
        --logging_steps 1 
        --fsdp "full_shard auto_wrap" 
        --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' 
        --tf32 True 
        --model_max_length 2048 
        --gradient_checkpointing True
        --ddp_timeout 18000

评估

在 MTSamples 上运行模型

 CUDA_VISIBLE_DEVICES=0 python src/evaluate.py 
  --model_name {MODEL_PATH} 
  --data_path eval/mtsamples_instructions.json 
  --output_path {OUTPUT_PATH}

我们在eval文件夹中将 GPT-3.5、Alpaca 和 CAMEL 生成的输出作为mtsamples_results.json提供。

运行 GPT-4 进行评估

 python eval/gpt4_evaluate.py --input {INPUT_PATH} --output {OUTPUT_PATH}

引文

 @misc{CAMEL,
    title = {CAMEL : Clinically Adapted Model Enhanced from LLaMA},
    author = {Sunjun Kweon and Junu Kim and Seongsu Bae and Eunbyeol Cho and Sujeong Im and Jiyoun Kim and Gyubok Lee and JongHak Moon and JeongWoo Oh and Edward Choi},
    month = {May},
    year = {2023}
    publisher = {GitHub},
    journal = {GitHub repository},
    howpublished = {url{https://github.com/starmpcc/CAMEL}},
}