《Code Synonyms Do Matter:用于自动 ICD 编码的多个同义词匹配网络》正式实施 [ACL 2022]
所有代码均在Python 3.7、PyTorch 1.7.0下测试。需要安装 opt_einsum 进行 einsum 计算。训练 MIMIC-III 完整设置至少需要 32GB GPU。
我们只为每个数据集放置几个样本。下载 MIMIC-III 数据集需要获得许可。获得MIMIC-III数据集后,请按照caml-mimic对数据集进行预处理。预处理后,您应该获得train_full.csv 、 test_full.csv 、 dev_full.csv 、 train_50.csv 、 test_50.csv 、 dev_50.csv 。请将它们放在sample_data/mimic3下。然后你应该使用preprocess/generate_data_new.ipynb来生成 json 格式的数据集。
请从 LAAT 下载 word2vec_sg0_100.model。您需要更改词嵌入的路径。
MIMIC-III 完整版(1 个 GPU):
CUDA_VISIBLE_DEVICES=0 python main.py --n_gpu 1 --version mimic3 --combiner lstm --rnn_dim 256 --num_layers 2 --decoder MultiLabelMultiHeadLAATV2 --attention_head 4 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 2 --gradient_accumulation_steps 8 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 4 --sort_method random --word_embedding_path word_embedding_path
MIMIC-III Full(8 个 GPU):
NCCL_IB_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node 8 --master_port=1212 --use_env main.py --n_gpu 8 --version mimic3 --combiner lstm --rnn_dim 256 --num_layers 2 --decoder MultiLabelMultiHeadLAATV2 --attention_head 4 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 2 --gradient_accumulation_steps 1 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 4 --sort_method random --word_embedding_path word_embedding_path
MIMIC-III 50:
CUDA_VISIBLE_DEVICES=0 python main.py --version mimic3-50 --combiner lstm --rnn_dim 512 --num_layers 1 --decoder MultiLabelMultiHeadLAATV2 --attention_head 8 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 16 --gradient_accumulation_steps 1 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 8 --word_embedding_path word_embedding_path
python eval_model.py MODEL_CHECKPOINT
模仿3检查点
imit3-50 检查点
@inproceedings{yuan-etal-2022-code,
title = "Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic {ICD} Coding",
author = "Yuan, Zheng and
Tan, Chuanqi and
Huang, Songfang",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-short.91",
pages = "808--814",
abstract = "Automatic ICD coding is defined as assigning disease codes to electronic medical records (EMRs).Existing methods usually apply label attention with code representations to match related text snippets.Unlike these works that model the label with the code hierarchy or description, we argue that the code synonyms can provide more comprehensive knowledge based on the observation that the code expressions in EMRs vary from their descriptions in ICD. By aligning codes to concepts in UMLS, we collect synonyms of every code. Then, we propose a multiple synonyms matching network to leverage synonyms for better code representation learning, and finally help the code classification. Experiments on the MIMIC-III dataset show that our proposed method outperforms previous state-of-the-art methods.",
}