"코드 동의어가 중요함: 자동 ICD 코딩을 위한 다중 동의어 매칭 네트워크"의 공식 구현 [ACL 2022]
모든 코드는 Python 3.7, PyTorch 1.7.0에서 테스트되었습니다. Einsum 계산을 위해서는 opt_einsum을 설치해야 합니다. MIMIC-III 전체 설정을 훈련하려면 최소 32GB GPU가 필요합니다.
우리는 각 데이터 세트에 대해 여러 개의 샘플만 넣었습니다. MIMIC-III 데이터 세트를 다운로드하려면 라이센스를 얻어야 합니다. MIMIC-III 데이터세트를 얻은 후에는 caml-mimic을 따라 데이터세트를 전처리하세요. 전처리 후에 train_full.csv , test_full.csv , dev_full.csv , train_50.csv , test_50.csv , dev_50.csv 를 얻어야 합니다. Sample_data/mimic3 아래에 넣어주세요. 그런 다음 json 형식 데이터 세트를 생성하려면 preprocess/generate_data_new.ipynb를 사용해야 합니다.
LAAT에서 word2vec_sg0_100.model을 다운로드하세요. 단어 임베딩 경로를 변경해야 합니다.
MIMIC-III 전체(1 GPU):
CUDA_VISIBLE_DEVICES=0 python main.py --n_gpu 1 --version mimic3 --combiner lstm --rnn_dim 256 --num_layers 2 --decoder MultiLabelMultiHeadLAATV2 --attention_head 4 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 2 --gradient_accumulation_steps 8 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 4 --sort_method random --word_embedding_path word_embedding_path
MIMIC-III 전체(8 GPU):
NCCL_IB_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m torch.distributed.launch --nproc_per_node 8 --master_port=1212 --use_env main.py --n_gpu 8 --version mimic3 --combiner lstm --rnn_dim 256 --num_layers 2 --decoder MultiLabelMultiHeadLAATV2 --attention_head 4 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 2 --gradient_accumulation_steps 1 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 4 --sort_method random --word_embedding_path word_embedding_path
MIMIC-III 50:
CUDA_VISIBLE_DEVICES=0 python main.py --version mimic3-50 --combiner lstm --rnn_dim 512 --num_layers 1 --decoder MultiLabelMultiHeadLAATV2 --attention_head 8 --attention_dim 512 --learning_rate 5e-4 --train_epoch 20 --batch_size 16 --gradient_accumulation_steps 1 --xavier --main_code_loss_weight 0.0 --rdrop_alpha 5.0 --est_cls 1 --term_count 8 --word_embedding_path word_embedding_path
python eval_model.py MODEL_CHECKPOINT
mimic3 체크포인트
mimic3-50 체크포인트
@inproceedings{yuan-etal-2022-code,
title = "Code Synonyms Do Matter: Multiple Synonyms Matching Network for Automatic {ICD} Coding",
author = "Yuan, Zheng and
Tan, Chuanqi and
Huang, Songfang",
booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
month = may,
year = "2022",
address = "Dublin, Ireland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.acl-short.91",
pages = "808--814",
abstract = "Automatic ICD coding is defined as assigning disease codes to electronic medical records (EMRs).Existing methods usually apply label attention with code representations to match related text snippets.Unlike these works that model the label with the code hierarchy or description, we argue that the code synonyms can provide more comprehensive knowledge based on the observation that the code expressions in EMRs vary from their descriptions in ICD. By aligning codes to concepts in UMLS, we collect synonyms of every code. Then, we propose a multiple synonyms matching network to leverage synonyms for better code representation learning, and finally help the code classification. Experiments on the MIMIC-III dataset show that our proposed method outperforms previous state-of-the-art methods.",
}