pytorch definitions 다운로드 - pytorch definitions 소스 코드 다운로드

단어 정의의 조건부 생성기

이 저장소에는 단어 정의의 조건부 생성기에 대한 코드가 포함되어 있습니다.

추상적인

우리는 단어의 사전 정의 모델링을 통해 단어의 다양한 분산 벡터 표현을 평가하기 위한 도구를 제공하는 최근 도입된 정의 모델링 기술을 탐구합니다. 본 연구에서는 정의 모델링에서 단어 모호성 문제를 연구하고 잠재 변수 모델링과 소프트 어텐션 메커니즘을 사용하여 가능한 솔루션을 제안합니다. 모델에 대한 정량적, 정성적 평가와 분석을 통해 단어 모호성과 다의어성을 고려하면 성능이 향상되는 것으로 나타났습니다.

소환

 @InProceedings{P18-2043,
  author = "Gadetsky, Artyom and Yakubovskiy, Ilya and Vetrov, Dmitry",
  title = "Conditional Generators of Words Definitions",
  booktitle = "Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)",
  year = "2018",
  publisher = "Association for Computational Linguistics",
  pages = "266--271",
  location = "Melbourne, Australia",
  url = "http://aclweb.org/anthology/P18-2043"
}

환경 요구 사항 및 데이터 준비

다음 패키지를 사용하여 conda 환경을 설치하십시오.

 Python 3.6
Pytorch 0.4
Numpy 1.14
Tqdm 4.23
Gensim 3.4

적응형 조건을 사용하기 위해 AdaGram 소프트웨어를 설치하려면:

공식 사이트에서 Julia 0.6 바이너리를 다운로드하고 ~/.bashrc에 별칭을 추가하세요.

 alias julia='JULIA_BINARY_PATH/bin/julia'

source ~/.bashrc 사용하여 ~/.bashrc를 다시 로드하세요.

그런 다음 julia 사용하여 Julia 인터프리터를 활성화하고 다음 패키지를 설치하십시오.

 Pkg.clone("https://github.com/mirestrepo/AdaGram.jl")
Pkg.build("AdaGram")
Pkg.add("ArgParse")
Pkg.add("JSON")
Pkg.add("NPZ")
exit()

그런 다음 ~/.bashrc를 추가하세요.

 export PATH="JULIA_BINARY_PATH/bin:$PATH"
export LD_LIBRARY_PATH="JULIA_INSTALL_PATH/v0.6/AdaGram/lib:$LD_LIBRARY_PATH"

그리고 마지막으로 내보내기를 적용합니다.

 source ~/.bashrc

Mosesdecoder(BLEU용)를 설치하려면 공식 사이트의 지침을 따르세요.
언어 모델(LM) 사전 학습을 위한 데이터를 얻으려면 다음 안내를 따르세요.

 cd pytorch-definitions
mkdir data
cd data
wget https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-v1.zip
unzip wikitext-103-v1.zip

Google 단어 벡터에 대한 데이터를 얻으려면 공식 사이트를 사용하십시오. .bin.gz 파일이 필요합니다. 바이너리를 추출하려면 다운로드한 파일을 gunzip 것을 잊지 마세요.
요청 시 적응형 스킵그램 벡터를 사용할 수 있습니다. 또한 공식 저장소의 지침을 사용하여 직접 훈련할 수도 있습니다.
정의 모델링 데이터는 Oxford Dictionaries 배포 라이센스로 인해 요청 시 제공됩니다. 또한 당신은 자신의 것을 수집할 수 있습니다. 자신만의 데이터를 수집하려면 훈련, 테스트, 평가의 3가지 데이터 분할을 준비해야 합니다. 각 데이터 분할은 json 파일로 저장된 다음 형식의 Python 배열입니다.

 data = [
  [
    ["word"],
    ["word1", "word2", ...],
    ["word1", "word2", ...]
  ],
  ...
]
So i-th element of the data:
data[i][0][0] - word being defined (string)
data[i][1] - definition (list of strings)
data[i][2] - context to understand word meaning (list of strings)

용법

먼저, 모델을 사용하기 위한 어휘, 벡터 등을 준비해야 합니다.

어휘를 준비하려면 python prep_vocab.py 사용하세요.

 usage: prep_vocab.py [-h] --defs DEFS [DEFS ...] [--lm LM [LM ...]] [--same]
                     --save SAVE [--save_context SAVE_CONTEXT] --save_chars
                     SAVE_CHARS

Prepare vocabularies for model

optional arguments:
  -h, --help            show this help message and exit
  --defs DEFS [DEFS ...]
                        location of json file with definitions.
  --lm LM [LM ...]      location of txt file with text for LM pre-training
  --same                use same vocab for definitions and contexts
  --save SAVE           where to save prepaired vocabulary (for words from
                        definitions)
  --save_context SAVE_CONTEXT
                        where to save vocabulary (for words from contexts)
  --save_chars SAVE_CHARS
                        where to save char vocabulary (for chars from all
                        words)

w2v 벡터를 준비하려면 python prep_w2v.py 사용하세요.

 usage: prep_w2v.py [-h] --defs DEFS [DEFS ...] --save SAVE [SAVE ...] --w2v
                   W2V

Prepare word vectors for Input conditioning

optional arguments:
  -h, --help            show this help message and exit
  --defs DEFS [DEFS ...]
                        location of json file with definitions.
  --save SAVE [SAVE ...]
                        where to save files
  --w2v W2V             location of binary w2v file

Adagram 벡터를 준비하려면 julia prep_ada.jl 사용하세요.

 usage: prep_ada.jl --defs DEFS [DEFS...] --save SAVE [SAVE...]
                   --ada ADA [-h]

Prepare word vectors for Input-Adaptive conditioning

optional arguments:
  --defs DEFS [DEFS...]
                        location of json file with definitions.
  --save SAVE [SAVE...]
                        where to save files
  --ada ADA             location of AdaGram file
  -h, --help            show this help message and exit

Google 단어 벡터를 사용하여 모델의 임베딩 행렬을 초기화하려면 다음을 사용하여 준비하십시오.
python prep_embedding_matrix.py 그런 다음 train.py 에서 --w2v_weights 로 저장된 가중치 경로를 사용합니다.

 usage: prep_embedding_matrix.py [-h] --voc VOC --w2v W2V --save SAVE

Prepare word vectors for embedding layer in the model

optional arguments:
  -h, --help   show this help message and exit
  --voc VOC    location of model vocabulary file
  --w2v W2V    location of binary w2v file
  --save SAVE  where to save prepaired matrix

이제 모델을 사용할 준비가 모두 완료되었습니다!

모델을 훈련하려면 python train.py 사용하세요.

 usage: train.py [-h] [--pretrain] --voc VOC [--train_defs TRAIN_DEFS]
                [--eval_defs EVAL_DEFS] [--test_defs TEST_DEFS]
                [--input_train INPUT_TRAIN] [--input_eval INPUT_EVAL]
                [--input_test INPUT_TEST]
                [--input_adaptive_train INPUT_ADAPTIVE_TRAIN]
                [--input_adaptive_eval INPUT_ADAPTIVE_EVAL]
                [--input_adaptive_test INPUT_ADAPTIVE_TEST]
                [--context_voc CONTEXT_VOC] [--ch_voc CH_VOC]
                [--train_lm TRAIN_LM] [--eval_lm EVAL_LM] [--test_lm TEST_LM]
                [--bptt BPTT] --nx NX --nlayers NLAYERS --nhid NHID
                --rnn_dropout RNN_DROPOUT [--use_seed] [--use_input]
                [--use_input_adaptive] [--use_input_attention]
                [--n_attn_embsize N_ATTN_EMBSIZE] [--n_attn_hid N_ATTN_HID]
                [--attn_dropout ATTN_DROPOUT] [--attn_sparse] [--use_ch]
                [--ch_emb_size CH_EMB_SIZE]
                [--ch_feature_maps CH_FEATURE_MAPS [CH_FEATURE_MAPS ...]]
                [--ch_kernel_sizes CH_KERNEL_SIZES [CH_KERNEL_SIZES ...]]
                [--use_hidden] [--use_hidden_adaptive]
                [--use_hidden_attention] [--use_gated] [--use_gated_adaptive]
                [--use_gated_attention] --lr LR --decay_factor DECAY_FACTOR
                --decay_patience DECAY_PATIENCE --num_epochs NUM_EPOCHS
                --batch_size BATCH_SIZE --clip CLIP --random_seed RANDOM_SEED
                --exp_dir EXP_DIR [--w2v_weights W2V_WEIGHTS]
                [--fix_embeddings] [--fix_attn_embeddings] [--lm_ckpt LM_CKPT]
                [--attn_ckpt ATTN_CKPT]

Script to train a model

optional arguments:
  -h, --help            show this help message and exit
  --pretrain            whether to pretrain model on LM dataset or train on
                        definitions
  --voc VOC             location of vocabulary file
  --train_defs TRAIN_DEFS
                        location of json file with train definitions.
  --eval_defs EVAL_DEFS
                        location of json file with eval definitions.
  --test_defs TEST_DEFS
                        location of json file with test definitions
  --input_train INPUT_TRAIN
                        location of train vectors for Input conditioning
  --input_eval INPUT_EVAL
                        location of eval vectors for Input conditioning
  --input_test INPUT_TEST
                        location of test vectors for Input conditioning
  --input_adaptive_train INPUT_ADAPTIVE_TRAIN
                        location of train vectors for InputAdaptive
                        conditioning
  --input_adaptive_eval INPUT_ADAPTIVE_EVAL
                        location of eval vectors for InputAdaptive
                        conditioning
  --input_adaptive_test INPUT_ADAPTIVE_TEST
                        location test vectors for InputAdaptive conditioning
  --context_voc CONTEXT_VOC
                        location of context vocabulary file
  --ch_voc CH_VOC       location of CH vocabulary file
  --train_lm TRAIN_LM   location of txt file train LM data
  --eval_lm EVAL_LM     location of txt file eval LM data
  --test_lm TEST_LM     location of txt file test LM data
  --bptt BPTT           sequence length for BackPropThroughTime in LM
                        pretraining
  --nx NX               size of embeddings
  --nlayers NLAYERS     number of LSTM layers
  --nhid NHID           size of hidden states
  --rnn_dropout RNN_DROPOUT
                        probability of RNN dropout
  --use_seed            whether to use Seed conditioning or not
  --use_input           whether to use Input conditioning or not
  --use_input_adaptive  whether to use InputAdaptive conditioning or not
  --use_input_attention
                        whether to use InputAttention conditioning or not
  --n_attn_embsize N_ATTN_EMBSIZE
                        size of InputAttention embeddings
  --n_attn_hid N_ATTN_HID
                        size of InputAttention linear layer
  --attn_dropout ATTN_DROPOUT
                        probability of InputAttention dropout
  --attn_sparse         whether to use sparse embeddings in InputAttention or
                        not
  --use_ch              whether to use CH conditioning or not
  --ch_emb_size CH_EMB_SIZE
                        size of embeddings in CH conditioning
  --ch_feature_maps CH_FEATURE_MAPS [CH_FEATURE_MAPS ...]
                        list of feature map sizes in CH conditioning
  --ch_kernel_sizes CH_KERNEL_SIZES [CH_KERNEL_SIZES ...]
                        list of kernel sizes in CH conditioning
  --use_hidden          whether to use Hidden conditioning or not
  --use_hidden_adaptive
                        whether to use HiddenAdaptive conditioning or not
  --use_hidden_attention
                        whether to use HiddenAttention conditioning or not
  --use_gated           whether to use Gated conditioning or not
  --use_gated_adaptive  whether to use GatedAdaptive conditioning or not
  --use_gated_attention
                        whether to use GatedAttention conditioning or not
  --lr LR               initial lr
  --decay_factor DECAY_FACTOR
                        factor to decay lr
  --decay_patience DECAY_PATIENCE
                        after number of patience epochs - decay lr
  --num_epochs NUM_EPOCHS
                        number of epochs to train
  --batch_size BATCH_SIZE
                        batch size
  --clip CLIP           value to clip norm of gradients to
  --random_seed RANDOM_SEED
                        random seed
  --exp_dir EXP_DIR     where to save all stuff about training
  --w2v_weights W2V_WEIGHTS
                        path to pretrained embeddings to init
  --fix_embeddings      whether to update embedding matrix or not
  --fix_attn_embeddings
                        whether to update attention embedding matrix or not
  --lm_ckpt LM_CKPT     path to pretrained language model weights
  --attn_ckpt ATTN_CKPT
                        path to pretrained Attention module

예를 들어 간단한 언어 모델 사용을 훈련하려면 다음을 수행하십시오.

 python train.py --voc VOC_PATH --nx 300 --nhid 300 --rnn_dropout 0.5 --lr 0.001 --decay_factor 0.1 --decay_patience 0
--num_epochs 1 --batch_size 16 --clip 5 --random_seed 42 --exp_dir DIR_PATH -bptt 30
--pretrain --train_lm PATH_TO_WIKI_103_TRAIN --eval_lm PATH_TO_WIKI_103_EVAL --test_lm PATH_TO_WIKI_103_TEST

예를 들어 Seed + Input 모델을 훈련하려면 다음을 사용하세요.

 python train.py --voc VOC_PATH --nx 300 --nhid 300 --rnn_dropout 0.5 --lr 0.001 --decay_factor 0.1 --decay_patience 0
--num_epochs 1 --batch_size 16 --clip 5 --random_seed 42 --exp_dir DIR_PATH
--train_defs TRAIN_SPLIT_PATH --eval_defs EVAL_DEFS_PATH --test_defs TEST_DEFS_PATH --use_seed
--use_input --input_train PREPARED_W2V_TRAIN_VECS --input_eval PREPARED_W2V_EVAL_VECS --input_test PREPARED_W2V_TEST_VECS

무조건 LM으로 사전 훈련을 통해 Seed + Input 모델을 훈련하려면 사전 훈련된 LM 가중치에 대한 경로를 제공합니다.
train.py 의 --lm_ckpt 인수로

모델을 사용하여 생성하려면 python generate.py 사용하세요.

 usage: generate.py [-h] --params PARAMS --ckpt CKPT --tau TAU --n N --length
                   LENGTH [--prefix PREFIX] [--wordlist WORDLIST]
                   [--w2v_binary_path W2V_BINARY_PATH]
                   [--ada_binary_path ADA_BINARY_PATH]
                   [--prep_ada_path PREP_ADA_PATH]

Script to generate using model

optional arguments:
  -h, --help            show this help message and exit
  --params PARAMS       path to saved model params
  --ckpt CKPT           path to saved model weights
  --tau TAU             temperature to use in sampling
  --n N                 number of samples to generate
  --length LENGTH       maximum length of generated samples
  --prefix PREFIX       prefix to read until generation starts
  --wordlist WORDLIST   path to word list with words and contexts
  --w2v_binary_path W2V_BINARY_PATH
                        path to binary w2v file
  --ada_binary_path ADA_BINARY_PATH
                        path to binary ada file
  --prep_ada_path PREP_ADA_PATH
                        path to prep_ada.jl script

모델을 평가하려면 python eval.py 사용하세요.

 usage: eval.py [-h] --params PARAMS --ckpt CKPT --datasplit DATASPLIT --type
               TYPE [--wordlist WORDLIST] [--tau TAU] [--n N]
               [--length LENGTH]

Script to evaluate model

optional arguments:
  -h, --help            show this help message and exit
  --params PARAMS       path to saved model params
  --ckpt CKPT           path to saved model weights
  --datasplit DATASPLIT
                        train, val or test set to evaluate on
  --type TYPE           compute ppl or bleu
  --wordlist WORDLIST   word list to evaluate on (by default all data will be
                        used)
  --tau TAU             temperature to use in sampling
  --n N                 number of samples to generate
  --length LENGTH       maximum length of generated samples

학습된 모델의 BLEU를 측정하려면 먼저 eval.py 에서 --bleu 인수를 사용하여 평가하세요.
그런 다음 python bleu.py 사용하여 블루를 계산합니다.

 usage: bleu.py [-h] --ref REF --hyp HYP --n N [--with_contexts] --bleu_path
               BLEU_PATH --mode MODE

Script to compute BLEU

optional arguments:
  -h, --help            show this help message and exit
  --ref REF             path to file with references
  --hyp HYP             path to file with hypotheses
  --n N                 --n argument used to generate --ref file using eval.py
  --with_contexts       whether to consider contexts or not when compute BLEU
  --bleu_path BLEU_PATH
                        path to mosesdecoder sentence-bleu binary
  --mode MODE           whether to average or take random example per word

또한 python train_attention_skipgram.py 사용하여 Attention 모듈을 사전 훈련할 수 있으며
그런 다음 train.py 의 --attn_ckpt 인수로 저장된 가중치에 대한 경로를 사용합니다.

 usage: train_attention_skipgram.py [-h] [--data DATA] --context_voc
                                   CONTEXT_VOC [--prepared] --window WINDOW
                                   --random_seed RANDOM_SEED [--sparse]
                                   --vec_dim VEC_DIM --attn_hid ATTN_HID
                                   --attn_dropout ATTN_DROPOUT --lr LR
                                   --batch_size BATCH_SIZE --num_epochs
                                   NUM_EPOCHS --exp_dir EXP_DIR

Script to train a AttentionSkipGram model

optional arguments:
  -h, --help            show this help message and exit
  --data DATA           path to data
  --context_voc CONTEXT_VOC
                        path to context voc for DefinitionModelingModel is
                        necessary to save pretrained attention module,
                        particulary embedding matrix
  --prepared            whether to prepare data or use already prepared
  --window WINDOW       window for AttentionSkipGram model
  --random_seed RANDOM_SEED
                        random seed for training
  --sparse              whether to use sparse embeddings or not
  --vec_dim VEC_DIM     vector dim to train
  --attn_hid ATTN_HID   hidden size in attention module
  --attn_dropout ATTN_DROPOUT
                        dropout prob in attention module
  --lr LR               initial lr to use
  --batch_size BATCH_SIZE
                        batch size to use
  --num_epochs NUM_EPOCHS
                        number of epochs to train
  --exp_dir EXP_DIR     where to save weights, prepared data and logs

확장하다