CoLAKEダウンロード - CoLAKEソースコードのダウンロード

CoLAKE

AI ソースコード

1.0.0

ダウンロード

CoLAKE

論文「 CoLAKE : Contextualized Language and Knowledge Embedding」のソースコード。実験の再現に関して問題がある場合は、お気軽にお問い合わせまたは問題を提案してください。

環境を準備する

新しい環境を作成することをお勧めします。

CoLAKE python=3.7 source activate CoLAKE ">

conda create --name CoLAKE python=3.7
source activate CoLAKE

CoLAKEは fastNLP とhuggingface のトランスフォーマーに基づいて実装されており、fitlog を使用して実験を記録します。

git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP/ & python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog/ & python setup.py install
pip install transformers==2.11
pip install sklearn

CoLAKE再トレーニングするには、多数のエンティティを処理するために CPU と GPU の混合トレーニングが必要になる場合があります。私たちの実装は、DGL が提供する KVStore に基づいています。さらに、リンク予測の実験を再現するには、DGL-KE も必要になる場合があります。

pip install dgl==0.4.3
pip install dglke

実験を再現する

1. モデルとエンティティの埋め込みをダウンロードする

事前トレーニングされたCoLAKEモデルと 300 万を超えるエンティティの埋め込みをダウンロードします。 LAMA および LAMA-UHN で実験を再現するには、モデルをダウンロードするだけです。このリポジトリのdownload_gdrive.py使用して、Google ドライブからサーバーにファイルを直接ダウンロードできます。

mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b ./model/model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI ./model/entities.npy

あるいは、 gdown使用することもできます。

pip install gdown
gdown https://drive.google.com/uc ? id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc ? id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI

2. 実験を実行する

論文内の実験用のデータセットを Google ドライブからダウンロードします。

python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 ./data.tar.gz
tar -xzvf data.tar.gz
cd finetune/

少数のレル

python run_re.py --debug --gpu 0

オープンエンティティ

python run_typing.py --debug --gpu 0

ラマとラマ・ウーン

 cd ../lama/
python eval_lama.py

CoLAKE再トレーニングする

1. データをダウンロードする

最新の Wiki ダンプ (XML 形式) をダウンロードします。

wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

ナレッジグラフをダウンロードします (Wikidata5M):

wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz ? dl=1
tar -xzvf wikidata5m_transductive.tar.gz

Wikidata5M エンティティと関係エイリアスをダウンロードします。

wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz ? dl=1
tar -xzvf wikidata5m_alias.tar.gz

2. データの前処理

Wiki ダンプを前処理します。

mkdir pretrain_data
# process xml-format wiki dump
python preprocess/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchors
python preprocess/extract.py 4
python preprocess/gen_data.py 4
# Count entity & relation frequency and generate vocabs
python statistic.py

3. CoLAKEを電車する

エンティティとリレーションのエイリアスの RoBERTa BPE 埋め込みの平均を使用して、エンティティとリレーションの埋め込みを初期化します。

 cd pretrain/
python init_ent_rel.py

CPU と GPU が混在した状態でCoLAKEトレーニングします。

./run_pretrain.sh

引用

コードとモデルを使用する場合は、この論文を引用してください。

CoLAKE, author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang}, title = { CoLAKE : Contextualized Language and Knowledge Embedding}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}}, year = {2020} }">

 @inproceedings{sun2020 CoLAKE ,
  author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang},
  title = { CoLAKE : Contextualized Language and Knowledge Embedding},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}},
  year = {2020}
}