Téléchargement CoLAKE - Téléchargement du code source CoLAKE

CoLAKE

Code Source AI

1.0.0

Télécharger

CoLAKE

Code source de l'article " CoLAKE : Langage contextualisé et intégration des connaissances". Si vous rencontrez des problèmes pour reproduire les expériences, n'hésitez pas à nous contacter ou à proposer un problème.

Préparez votre environnement

Nous vous recommandons de créer un nouvel environnement.

CoLAKE python=3.7 source activate CoLAKE ">

conda create --name CoLAKE python=3.7
source activate CoLAKE

CoLAKE est implémenté sur la base des transformateurs fastNLP et huggingface, et utilise fitlog pour enregistrer les expériences.

git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP/ & python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog/ & python setup.py install
pip install transformers==2.11
pip install sklearn

Pour recycler CoLAKE , vous aurez peut-être besoin d'une formation mixte CPU-GPU pour gérer le grand nombre d'entités. Notre implémentation est basée sur KVStore fourni par DGL. De plus, pour reproduire les expériences de prédiction de liens, vous aurez peut-être également besoin de DGL-KE.

pip install dgl==0.4.3
pip install dglke

Reproduire les expériences

1. Téléchargez les intégrations de modèle et d'entité

Téléchargez le modèle CoLAKE pré-entraîné et les intégrations pour plus de 3 millions d'entités. Pour reproduire les expériences sur LAMA et LAMA-UHN, il suffit de télécharger le modèle. Vous pouvez utiliser download_gdrive.py dans ce dépôt pour télécharger directement des fichiers de Google Drive sur votre serveur :

mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b ./model/model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI ./model/entities.npy

Alternativement, vous pouvez utiliser gdown :

pip install gdown
gdown https://drive.google.com/uc ? id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc ? id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI

2. Exécutez les expériences

Téléchargez les ensembles de données pour les expériences dans l'article : Google Drive.

python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 ./data.tar.gz
tar -xzvf data.tar.gz
cd finetune/

Peu de relations

python run_re.py --debug --gpu 0

Entité ouverte

python run_typing.py --debug --gpu 0

LAMA et LAMA-UHN

 cd ../lama/
python eval_lama.py

Recycler CoLAKE

1. Téléchargez les données

Téléchargez le dernier dump wiki (format XML) :

wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Téléchargez le graphe de connaissances (Wikidata5M) :

wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz ? dl=1
tar -xzvf wikidata5m_transductive.tar.gz

Téléchargez les alias d'entité et de relation Wikidata5M :

wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz ? dl=1
tar -xzvf wikidata5m_alias.tar.gz

2. Prétraiter les données

Prétraiter le dump wiki :

mkdir pretrain_data
# process xml-format wiki dump
python preprocess/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchors
python preprocess/extract.py 4
python preprocess/gen_data.py 4
# Count entity & relation frequency and generate vocabs
python statistic.py

3. Former CoLAKE

Initialisez les intégrations d'entités et de relations avec la moyenne des intégrations RoBERTa BPE des alias d'entités et de relations :

 cd pretrain/
python init_ent_rel.py

Entraînez CoLAKE avec un CPU-GPU mixte :

./run_pretrain.sh

Citer

Si vous utilisez le code et le modèle, veuillez citer cet article :

CoLAKE, author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang}, title = { CoLAKE : Contextualized Language and Knowledge Embedding}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}}, year = {2020} }">

 @inproceedings{sun2020 CoLAKE ,
  author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang},
  title = { CoLAKE : Contextualized Language and Knowledge Embedding},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}},
  year = {2020}
}