Descarga CoLAKE - Descarga del código fuente CoLAKE

CoLAKE

Código Fuente de IA

1.0.0

Descargar

CoLAKE

Código fuente del artículo " CoLAKE : lenguaje contextualizado e incorporación de conocimientos". Si tiene algún problema con la reproducción de los experimentos, no dude en contactarnos o proponernos un problema.

Prepara tu entorno

Recomendamos crear un nuevo entorno.

CoLAKE python=3.7 source activate CoLAKE ">

conda create --name CoLAKE python=3.7
source activate CoLAKE

CoLAKE se implementa en base a fastNLP y los transformadores de huggingface, y utiliza fitlog para registrar los experimentos.

git clone https://github.com/fastnlp/fastNLP.git
cd fastNLP/ & python setup.py install
git clone https://github.com/fastnlp/fitlog.git
cd fitlog/ & python setup.py install
pip install transformers==2.11
pip install sklearn

Para volver a entrenar CoLAKE , es posible que necesite un entrenamiento mixto de CPU y GPU para manejar la gran cantidad de entidades. Nuestra implementación se basa en KVStore proporcionado por DGL. Además, para reproducir los experimentos sobre predicción de enlaces, es posible que también necesite DGL-KE.

pip install dgl==0.4.3
pip install dglke

Reproducir los experimentos.

1. Descargue las incrustaciones de modelo y entidad.

Descargue el modelo CoLAKE previamente entrenado y sus incorporaciones para más de 3 millones de entidades. Para reproducir los experimentos en LAMA y LAMA-UHN, sólo es necesario descargar el modelo. Puede utilizar download_gdrive.py en este repositorio para descargar archivos directamente desde Google Drive a su servidor:

mkdir model
python download_gdrive.py 1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b ./model/model.bin
python download_gdrive.py 1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI ./model/entities.npy

Alternativamente, puedes usar gdown :

pip install gdown
gdown https://drive.google.com/uc ? id=1MEGcmJUBXOyxKaK6K88fZFyj_IbH9U5b
gdown https://drive.google.com/uc ? id=1_FG9mpTrOnxV2NolXlu1n2ihgSZFXHnI

2. Ejecute los experimentos

Descargue los conjuntos de datos para los experimentos en el artículo: Google Drive.

python download_gdrive.py 1UNXICdkB5JbRyS5WTq6QNX4ndpMlNob6 ./data.tar.gz
tar -xzvf data.tar.gz
cd finetune/

PocosRel

python run_re.py --debug --gpu 0

Entidad abierta

python run_typing.py --debug --gpu 0

LAMA y LAMA-UHN

 cd ../lama/
python eval_lama.py

Volver a entrenar CoLAKE

1. Descarga los datos

Descargue el último volcado de wiki (formato XML):

wget -c https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

Descargue el gráfico de conocimiento (Wikidata5M):

wget -c https://www.dropbox.com/s/6sbhm0rwo4l73jq/wikidata5m_transductive.tar.gz ? dl=1
tar -xzvf wikidata5m_transductive.tar.gz

Descargue los alias de entidades y relaciones de Wikidata5M:

wget -c https://www.dropbox.com/s/lnbhc8yuhit4wm5/wikidata5m_alias.tar.gz ? dl=1
tar -xzvf wikidata5m_alias.tar.gz

2. Preprocesar los datos

Volcado de wiki previo al proceso:

mkdir pretrain_data
# process xml-format wiki dump
python preprocess/WikiExtractor.py enwiki-latest-pages-articles.xml.bz2 -o pretrain_data/output -l --min_text_length 100 --filter_disambig_pages -it abbr,b,big --processes 4
# Modify anchors
python preprocess/extract.py 4
python preprocess/gen_data.py 4
# Count entity & relation frequency and generate vocabs
python statistic.py

3. Entrena CoLAKE

Inicialice las incrustaciones de entidades y relaciones con el promedio de incrustaciones de RoBERTa BPE de alias de entidades y relaciones:

 cd pretrain/
python init_ent_rel.py

Entrene CoLAKE con CPU-GPU mixta:

./run_pretrain.sh

Citar

Si utiliza el código y el modelo, cite este documento:

CoLAKE, author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang}, title = { CoLAKE : Contextualized Language and Knowledge Embedding}, booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}}, year = {2020} }">

 @inproceedings{sun2020 CoLAKE ,
  author = {Tianxiang Sun and Yunfan Shao and Xipeng Qiu and Qipeng Guo and Yaru Hu and Xuanjing Huang and Zheng Zhang},
  title = { CoLAKE : Contextualized Language and Knowledge Embedding},
  booktitle = {Proceedings of the 28th International Conference on Computational Linguistics, {COLING}},
  year = {2020}
}