OpenDialog Download - OpenDialog Source code download

OpenDialog

AI Source Code

1.0.0

Download

OpenDialog

~~We now have a test interface, which can be used by searching the WeChat public account OpenDialog~~

OpenDialog is built on PyTorch-based transformers. Provides a series of transformer-based Chinese open domain dialogue models (chat conversations), collects existing data resources and continuously supplements the corresponding Chinese conversation system data sets, with the intention of building an open source Chinese chat dialogue platform.

Latest developments:

2020.8.20, completed the interface of the LCCC-GPT-Large generative Open-Domain pre-training model, and run the following code to start the corresponding service
```
./run_flask lccc < gpu_id >
```
2020.10.26, completed a batch of bi-encoder retrieval dialogue models (bert-bi-encoder, polyencoder, etc.)
...

Tutorial

1. Brief description of project structure and documents

OpenDialog core files and directories:

data : data set, configuration file, word list, word vector, data set processing script
models : dialogue models
metrics : evaluation indicators
multiview : multi-angle re-ranking model, re-ranking to obtain dialogue candidate responses
ckpt : stores the training model
rest : stores tensorboard logs and result files generated during the test phase
utils : stores tool functions
dataloader.py : Dataset loading script
main.py : main running file
header.py : the package that needs to be imported
eval.py : Call the evaluation script of the evaluation indicators in metrics to test the results of the file generated in rest
run.sh : run batch script
run_flask.sh : Call the model and start the service

2. Prepare the environment

Basic system environment: Linux/Ubuntu-16.04+ , Python 3.6+ , GPU (default 1080 Ti)
Install python dependent libraries

pip install -r requirements.txt

Install ElasticSearch
The retrieval-based dialogue system needs to first use elasticsearch for rough screening. At the same time, in order to achieve Chinese word segmentation in the coarse screening retrieval stage, a Chinese word segmenter needs to be downloaded and installed.
Install mongodb
After starting the service, mongodb will be used to store session history and necessary data

3. Prepare data

Data set Baidu Cloud link: https://pan.baidu.com/s/1xJibJmOOCGIzmJVC6CZ39Q; Extraction code: vmua
Store the corresponding data files in the corresponding subdirectories under data directory, and store the word vector files chinese_w2v.txt and english_w2v.bin under data .
See data/README.md for data details and preprocessed data.
Available datasets

5. Training model

The training model supports multi-GPU parallelism. You only need to specify multiple gpu ids <gpu_ids> , such as 0,1,2,3
The dataset name is consistent with the name in the data directory.

Model	CMD	Type	Details	Refer
bertretrieval	./run.sh train <dataset> bertretrieval <gpu_ids>	retrieval	Bert-based fine-tuning model (fine-tuning)	Paper
gpt2	./run.sh train <dataset> gpt2 <gpu_ids>	generative	GPT2 generative dialogue model	Code
gpt2gan	./run.sh train <dataset> gpt2gan <gpu_ids>	generative	GAN-based dialogue model, the generative model is GPT2, and the discriminant model is the BERT two-classification model.	Paper