We now have a test interface, which can be used by searching the WeChat public account OpenDialog
OpenDialog is built on PyTorch-based transformers. Provides a series of transformer-based Chinese open domain dialogue models (chat conversations), collects existing data resources and continuously supplements the corresponding Chinese conversation system data sets, with the intention of building an open source Chinese chat dialogue platform.
Latest developments:
2020.8.20, completed the interface of the LCCC-GPT-Large generative Open-Domain pre-training model, and run the following code to start the corresponding service
./run_flask lccc < gpu_id >
2020.10.26, completed a batch of bi-encoder retrieval dialogue models (bert-bi-encoder, polyencoder, etc.)
...
OpenDialog core files and directories:
data
: data set, configuration file, word list, word vector, data set processing scriptmodels
: dialogue modelsmetrics
: evaluation indicatorsmultiview
: multi-angle re-ranking model, re-ranking to obtain dialogue candidate responsesckpt
: stores the training modelrest
: stores tensorboard logs and result files generated during the test phaseutils
: stores tool functionsdataloader.py
: Dataset loading scriptmain.py
: main running fileheader.py
: the package that needs to be importedeval.py
: Call the evaluation script of the evaluation indicators in metrics
to test the results of the file generated in rest
run.sh
: run batch scriptrun_flask.sh
: Call the model and start the service Basic system environment: Linux/Ubuntu-16.04+
, Python 3.6+
, GPU (default 1080 Ti)
Install python dependent libraries
pip install -r requirements.txt
Install ElasticSearch
The retrieval-based dialogue system needs to first use elasticsearch
for rough screening. At the same time, in order to achieve Chinese word segmentation in the coarse screening retrieval stage, a Chinese word segmenter needs to be downloaded and installed.
Install mongodb
After starting the service, mongodb
will be used to store session history and necessary data
data
directory, and store the word vector files chinese_w2v.txt
and english_w2v.bin
under data
.data/README.md
for data details and preprocessed data.<gpu_ids>
, such as 0,1,2,3
dataset
name is consistent with the name in the data
directory.Model | CMD | Type | Details | Refer | Pre-train Model |
---|---|---|---|---|---|
bertretrieval | ./run.sh train <dataset> bertretrieval <gpu_ids> | retrieval | Bert-based fine-tuning model (fine-tuning) | Paper | |
gpt2 | ./run.sh train <dataset> gpt2 <gpu_ids> | generative | GPT2 generative dialogue model | Code | |
gpt2gan | ./run.sh train <dataset> gpt2gan <gpu_ids> | generative | GAN-based dialogue model, the generative model is GPT2, and the discriminant model is the BERT two-classification model. | Paper |
Start flask service
./run_flask.sh <model_name> <gpu_id>
Call interface