Chinese Chatbot PyTorch Implementation Download - Chinese Chatbot PyTorch Implementation Source code download

Chinese Chatbot PyTorch Implementation

AI Source Code

1.0.0

Download

? Xiaozhi, another Chinese chatbot:yum:

? Using the interesting Chinese corpus qingyun, a Chinese chat robot written by @Doragd: snowman:

Even though she's not that perfect :muscle:, not that great :paw_prints:
But she was coded by myself:sparkling_heart:, so
I hope everyone can give more stars to support this NLP beginner? and his friends? Xiaozhi

?background

This project is actually a sub-module of the software engineering course design. Our goal is to develop an intelligent customer service ticket processing system.

The actual workflow of the intelligent customer service work order system is: when a person asks a question to the system, the system first searches the knowledge base to see if there is a relevant question. If so, it returns the answer to the question. If the person is not satisfied at this time, he or she can directly Submit a work order. If it does not exist in the knowledge base, this chatbot is called for automatic reply.

The service scenario of this system is similar to Tencent Cloud's customer service system. Customers mostly come to consult related issues (cloud servers, domain names, etc.), so the knowledge base is also a collection of consultations and troubleshooting (questions, answers) about cloud servers, domain names, etc. .

The system's front-end interface and front-end and back-end message interaction are completed by another classmate @adjlyadv, mainly using React+Django.

@Doragd is responsible for acquiring the knowledge base and writing, training, and testing the chatbot. The content of this repo is also about this.

? Test effect

Answer without using the knowledge base

Use the knowledge base to answer

Overall system effect:

?Project structure

 │  .gitignore
│  config.py               #模型配置参数
│  corpus.pth              #已经过处理的数据集
│  dataload.py             #dataloader
│  datapreprocess.py       #数据预处理
│  LICENSE
│  main.py               
│  model.py       
│  README.md
│  requirements.txt
│  train_eval.py            #训练和验证,测试
│  
├─checkpoints              
│      chatbot_0509_1437   #已经训练好的模型
│      
├─clean_chat_corpus
│      qingyun.tsv         #语料库
│      
├─QA_data
│      QA.db               #知识库
│      QA_test.py          #使用知识库时调用
│      stop_words.txt      #停用词
│      __init__.py
│      
└─utils
        beamsearch.py      #to do 未完工
        greedysearch.py    #贪婪搜索，用于测试
        __init__.py

?Dependent libraries

Install dependencies

$ pip install -r requirements.txt

?Get started

Data preprocessing (can be omitted)

$ python datapreprocess.py

Preprocess the corpus to generate corpus.pth ( corpus.pth has been uploaded here, so this step can be omitted )

Modifiable parameters:

 # datapreprocess.py
corpus_file = 'clean_chat_corpus/qingyun.tsv' #未处理的对话数据集
max_voc_length = 10000 #字典最大长度
min_word_appear = 10 #加入字典的词的词频最小值
max_sentence_length = 50 #最大句子长度
save_path = 'corpus.pth' #已处理的对话数据集保存路径

use

Use the knowledge base

When using the knowledge base, you need to pass in the parameter use_QA_first=True At this time, for the input string, the best question and answer are first matched in the knowledge base and returned. When it cannot be found, the chatbot is called to automatically generate a reply.

The knowledge base here is a collection of 100 frequently asked questions and answers compiled from Tencent Cloud official documents, for testing only!

$ python main.py chat --use_QA_first=True

Not using a knowledge base

Due to the needs of course design, Tencent Cloud's question and answer pairs have been added, but it is irrelevant to the chat robot project, so when used generally, use_QA_first=False , this parameter defaults to True

$ python main.py chat --use_QA_first=False

Use default parameters

$ python main.py chat

Exit chat: enter exit , quit , or q

Other configurable parameters

Explain in the config.py file

When you need to pass in new parameters, you only need to pass them in from the command line, in the form of

$ python main.py chat --model_ckpt= ' checkpoints/chatbot_0509_1437 ' --use_QA_first=False

The above command indicates the path to load the trained model and whether to use the knowledge base

?Technical implementation

corpus

Corpus name	Number of corpus	Explanation of corpus sources	Corpus characteristics	Corpus sample	Whether it has been participled
qingyun (Qingyun corpus)	10W	A chatbot communication group	Relatively good, life-friendly	Q: It seems you love money very much. A: Oh, really? Then you're almost there	no

Source: https://github.com/codemayq/chinese_chatbot_corpus

Seq2Seq

Encoder: two-layer bidirectional GRU
Decoder: Double-layer unidirectional GRU

Attention

Global attention, using dot to calculate scores
Ref. https://arxiv.org/abs/1508.04025

?Model training and evaluation

$ python train_eval.py train [--options]

The quantitative evaluation part has not been written yet. It should be measured by perplexity. Currently, it can only generate sentences and manually evaluate the quality.

$ python train_eval.py eval [--options]

?Record and summary of pit jumping

The most profound experience is that "there is a gap of N programming implementations between the understanding and understanding of deep learning knowledge." Although the theory is very clear to everyone, when it comes to programming implementation, there will always be such and such problems: from the processing of data sets, to the programming implementation of many formulas, to parameter adjustment, GPU configuration and other issues.
The actual process of this practice was to follow the PyTorch Tutorial and go through the Chatbot part first. After running through it, I changed the corpus, processed the corpus, and then refactored the code according to the class style. Then there was the endless debugging process, and I encountered many pitfalls. , especially when moving tensors to the GPU, we encounter various problems, mainly because we don’t know exactly what was moved when to (device).
- Through testing, we found that model.to(device) will only move the parameters to the GPU, and will not move the member tensor defined in the class, so if you define a new tensor in the forward method, remember to move it.
- There is also the issue of the order of movement: first move the model to the GPU, and then define the optimizer. And the moving method: model=model.to(device), don’t forget to assign value.
- It is easy for the GPU to run out of memory. Pay attention to memory utilization when writing code and minimize duplication of tensors.
- After changing the Chinese corpus at the beginning, the training always failed to converge. In the end, I found out that the batch_size setting was too small. In fact, I felt that the batch_size should be as large as possible when the video memory is sufficient. In fact, I have seen this before, but I completely forgot about it when writing the code. That's it. It shows that I didn’t understand enough when I saw mini-batch at the time. I still have to actually write code to be able to be deeply rooted in people’s hearts. At least bugs are deeply rooted in people’s hearts.
- Another problem is that I misunderstood torch.long, thinking it was a high-precision floating point, but it turned out to be an int64 type, which caused a bug. It took me a long time to find out what was going on. This tells us to read the documentation carefully.
- The final gain is familiarity with how to actually implement a model, which is important.
In fact, the effect of this model is not very good. Apart from the problems of the model itself, I found that the quality of word segmentation will seriously affect the quality of the sentence. However, when I did not even set stop words during word segmentation, some strange results would occur.
Another problem is that when dealing with variable-length sequences, if the loss function is defined by yourself, it is easy to become unstable. We are still studying the official API.
During this practice, I also found that my understanding of some parameters was not deep enough, and I didn’t know how to adjust them, so I had to supplement the theory.
The evaluation of the model still needs to be done.

Acknowledgments

Official Chatbot Tutorial
- https://pytorch.org/tutorials/beginner/chatbot_tutorial.html
Provide Chinese corpus
- https://github.com/codemayq/chinese_chatbot_corpus
The content is consistent with the official Chatbot Tutorial, but has detailed code comments
- http://fancyerii.github.io/2019/02/14/chatbot/
Please refer to the writing method and habits of the model.
- https://github.com/chenyuntc/pytorch-book

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2024-12-19
size 83.54MB
From Github

Related Applications

GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
pytorch image models

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01
Chinese DOS games (Chinese DOS games in browser) project source code official version

2022-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
node telegram bot api

AI Source Code

v0.50.0
typebot.io

AI Source Code

v3.1.2
python wechaty getting started

AI Source Code

1.0.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All