Datasets for Training Chatbot System
This project collects some dialogue materials found on the Internet for training Chinese (English) chat robots
Some of the data sets collected are as follows, click on the link to enter the original address
dgk_shooter_min.conv.zip
Chinese movie dialogue corpus is relatively noisy, and many dialogues do not have good question-answer relationships.
The NUS SMS Corpus
Contains Chinese and English short message corpus, which is said to be the largest public short message corpus in the world.
ChatterBot Chinese basic chat corpus
The ChatterBot chat engine provides some basic Chinese chat corpus. The quantity is small, but the quality is relatively high.
Datasets for Natural Language Processing
This is a natural language processing related data set collected by others. It mainly includes three parts: Question Answering, Dialogue Systems, and Goal-Oriented Dialogue Systems, all of which are English texts. Can use machine translation to Chinese for use in Chinese conversations
Xiaohuangji It is said that this is the corpus of Xiaohuangji50w_fenciA.conv.zip (segmented) and xiaohuangji50w_nofenci.conv.zip (unsegmented)
The Chinese question and answer corpus of Egret Times is compiled from the 10,000+ questions in the Q&A section of the official forum of Egret Times, and records marked with the "best answer" are selected. Manually review the raw data and give each question an acceptable answer. Currently, the corpus contains only 2907 questions and answers. (backup)
Chat corpus repository
chat corpus collection from various open sources
Includes: open subtitles, English movie subtitles, Chinese lyrics, English tweets
The insurance industry QA corpus is a data set generated by translating insuranceQA. train_data contains 12,889 questions, 141,779 data, positive examples: negative examples = 1:10; test_data contains 2,000 questions, 22,000 data, positive examples: negative examples = 1:10; valid_data contains 2,000 questions, 22,000 data, Positive example: Negative example = 1:10
This part of the corpus has been circulated on the Internet, but due to our limited ability or the original author has not made it public, it has not been obtained yet. Just listed for future search.
All original corpus belongs to the original author
He Yunchao
weibo: @Yunchao_He