In the research process of the chatbot, except to having a wonderful model, a large amount of training materials are also needed to strengthen the efficacy of bot. The cleaner our corpus, the smarter chatbot that is able to generate human natural language replies can be (In the process of conducting Chatbot research, in addition to having a beautiful model, we also need a large amount of training corpus to strengthen our chatbot. The cleaner the corpus, the closer to human natural language responses can be trained Chatbot.)
dgk_shooter_min.conv movie dialogue corpus (Chinese movie dialogue corpus, noisy, because the dialogue does not distinguish the speaker, it is difficult to correspond to the dialogue question and answer relationship. )
ChatBot multi-language chat corpus Multi-language dialogue corpus proposed by ChatterBot (The basic language chat provided by the ChatterBot chat engine covers a wide range of languages, but the quantity is not large, but the quality is high and suitable for model testing.)
DataSets for Natural Language Processing A little bit summary of the corpus for paper researchs (This is a human-generated collection of natural language processing research papers and corresponding data sets. The main coverage areas include: Question Answering, Dialogue Systems and Goal-Oriented Dialogue System , etc. The text is composed of English and can be used for machine translation and conversational models).
A famous dialogue corpus “xiaohuangji ” published online (Unsegmented) The two parts are separated by "/" , and there is no semantic division. The corpus contains more emoticons, and the overall number of words in the dialogue is small and there is more noise.)
A Chinese QA pairs dataset (composed of questions and responses from the Q&A section of the official forum of Egret Times). The responses selected records marked with "best answers" as the target. Manually review the data and give each question an acceptable Answers. Not many, mostly in question and answer mode)
Cornell_Movie-Dialogs_Corpus Cornell movie dialogue corpus (Cornell University film and television dialogue data collection, the corpus contains interlocutor name information, the corpus is in English, mainly multi-turn dialogues.)
Chinese Quatrains Corpus Chinese quatrains corpus with length five (古文五语quatrains)
Obama Political Speeches Corpus Obama political speeches corpus (Excerpts from President Obama’s political speeches)
Chinese news corpus Chinese news (news headlines and briefs crawled from major news websites using crawlers.)
PTT gossip board tweets PTT twittes (Use a crawler to crawl the content of the gossip classification section on the social software PTT. The original data is PTT gossip board tweets.txt, which includes some symbols and space noise. Filter the noise (use statistics After replacing the method with fixed symbols in proportion to reduce the complexity of the data), the question and answer corpus and dictionary are established through different methods such as single words or phrases (jieba paragraphs).
The copyright of the public corpus is owned by the original author, and no one may be allowed to invest in profitable activities without his/her permission, thanks for your cooperation. Investing in profit-making activities in person’s name).
Tags: Corpus
Chatbot