NLP corpus preparation tool. Warm reminder that this project is only used for academic research. The author does not bear any responsibility for a series of consequences caused by using it for other purposes. About two years have passed. I want to update this project again, just for the sake of responsibility and belief. This update is as follows:
/usr/bin
via Google;It seems that the previous image links are no longer available and should be deleted directly. I can only say that domestic cloud server providers are too cheating and stingy. They will cut off your link without paying. It may also be that the general environment is not good and all the ones that should go bankrupt have gone bankrupt. .
In fact, the author stopped engaging in NLP two years later. After graduating from graduate school, I engaged in related work in the field of autonomous driving, but I have never given up on NLP work. The follow-up does not prevent me from relying on my interest in NLP. For this purpose, I will provide you with some information. The latest interesting stuff:
The largest AI algorithm market in China: http://manaai.cn
Some communication forums about the question and answer system: http://t.manaai.cn
This project will continue to be updated and maintained, thank you for your attention.
More than twenty days have passed since this project was initiated. Five hundred years later, we finally have to get things started! ! This project will always be updated. In order to facilitate contributions, I have re-launched a project: https://github.com/jinfagang/weibo_terminator_workflow.git. If you want to contribute crawling corpus together, you can star the workflow project at the same time. If If you want to play with Weibo crawler, you can continue to follow this project.
weibo_terminator Weibo terminator crawler is basically ready:
This time we have updated the following features:
If you think this is all you have, you will be ruined. Take the clothes from Sanmu Tanmuzi. The more important updates are:
realangelababy
; In order to be based on the huge Weibo network, we launched the Terminator Project and worked together to crawl the Weibo Chinese project corpus. This updated repo contains a weibo_id.list
file, which contains the IDs of nearly 8 million users classified into categories. Don't ask me how it came about. Next, we assign each contributor an ID of a certain range, crawl all Weibo, and then upload the results to our internal Baidu cloud disk. All data only includes all contributors and weibo_terminator. authors can be obtained. The final statement is as follows. This project refers to some similar projects, but the functions and complexity of the problems implemented by this project are not comparable to those of the above projects. What we implement are the latest web API and Python3, and many other projects are based on scrapy. This project does not use any similar crawler libraries at all. For no other reason than projects built with those libraries lack flexibility, which we don’t like very much. Hope everyone understands.
Finally, everyone is still welcome to submit issues. We will always open source and maintain and update it! !
Contribution tips:
git clone https://github.com/jinfagang/weibo_terminater.git
;settings/config.py
, follow the instruction there;settings/accounts.py
, you can use multi account now, terminator will automatically dispatch them;python3 main.py -i realangelababy
, scrap single user, set settings/id_file
for multi user scrap;jintianiloveu
, if you want to contribute, administrator will hand out you and id_file which is unique in our project;./weibo_detail
, with different id separately.WT & TIANEYE COPYRIGHT
.We fund several groups for our project:
QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023
Wechat
add administrator `jintianiloveu` to be added in.
This is the missing part of the first commit, use help:
# -h see helps
python3 main.py -h
# -i specific an single id or id_file path(with every id as a line.)
python3 main.py -i 167385960
python3 main.py -i ./id_file
# -f specific filter mode, if 0, all weibo are all original, if 1, contains repost one, default is 0
python3 main.py -i 16758795 -f 0
# -d specific debug mode for testing, be aware debug mode only support one single id.
python3 main.py -i 178600077 -d 1
That's all, simple and easy.
The cookies still maybe banned, if our scraper continues get information from weibo, that is exactly we have to get this job done under people's strength, no one can build such a big corpora under one single power. If your cookies out of date or being banned, we strongly recommended using another weibo account which can be your friends or anyone else, and continue scrap, one thing you have to remind is that our weibo_terminator can remember scrap progress and it will scrap from where it stopped last time. :)
Regarding the chat dialogue system, I will open source a project later. The purpose of this repo is to build a high-quality dialogue material based on Weibo. This project will continue to be further developed. Everyone, please star! ! Always open source!
This project is dedicated to combating Weibo's anti-crawler mechanism, gathering everyone's efforts to crawl thousands of Weibo comment corpus and producing an open source, high-quality Chinese dialogue corpus to promote the research and development of Chinese dialogue systems. This system has now implemented:
I hope more children’s shoes can contribute. There is still a lot of work to be done. Welcome to submit PR!
Chinese corpora have always been criticized, and there are no institutions or organizations to establish some public data sets. In contrast, in foreign countries, English corpora are quite abundant and have been made very accurately.
The author of Weibo corpus believes that it is the corpus with the widest coverage, the most active and the freshest. It does not matter whether the model is accurate when using it to build a dialogue system, but there is definitely a fresh vocabulary.
The designated user’s Weibo and comment formats are as follows:
E
4月15日#傲娇与偏见# 超前点映,跟我一起去抢光它 [太开心] 傲娇与偏见 8.8元超前点映 顺便预告一下,本周四(13号)下
午我会微博直播送福利,不见不散哦[坏笑] 电影傲娇与偏见的秒拍视频 <200b><200b><200b>
E
F
<哈哈哈哈哈哈狗->: 还唱吗[doge]
<緑麓>: 绿麓!
<哈哈哈哈哈哈狗->: [doge][doge]
<至诚dliraba>: 哈哈哈哈哈哈哈
<五只热巴肩上扛>: 大哥已经唱完了[哆啦A梦吃惊]
<哈哈哈哈哈哈狗->: 大哥[哆啦A梦吃惊]
<独爱Dear>: 10:49坐等我迪的直播[喵喵][喵喵][喵喵]
<四只热巴肩上扛>: 对不起[可怜]我不赶
<四只热巴肩上扛>: 哈狗[哆啦A梦花心][哆啦A梦花心]
<至诚dliraba>: 哈狗来了 哈哈哈
<四只热巴肩上扛>: [摊手]绿林鹿去哪里了!!!!
<哈哈哈哈哈哈狗->: 阿健[哆啦A梦花心]
<至诚dliraba>: 然而你还要赶我出去[喵喵]
<四只热巴肩上扛>: 我也很绝望
<至诚dliraba>: 只剩翻墙而来的我了
<四只热巴肩上扛>: [摊手]我能怎么办
<四只热巴肩上扛>: [摊手]一首歌唱到一半被掐断是一个歌手的耻辱[摊手]
<至诚dliraba>: 下一首
<四只热巴肩上扛>: 最害怕就是黑屋[摊手]
<至诚dliraba>: 我脑海一直是 跨过傲娇与偏见 永恒的信念
F
illustrate:
The corpus crawled now is the original version. You can start from here on how to use the corpus. It can be used to make topic comment robots. However, the author will continue to develop post-processing programs to turn Weibo raw data into conversational form and open source it. . Of course, interested children's shoes are also welcome to submit a PR to me and select the best solution to promote the progress of this project.
If you have any questions about the project, you can contact me on wechat: jintianiloveu
. Issues are also welcome.
(c) 2017 Jin Fagang & Tianmu Inc. & weibo_terminator authors LICENSE Apache 2.0