weibo_terminater Download - weibo_terminater Source code download

weibo_terminater

AI Source Code

1.0.0

Download

Weibo Terminator

NLP corpus preparation tool. Warm reminder that this project is only used for academic research. The author does not bear any responsibility for a series of consequences caused by using it for other purposes. About two years have passed. I want to update this project again, just for the sake of responsibility and belief. This update is as follows:

Added some auxiliary libraries for logging to better display information. The log library comes from alfred: http://github.com/jinfagang/alfred;
The driver of PhantomJS has been abandoned and FireFox is used as the proxy by default. As a result, you may need to install the selenium plug-in of FireFox, called geckodriver. Just install it in /usr/bin via Google;
Removed some unnecessary files.

It seems that the previous image links are no longer available and should be deleted directly. I can only say that domestic cloud server providers are too cheating and stingy. They will cut off your link without paying. It may also be that the general environment is not good and all the ones that should go bankrupt have gone bankrupt. .

In fact, the author stopped engaging in NLP two years later. After graduating from graduate school, I engaged in related work in the field of autonomous driving, but I have never given up on NLP work. The follow-up does not prevent me from relying on my interest in NLP. For this purpose, I will provide you with some information. The latest interesting stuff:

The largest AI algorithm market in China: http://manaai.cn
Some communication forums about the question and answer system: http://t.manaai.cn

This project will continue to be updated and maintained, thank you for your attention.

belated update

More than twenty days have passed since this project was initiated. Five hundred years later, we finally have to get things started! ! This project will always be updated. In order to facilitate contributions, I have re-launched a project: https://github.com/jinfagang/weibo_terminator_workflow.git. If you want to contribute crawling corpus together, you can star the workflow project at the same time. If If you want to play with Weibo crawler, you can continue to follow this project.

2017-4-19 Big update! ! ! Launch Weibo Terminator Plan (WT Plan)

weibo_terminator Weibo terminator crawler is basically ready:

This time we have updated the following features:

A delay strategy has been added, crawling 10 pages each time and pausing for five minutes. This still cannot 100% guarantee that the account will not be banned, but we still have strategies! !
Now we use more than a dozen accounts to start crawling at the same time. weibo_scraper will automatically switch to the next account after one account is banned! !
No need to set cookies! ! ! Important things have to be said three times. We no longer need to set cookies manually. We only need to set the corresponding account in accounts.py. WT will automatically obtain the cookies. You can also set and update them later, or delete the cookie cache and update them manually;

If you think this is all you have, you will be ruined. Take the clothes from Sanmu Tanmuzi. The more important updates are:

IDs are not limited to numeric IDs. The letter IDs of some celebrities are still crawling. The default ID we updated this time is angelababy’s Weibo. Her ID is: realangelababy ;
The author has perfected the script for extracting chat pairs from the conversation format of Weibo content, and the accuracy of the conversation is around 99% (consider copyright issue, we will open source it later);
The author submitted a list of nearly 8 million user IDs divided into categories, and the entire network was crawled (Consider weibo official limitations, we can't distributed all list, just for sample, join our contributor team we will give every contributor single and unique part of id_file .);
The author has added a breakpoint resume function. In this update, our crawler will remember where it was crawled last time. The second time it will crawl directly from the last interrupted place until it crawls the entire Weibo, so when you After your cookies are banned, just change to a smaller account and continue climbing;
All work will be completed within half a month. The constructed corpus is limited to contributors. Everyone is welcome to contribute to WT.

In order to be based on the huge Weibo network, we launched the Terminator Project and worked together to crawl the Weibo Chinese project corpus. This updated repo contains a weibo_id.list file, which contains the IDs of nearly 8 million users classified into categories. Don't ask me how it came about. Next, we assign each contributor an ID of a certain range, crawl all Weibo, and then upload the results to our internal Baidu cloud disk. All data only includes all contributors and weibo_terminator. authors can be obtained. The final statement is as follows. This project refers to some similar projects, but the functions and complexity of the problems implemented by this project are not comparable to those of the above projects. What we implement are the latest web API and Python3, and many other projects are based on scrapy. This project does not use any similar crawler libraries at all. For no other reason than projects built with those libraries lack flexibility, which we don’t like very much. Hope everyone understands.

Finally, everyone is still welcome to submit issues. We will always open source and maintain and update it! !

Contribution tips:

Clone this repo: git clone https://github.com/jinfagang/weibo_terminater.git ;
Install PhantomJS to enable weibo_terminator auto get cookies, get it from here and set your unzip path to settings/config.py , follow the instruction there;
Set your multi account, inside settings/accounts.py , you can use multi account now, terminator will automatically dispatch them;
Run python3 main.py -i realangelababy , scrap single user, set settings/id_file for multi user scrap;
Contact project administrator via wechat jintianiloveu , if you want to contribute, administrator will hand out you and id_file which is unique in our project;
All data will be saved into ./weibo_detail , with different id separately.
Collect data to project administrator.
When all the work finished, administrator will distribute all data as one single file to all contributors. Using it under WT & TIANEYE COPYRIGHT .

Research & Discuss Group

We fund several groups for our project:

 QQ
AI智能自然语言处理: 476464663
Tensorflow智能聊天Bot: 621970965
GitHub深度学习开源交流: 263018023

Wechat
add administrator `jintianiloveu` to be added in.

Tutorial

This is the missing part of the first commit, use help:

 # -h see helps
python3 main.py -h

# -i specific an single id or id_file path(with every id as a line.)
python3 main.py -i 167385960
python3 main.py -i ./id_file

# -f specific filter mode, if 0, all weibo are all original, if 1, contains repost one, default is 0
python3 main.py -i 16758795 -f 0

# -d specific debug mode for testing, be aware debug mode only support one single id.
python3 main.py -i 178600077 -d 1

That's all, simple and easy.

About cookies

The cookies still maybe banned, if our scraper continues get information from weibo, that is exactly we have to get this job done under people's strength, no one can build such a big corpora under one single power. If your cookies out of date or being banned, we strongly recommended using another weibo account which can be your friends or anyone else, and continue scrap, one thing you have to remind is that our weibo_terminator can remember scrap progress and it will scrap from where it stopped last time. :)

Weibo terminator crawler

Regarding the chat dialogue system, I will open source a project later. The purpose of this repo is to build a high-quality dialogue material based on Weibo. This project will continue to be further developed. Everyone, please star! ! Always open source!

This project is dedicated to combating Weibo's anti-crawler mechanism, gathering everyone's efforts to crawl thousands of Weibo comment corpus and producing an open source, high-quality Chinese dialogue corpus to promote the research and development of Chinese dialogue systems. This system has now implemented:

Crawling the number of Weibo posts, number of followers, number of fans, all Weibo content and all corresponding comments on Weibo of the specified id user;
The author considers the feasibility of creating a dialogue system and the difficulty of processing Weibo corpus. During the crawling process, all Weibo will be saved in an extractable form. For details, please refer to the sample of crawling result preservation;
This project does not rely on any third-party crawling framework, but manually implements a multi-threaded library. When crawling multiple users, hundreds of threads will be started to work, and the crawling speed is in the millions per hour;
The ultimate goal of this project is to make full use of the huge Weibo platform to build an open source and high-quality Chinese dialogue system (as far as the author knows, many companies treat their own data as treasures and despise it);
In addition, this project can also be used to analyze comments from designated users. For example, crawling Luo Yonghao’s Weibo can analyze his sales of Smartisan mobile phones in the second year (awesome).

I hope more children’s shoes can contribute. There is still a lot of work to be done. Welcome to submit PR!

Born for artificial intelligence

Chinese corpora have always been criticized, and there are no institutions or organizations to establish some public data sets. In contrast, in foreign countries, English corpora are quite abundant and have been made very accurately.

The author of Weibo corpus believes that it is the corpus with the widest coverage, the most active and the freshest. It does not matter whether the model is accurate when using it to build a dialogue system, but there is definitely a fresh vocabulary.

Crawl results

The designated user’s Weibo and comment formats are as follows:

 E
4月15日#傲娇与偏见# 超前点映，跟我一起去抢光它 [太开心]  傲娇与偏见 8.8元超前点映  顺便预告一下，本周四（13号）下
午我会微博直播送福利，不见不散哦[坏笑]   电影傲娇与偏见的秒拍视频 <200b><200b><200b>
E
F
<哈哈哈哈哈哈狗->: 还唱吗[doge]
<緑麓>: 绿麓！
<哈哈哈哈哈哈狗->: [doge][doge]
<至诚dliraba>: 哈哈哈哈哈哈哈
<五只热巴肩上扛>: 大哥已经唱完了[哆啦A梦吃惊]
<哈哈哈哈哈哈狗->: 大哥[哆啦A梦吃惊]
<独爱Dear>: 10:49坐等我迪的直播[喵喵][喵喵][喵喵]
<四只热巴肩上扛>: 对不起[可怜]我不赶
<四只热巴肩上扛>: 哈狗[哆啦A梦花心][哆啦A梦花心]
<至诚dliraba>: 哈狗来了 哈哈哈
<四只热巴肩上扛>: [摊手]绿林鹿去哪里了！！！！
<哈哈哈哈哈哈狗->: 阿健[哆啦A梦花心]
<至诚dliraba>: 然而你还要赶我出去[喵喵]
<四只热巴肩上扛>: 我也很绝望
<至诚dliraba>: 只剩翻墙而来的我了
<四只热巴肩上扛>: [摊手]我能怎么办
<四只热巴肩上扛>: [摊手]一首歌唱到一半被掐断是一个歌手的耻辱[摊手]
<至诚dliraba>: 下一首
<四只热巴肩上扛>: 最害怕就是黑屋[摊手]
<至诚dliraba>: 我脑海一直是 跨过傲娇与偏见 永恒的信念
F

illustrate:

EE represents the beginning and end of Weibo content
FF represents the beginning and end of all comments
<> in each comment is the id of the user who initiated the comment, and $$ is the id of the at user.

Future Work

The corpus crawled now is the original version. You can start from here on how to use the corpus. It can be used to make topic comment robots. However, the author will continue to develop post-processing programs to turn Weibo raw data into conversational form and open source it. . Of course, interested children's shoes are also welcome to submit a PR to me and select the best solution to promote the progress of this project.