Since the mobile web version does not impose too many restrictions on crawlers, it can directly crawl some Weibo search data. The search API is as follows:
https://m.weibo.cn/api/container/getIndex?type=wb&queryVal={}&containerid=100103type=2%26q%3D{}&page={}
Based on this API, a certain amount of JSON data can be obtained (see sample.json for the original data). After processing, the format is as follows:
{
"mid" : " 4199434918992223 " ,
"text" : " 【深度学习的终极形态】近期,院友袁进辉博士回到微软亚洲研究院做了题为《打造最强深度学习引擎》的报告,分享了深度学习框架方面的技术进展。他在报告中启发大家思考如何才能“鱼和熊掌兼得”,让软件发挥灵活性,硬件发挥高效率。我们整理了本次报告的重点,希望能对大家有所帮助! ...全文" ,
"userid" : " 1286528122 " ,
"username" : "微软亚洲研究院" ,
"reposts_count" : 21 ,
"comments_count" : 1 ,
"attitudes_count" : 9
}
For detailed crawlers, see weibo_search.py.
Word cloud can be implemented using wordcloud. The basic steps are:
Word segmentation and keyword extraction: Chinese text requires word segmentation and the removal of a large number of stop words, such as (you, me, him, this), in order to make the generated word cloud more meaningful. This step can be completed directly using the TF-IDF keyword extraction of jieba word segmenter.
What is passed into wordcloud is a string and an underlying image. Concatenate the keywords obtained in the first step with spaces. For the selection of the underlying image, try to choose a white background image, so that the generated image will be closer to the original. picture.
See weibo_cloud.py for code details.
Keywords: iPhone
Keywords: Microsoft
Keywords: Google