A powerful tool for crawling Baidu
Simplified Chinese | Traditional Chinese | English
Get started quickly »
View examples · Report a problem · Request a requirement
The search engine is a very powerful tool, and if other tools can be integrated with the many powerful functions of the search engine, then these tools will become even more powerful. But currently I have not found an open source crawler that can accurately extract search engine search results. So, I wrote this project to crawl Baidu search engine: BaiduSpider.
BaiduSpider’s unique features:
It saves time in extracting data and is a good help for data model establishment and training in similar deep learning projects.
Accurately extract data and remove ads.
The search results are large and comprehensive, supporting multiple search types and return types.
Of course, no project is perfect. The development of any project requires the help of the community. You can help BaiduSpider progress by publishing an Issue or submitting a PR! :smile:
Some helpful documents or tools are listed in the Acknowledgments section at the end.
Some of the main open source dependency libraries used by BaiduSpider.
In order to install BaiduSpider, please follow the following few steps.
Before installing BaiduSpider, please make sure you have Python3.6+
installed:
$ python --version
If the version is less than 3.6.0
, please go to the Python official website to download and install Python.
pip
Please type at the command line:
$ pip install baiduspider
$ git clone [email protected]:BaiduSpider/BaiduSpider.git
# ...
$ python setup.py install
You can use the following code to obtain Baidu's web search results through BaiduSpider:
# 导入BaiduSpider
from baiduspider import BaiduSpider
from pprint import pprint
# 实例化BaiduSpider
spider = BaiduSpider ()
# 搜索网页
pprint ( spider . search_web ( query = 'Python' ))
For more samples and configurations, please refer to the documentation
Please refer to Opening Issues for the latest project plans and known issues.
Community contributions are the soul of open source projects and are also the way for the entire open source community to learn, communicate, and gain inspiration. We strongly welcome anyone to participate in the development and maintenance of this project.
Specific steps to participate are as follows:
git checkout -b NewFeatures
)git commit -m 'Add some AmazingFeature'
)git push origin username/BaiduSpider
) This project is open source based on GPL-V3
, please see LICENSE
for details.
samzhangjy - @samzhangjy - [email protected]
Project link: https://github.com/BaiduSpider/BaiduSpider
This project is for learning purposes only and cannot be used for commercial purposes or to crawl large amounts of Baidu data. In addition, this project uses the GPL-V3
copyright agreement, which means that any other projects involving (using) this project must be open source and indicate the source, and the author of this project does not bear any legal risks caused by misuse. It is hereby stated that violators shall bear the consequences at their own risk.