Baidu cloud sharing crawler project
There are several such open source projects on github, but they only provide the crawler part. This project also adds modules for saving data and establishing elasticsearch indexes based on the crawler. It can be used in actual production environments, but the web module is still required. Develop it yourself
Install
Install node.js and pm2, node is used to run crawlers and indexing programs, and pm2 is used to manage node tasks.
Install mysql and mongodb. Mysql is used to save crawler data, and mongodb is used to save the final Baidu Cloud shared data. These data are in json format, and it is more convenient to save them with mongodb.
gitclonehttps://github.com/callmelanmao/yunshare
cnpmi
It is recommended to use the cnpm command to install npm dependencies, the simplest installation method
$npminstall-gcnpm--registry=https://registry.npm.taobao.org
More commands for installing cnpm can be found at npm.taobao.org.
initialization
The crawler data (mainly URL list) is saved in the mysql database. Yunshare uses sequelizejs for ORM mapping. The source file is in src/models/index.js. The default mysql username and password are both root. The data looks like yun. You can You need to manually create the yun database
createddatabaseyundefaultcharsetutf8
Change the password according to your needs. After completing the mysql configuration, you can run the following command
gulpbabel
nodedist/script/init.js
Note that you must first run gulpbabel to compile the es6 code into es5, and then run the initialization script to import the initial data. The data file is in data/hot.json, which is from the page http://yun.baidu.com/pcloud/friend/gethotuserlist? type=1&from=feed&start=0&limit=24&bdstoken=ac95ef31d3979f6ee707ef75cee9f5c5&clienttype=0&web=1 saved.
Start a project
Yunshare uses pm2 for nodejs process management. Run pm2startprocess.json to start all background tasks. To check whether the tasks are running normally, you can use the command pm2list. There should be 4 tasks running normally.
Start elasticsearch index
The elasticsearch index program has also been written. The mapping file is in data/mapping.json. Please make sure you have installed the elasticsearch5.0 version before running the index program, command pm2startdist/elastic.js.
The default elasticsearch address is http://localhost:9200. If you need to modify this address, you can modify it in src/ElasticWorker.js. After modifying any js source code, remember to run gulpbabel and restart the pm2 task, otherwise the modification will not take effect. of.
After completing the elasticsearch configuration, you can also add an elastic task in process.json so that you do not need to start the indexing program separately.
Related documents
Simple and efficient nodejs crawler model
DEMO