YayCrawler distributed crawler system, simple to use, advanced configuration. It is extensible, reduces development workload, can be dockerized, and adapts to various urgent needs. Core frameworks: WebMagic, Spring Boot, MongoDB, ActiveMQ, Spring + Quartz, Spring Jpa, Druid, Redis, Ehcache, SLF4J, Log4j2, Bootstrap + Jquery, etc.
To the best of our ability, maximize the productivity of web crawler developers, which is a breath of fresh air in the crawler framework.
A complete distributed crawler framework developed based on WebMagic. The characteristics of this framework are as follows:
1. Completely distributed: It consists of the management terminal (Admin), the dispatching terminal (Master) and multiple workers. Each component communicates through the HTTP protocol.
2. Fully configurable: Data from any website can be crawled through the page configuration rules on the Admin side. Of course, the difficulty of different websites is different, and there will be different components to handle issues such as login, verification code, IP blocking, etc.
3. Scalable task queue: The task queue is implemented by Redis. There are four different task queues according to the status of the task: initial, executing, successful, and failed. You can also extend different task scheduling algorithms, the default is fair scheduling.
4. Definable persistence method: In the crawling results, the attribute data is persisted to MonogoDB by default, and the images will be downloaded to the file server. Of course, you can expand more storage types.
5. Stability and fault tolerance: Any crawler task will be retried and recorded. Only when the task is truly successful will it be moved to the success queue. Failure will have a description of the reason for the failure.
Core framework: Webmagic Spring boot
Task scheduling: Spring + Quartz
Persistence layer framework: Spring Jpa
Database & connection pool: Alibaba Druid MongoDB MySql
Caching framework: Redis Ehcache
Log management: SLF4J, Log4j2
Front-end framework: Bootstrap + Jquary
1.Install JDK8
2. Install the mysql database to store data such as parsing rules. You need to create a database instance of "yayCrawler" and execute the quartz-related database script: quartz.sql (see the release package or source code).
3.Install redis
4. Install mongoDB to store result data
5. Install the ftp server software ftpserver (optional, used to store downloaded images)
Import the project, maven install to install the Admin, Worker, Master modules. Then copy the generated Jar to the crawler.worker/deploy directory. Remember to change the IP of Redis and mysql mogodb in the configuration file and click start.bat to start.
(Linux & Windows) java -jar worker.war --spring.config.location=worker_local.properties
close command
(Windows) for /f "tokens=1-5 delims= " %%a in ('"netstat -ano|findstr "^:8086""') do taskkill /f /pid %%e
1. Admin The Admin layer is mainly responsible for page extraction rule configuration, page Site configuration, resource management and task release.
2. Master is the control center of the distributed crawler. It accepts tasks released by Admin and assigns tasks to workers for execution.
2.1. Receive publishing tasks
2.2. Accept Worker registration
3. Worker is a hard-working young man who really does things. He accepts and executes the tasks assigned by the Master, and reports his heartbeat to the Master regularly.