YayCrawler distributed crawler system download-YayCrawler distributed crawler system v1.0 source code download

YayCrawler distributed crawler system v1.0

JAVA source code

1.0

Download

YayCrawler distributed crawler system, simple to use, advanced configuration. It is extensible, reduces development workload, can be dockerized, and adapts to various urgent needs. Core frameworks: WebMagic, Spring Boot, MongoDB, ActiveMQ, Spring + Quartz, Spring Jpa, Druid, Redis, Ehcache, SLF4J, Log4j2, Bootstrap + Jquery, etc.

Project goals

To the best of our ability, maximize the productivity of web crawler developers, which is a breath of fresh air in the crawler framework.

Main functions

A complete distributed crawler framework developed based on WebMagic. The characteristics of this framework are as follows:

1. Completely distributed: It consists of the management terminal (Admin), the dispatching terminal (Master) and multiple workers. Each component communicates through the HTTP protocol.

2. Fully configurable: Data from any website can be crawled through the page configuration rules on the Admin side. Of course, the difficulty of different websites is different, and there will be different components to handle issues such as login, verification code, IP blocking, etc.

3. Scalable task queue: The task queue is implemented by Redis. There are four different task queues according to the status of the task: initial, executing, successful, and failed. You can also extend different task scheduling algorithms, the default is fair scheduling.

4. Definable persistence method: In the crawling results, the attribute data is persisted to MonogoDB by default, and the images will be downloaded to the file server. Of course, you can expand more storage types.

5. Stability and fault tolerance: Any crawler task will be retried and recorded. Only when the task is truly successful will it be moved to the success queue. Failure will have a description of the reason for the failure.

Technology selection

Core framework: Webmagic Spring boot

Task scheduling: Spring + Quartz

Persistence layer framework: Spring Jpa

Database & connection pool: Alibaba Druid MongoDB MySql

Caching framework: Redis Ehcache

Log management: SLF4J, Log4j2

Front-end framework: Bootstrap + Jquary

Development environment configuration

1.Install JDK8

2. Install the mysql database to store data such as parsing rules. You need to create a database instance of "yayCrawler" and execute the quartz-related database script: quartz.sql (see the release package or source code).

3.Install redis

4. Install mongoDB to store result data

5. Install the ftp server software ftpserver (optional, used to store downloaded images)

Startup instructions

Import the project, maven install to install the Admin, Worker, Master modules. Then copy the generated Jar to the crawler.worker/deploy directory. Remember to change the IP of Redis and mysql mogodb in the configuration file and click start.bat to start.

(Linux & Windows) java -jar worker.war --spring.config.location=worker_local.properties

close command

(Windows) for /f "tokens=1-5 delims= " %%a in ('"netstat -ano|findstr "^:8086""') do taskkill /f /pid %%e

Communication instructions for each component

1. Admin The Admin layer is mainly responsible for page extraction rule configuration, page Site configuration, resource management and task release.

2. Master is the control center of the distributed crawler. It accepts tasks released by Admin and assigns tasks to workers for execution.

2.1. Receive publishing tasks

2.2. Accept Worker registration

3. Worker is a hard-working young man who really does things. He accepts and executes the tasks assigned by the Master, and reports his heartbeat to the Master regularly.

Expand

Additional Information

Version 1.0
Type JAVA source code
Update Time 2024-11-11
size 139.6MB

Related Applications

Apache Nutch web crawler v1.19

2024-10-30
Online examination system v1.0

2022-05-30
Transaction matching system v1.0

2022-05-30
College Management System v1.0

2022-05-30
Course selection system v1.0

2022-05-30
Jingdong product review crawler source code v1.0

2022-05-23

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
XYCMS nursing home website building system

JAVA source code

v1.4
redisson

JAVA source code

redisson-3.40.1
opentelemetry java instrumentation

JAVA source code

Version 2.10.0
waymo open dataset

Other source code

December 2023 Update
termwind

Other categories

v2.3.0
wp functions

Other categories

1.0.0

Related Information All