Preparation work for writing Java Zhihu crawler from scratch

Author：Eve Cole Update Time：2025-01-15 18:24:01

Let's start with the same thing as before. Let's talk about the ideas of making a crawler and the knowledge that needs to be prepared. Experts, please ignore it.

First, let’s think about what we want to do and list some simple requirements.

The requirements are as follows:

1. Simulate access to Zhihu official website (http://www.zhihu.com/)

2. Download the specified page content, including: today’s hottest, this month’s hottest, and editor recommendations

3. Download all questions and answers in the specified categories, such as: investment, programming, failing courses

4. Download all answers from the specified respondent

5. It would be best to have a perverted one-click like function (so that I can like all of Laylen’s answers at once. I’m so smart!)

Then the technical problems that need to be solved are briefly listed as follows:

1. Simulate browser access to web pages

2. Capture key data and save it locally

3. Solve the dynamic loading problem in web browsing

4. Use a tree structure to massively crawl all content on Zhihu

Okay, that’s all I’m thinking about right now.

The next step is preparation.

1. Determine the crawler language: Since I have written a series of crawler tutorials before (click here), Baidu Tieba, Encyclopedia of Embarrassing Things, Shandong University's grade point query, etc. are all written in python, so I decided to use Java to write it this time (feed completely If you don’t have half a dime, why don’t you contact me?)

2. Popular science crawler knowledge: Web crawler, or Web Spider, is a very vivid name. If the Internet is compared to a spider web, then a spider is a spider crawling around on the web. Web spiders search for web pages through their link addresses. For a detailed introduction, please click here.

3. Prepare the crawler environment: I won’t go into details about the installation and configuration of Jdk and Eclipse. Here, a good browser is very important for crawlers, because first you need to browse the web to know where the things you need are, and only then can you tell your crawlers where to go and how to crawl. I personally recommend Firefox or Google Chrome. Their functions of right-clicking to inspect elements and viewing source code are very powerful.

Now we start the official crawler journey! ~What should I talk about specifically? Well, this is a question. Let me think about it. Don’t worry^_^