Write Java Zhihu crawler from scratch to obtain content recommended by Zhihu editors

Author：Eve Cole Update Time：2025-01-15 18:48:01

Zhihu is a real online Q&A community with a friendly, rational and serious community atmosphere that connects elites from all walks of life. They share each other's professional knowledge, experience and insights, and continuously provide high-quality information for the Chinese Internet.

First, spend three to five minutes designing a Logo =. =As a programmer, I have always had a heart for being an artist!

Okay, it's a little makeshift, so I'll make do with it for now.

Next, we started making Zhihu’s crawler.

First, determine the first goal: editor recommendation.

Web link: http://www.zhihu.com/explore/recommendations

We slightly modified the last code to first achieve the ability to obtain the content of the page:

 import java.io.*;
 import java.net.*;
 import java.util.regex.*;
 public class Main {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 // Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream()));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 static String RegexString(String targetStr, String patternStr) {
 // Define a style template, using regular expressions, and the content to be captured is in parentheses
 // It's equivalent to burying a trap and it will fall if it matches.
 Pattern pattern = Pattern.compile(patternStr);
 //Define a matcher for matching
 Matcher matcher = pattern.matcher(targetStr);
 // if found
 if (matcher.find()) {
 // print out the result
 return matcher.group(1);
 }
 return "Nothing";
 }
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.zhihu.com/explore/recommendations";
 //Access the link and get the page content
 String result = SendGet(url);
 // Use regular expressions to match the src content of the image
 //String imgSrc = RegexString(result, "src=/"(.+?)/"");
 // print results
 System.out.println(result);
 }
 }

There is no problem after running it. The next step is a regular matching problem.

First, let's get all the questions on this page.

Right-click on the title and inspect the element:

Aha, you can see that the title is actually an a tag, which is a hyperlink, and the thing that can be distinguished from other hyperlinks should be the class, which is the class selector.

So our regular statement comes out: question_link.+?href=/"(.+?)/"

Call the RegexString function and pass it parameters:

 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.zhihu.com/explore/recommendations";
 //Access the link and get the page content
 String result = SendGet(url);
 // Use regular expressions to match the src content of the image
 String imgSrc = RegexString(result, "question_link.+?>(.+?)<");
 // print results
 System.out.println(imgSrc);
 }

Ah ha, you can see that we successfully captured a title (note, just one):

Wait a minute, what is all this mess? !

Don't be nervous=. =It's just garbled characters.

For encoding issues, please see: HTML character set

Generally speaking, the mainstream encodings with better support for Chinese are UTF-8, GB2312 and GBK encoding.

The webpage can set the webpage encoding through the charset of the meta tag, for example:

 <meta charset="utf-8" />

Let's right-click to view the page source code:

As you can see, Zhihu uses UTF-8 encoding.

Here I will explain to you the difference between viewing the page source code and inspecting elements.

Viewing the page source code displays all the code of the entire page. It is not formatted according to HTML tags. It is equivalent to viewing the source code directly. This method is more useful for viewing the information of the entire web page, such as meta.

Inspect element, or some browsers call it view element, which is to view the element you right-click, such as a div or img. It is more suitable for viewing the attributes and tags of an object individually.

Okay, now we know that the problem lies in the encoding, the next step is to convert the encoding of the captured content.

The implementation in java is very simple, just specify the encoding method in InputStreamReader:

 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream(),"UTF-8"));

If you run the program again at this time, you will find that the title can be displayed normally:

OK! very good!

But now there is only one title, we need all the titles.

We slightly modify the regular expression and store the searched results in an ArrayList:

 import java.io.*;
 import java.net.*;
 import java.util.ArrayList;
 import java.util.regex.*;
 public class Main {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 // Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream(), "UTF-8"));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 static ArrayList<String> RegexString(String targetStr, String patternStr) {
 // Predefine an ArrayList to store the results
 ArrayList<String> results = new ArrayList<String>();
 // Define a style template, using regular expressions, and the content to be captured is in parentheses
 Pattern pattern = Pattern.compile(patternStr);
 //Define a matcher for matching
 Matcher matcher = pattern.matcher(targetStr);
 // if found
 boolean isFind = matcher.find();
 // Use a loop to find and replace all kelvin in the sentence and add the content to sb
 while (isFind) {
 //Add successful matching results
 results.add(matcher.group(1));
 // Continue to find the next matching object
 isFind = matcher.find();
 }
 return results;
 }
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.zhihu.com/explore/recommendations";
 //Access the link and get the page content
 String result = SendGet(url);
 // Use regular expressions to match the src content of the image
 ArrayList<String> imgSrc = RegexString(result, "question_link.+?>(.+?)<");
 // print results
 System.out.println(imgSrc);
 }
 }

This will match all the results (because the ArrayList is printed directly, there will be some square brackets and commas):

OK, this is the first step of Zhihu crawler.

But we can see that there is no way to capture all the questions and answers in this way.

We need to design a Zhihu encapsulation class to store all captured objects.

Zhihu.java source code:

 import java.util.ArrayList;
 public class Zhihu {
 public String question; // question
 public String zhihuUrl;//Web page link
 public ArrayList<String> answers; // Array to store all answers
 // Constructor initializes data
 public Zhihu() {
 question = "";
 zhihuUrl = "";
 answers = new ArrayList<String>();
 }
 @Override
 public String toString() {
 return "Question:" + question + "/nLink:" + zhihuUrl + "/nAnswer:" + answers + "/n";
 }
 }

Create a new Spider class to store some commonly used functions of crawlers.

Spider.java source code:

 import java.io.BufferedReader;
 import java.io.InputStreamReader;
 import java.net.URL;
 import java.net.URLConnection;
 import java.util.ArrayList;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 public class Spider {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 // Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream(), "UTF-8"));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 static ArrayList<Zhihu> GetZhihu(String content) {
 // Predefine an ArrayList to store the results
 ArrayList<Zhihu> results = new ArrayList<Zhihu>();
 // Used to match titles
 Pattern questionPattern = Pattern.compile("question_link.+?>(.+?)<");
 Matcher questionMatcher = questionPattern.matcher(content);
 // Used to match the url, which is the link to the question
 Pattern urlPattern = Pattern.compile("question_link.+?href=/"(.+?)/"");
 Matcher urlMatcher = urlPattern.matcher(content);
 // Both the question and the link must match
 boolean isFind = questionMatcher.find() && urlMatcher.find();
 while (isFind) {
 //Define a Zhihu object to store the captured information
 Zhihu zhuhuTemp = new Zhihu();
 zhuhuTemp.question = questionMatcher.group(1);
 zhuhuTemp.zhihuUrl = "http://www.zhihu.com" + urlMatcher.group(1);
 //Add successful matching results
 results.add(zhuhuTemp);
 // Continue to find the next matching object
 isFind = questionMatcher.find() && urlMatcher.find();
 }
 return results;
 }
 }

The last main method is responsible for calling.

 import java.util.ArrayList;
 public class Main {
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.zhihu.com/explore/recommendations";
 //Access the link and get the page content
 String content = Spider.SendGet(url);
 // Get all Zhihu objects on this page
 ArrayList<Zhihu> myZhihu = Spider.GetZhihu(content);
 // print results
 System.out.println(myZhihu);
 }
 }

Ok, that’s it. Run it and see the results:

Good results.

The next step is to access the link and get all the answers.

We will introduce it next time.

Okay, the above is a brief introduction to the entire process of how to use Java to capture content recommended by Zhihu editors. It is very detailed and easy to understand. Friends in need can refer to it and expand freely. No problem, haha