Write Java Zhihu crawler with zero foundation to crawl Zhihu answers

Author：Eve Cole Update Time：2025-01-15 19:00:03

In the early stage, we crawled the title under this link:

http://www.zhihu.com/explore/recommendations

But obviously this page cannot get the answer.

A complete question page would look like this:

http://www.zhihu.com/question/22355264

Taking a closer look, aha, our encapsulation class needs to be further packaged. At least a questionDescription is needed to store the question description:

 import java.util.ArrayList;
 public class Zhihu {
 public String question; // question
 public String questionDescription; // Question description
 public String zhihuUrl;//Web page link
 public ArrayList<String> answers; // Array to store all answers
 // Constructor initializes data
 public Zhihu() {
 question = "";
 questionDescription = "";
 zhihuUrl = "";
 answers = new ArrayList<String>();
 }
 @Override
 public String toString() {
 return "Question:" + question + "/n" + "Description:" + questionDescription + "/n"
 + "Link:" + zhihuUrl + "/nanswer:" + answers + "/n";
 }
 }

We add a parameter to Zhihu's constructor to set the URL value. Because the URL is determined, the description and answer to the question can be captured.

Let’s change Spider’s method of obtaining Zhihu objects and only obtain the URL:

 static ArrayList<Zhihu> GetZhihu(String content) {
 // Predefine an ArrayList to store the results
 ArrayList<Zhihu> results = new ArrayList<Zhihu>();
 // Used to match the url, which is the link to the question
 Pattern urlPattern = Pattern.compile("<h2>.+?question_link.+?href=/"(.+?)/".+?</h2>");
 Matcher urlMatcher = urlPattern.matcher(content);
 // Whether there is a successful matching object
 boolean isFind = urlMatcher.find();
 while (isFind) {
 //Define a Zhihu object to store the captured information
 Zhihu zhihuTemp = new Zhihu(urlMatcher.group(1));
 //Add successful matching results
 results.add(zhihuTemp);
 // Continue to find the next matching object
 isFind = urlMatcher.find();
 }
 return results;
 }

Next, in Zhihu's construction method, get all the detailed data through the url.

We first need to process the url, because for some answers, its url is:

http://www.zhihu.com/question/22355264/answer/21102139

Some are specific to the problem, and its url is:

http://www.zhihu.com/question/22355264

Then what we obviously need is the second type, so we need to use regular rules to cut the first type of link into the second type. This can be done by writing a function in Zhihu.

 // handle url
 boolean getRealUrl(String url) {
 // Change http://www.zhihu.com/question/22355264/answer/21102139
 //Convert to http://www.zhihu.com/question/22355264
 // Otherwise, no change
 Pattern pattern = Pattern.compile("question/(.*?)/");
 Matcher matcher = pattern.matcher(url);
 if (matcher.find()) {
 zhihuUrl = "http://www.zhihu.com/question/" + matcher.group(1);
 } else {
 return false;
 }
 return true;
 }

The next step is to obtain the various parts.

Let’s take a look at the title first:

Just grasp that class in regular form. The regular statement can be written as: zm-editable-content/">(.+?)<

Run it to see the results:

Ouch, not bad.

Next grab the problem description:

Aha, the same principle, grab the class, because it should be the unique identifier of this.

Verification method: Right-click to view the page source code, ctrl+F to see if there are other strings in the page.

Later, after verification, something really went wrong:

The class in front of the title and description content is the same.

That can only be re-captured by modifying the regular pattern:

 // match title
 pattern = Pattern.compile("zh-question-title.+?<h2.+?>(.+?)</h2>");
 matcher = pattern.matcher(content);
 if (matcher.find()) {
 question = matcher.group(1);
 }
 // Match description
 pattern=Pattern
 .compile("zh-question-detail.+?<div.+?>(.*?)</div>");
 matcher = pattern.matcher(content);
 if (matcher.find()) {
 questionDescription = matcher.group(1);
 }

The last thing is to loop to grab the answer:

Preliminary tentative regular statement: /answer/content.+?<div.+?>(.*?)</div>

After changing the code, we will find that the software runs significantly slower because it needs to visit each web page and capture the content on it.

For example, if there are 20 questions recommended by the editor, then you need to visit the web page 20 times, and the speed will slow down.

Try it out, it looks good:

OK, let’s leave it like this for now~ Next time we will continue to make some detailed adjustments, such as multi-threading, writing IO streams locally, etc.

Attached is the project source code:

Zhihu.java

 import java.util.ArrayList;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 public class Zhihu {
 public String question; // question
 public String questionDescription; // Question description
 public String zhihuUrl;//Web page link
 public ArrayList<String> answers; // Array to store all answers
 // Constructor initializes data
 public Zhihu(String url) {
 //Initialize properties
 question = "";
 questionDescription = "";
 zhihuUrl = "";
 answers = new ArrayList<String>();
 // Determine whether the url is legal
 if (getRealUrl(url)) {
 System.out.println("crawling" + zhihuUrl);
 // Get the details of the question and answer based on the url
 String content = Spider.SendGet(zhihuUrl);
 Pattern pattern;
 Matcher matcher;
 // match title
 pattern = Pattern.compile("zh-question-title.+?<h2.+?>(.+?)</h2>");
 matcher = pattern.matcher(content);
 if (matcher.find()) {
 question = matcher.group(1);
 }
 // Match description
 pattern=Pattern
 .compile("zh-question-detail.+?<div.+?>(.*?)</div>");
 matcher = pattern.matcher(content);
 if (matcher.find()) {
 questionDescription = matcher.group(1);
 }
 // Match answer
 pattern = Pattern.compile("/answer/content.+?<div.+?>(.*?)</div>");
 matcher = pattern.matcher(content);
 boolean isFind = matcher.find();
 while (isFind) {
 answers.add(matcher.group(1));
 isFind = matcher.find();
 }
 }
 }
 // Grab your own questions, descriptions and answers based on your own url
 public boolean getAll() {
 return true;
 }
 // handle url
 boolean getRealUrl(String url) {
 // Change http://www.zhihu.com/question/22355264/answer/21102139
 //Convert to http://www.zhihu.com/question/22355264
 // Otherwise, no change
 Pattern pattern = Pattern.compile("question/(.*?)/");
 Matcher matcher = pattern.matcher(url);
 if (matcher.find()) {
 zhihuUrl = "http://www.zhihu.com/question/" + matcher.group(1);
 } else {
 return false;
 }
 return true;
 }
 @Override
 public String toString() {
 return "Question:" + question + "/n" + "Description:" + questionDescription + "/n"
 + "Link:" + zhihuUrl + "/nAnswer:" + answers.size() + "/n";
 }
 }

Spider.java

 import java.io.BufferedReader;
 import java.io.InputStreamReader;
 import java.net.URL;
 import java.net.URLConnection;
 import java.util.ArrayList;
 import java.util.regex.Matcher;
 import java.util.regex.Pattern;
 public class Spider {
 static String SendGet(String url) {
 //Define a string to store web page content
 String result = "";
 //Define a buffered character input stream
 BufferedReader in = null;
 try {
 //Convert string to url object
 URL realUrl = new URL(url);
 // Initialize a link to that url
 URLConnection connection = realUrl.openConnection();
 // Start the actual connection
 connection.connect();
 //Initialize the BufferedReader input stream to read the response of the URL
 in = new BufferedReader(new InputStreamReader(
 connection.getInputStream(), "UTF-8"));
 // Used to temporarily store the data of each row captured
 String line;
 while ((line = in.readLine()) != null) {
 // Traverse each captured row and store it in result
 result += line;
 }
 } catch (Exception e) {
 System.out.println("Exception occurred when sending GET request!" + e);
 e.printStackTrace();
 }
 // Use finally to close the input stream
 finally {
 try {
 if (in != null) {
 in.close();
 }
 } catch (Exception e2) {
 e2.printStackTrace();
 }
 }
 return result;
 }
 // Get all Zhihu content recommended by editors
 static ArrayList<Zhihu> GetRecommendations(String content) {
 // Predefine an ArrayList to store the results
 ArrayList<Zhihu> results = new ArrayList<Zhihu>();
 // Used to match the url, which is the link to the question
 Pattern pattern = Pattern
 .compile("<h2>.+?question_link.+?href=/"(.+?)/".+?</h2>");
 Matcher matcher = pattern.matcher(content);
 // Whether there is a successful matching object
 Boolean isFind = matcher.find();
 while (isFind) {
 //Define a Zhihu object to store the captured information
 Zhihu zhihuTemp = new Zhihu(matcher.group(1));
 //Add successful matching results
 results.add(zhihuTemp);
 // Continue to find the next matching object
 isFind = matcher.find();
 }
 return results;
 }
 }

Main.java

 import java.util.ArrayList;
 public class Main {
 public static void main(String[] args) {
 // Define the link to be visited
 String url = "http://www.zhihu.com/explore/recommendations";
 //Access the link and get the page content
 String content = Spider.SendGet(url);
 // Get editor recommendations
 ArrayList<Zhihu> myZhihu = Spider.GetRecommendations(content);
 // print results
 System.out.println(myZhihu);
 }
 }

The above is the entire record of grabbing Zhihu answers. It is very detailed. Friends in need can refer to it.