In the early stage, we crawled the title under this link:
http://www.zhihu.com/explore/recommendations
But obviously this page cannot get the answer.
A complete question page would look like this:
http://www.zhihu.com/question/22355264
Taking a closer look, aha, our encapsulation class needs to be further packaged. At least a questionDescription is needed to store the question description:
import java.util.ArrayList;
public class Zhihu {
public String question; // question
public String questionDescription; // Question description
public String zhihuUrl;//Web page link
public ArrayList<String> answers; // Array to store all answers
// Constructor initializes data
public Zhihu() {
question = "";
questionDescription = "";
zhihuUrl = "";
answers = new ArrayList<String>();
}
@Override
public String toString() {
return "Question:" + question + "/n" + "Description:" + questionDescription + "/n"
+ "Link:" + zhihuUrl + "/nanswer:" + answers + "/n";
}
}
We add a parameter to Zhihu's constructor to set the URL value. Because the URL is determined, the description and answer to the question can be captured.
Let’s change Spider’s method of obtaining Zhihu objects and only obtain the URL:
static ArrayList<Zhihu> GetZhihu(String content) {
// Predefine an ArrayList to store the results
ArrayList<Zhihu> results = new ArrayList<Zhihu>();
// Used to match the url, which is the link to the question
Pattern urlPattern = Pattern.compile("<h2>.+?question_link.+?href=/"(.+?)/".+?</h2>");
Matcher urlMatcher = urlPattern.matcher(content);
// Whether there is a successful matching object
boolean isFind = urlMatcher.find();
while (isFind) {
//Define a Zhihu object to store the captured information
Zhihu zhihuTemp = new Zhihu(urlMatcher.group(1));
//Add successful matching results
results.add(zhihuTemp);
// Continue to find the next matching object
isFind = urlMatcher.find();
}
return results;
}
Next, in Zhihu's construction method, get all the detailed data through the url.
We first need to process the url, because for some answers, its url is:
http://www.zhihu.com/question/22355264/answer/21102139
Some are specific to the problem, and its url is:
http://www.zhihu.com/question/22355264
Then what we obviously need is the second type, so we need to use regular rules to cut the first type of link into the second type. This can be done by writing a function in Zhihu.
// handle url
boolean getRealUrl(String url) {
// Change http://www.zhihu.com/question/22355264/answer/21102139
//Convert to http://www.zhihu.com/question/22355264
// Otherwise, no change
Pattern pattern = Pattern.compile("question/(.*?)/");
Matcher matcher = pattern.matcher(url);
if (matcher.find()) {
zhihuUrl = "http://www.zhihu.com/question/" + matcher.group(1);
} else {
return false;
}
return true;
}
The next step is to obtain the various parts.
Let’s take a look at the title first:
Just grasp that class in regular form. The regular statement can be written as: zm-editable-content/">(.+?)<
Run it to see the results:
Ouch, not bad.
Next grab the problem description:
Aha, the same principle, grab the class, because it should be the unique identifier of this.
Verification method: Right-click to view the page source code, ctrl+F to see if there are other strings in the page.
Later, after verification, something really went wrong:
The class in front of the title and description content is the same.
That can only be re-captured by modifying the regular pattern:
// match title
pattern = Pattern.compile("zh-question-title.+?<h2.+?>(.+?)</h2>");
matcher = pattern.matcher(content);
if (matcher.find()) {
question = matcher.group(1);
}
// Match description
pattern=Pattern
.compile("zh-question-detail.+?<div.+?>(.*?)</div>");
matcher = pattern.matcher(content);
if (matcher.find()) {
questionDescription = matcher.group(1);
}
The last thing is to loop to grab the answer:
Preliminary tentative regular statement: /answer/content.+?<div.+?>(.*?)</div>
After changing the code, we will find that the software runs significantly slower because it needs to visit each web page and capture the content on it.
For example, if there are 20 questions recommended by the editor, then you need to visit the web page 20 times, and the speed will slow down.
Try it out, it looks good:
OK, let’s leave it like this for now~ Next time we will continue to make some detailed adjustments, such as multi-threading, writing IO streams locally, etc.
Attached is the project source code:
Zhihu.java
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Zhihu {
public String question; // question
public String questionDescription; // Question description
public String zhihuUrl;//Web page link
public ArrayList<String> answers; // Array to store all answers
// Constructor initializes data
public Zhihu(String url) {
//Initialize properties
question = "";
questionDescription = "";
zhihuUrl = "";
answers = new ArrayList<String>();
// Determine whether the url is legal
if (getRealUrl(url)) {
System.out.println("crawling" + zhihuUrl);
// Get the details of the question and answer based on the url
String content = Spider.SendGet(zhihuUrl);
Pattern pattern;
Matcher matcher;
// match title
pattern = Pattern.compile("zh-question-title.+?<h2.+?>(.+?)</h2>");
matcher = pattern.matcher(content);
if (matcher.find()) {
question = matcher.group(1);
}
// Match description
pattern=Pattern
.compile("zh-question-detail.+?<div.+?>(.*?)</div>");
matcher = pattern.matcher(content);
if (matcher.find()) {
questionDescription = matcher.group(1);
}
// Match answer
pattern = Pattern.compile("/answer/content.+?<div.+?>(.*?)</div>");
matcher = pattern.matcher(content);
boolean isFind = matcher.find();
while (isFind) {
answers.add(matcher.group(1));
isFind = matcher.find();
}
}
}
// Grab your own questions, descriptions and answers based on your own url
public boolean getAll() {
return true;
}
// handle url
boolean getRealUrl(String url) {
// Change http://www.zhihu.com/question/22355264/answer/21102139
//Convert to http://www.zhihu.com/question/22355264
// Otherwise, no change
Pattern pattern = Pattern.compile("question/(.*?)/");
Matcher matcher = pattern.matcher(url);
if (matcher.find()) {
zhihuUrl = "http://www.zhihu.com/question/" + matcher.group(1);
} else {
return false;
}
return true;
}
@Override
public String toString() {
return "Question:" + question + "/n" + "Description:" + questionDescription + "/n"
+ "Link:" + zhihuUrl + "/nAnswer:" + answers.size() + "/n";
}
}
Spider.java
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Spider {
static String SendGet(String url) {
//Define a string to store web page content
String result = "";
//Define a buffered character input stream
BufferedReader in = null;
try {
//Convert string to url object
URL realUrl = new URL(url);
// Initialize a link to that url
URLConnection connection = realUrl.openConnection();
// Start the actual connection
connection.connect();
//Initialize the BufferedReader input stream to read the response of the URL
in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
// Used to temporarily store the data of each row captured
String line;
while ((line = in.readLine()) != null) {
// Traverse each captured row and store it in result
result += line;
}
} catch (Exception e) {
System.out.println("Exception occurred when sending GET request!" + e);
e.printStackTrace();
}
// Use finally to close the input stream
finally {
try {
if (in != null) {
in.close();
}
} catch (Exception e2) {
e2.printStackTrace();
}
}
return result;
}
// Get all Zhihu content recommended by editors
static ArrayList<Zhihu> GetRecommendations(String content) {
// Predefine an ArrayList to store the results
ArrayList<Zhihu> results = new ArrayList<Zhihu>();
// Used to match the url, which is the link to the question
Pattern pattern = Pattern
.compile("<h2>.+?question_link.+?href=/"(.+?)/".+?</h2>");
Matcher matcher = pattern.matcher(content);
// Whether there is a successful matching object
Boolean isFind = matcher.find();
while (isFind) {
//Define a Zhihu object to store the captured information
Zhihu zhihuTemp = new Zhihu(matcher.group(1));
//Add successful matching results
results.add(zhihuTemp);
// Continue to find the next matching object
isFind = matcher.find();
}
return results;
}
}
Main.java
import java.util.ArrayList;
public class Main {
public static void main(String[] args) {
// Define the link to be visited
String url = "http://www.zhihu.com/explore/recommendations";
//Access the link and get the page content
String content = Spider.SendGet(url);
// Get editor recommendations
ArrayList<Zhihu> myZhihu = Spider.GetRecommendations(content);
// print results
System.out.println(myZhihu);
}
}
The above is the entire record of grabbing Zhihu answers. It is very detailed. Friends in need can refer to it.