Zhihu is a real online Q&A community with a friendly, rational and serious community atmosphere that connects elites from all walks of life. They share each other's professional knowledge, experience and insights, and continuously provide high-quality information for the Chinese Internet.
First, spend three to five minutes designing a Logo =. =As a programmer, I have always had a heart for being an artist!
Okay, it's a little makeshift, so I'll make do with it for now.
Next, we started making Zhihu’s crawler.
First, determine the first goal: editor recommendation.
Web link: http://www.zhihu.com/explore/recommendations
We slightly modified the last code to first achieve the ability to obtain the content of the page:
import java.io.*;
import java.net.*;
import java.util.regex.*;
public class Main {
static String SendGet(String url) {
//Define a string to store web page content
String result = "";
//Define a buffered character input stream
BufferedReader in = null;
try {
//Convert string to url object
URL realUrl = new URL(url);
// Initialize a link to that url
URLConnection connection = realUrl.openConnection();
// Start the actual connection
connection.connect();
//Initialize the BufferedReader input stream to read the response of the URL
in = new BufferedReader(new InputStreamReader(
connection.getInputStream()));
// Used to temporarily store the data of each row captured
String line;
while ((line = in.readLine()) != null) {
// Traverse each captured row and store it in result
result += line;
}
} catch (Exception e) {
System.out.println("Exception occurred when sending GET request!" + e);
e.printStackTrace();
}
// Use finally to close the input stream
finally {
try {
if (in != null) {
in.close();
}
} catch (Exception e2) {
e2.printStackTrace();
}
}
return result;
}
static String RegexString(String targetStr, String patternStr) {
// Define a style template, using regular expressions, and the content to be captured is in parentheses
// It's equivalent to burying a trap and it will fall if it matches.
Pattern pattern = Pattern.compile(patternStr);
//Define a matcher for matching
Matcher matcher = pattern.matcher(targetStr);
// if found
if (matcher.find()) {
// print out the result
return matcher.group(1);
}
return "Nothing";
}
public static void main(String[] args) {
// Define the link to be visited
String url = "http://www.zhihu.com/explore/recommendations";
//Access the link and get the page content
String result = SendGet(url);
// Use regular expressions to match the src content of the image
//String imgSrc = RegexString(result, "src=/"(.+?)/"");
// print results
System.out.println(result);
}
}
There is no problem after running it. The next step is a regular matching problem.
First, let's get all the questions on this page.
Right-click on the title and inspect the element:
Aha, you can see that the title is actually an a tag, which is a hyperlink, and the thing that can be distinguished from other hyperlinks should be the class, which is the class selector.
So our regular statement comes out: question_link.+?href=/"(.+?)/"
Call the RegexString function and pass it parameters:
public static void main(String[] args) {
// Define the link to be visited
String url = "http://www.zhihu.com/explore/recommendations";
//Access the link and get the page content
String result = SendGet(url);
// Use regular expressions to match the src content of the image
String imgSrc = RegexString(result, "question_link.+?>(.+?)<");
// print results
System.out.println(imgSrc);
}
Ah ha, you can see that we successfully captured a title (note, just one):
Wait a minute, what is all this mess? !
Don't be nervous=. =It's just garbled characters.
For encoding issues, please see: HTML character set
Generally speaking, the mainstream encodings with better support for Chinese are UTF-8, GB2312 and GBK encoding.
The webpage can set the webpage encoding through the charset of the meta tag, for example:
<meta charset="utf-8" />
Let's right-click to view the page source code:
As you can see, Zhihu uses UTF-8 encoding.
Here I will explain to you the difference between viewing the page source code and inspecting elements.
Viewing the page source code displays all the code of the entire page. It is not formatted according to HTML tags. It is equivalent to viewing the source code directly. This method is more useful for viewing the information of the entire web page, such as meta.
Inspect element, or some browsers call it view element, which is to view the element you right-click, such as a div or img. It is more suitable for viewing the attributes and tags of an object individually.
Okay, now we know that the problem lies in the encoding, the next step is to convert the encoding of the captured content.
The implementation in java is very simple, just specify the encoding method in InputStreamReader:
//Initialize the BufferedReader input stream to read the response of the URL
in = new BufferedReader(new InputStreamReader(
connection.getInputStream(),"UTF-8"));
If you run the program again at this time, you will find that the title can be displayed normally:
OK! very good!
But now there is only one title, we need all the titles.
We slightly modify the regular expression and store the searched results in an ArrayList:
import java.io.*;
import java.net.*;
import java.util.ArrayList;
import java.util.regex.*;
public class Main {
static String SendGet(String url) {
//Define a string to store web page content
String result = "";
//Define a buffered character input stream
BufferedReader in = null;
try {
//Convert string to url object
URL realUrl = new URL(url);
// Initialize a link to that url
URLConnection connection = realUrl.openConnection();
// Start the actual connection
connection.connect();
//Initialize the BufferedReader input stream to read the response of the URL
in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
// Used to temporarily store the data of each row captured
String line;
while ((line = in.readLine()) != null) {
// Traverse each captured row and store it in result
result += line;
}
} catch (Exception e) {
System.out.println("Exception occurred when sending GET request!" + e);
e.printStackTrace();
}
// Use finally to close the input stream
finally {
try {
if (in != null) {
in.close();
}
} catch (Exception e2) {
e2.printStackTrace();
}
}
return result;
}
static ArrayList<String> RegexString(String targetStr, String patternStr) {
// Predefine an ArrayList to store the results
ArrayList<String> results = new ArrayList<String>();
// Define a style template, using regular expressions, and the content to be captured is in parentheses
Pattern pattern = Pattern.compile(patternStr);
//Define a matcher for matching
Matcher matcher = pattern.matcher(targetStr);
// if found
boolean isFind = matcher.find();
// Use a loop to find and replace all kelvin in the sentence and add the content to sb
while (isFind) {
//Add successful matching results
results.add(matcher.group(1));
// Continue to find the next matching object
isFind = matcher.find();
}
return results;
}
public static void main(String[] args) {
// Define the link to be visited
String url = "http://www.zhihu.com/explore/recommendations";
//Access the link and get the page content
String result = SendGet(url);
// Use regular expressions to match the src content of the image
ArrayList<String> imgSrc = RegexString(result, "question_link.+?>(.+?)<");
// print results
System.out.println(imgSrc);
}
}
This will match all the results (because the ArrayList is printed directly, there will be some square brackets and commas):
OK, this is the first step of Zhihu crawler.
But we can see that there is no way to capture all the questions and answers in this way.
We need to design a Zhihu encapsulation class to store all captured objects.
Zhihu.java source code:
import java.util.ArrayList;
public class Zhihu {
public String question; // question
public String zhihuUrl;//Web page link
public ArrayList<String> answers; // Array to store all answers
// Constructor initializes data
public Zhihu() {
question = "";
zhihuUrl = "";
answers = new ArrayList<String>();
}
@Override
public String toString() {
return "Question:" + question + "/nLink:" + zhihuUrl + "/nAnswer:" + answers + "/n";
}
}
Create a new Spider class to store some commonly used functions of crawlers.
Spider.java source code:
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Spider {
static String SendGet(String url) {
//Define a string to store web page content
String result = "";
//Define a buffered character input stream
BufferedReader in = null;
try {
//Convert string to url object
URL realUrl = new URL(url);
// Initialize a link to that url
URLConnection connection = realUrl.openConnection();
// Start the actual connection
connection.connect();
//Initialize the BufferedReader input stream to read the response of the URL
in = new BufferedReader(new InputStreamReader(
connection.getInputStream(), "UTF-8"));
// Used to temporarily store the data of each row captured
String line;
while ((line = in.readLine()) != null) {
// Traverse each captured row and store it in result
result += line;
}
} catch (Exception e) {
System.out.println("Exception occurred when sending GET request!" + e);
e.printStackTrace();
}
// Use finally to close the input stream
finally {
try {
if (in != null) {
in.close();
}
} catch (Exception e2) {
e2.printStackTrace();
}
}
return result;
}
static ArrayList<Zhihu> GetZhihu(String content) {
// Predefine an ArrayList to store the results
ArrayList<Zhihu> results = new ArrayList<Zhihu>();
// Used to match titles
Pattern questionPattern = Pattern.compile("question_link.+?>(.+?)<");
Matcher questionMatcher = questionPattern.matcher(content);
// Used to match the url, which is the link to the question
Pattern urlPattern = Pattern.compile("question_link.+?href=/"(.+?)/"");
Matcher urlMatcher = urlPattern.matcher(content);
// Both the question and the link must match
boolean isFind = questionMatcher.find() && urlMatcher.find();
while (isFind) {
//Define a Zhihu object to store the captured information
Zhihu zhuhuTemp = new Zhihu();
zhuhuTemp.question = questionMatcher.group(1);
zhuhuTemp.zhihuUrl = "http://www.zhihu.com" + urlMatcher.group(1);
//Add successful matching results
results.add(zhuhuTemp);
// Continue to find the next matching object
isFind = questionMatcher.find() && urlMatcher.find();
}
return results;
}
}
The last main method is responsible for calling.
import java.util.ArrayList;
public class Main {
public static void main(String[] args) {
// Define the link to be visited
String url = "http://www.zhihu.com/explore/recommendations";
//Access the link and get the page content
String content = Spider.SendGet(url);
// Get all Zhihu objects on this page
ArrayList<Zhihu> myZhihu = Spider.GetZhihu(content);
// print results
System.out.println(myZhihu);
}
}
Ok, that’s it. Run it and see the results:
Good results.
The next step is to access the link and get all the answers.
We will introduce it next time.
Okay, the above is a brief introduction to the entire process of how to use Java to capture content recommended by Zhihu editors. It is very detailed and easy to understand. Friends in need can refer to it and expand freely. No problem, haha