Advanced chapter on writing Java Zhihu crawler from scratch

Author：Eve Cole Update Time：2025-01-15 19:12:01

Speaking of crawlers, using the URLConnection that comes with Java can achieve some basic page crawling functions, but for some more advanced functions, such as redirection processing and HTML tag removal, just using URLConnection is not enough.

Here we can use the third-party jar package HttpClient.

Next, we use HttpClient to simply write a demo that crawls to Baidu:

 import java.io.FileOutputStream;
 import java.io.InputStream;
 import java.io.OutputStream;
 import org.apache.commons.httpclient.HttpClient;
 import org.apache.commons.httpclient.HttpStatus;
 import org.apache.commons.httpclient.methods.GetMethod;
 /**
 *
 * @author CallMeWhy
 *
 */
 public class Spider {
 private static HttpClient httpClient = new HttpClient();
 /**
 * @param path
 * Link to target page
 * @return Returns a Boolean value, indicating whether the target page is downloaded normally
 * @throwsException
 * IO exception when reading web page stream or writing local file stream
 */
 public static boolean downloadPage(String path) throws Exception {
 //Define input and output streams
 InputStream input = null;
 OutputStream output = null;
 // Get the post method
 GetMethod getMethod = new GetMethod(path);
 //Execute, return status code
 int statusCode = httpClient.executeMethod(getMethod);
 // Process the status code
 // For simplicity, only the status code with a return value of 200 is processed
 if (statusCode == HttpStatus.SC_OK) {
 input = getMethod.getResponseBodyAsStream();
 // Get the file name from the URL
 String filename = path.substring(path.lastIndexOf('/') + 1)
 + ".html";
 // Get the file output stream
 output = new FileOutputStream(filename);
 // Output to file
 int tempByte = -1;
 while ((tempByte = input.read()) > 0) {
 output.write(tempByte);
 }
 // Close the input stream
 if (input != null) {
 input.close();
 }
 // Close the output stream
 if (output != null) {
 output.close();
 }
 return true;
 }
 return false;
 }
 public static void main(String[] args) {
 try {
 // Grab Baidu homepage and output
 Spider.downloadPage("http://www.baidu.com");
 } catch (Exception e) {
 e.printStackTrace();
 }
 }
 }

However, such a basic crawler cannot meet the needs of various crawlers.

Let’s first introduce the breadth-first crawler.

I believe everyone is familiar with breadth-first. In simple terms, you can understand breadth-first crawlers like this.

We think of the Internet as a super large directed graph. Every link on a web page is a directed edge, and every file or pure page without a link is the end point of the graph:

The breadth-first crawler is such a crawler. It crawls on this directed graph, starting from the root node and crawling out the data of new nodes layer by layer.

The width traversal algorithm is as follows:

(1) Vertex V is put into the queue.
(2) Continue execution when the queue is not empty, otherwise the algorithm is empty.
(3) Dequeue, obtain the head node V, visit the vertex V and mark that V has been visited.
(4) Find the first adjacent vertex col of vertex V.
(5) If the adjacent vertex col of V has not been visited, then col is put into the queue.
(6) Continue to search for other adjacent vertices col of V, and go to step (5). If all adjacent vertices of V have been visited, go to step (2).

According to the width traversal algorithm, the traversal order of the above picture is: A->B->C->D->E->F->H->G->I, so that it is traversed layer by layer.

The breadth-first crawler actually crawls a series of seed nodes, which is basically the same as graph traversal.

We can put the URLs of the pages that need to be crawled in a TODO table, and the visited pages in a Visited table:

The basic process of the breadth-first crawler is as follows:

(1) Compare the parsed link with the link in the Visited table. If the link does not exist in the Visited table, it means that it has not been visited.
(2) Put the link into the TODO table.
(3) After processing, obtain a link from the TODO table and put it directly into the Visited table.
(4) Continue the above process for the web page represented by this link. And so on.

Next we will create a breadth-first crawler step by step.

First, design a data structure to store the TODO table. Considering the need for first-in-first-out, we use a queue and customize a Quere class:

 import java.util.LinkedList;
 /**
 * Custom queue class to save TODO table
 */
 public class Queue {
 /**
 * Define a queue and implement it using LinkedList
 */
 private LinkedList<Object> queue = new LinkedList<Object>(); // Queue
 /**
 * Add t to the queue
 */
 public void enQueue(Object t) {
 queue.addLast(t);
 }
 /**
 * Remove the first item from the queue and return it
 */
 public Object deQueue() {
 return queue.removeFirst();
 }
 /**
 * Returns whether the queue is empty
 */
 public boolean isQueueEmpty() {
 return queue.isEmpty();
 }
 /**
 * Determine and return whether the queue contains t
 */
 public boolean contians(Object t) {
 return queue.contains(t);
 }
 /**
 * Determine and return whether the queue is empty
 */
 public boolean empty() {
 return queue.isEmpty();
 }
 }

A data structure is also needed to record the URLs that have been visited, namely the Visited table.

Considering the role of this table, whenever a URL is to be accessed, it is first searched in this data structure. If the current URL already exists, the URL task is discarded.

This data structure needs to be non-duplicate and can be searched quickly, so HashSet is chosen for storage.

To sum up, we create another SpiderQueue class to save the Visited table and TODO table:

 import java.util.HashSet;
 import java.util.Set;
 /**
 * Custom class to save Visited table and unVisited table
 */
 public class SpiderQueue {
 /**
 * The visited url collection, namely the Visited table
 */
 private static Set<Object> visitedUrl = new HashSet<>();
 /**
 * Add to visited URL queue
 */
 public static void addVisitedUrl(String url) {
 visitedUrl.add(url);
 }
 /**
 * Remove visited URLs
 */
 public static void removeVisitedUrl(String url) {
 visitedUrl.remove(url);
 }
 /**
 * Get the number of visited URLs
 */
 public static int getVisitedUrlNum() {
 return visitedUrl.size();
 }
 /**
 * The collection of urls to be visited, that is, the unVisited table
 */
 private static Queue unVisitedUrl = new Queue();
 /**
 * Get the UnVisited queue
 */
 public static Queue getUnVisitedUrl() {
 return unVisitedUrl;
 }
 /**
 * Unvisited unVisitedUrl is dequeued
 */
 public static Object unVisitedUrlDeQueue() {
 return unVisitedUrl.deQueue();
 }
 /**
 * Ensure that each URL is only visited once when adding url to unVisitedUrl
 */
 public static void addUnvisitedUrl(String url) {
 if (url != null && !url.trim().equals("") && !visitedUrl.contains(url)
 && !unVisitedUrl.contians(url))
 unVisitedUrl.enQueue(url);
 }
 /**
 * Determine whether the unvisited URL queue is empty
 */
 public static boolean unVisitedUrlsEmpty() {
 return unVisitedUrl.empty();
 }
 }

The above is an encapsulation of some custom classes. The next step is to define a tool class for downloading web pages. We define it as the DownTool class:

 package controller;
 import java.io.*;
 import org.apache.commons.httpclient.*;
 import org.apache.commons.httpclient.methods.*;
 import org.apache.commons.httpclient.params.*;
 public class DownTool {
 /**
 * Generate the file name of the web page to be saved based on the URL and web page type, and remove non-file name characters in the URL
 */
 private String getFileNameByUrl(String url, String contentType) {
 // Remove the seven characters "http://"
 url = url.substring(7);
 // Confirm that the captured page is of text/html type
 if (contentType.indexOf("html") != -1) {
 // Convert all special symbols in URLs into underscores
 url = url.replaceAll("[//?/:*|<>/"]", "_") + ".html";
 } else {
 url = url.replaceAll("[//?/:*|<>/"]", "_") + "."
 + contentType.substring(contentType.lastIndexOf("/") + 1);
 }
 return url;
 }
 /**
 * Save the web page byte array to a local file, filePath is the relative address of the file to be saved
 */
 private void saveToLocal(byte[] data, String filePath) {
 try {
 DataOutputStream out = new DataOutputStream(new FileOutputStream(
 new File(filePath)));
 for (int i = 0; i < data.length; i++)
 out.write(data[i]);
 out.flush();
 out.close();
 } catch (IOException e) {
 e.printStackTrace();
 }
 }
 // Download the web page pointed to by the URL
 public String downloadFile(String url) {
 String filePath = null;
 // 1. Generate HttpClinet object and set parameters
 HttpClient httpClient = new HttpClient();
 //Set HTTP connection timeout 5s
 httpClient.getHttpConnectionManager().getParams()
 .setConnectionTimeout(5000);
 // 2. Generate GetMethod object and set parameters
 GetMethod getMethod = new GetMethod(url);
 //Set the get request timeout to 5s
 getMethod.getParams().setParameter(HttpMethodParams.SO_TIMEOUT, 5000);
 //Set request retry processing
 getMethod.getParams().setParameter(HttpMethodParams.RETRY_HANDLER,
 new DefaultHttpMethodRetryHandler());
 // 3. Execute GET request
 try {
 int statusCode = httpClient.executeMethod(getMethod);
 // Determine the access status code
 if (statusCode != HttpStatus.SC_OK) {
 System.err.println("Method failed: "
 + getMethod.getStatusLine());
 filePath = null;
 }
 // 4. Process HTTP response content
 byte[] responseBody = getMethod.getResponseBody(); // Read as byte array
 // Generate the file name when saving based on the web page URL
 filePath = "temp//"
 + getFileNameByUrl(url,
 getMethod.getResponseHeader("Content-Type")
 .getValue());
 saveToLocal(responseBody, filePath);
 } catch (HttpException e) {
 //A fatal exception occurs. It may be that the protocol is incorrect or there is something wrong with the returned content.
 System.out.println("Please check whether your http address is correct");
 e.printStackTrace();
 } catch (IOException e) {
 //A network exception occurs
 e.printStackTrace();
 } finally {
 // Release the connection
 getMethod.releaseConnection();
 }
 return filePath;
 }
 }

Here we need a HtmlParserTool class to handle Html tags:

 package controller;
 import java.util.HashSet;
 import java.util.Set;
 import org.htmlparser.Node;
 import org.htmlparser.NodeFilter;
 import org.htmlparser.Parser;
 import org.htmlparser.filters.NodeClassFilter;
 import org.htmlparser.filters.OrFilter;
 import org.htmlparser.tags.LinkTag;
 import org.htmlparser.util.NodeList;
 import org.htmlparser.util.ParserException;
 import model.LinkFilter;
 public class HtmlParserTool {
 // Get links on a website, filter is used to filter links
 public static Set<String> extracLinks(String url, LinkFilter filter) {
 Set<String> links = new HashSet<String>();
 try {
 Parser parser = new Parser(url);
 parser.setEncoding("gb2312");
 // Filter the <frame> tag, used to extract the src attribute in the frame tag
 NodeFilter frameFilter = new NodeFilter() {
 private static final long serialVersionUID = 1L;
 @Override
 public boolean accept(Node node) {
 if (node.getText().startsWith("frame src=")) {
 return true;
 } else {
 return false;
 }
 }
 };
 // OrFilter to set filter <a> tag and <frame> tag
 OrFilter linkFilter = new OrFilter(new NodeClassFilter(
 LinkTag.class), frameFilter);
 // Get all filtered tags
 NodeList list = parser.extractAllNodesThatMatch(linkFilter);
 for (int i = 0; i < list.size(); i++) {
 Node tag = list.elementAt(i);
 if (tag instanceof LinkTag)// <a> tag
 {
 LinkTag link = (LinkTag) tag;
 String linkUrl = link.getLink();// URL
 if (filter.accept(linkUrl))
 links.add(linkUrl);
 } else// <frame> tag
 {
 // Extract the link of the src attribute in the frame, such as <frame src="test.html"/>
 String frame = tag.getText();
 int start = frame.indexOf("src=");
 frame = frame.substring(start);
 int end = frame.indexOf(" ");
 if (end == -1)
 end = frame.indexOf(">");
 String frameUrl = frame.substring(5, end - 1);
 if (filter.accept(frameUrl))
 links.add(frameUrl);
 }
 }
 } catch (ParserException e) {
 e.printStackTrace();
 }
 return links;
 }
 }

Finally, let’s write a crawler class to call the previous encapsulation class and function:

 package controller;
 import java.util.Set;
 import model.LinkFilter;
 import model.SpiderQueue;
 public class BfsSpider {
 /**
 * Initialize URL queue using seed
 */
 private void initCrawlerWithSeeds(String[] seeds) {
 for (int i = 0; i < seeds.length; i++)
 SpiderQueue.addUnvisitedUrl(seeds[i]);
 }
 //Define a filter to extract links starting with http://www.xxxx.com
 public void crawling(String[] seeds) {
 LinkFilter filter = new LinkFilter() {
 public boolean accept(String url) {
 if (url.startsWith("http://www.baidu.com"))
 return true;
 else
 return false;
 }
 };
 //Initialize URL queue
 initCrawlerWithSeeds(seeds);
 // Loop condition: the link to be crawled is not empty and the number of crawled web pages is not more than 1000
 while (!SpiderQueue.unVisitedUrlsEmpty()
 && SpiderQueue.getVisitedUrlNum() <= 1000) {
 //Queue head URL dequeued
 String visitUrl = (String) SpiderQueue.unVisitedUrlDeQueue();
 if (visitUrl == null)
 continue;
 DownTool downLoader = new DownTool();
 // Download web page
 downLoader.downloadFile(visitUrl);
 // Put this URL into the visited URL
 SpiderQueue.addVisitedUrl(visitUrl);
 //Extract the URL from the download web page
 Set<String> links = HtmlParserTool.extracLinks(visitUrl, filter);
 // New unvisited URLs are queued
 for (String link : links) {
 SpiderQueue.addUnvisitedUrl(link);
 }
 }
 }
 //main method entry
 public static void main(String[] args) {
 BfsSpider crawler = new BfsSpider();
 crawler.crawling(new String[] { "http://www.baidu.com" });
 }
 }

After running it, you can see that the crawler has crawled all the pages under Baidu webpage:

The above is the entire content of java using the HttpClient toolkit and width crawler to crawl content. It is a little more complicated, so friends should think about it carefully. I hope it can be helpful to everyone.