Let’s talk about how to capture data using Node.js + Cheerio

Author：Eve Cole Update Time：2022-08-04 14:50:56

To obtain data, you have to resort to web scraping. This article will introduce how to use Node and Cheerio to crawl website data. I hope it will be helpful to everyone!

Before we start, you need to abide by local laws and regulations, and do not scrape data that is disclosed without permission.

Prerequisites

Here are some things you'll need for this tutorial:

You need Node.js installed. If you don't have Node, just make sure to download it for your system from the Node.js download page (https://nodejs.dev/download/)
You will need to have a text editor installed on your machine, such as VSCode or Atom
you You should have at least a basic understanding of JavaScript, Node.js, and the Document Object Model (DOM).

What is Cheerio?

Cheerio is a tool for parsing HTML and XML in Node.js. It is very popular on GitHub and has more than 23k stars.

It's fast, flexible and easy to use. Since it implements a subset of JQuery, it's easy to get started with Cheerio if you're already familiar with JQuery.

The main difference between Cheerio and a web browser is that cheerio does not generate visual rendering, load CSS, load external resources, or execute JavaScript. It just parses the markup and provides an API for manipulating the resulting data structures. This explains why it's also very fast - cheerio documentation.

If you want to use cheerio to fetch web pages, you need to first use a package like axios or node-fetch to get the tags.

How to crawl web pages in Node using Cheerio

In this example, we will crawl the ISO 3166-1 alpha-3 codes for all countries and other jurisdictions listed on this Wikipedia page. It's located under the current code section of the ISO 3166-1 alpha-3 page.

This is what the list of countries/jurisdictions and their corresponding codes look like:

Step 1 - Create a working directory

In this step, you will create a directory for your project by running the following command on the terminal. This command will create a file named learn-cheerio . You can give it a different name if you wish.

mkdir learn-cheerio

learn-cheerio After successfully running the above command, you should be able to see a folder named created.

In the next step, you will open the directory you just created in your favorite text editor and initialize the project.

Step 2 - Initialize the Project

In this step, you will navigate to the project directory and initialize the project. Open the directory you created in the previous step in your favorite text editor and initialize the project by running the following command.

npm init -y

A successful run of the above command will create a file package.json in the root of the project directory.

In the next step, you will install the project dependencies.

Step 3 - Install Dependencies

In this step, you will install the project dependencies by running the following command. This will take a few minutes, so please be patient.

npm i axios cheerio pretty

Successfully running the above command will register three dependencies in the file under the field package.json . dependencies first dependency is axios , the second is cheerio , and the third is pretty .

axios is a very popular http client that can run in node and browsers. We need it because cheerio is a token parser.

In order for Cheerio to parse the tags and crawl the data you need, we need axios for getting the tags from the website. If you prefer, you can use another HTTP client to get the token. It doesn't have to be axios .

pretty is an npm package for beautifying markup so it's readable when printed on the terminal.

In the next section, you'll examine the tags from which data will be scraped.

Step 4 - Check the web page you want to scrape

Before scraping data from a web page, it is important to understand the HTML structure of the page.

In this step, you examine the HTML structure of the web page from which you want to scrape data.

Navigate to the ISO 3166-1 alpha-3 codes page on Wikipedia. Under the "Current Codes" section, there is a list of countries and their corresponding codes. CTRL + SHIFT + I You can open DevTools by pressing the key combination on chrome or by right-clicking and selecting the "Inspect" option.

This is my list in chrome DevTools:

In the next section, you'll write the code to crawl the web.

Step 5 - Write the code to scrape the data

In this section, you will write the code to scrape the data that we are interested in. First run the following command which will create the app.js file.

touch app.js

Successfully running the above command will create a file app.js in the root directory of the project directory.

Like any other Node package, you must first require axios , cheerio , and axios before starting to use them. You can do this by adding the following code at the top of the file you just created pretty . app.js

const axios = require("axios");
const cheerio = require("cheerio");
const pretty = require("pretty");

Before we write the code for scraping the data, we need to learn cheerio . We will parse the markup below and try to manipulate the resulting data structure. This will help us learn Cheerio syntax and its most commonly used methods.

The markup below is ul li element that contains our element.

const markup = `
<ul class="fruits">
  <li class="fruits__mango"> Mango </li>
  <li class="fruits__apple"> Apple </li>
</ul>
`;

Add the above variable declaration to the app.js file

How to load tags in Cheerio cheerio

cheerio can load tags using the cheerio.load method. This method takes the marker as parameter. It also requires two additional optional parameters. If you're interested, you can read more about them in the documentation.

Below, we pass the first and only required parameter and store the return value in the $ variable. We use this variable because of cheerio's similarity to Jquery $ . You can use different variable names if you wish.

Add the following code to your app.js file:

const $ = cheerio.load(markup);
console.log(pretty($.html()));

If you now execute the code in the file node app.js by running the command app.js on the terminal, you should be able to see the markup on the terminal. This is what I see on the terminal:

How to select elements in Cheerio

Cheerio supports most common CSS selectors, such as class , id and element selectors. In the code below, we select an element with class fruits__mango and then log the selected element to the console. Add the following code to your app.js file.

const mango = $(".fruits__mango");
console.log(mango.html()); // Mango

If you use command execution, the above line of code will Mango log text on the terminal. app.js``node app.js

to get the attributes of an element in Cheerio

You can also select an element and get specific attributes such as class , id or all attributes and their corresponding values.

Add the following code to your app.js file:

const apple = $(".fruits__apple");
console.log(apple.attr("class")); //

The code above fruits__apple will log in to the fruits__apple terminal. fruits__apple is the class of the selected element.

How to loop through a list of elements in Cheerio

Cheerio provides the .each method to loop through multiple selected elements.

Below, we select all elements and loop through them using the method li . .each we log the text content of each list item on the terminal.

Add the following code to your app.js file.

const listItems = $("li");
console.log(listItems.length); // 2
listItems.each(function (idx, el) {
  console.log($(el).text());
});
//Mango
// Apple

's code above will record 2 , which is the length of the list item. After executing the code, the text Mango and will be displayed on the terminal. Apple``app.js

How to append or add elements to markup in Cheerio

Cheerio provides a way to append or append elements to markup.

The append method will append the element passed as parameter after the last child element of the selected element. On the other hand, prepend will add the passed element before the first child of the selected element.

Add the following code to your app.js file:

const ul = $("ul");
ul.append("<li>Banana</li>");
ul.prepend("<li>Pineapple</li>");
console.log(pretty($.html()));

After adding and adding elements to the markup, this is what I see when I log into $.html() terminal:

These are the basics of Cheerio to get you started with web scraping. To scrape the data from Wikipedia we described at the beginning of this article, copy and paste the following code into the app.js file:

// Loading the dependencies. We don't need pretty
// because we shall not log html to the terminal
const axios = require("axios");
const cheerio = require("cheerio");
const fs = require("fs");

// URL of the page we want to scrape
const url = "https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3";

// Async function which scrapes the data
async function scrapeData() {
  try {
    // Fetch HTML of the page we want to scrape
    const { data } = await axios.get(url);
    // Load HTML we fetched in the previous line
    const $ = cheerio.load(data);
    // Select all the list items in plainlist class
    const listItems = $(".plainlist ul li");
    // Stores data for all countries
    const countries = [];
    // Use .each method to loop through the li we selected
    listItems.each((idx, el) => {
      // Object holding data for each country/jurisdiction
      const country = { name: "", iso3: "" };
      // Select the text content of a and span elements
      // Store the textcontent in the above object
      country.name = $(el).children("a").text();
      country.iso3 = $(el).children("span").text();
      // Populate countries array with country data
      countries.push(country);
    });
    // Logs countries array to the console
    console.dir(countries);
    // Write countries array in countries.json file
    fs.writeFile("coutries.json", JSON.stringify(countries, null, 2), (err) => {
      if (err) {
        console.error(err);
        return;
      }
      console.log("Successfully written data to file");
    });
  } catch (err) {
    console.error(err);
  }
}
//Invoke the above function
scrapeData();

By reading the code, do you understand what is happening? If not, I'll go into detail now. I've also commented each line of code to help you understand.

In the above code, we need all the dependencies at the top of the file app.js and then we declare the scrapeData function. Inside the function, the fetched HTML of the page we need to scrape is then loaded into cheerio using axios .

The list of countries and their corresponding iso3 codes is nested in a div element with the class plainlist . The li elements are selected and we then loop through them using the .each method. Data for each country is scraped and stored in an array.

After running the above code using the node app.js command, the captured data is written to the countries.json file and printed on the terminal. This is part of what I see on the terminal:

Conclusion

Thank you for reading this article! We've already covered using cheerio . If you want to go deeper and fully understand how it works, you can head to the Cheerio documentation.