How to Scrape Web Pages With Cheerio in Node.js
Here, I’ll show you how to scrape data from websites using Cheerio in Node.js. We’ll go step-by-step through the process: setting up your project, making HTTP requests, and extracting the necessary data. By the end of this guide, you’ll know how to collect and save data in a clean format, like JSON, making it easy to use for your projects. Ready to dive in? Let’s get started!
What is Cheerio?
Cheerio is a fast and lightweight JavaScript library built on top of htmlparser2. It allows you to work with HTML in a very similar way to jQuery, making it easy to select, manipulate, and extract data from HTML documents. Cheerio is designed for use in server-side environments like Node.js, which makes it a perfect choice for web scraping tasks.
Since Cheerio is built for Node.js, it’s not intended to render or simulate a browser, like Puppeteer or Playwright. Instead, it operates directly on the HTML markup of a page, allowing you to parse and interact with it efficiently.
Cheerio provides a jQuery-like syntax that’s simple to understand, making it a great tool for beginners and experienced developers alike. Not sure if you should choose Cheerio or BeautifulSoup? Check our article on Cheerio vs. BeautifulSoup.
Skip Manual Scraping
If you want to skip manual web scraping, I’ve created detailed lists of the best dataset websites and the best web scraping tools. Now, let’s continue with our guide!
Prerequisites
Before we begin, you need to have a few things set up on your computer:
- Node.js: The JavaScript runtime environment that will execute our scraping code.
- npm (Node Package Manager): This will be used to install the libraries required for the project.
- A code editor: Use Visual Studio Code or Sublime Text to write and run your code.
If you’re not sure whether you have Node.js and npm installed, you can check by opening your terminal (or command prompt) and typing:
node -v
npm -v
If these commands return version numbers, you’re good to go! Otherwise, download and install Node.js from nodejs.org.
Setting Up Your Project
Let’s begin by setting up a simple Node.js project where we will install Cheerio and Axios (a popular HTTP client for making requests). Follow these steps to set up your project:
Step 1: Create a Project Directory
Start by creating a folder for your project. Open the terminal and run the following command to create a new directory:
mkdir cheerio-web-scraping
cd cheerio-web-scraping
This will create a new folder called cheerio-web-scraping and move you into it.
Step 2: Initialize a Node.js Project
Next, initialize a new Node.js project. This will create a package.json file to manage your project’s dependencies:
npm init -y
The -y flag automatically answers “yes” to all the prompts, creating the package.json file with default values.
Step 3: Install Dependencies
Now, let’s install Cheerio and Axios. In the terminal, run:
npm install cheerio axios
- Cheerio is the library we’ll use to parse and manipulate HTML.
- Axios is an HTTP client that makes it easy to send requests to websites and get back their HTML.
Step 4: Create Your First Script
Create a new file in the project folder named index.js. This will be the main file where you write your scraping code.
Now that we’ve set up our project, it’s time to write the script to scrape web pages!
Scraping a Web Page with Cheerio
Let’s start by scraping a simple webpage. For this example, we’ll scrape an e-commerce website and collect product details like the image, name, and price. But before we begin scraping, let’s inspect the page’s structure.
Inspecting the Page
In this example, we’ll scrape products from a demo e-commerce store. First, open the website you want to scrape in a browser and use your browser’s Developer Tools (right-click the page and choose “Inspect” in Chrome or Firefox) to explore the HTML structure.
Look for patterns such as class names or HTML elements that contain the data you want to extract. In this case, let’s say we want to collect the following data for each product:
- Product image URL
- Product name
- Product price
Writing the Scraper
With the page inspected and the structure identified, it’s time to write the scraping code. Open your index.js file and add the following code:
const axios = require('axios');
const cheerio = require('cheerio');
const fs = require('fs');
// URL of the website we want to scrape
const targetURL = 'https://www.example.com/products';
// Function to scrape product data
const getProducts = ($) => {
// Select all product items on the page
const products = $('.product-item');
const productData = [];
// Loop through each product item
products.each((index, element) => {
const product = {};
// Extract the product image URL
product.img = $(element).find('img').attr('src');
// Extract the product name
product.name = $(element).find('.product-name').text();
// Extract the product price
product.price = $(element).find('.product-price').text();
// Push the product data into the array
productData.push(product);
});
// Save the scraped data to a JSON file
fs.writeFile('products.json', JSON.stringify(productData, null, 2), (err) => {
if (err) {
console.error('Error writing data to file:', err);
return;
}
console.log('Data written to products.json');
});
};
// Fetch the HTML of the page using Axios
axios.get(targetURL)
.then(response => {
const $ = cheerio.load(response.data);
getProducts($); // Call the function to scrape the products
})
.catch(error => {
console.error('Error fetching the page:', error);
});
How This Code Works
Making the HTTP Request:
- We use Axios to make a GET request to the webpage we want to scrape (targetURL).
- The response data is the HTML content of the page.
Parsing HTML with Cheerio:
We load the HTML response using Cheerio (cheerio.load(response.data)), which gives us a jQuery-like interface to interact with the HTML.
Extracting Data:
- We use Cheerio selectors ($(‘.product-item’), $(element).find(‘img’), etc.) to find specific elements on the page.
- For each product, we extract the image URL, product name, and price.
Saving the Data:
The data is stored in an array (productData), and once all products are scraped, we write the data to a JSON file using fs.writeFile.
Running the Scraper
To run the scraper, open your terminal, navigate to your project folder, and run:
node index.js
After the script completes, a file called products.json will be created in your project folder, containing the scraped data.
Handling Errors and Edge Cases
When scraping, it’s essential to handle potential errors. For example, the webpage you’re scraping might not load, or the HTML structure might change. Here are some strategies for handling these issues:
Check for HTTP Errors: Always handle HTTP errors (e.g., 404 or 500 errors) gracefully. Axios has a built-in .catch() method to handle errors.
Missing Data: Not all elements might be present in every product. To handle missing data, you can check whether an element exists before trying to access its value:
product.img = $(element).find('img').attr('src') || 'default-image.jpg';
Rate Limiting and Blocking: Some websites may block scraping bots. To avoid this, you can:
- Add delays between requests.
- Use rotating proxy services.
- Respect the website’s robots.txt file, which indicates the site’s scraping policy.
Conclusion
And there you have it! Now you know how to scrape web pages using Cheerio in Node.js. We covered everything from setting up your project and making HTTP requests to parsing the HTML and saving the data in a clean JSON file.
Web scraping is an incredibly useful tool, but always remember to follow the website’s terms of service. Some sites don’t allow scraping, and others may have anti-bot measures. So, make sure your scraper is prepared for those challenges. With these basics, you can start scraping any website with the data you need. Happy scraping, and good luck with your projects!