Web Scraping With Cheerio: How Is Cheerio Web Scraping Different than Puppeteer?

Scraping Robot
June 1, 2023
Community

You’ve come across a website that has your desired data. But the website does not expose an API you can use to access this data. What’s the fix? Web scraping, particularly Cheerio web scraping.

Table of Contents

Before you scrape a website, know whether it’s legally and ethically allowed. Most websites have a ToS (terms of service) page that states what you can and cannot do with the website. Comply with the ToS to stay ethically sound and avoid legal repercussions.

With that established, let’s dive into web scraping with Cheerio. We also provide a Cheerio vs. Puppeteer comparison to help you decide which of the two is the better option for your project.

What Is Cheerio Web Scraping?

What Is Cheerio Web Scraping?

Cheerio is a web scraping library in Node.js. It lets you extract data from XML or HTML documents with a jQuery-like syntax.

Since Cheerio offers a simple API to manipulate and traverse a document’s structure, it allows you to extract specific data points and elements from any data.

Cheerio also lets you do the following:

  • Select elements using CSS selectors
  • Access attributes of data elements
  • Manipulate the DOM (Document Object Model)
  • Create custom functions to filter data
  • Extract HTML or text content

How To Scrape a Web Page With Node.js and Cheerio

How To Scrape a Web Page With Node.js and Cheerio

You’ll need a few prerequisites for Cheerio web scraping. They include:

  • Node.js (if your system does not have Node, download it first)
  • A text editor such as Atom or VSCode

Besides having these tools, you should also somewhat understand Node.js, JavaScript, and DOM. But even if you’re a beginner, you can follow the steps below for scraping JavaScript-rendered web pages using Cheerio.

For this example, we’ll scrape this Wikipedia page containing a list of movies based on Marvel comics. Here’s how to get started.

Create a working directory

The first step is to create a working directory for your project. Enter this command:

mkdir learn-cheerio

It will create a working directory called learn-cheerio. You can use a different name if you want. After you run the command above, you’ll see a folder called Learn-cheerio in the console.

Open the directory you’ve created in the text editor of your choice. For example, you can use Axios to start the project.

Start the project

In the initialization step, you must open the directory using your preferred text editor. Run the following command:

npm init -y

When you run this command, a package.json will be created at the root of your directory.

Install dependencies

A dependency is a piece of code shared by other developers to help you get up and running quickly. Install the dependency for your Cheerio web scraping dependency using the following command:

npm i axios Cheerio pretty

The command will take a few minutes to run. After it runs, you will see three dependencies in the package.json file in the console’s dependencies field. These three dependencies are:

  • Axios
  • Cheerio
  • Pretty

Axios fetches markup from the target website. Cheerio then parses this markup. However, if you want to use another HTTP client instead of Axios, you can do that too.

Pretty is a package that beautifies the markup to make it readable on the terminal. It also makes the markup comprehensible when printed.

Inspect the web page

You must inspect a web page’s HTML structure before scraping it. Go to the page mentioned above on Wikipedia. Under the Feature Films section, you will see the names of all films made based on Marvel publications.

Press the CTRL + SHIFT + I keys together in Google Chrome to see the page in DevTools. Or, you can right-click on the section and click “Inspect.”

Write code

Now, you have to write the code for Cheerio web scraping. To do this, run this command to form the app.js file:

touch app.js

Before you can use this package, you must have Pretty, Cheerio, and Axios. You can load them by adding these commands before you run the command for the app.js file.

const axios = require(“Axios”);

const cheerio = require(“cheerio”);

const pretty = require(“pretty”);

After running these commands, you’re all set to write the code for web scraping. Free Code Camp has a code you can use for your projects. When using the code, make changes according to the website you need to scrape.

Navigating Cheerio for Web Scraping

Now that you know the basics, let’s learn how to perform some basic actions in Cheerio.

Selecting an element

Cheerio supports common CSS selectors, like element, id, and class. 

First, you need to load the document where the element is present. After loading the document, use the $ function to find your desired element. Suppose you want to select the ‘s’ elements in a document. The code to select them will be:

const $s = $(‘s’);

If the element has a specific class name, use this command:

const $selected = $(‘.selected’);

If it has a specific attribute value, use this command:

const $selected = $(‘[data-selected=true]’);

Getting the attribute of an element

You can also select a particular attribute of an element, such as its id or class. For this example, let’s say B is the element and F is the class or attribute. Use the following code:

const F= $(“.B__F”);

console.log(F.attr(“class”)); 

Web Scraping With Puppeteer and Cheerio: How Do They Differ?

Web Scraping With Puppeteer and Cheerio: How Do They Differ?

Web scraping with Puppeteer and Cheerio differ in many regards. Here are some of them.

Nature

Cheerio web scraping is based on the DOM parsing function of the library. Simply put, the library can parse HTML and XML files to help you find desired data.

Puppeteer is a Node.js library the Chrome team has developed. It gives you a high-level API that you can use to control and automate Chromium and Chrome browsers programmatically. You can use Puppeteer to generate screenshots, fill out forms, click buttons, scrape websites, and other such functions.

Website rendering

Since Cheerio does not apply CSS, it won’t render websites like a browser would. It also does not load external resources like images, videos, and iframes.

Due to this, Cheerio web scraping is not ideal for single-page applications developed with React or other front-end technologies.

On the other hand, Puppeteer can execute JavaScript. So, it lets you scrape single-page applications and other dynamic pages.

DOM manipulation

Both Cheerio and Puppeteer can let you scrape a website, but only the latter lets you interact with it. Puppeteer also allows you to manipulate the DOM. For example, you can query to locate HTML elements, click buttons, fill out forms, and go between pages.

Cheerio web scraping is mainly focused on parsing HTML structure. You cannot use Cheerio to interact with the website, such as clicking a button or scrolling the page. Instead, you can use it to transform data and extract desired information.

Dependencies

When using the Node.js web scraping Cheerio procedure, you can install dependencies from the npm package manager. You do not need any additional configurations for installations to install dependencies.

Puppeteer has Chromium and Chrome browser dependencies. Since you need to install these dependencies separately, they have a large installation footprint.

Learning curve

Cheerio web scraping is much easier to learn compared to Puppeteer web scraping. The former will be even easier for people with a previous understanding of jQuery.

Meanwhile, Puppeteer has more functionalities, making it harder to grasp. Plus, it often requires using asynchronous code, such as promises and callbacks, which can be challenging to learn.

Speed

Puppeteer web scraping can be slower due to the need to render pages in a headless browser. Since Cheerio web scraping doesn’t require this, it is faster.

When To Choose Cheerio Web Scraping

When To Choose Cheerio Web Scraping

Cheerio web scraping is the optimal choice if you need to scrape a static website without the need for interactions, like form submission and clicks. In contrast, Puppeteer is a good choice if you want to interact with the page elements.

If you’re struggling to choose between the two, do some Puppeteer/Cheerio example web scraping with different websites to see which works best for your needs.

But if both are too complex for you, invest in a scraping API, such as Scraping Robot, instead. With Scraping Robot, you can scrape websites into Json without worrying about browser scaling, proxy management, CAPTCHAs, or blocks.

You won’t have to worry about JavaScript implementations when using this web scraping API. Scraping Robot will ensure HTML loading on all JavaScripts before presenting you the content. It also parses the metadata, allowing you to find your desired information with ease. Register today to get started with free credits.

Takeaway

Takeaway

Web scraping is a need for many businesses today. From price comparisons and consumer analysis to SEO and social media management, you need web scraping to get the most out of business tasks.

Cheerio web scraping is an ideal route to gathering the data you need for informed decision-making. Since Cheerio is quick and versatile, it can help you scrape static websites with ease. And when things get too complex, simplify them with a scraping API like Scraping Robot.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.