What Is Puppeteer: A Guide To Scraping With Automation

Saheed Opeyemi
May 10, 2021
Community

Table of Contents

1. What is Puppeteer?

2. How to Use Puppeteer for Scraping

3. Puppeteer Automation and APIs

4. Puppeteer Scraping With Scraping Robot

Over the past decade, the internet has evolved from bare-bones websites built with ordinary HTML and CSS to having complex web apps with interactive user interfaces, built using frameworks like Angular or React, that is written with Javascript. Now, this might be good news for the average internet user, but for someone who is looking to perform tasks like automated web scraping, it’s a tad inconvenient.

When your browser makes a request, the server usually brings a response of Javascript files injected into the HTML framework. Essentially, Javascript has become the language of modern websites. However, seeing as most web scraping tools are designed to capture HTML code and extract data from that, you run into the problem of how to extract website data that is rendered by Javascript code. This is where headless browser automation and Puppeteer come in. So…what is Puppeteer?

What is Puppeteer?

What is Puppeteer?

Headless browser automation is a way of leveraging the ability of your browser to render Javascript code for automating use-cases, like web scraping. It is referred to as headless because there is no Graphical User Interface to interact with. You don’t interact with visual elements on a screen but instead, use a Command-Line Interface to make requests and automate use-cases. There are several examples of headless browser automation tools like Selenium for Firefox, Zombie.js, and Intoli’s Remote Browser. However, for this article, we’ll be focusing on Google’s Puppeteer for Chrome. We’ll answer the question “what is Puppeteer?” and how can you use it for web scraping.

So, what’s Puppeteer?

“What is Puppeteer?”  Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium or to interact with the DevTools protocol. It’s maintained by the Chrome DevTools team and an awesome open-source community. Like we said above, headless means you are interacting with a CLI rather than a GUI. Puppeteer provides you with an API that allows you to take remote control of the headless Chromium instances and use them as a launching point for leveraging the ability of a browser like Chrome to render Javascript elements on a webpage. Puppeteer runs headless by default but can also be configured to run full (non-headless) Chrome or Chromium. The tool has grown to be immensely popular since it was launched due to the wide range of features it offers with very lightweight code. There are two packages maintained for Puppet on Github according to the official documentation of the project:

  • Puppeteer-core: This is the puppeteer library, a lightweight package that interacts with any browser that supports the DevTools protocol. Puppeteer-core is driven through its programmatic interface and does not need a downloaded version of Chromium to function.
  • Puppeteer: This is the main package which is a full product for browser automation. It downloads a version of Chromium when installed which it drives using the puppeteer-core library.

Essentially, puppeteer-core is the backend of this automation tool, while puppeteer is the end-user interface. Now let’s look at steps to make Puppeteer work for you in a headless version of Chromium.

To get started:

  • If you don’t have Node.js 8+ installed on your system, download and install it from here.
  • You also need to install some other packages like the node package manager{npm}. You can check out the steps to set up your npm here.
  • Create a command-line directory where you will run your puppeteer package and then launch it with npm. The launch might take some time because Puppeteer will download and install Chromium in the background.

Once you do this, you are ready to get started with Puppeteer for remote API access to Chromium instances on your system.

How to Use Puppeteer for Scraping

How to Use Puppeteer for Scraping

Know those websites that are always asking you to verify that you are human before you can perform any actions? CAPTCHA? Very annoying, right? I mean, a bunch of code asking me if I’m human. That’s just hilarious. However, while this is usually just a minor inconvenience for you and me, a web scraping automation tool might feel otherwise. Being unable to click buttons and select boxes with trees, it might find itself unable to carry out its web scraping functions.

Or perhaps, your web scraping software sent a request for data from a particular web page and returns some values empty because the page loaded too slowly. Or like we mentioned earlier, your bot can’t scrape website data rendered in Javascript because it was designed to scrape only HTML data. Puppeteer solves all these problems in one fell swoop.

The Puppeteer tool can perform almost any action that an actual human would perform on a website. This includes filling a CAPTCHA form, waiting for page elements to load, navigating a page, etc. This means you are in possession of an automation tool that can effectively replace any standard data extraction tool. For example, Puppeteer has functions to wait for stuff like page headers, elements, navigation, titles, and functions to load. This allows it to deal with the asynchronous flow of data from the server.

Say, you want to scrape data from the Scraping Robot homepage. First, you execute a NewPage method. This creates a page class. A page class is the effective representation of a single tab in your web browser. Then we use the goto command to direct the newly created page to the Scraping Robot homepage (the URL for the homepage is the variable here and you can replace it with any website of your choice). Then, let’s say we decide to request the title of the mainframe of our homepage. However, seeing as the Scraping Robot homepage has an entry page before redirecting to the homepage, the result will come back as an empty string (an entry page is a page that loads first before redirecting you to the URL you requested). This is because the request was executed too early before the actual homepage loaded and therefore returned the title string of the entry page which is an empty string. However, there is an easy solution to this. By invoking the waitforselector function, with ‘title‘ as the variable, Puppeteer waits until a title is rendered on the page before returning a result for our request. This is a practical example of using Puppeteer to carry out web scraping functions that would stump a traditional scraping software (check out the code for this example here).

While using Puppeteer for scraping is one of the major ways to apply this tool, it is not the only one. Like we said above, almost any action that can be performed by a human on a webpage can be simulated by Puppeteer. You can simulate the actions of a mouse, simulate a keyboard for input purposes, emulate different devices including mobile devices (the Puppeteer library even comes with a built-in list of device descriptors and an emulate method that serves as a shortcut for invoking the setUserAgent and setViewport functions, that allow you to specify device description and viewport definition of pages), take screenshots, test website performance and so much more. It’s the automation tool of the century!!! (right after electric toasters, lol).

Puppeteer Automation and APIs

Puppeteer Automation and APIs

If you remember, we said Puppeteer is an API built on headless Chrome to automate use-cases. API stands for Application Programming Interface, a piece of software that serves as a go-between between disparate software systems or web applications. Now, if there’s one thing I love about APIs, it’s that they love themselves. And they are nice. Seeing as Puppeteer is an API in and of itself, it opens up a lot of opportunities to set up a data collection funnel with Puppeteer as the starting point and stretch it all the way to wherever you want. You can set up the tool to interface with another API, probably to display the extracted data directly in the UI of another web application. Or you can plug in another software (your database, for example) and have Puppeteer feed data directly into it.

APIs make it extremely easy to collect and transfer data. With Puppeteer running on Chrome, you can extract eCommerce data and update it directly on your eCommerce websites, scrape SEO data, social media data, business information from sites like Yellowpages, etc, and send the data directly to where you need it. Whichever one is relevant to you, the fact remains that with the Puppeteer tool, you have the ability to take your data extraction to the next level. Of course, you must remember to follow proper scraping ethics and make sure that you do not use this tool to violate the privacy of individuals or corporations on the internet.

Puppeteer Scraping With Scraping Robot

Puppeteer Scraping With Scraping Robot

Always striving to be at the forefront of innovations in the data extraction industry, we at Scraping Robot have integrated this wonderful piece of automation into our scraping service and are proud to tell you that you don’t have to set up your own software. We will take care of everything for you. With our web scraping modules, you can extract data from literally any website on the internet, at unrealistic speeds and with little to no restrictions thanks to our HTML scraper and the Puppeteer tool. We have gone through the stress so you don’t have to. Just sign up now and you are good to go. And if you are trying to scrape a website that we do not have a prebuilt module for, our developers are on standby to help you build your own custom scraping solution.

So now that we have helped you define “what is Puppeteer?” and we are ready to put it to work for you, what are you waiting for?

Final Thoughts

Final Thoughts

Answering the question of “what is Puppeteer?” is relatively straightforward. The question of the possibilities that exist with this interesting piece of software, however, is much broader and difficult to answer properly. One thing we do know though is that when it comes to web scraping, Puppeteer offers an added advantage and solves a problem that many web scraping services have been dealing with for a while. And for that alone, you should get in touch with us at Scraping Robot today and take advantage of this wonderful piece of software to solve your web scraping and data needs.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.