Web Scraping with RSelenium and Rvest
Dynamic web pages often cause web scrapers to stumble. They are more demanding of the user, often requesting logins or CAPTCHA verifications before letting anyone get past them. With the RSelenium package, along with the rvest package, you can successfully scrape dynamic web pages.
Table of Contents
Rvest works with magrittr to simplify common web scraping tasks. In terms of how it works, it’s similar to Beautiful Soup and RoboBrowser. This guide will help you achieve better scraping results by using rvest in R. We’ll show you how to scrape the web using R and RSelnium with ease.
The Demands of Dynamic Web Pages When Web Scraping
When visiting a website, some require users to input information before they can move beyond that step to get to the actual information they desire. Static web pages contain the same information for anyone who visits. They are updated over time, but most often, every time you visit, no matter who you are, the content is the same.
Dynamic web pages are the opposite. They produce content that’s specific to your request. It’s much like using the filters on an e-commerce page to find the very specific size of a type of clothing. These sites require interaction of some type.
We have provided a variety of tutorials on how to scrape dynamic web pages. But there’s another tool you need to know about before going further. With rvest, you can scrape data from static web pages. However, if you want to scrape data using rvest that’s on a dynamic page, you’ll need to help it. That’s where RSelenium comes into play.
If you haven’t yet, take some time to learn a bit more about Selenium for web scraping. You will need a modern version of Selenium to access the RSelenium package. You will also need to have Java installed.
How to Use RSelenium and Rvest for Web Scraping
Before going further, you’ll need to learn how to install rvest in R and a few other tools. We’ll provide insight here on how to use RSelenium along the way. To get started, you will need to download the following packages.
For R Packages, use the following code:
library(tidyverse)
library(rvest)
library(RSelenium)
Next up, you need to start Selenium. Use the rsDriver() to do so.
rD <- RSelenium::rsDriver()
When you do this, a new Chrome window should open. Now that you have the first steps in place, let’s explore the overall use of these tools.
Basic Usage Steps and Definitions
There are dozens of methods you could use as a component of this process. The following are a few of the methods you are most likely to need for the remDR object created previously.
- Navigate a given url: Use navigate(url) such as remDr$navigate(“https://www.google.com/”)
- Go back: Use goBack() to go back a space
- Go Forward: Use goForward() to move on with the step.
- Reload the page: To reload the current page, use the refresh()
- Retrieve the URL: To capture the URL of the page you are on, use getCurrentUrl()
- Page source: To obtain the page source for the page you are on, use getPageSource() [[1]]. You will use this method with rvest to scrape dynamic web pages.
- Read the content: Once you use the previous step, the produced XML document can be read using rvest::read_html(). This will produce a list object.
These details give you the actual code to use for various applications. If you haven’t figured it out yet, learning how to use rvest will take some knowledge of coding and some skill.
Now, let’s outline the elements you need to know to take this process one step further.
- Search for an element: To search for the element you need on a page, starting from the document root, use findElement(using, value). To help you with this process, you will need to have some insight into HTML and CSS, or you will need to have insight into XPath. Our guide is robust and teaches you the details.
- Highlight current element: Now, to help you check that you selected the desired element, you will want to use highlightElement(). This is a utility function that highlights the current element.
- Send a sequence: You’ll likely need to send a sequence of keystrokes to an element at this point, which is done using a list. To do that, use sendKeysToElement(). The unnamed elements of the list are documented in plain text, and keyboard entries are in ‘selKeys” with the “Key” tag to them.
- Add the element’s value: To do this, clear the TEXTAREA or text INPUT of the element’s value by using clearElement().
- Click the element: If you want the task to click on the element, use clickElement() to do this. That will allow you to check a box, click on a link, or even use drop-down lists.
You have most of the tools you need at this point. There are a few more steps to learning the details of this process.
How to Select Nodes, Parse, and Navigate
In some projects, you may benefit from using CSS selectors to locate nodes. To achieve that, you’ll need to know the following:
- html_node(page, “div”) # div elements
- html_nodes(page, “div#intro”) # div with id intro
- html_nodes(page, “div > p”) # p inside div
- html_nodes(page, “div#intro.featured”) # div with both id and class
- html_nodes(page, “ul > li:nth-child(2)”) # second li in ul
- html_nodes(page, “div.results”) # div with class results
Once you gather the information, you need to extract it, as that’s what’s going to allow you to scrape the content in a useful manner. To do that, you need to know the following extracted attributes and HTML. Here’s a cheat list for you:
- text <- html_text(nodes) # text content
- html <- html_contents(nodes) # inner HTML
- imgs <- html_attr(img_nodes, “src”) # attribute
- hrefs <- html_attr(links, “href”) # attribute
Let’s say you want to extract information from a website. That website presents the information in a table format. To instruct your code to capture that information, use:
tables <- html_table(html_nodes(page, “table”))
df <- tables[[1]] # extract as dataframe
In some situations (and increasingly so), you need to navigate more complex details or page outlines. When you have more challenging queries, you’ll want to use code that accomplishes your goal, and that often means using XPath. Here’s a sample for you:
html_nodes(page, xpath = ‘//*[@id=”intro”]/p’) # xpath selector
html_text(html_nodes(page, xpath=’//p[@class=”summary”]’)) # xpath and extract text
On to parsing! You’re getting there with your project, but the next step is to parse the document structure to pinpoint the specific information you need. The code you use is dependent on what you’re capturing. Here are some bits of code that may apply in your situation:
url <- “<http://example.com>”
page <- read_html(url)
title <- html_text(html_nodes(page, “title”))
h1 <- html_text(html_nodes(page, “h1”))
links <- html_nodes(page, “a”) # all links
Here, we’ll be capturing the page title, H1, and links to the details. But we also need to move from the first page to the next one. To achieve that, we’ll use:
other_page <- read_html(links[12])
submit_form(page, submit = “login”, user = “name”, pass = “password”)
You’ll need to enter the necessary login information in this situation. That’s how it’s going to get beyond the limitations of dynamic web pages. To get around a login, such as when you do not have one, the following code, customized for you, will work in most situations:
page <- html_session(“<http://example.com>”) %>%
jump_to(“login”) %>%
fill(“username”, “user123”) %>%
fill(“password”, “secret”) %>%
submit_form(id = “login-form”) %>%
jump_to(“account”)
Now, we need to scrape that data using rvest. Use the following code filled in with the website you’re after:
page <- rvest::html_session(“<http://example.com>”)
page <- rvest::jump_to(page, “dynamicContent”)
html <- rvest::html_text(page)
To extract element names, consider the following code:
names <- html_name(nodes)
Extract child nodes using the following:
children <- html_children(node)
Extract sibling codes using the following:
siblings <- html_siblings(node)
Now, consider a few strategies for successfully using rvest and RSelenium. To interact with dynamic pages, you’ll want to use the following RSelenium code examples.
library(RSelenium)
driver <- rsDriver(browser = “chrome”)
driver$navigate(“<http://example.com>”)
page <- driver$getPageSource()[[1]]
html_text(html_nodes(page, “#dynamic-content”))
driver$close()
If you plan to parse in XML, you can do so with rvest using xml2. To do that, use the following code:
library(xml2)
page <- read_html(“data.xml”)
xml <- html_xml(page)
nodes <- xml %>% xml_find_all(“//item”)
Putting Scraping Robot to Work on Your Project
Did you know that Scraping Robot offers an API tool that can help you with all of your web scraping tasks? When you need quality resources to build a successful web scraper, using rvest and RSelenium are solid choices. There are a variety of tools available to help you with web scraping.
When you are ready for a straightforward solution that gets the job done, turn to Scraping Robot for more information and the support you need to complete your project with ease. Contact us now.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.