Pandas Read HTML: How To Use Pandas To Read In The HTML File
Web scraping has become a popular technique for obtaining data from websites that do not provide an API. While web scraping can be performed manually by a user, the term usually refers to automated processes using a bot or web crawler.
Table of Contents
Python enthusiasts have come up with various web scraping tools. Pandas read HTML is one of them, allowing users to read tables from a string, URL, file, and columns. It’s an alternative to Beautiful Soup and lxml for reading tables from HTML pages into a list of DataFrame objects. The read_html() function in Pandas can read tables from both HTML files and HTML strings.
Below, you’ll learn how to read HTML in pandas and apply it in practical settings. You can use the table of contents to navigate the full article.
What Is Pandas read_html()?
Pandas read_html() is a function that reads HTML tables into a list of DataFrame objects. The function comes from the Python package beautifulsoup4.
The function works by first parsing the HTML page into a BeautifulSoup object. Once the object is created, you can use its .find_all() method to search for any tag you want.
The most common search is to look for all table tags, which is done by searching for the “table” tag.
Once you have a list of all the table tags, you can loop through each one and use the BeautifulSoup .find() method to search for the tr (table row) and td (table data) tags.
The .find() method returns a list of all the matching tags, which you can then loop through and extract the text from.
How to use Pandas to read in the HTML file
Before using the read_html() function, you need to install lxml. It is an external module that is not included with the standard Python distribution. You can install it using pip:
pip install lxml
Once you have installed lxml, you can use read_html() to read in an HTML file as a list of DataFrame objects. Each DataFrame is a table element from the HTML file.
Suppose you want to read a table of FIFA teams ranking from this page. You can use the following commands to read all tables on this page:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from unicodedata import normalize
#read all HTML tables from specific URL
tabs = pd.read_html(‘https://int.soccerway.com/teams/rankings/fifa/’)
If you want to determine the total number of tables on this page, you can use the following commands:
#display total number of tables read
len(tabs)
44
As you can see, there is only one table on the page. But if there were multiple tables and you were only interested in the one with the word “Scotland” in it, you could write your command accordingly.
#read HTML tables from specific URL with the word “Scotland” in them
tabs = pd.read_html(‘https://int.soccerway.com/teams/rankings/fifa/’,
match=’Scotland’)
#display total number of tables read
len(tabs)
1
You can modify the command to get the specific table you want. For example, you may want to get the first table on the page, or you may only want to see the first four columns of the table.
How to use Pandas to read HTML from a string
Before you use Pandas to read HTML from a string, you need to install Pandas using conda or pip commands.
pip3 install pandas
conda install pandas
Once you’ve done that, you can create a Python file. Paste a line of code in which any variable contains HTML. According to the Pandas official documentation, the string can represent the HTML or a URL.
If you’re using lxml, it will only accept the following protocols:
- Ftp
- File url
- Http
Therefore, if you’re using a URL that begins with “HTTPS,” remove the ‘s’ from the end. After pasting the code with HTML, you can run the read_html function.
import pandas as pd
df_list = pd.read_html(html)
The function will extract the data from HTML tables, showing you the list of tables. If you know the number of tables in the string, you can confirm that Pandas has read all of the DataFrames by using the following command:
print(len(df_list))
# OUTPUT: 1
If your string only has one table, the df_list variable will confirm it. Finally, if you want to see the contents of the table in your string, you can use this command:
print(df_list[0])
It will extract the data from the HTML table/s and show it.
How to read HTML in Pandas through a URL
Pandas read_html() can also accept a URL. You can read HTML tables from websites directly into a pandas DataFrame by passing the URL to the read_html() function.
The function will return one DataFrame for each table on the page. In this Pandas read HTML example, the following URL is used: https://int.soccerway.com/teams/rankings/fifa/
To list the DataFrames, paste the following command:
dfs = pd.read_html(URL)
You will see a list of DataFrames. You can now type len(dfs) to see all the tables in the URL. The Pandas read HTML example URL has 1 table.
How to read HTML in Pandas through a file
You can also read HTML tables from a local file by passing the file path to the read_html() function. Suppose you have saved an HTML file called “table.html” in your working directory. The file path would be:
file_path = ‘table.html’
Now, to read this table into a pandas DataFrame, run the following code:
file_path = ‘‘table.html’
with open(file_path, ‘r’) as f:
dfs = pd.read_html(f.read())
dfs[0]
How To Use Pandas Read HTML for URLs Requiring Authentication
Sometimes, the website you’re trying to read data from will require authentication. In these cases, you can use Pandas to read HTML from the website by providing your credentials.
You will see the following exception when you try to run the normal code:
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 401: UNAUTHORIZED
Thus, you need to install the request module to access these websites.
$ pip install requests
Now, you can use the get() function in the requests module to read the HTML. First, you must pass your credentials (username and password) as a tuple in the auth parameter.
For example, suppose you needed authentication to get data from the website mentioned earlier, https://int.soccerway.com/teams/rankings/fifa/. Use the following command:
import requests
r = requests.get(‘https://int.soccerway.com/teams/rankings/fifa/’, auth=(‘john’, ‘johnspassword’))
print(r.status_code)
print(r.text)
You accessed the URL’s content successfully if you see the following:
200
{
“authenticated”: true,
“user”: “john”
}
However, the website only has JSON data. You need HTML table elements. So, you’ll use the requests module to read HTML tables.
Benefits of Reading HTML in Pandas
Pandas read HTML has plenty of benefits to offer. For instance, it can help you save time by automatically scraping data from the web. Moreover, it is an efficient tool that can help you gather data from multiple sources and cleanse it for further analysis.
Here are some benefits of using read HTML in Pandas:
- Data representation: Pandas offers streamlined data representation features. As a result, it is an excellent data analysis and manipulation tool.
- Web scraping: One of the most significant advantages of using read HTML in Pandas is that it can help you automate web scraping tasks.
- Efficient data handling: Pandas can handle large amounts of data efficiently.
- Data cleansing: Another benefit of reading HTML in Pandas is that it can help you automatically cleanse your data.
Limitations of Pandas Read HTML
Hopefully, this guide will help you understand how to get pandas to_html to be read in HTML, but there are still some limitations to this method. Someone with no coding background would have trouble understanding how to get it working.
Another problem with this approach is that the generated HTML can be extensive and cause your page to load slowly. On the other hand, using a pre-build and automated web scraper is a more reliable and straightforward way to gather data from the web for price monitoring, competitor analysis, or any other business objective.
One approach would be to get proxies from a reliable provider, such as Rayobyte, and use them to make requests to the target website. However, this could be time-consuming.
The second and more straightforward alternative is Scraping Robot, a web scraper that makes web scraping a breeze by taking the hassle and time out of the process. You only have to provide the URLs to be scraped, and the web scraper will do the rest.
Use Pandas Read HTML To Scrape the Web
Pandas read HTML can be an effective way to scrape the web for data. With just a few lines of code, you can read HTML tables into a pandas DataFrame, making it simple to work with the data in Python.
Moreover, you can customize a Pandas read HTML table by changing its index, border, colors, column names, etc. However, if working with code is too confusing or overwhelming, you can opt for a ready-made web scraper, such as Scraping Robot, to collect all the information you need.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.