One of the biggest hurdles in web scraping is how literal computer programs can be. When writing a web scraper, you need to think like a computer. That means thinking in “yes” and “no,” with no shades of gray, or you’ll tell the program to collect information that turns out to be total junk.
Table of Contents
The solution is to use CSS selectors. With the right CSS selectors, you can tell your scraper to collect precisely the page elements you care about: nothing more or less. With this CSS selectors cheat sheet, you’ll have all the selectors you could possibly want to use all in one place. Keep reading to learn which selectors are most important for web scraping, why they’re helpful, and how to use them appropriately.
What Is a CSS Selector?
CSS, which stands for Cascading Style Sheets, is a style sheet language used to make HTML documents more engaging. The language is heavily used to make websites more visually appealing and is found on most modern sites.
A CSS selector is a tag used to specify which HTML elements should be affected by specific styles. CSS selectors are why different headers have different appearances, and links you’ve clicked look different from unclicked links. They help sites present more information to visitors, and they can also help you scrape sites more effectively.
Why Scrape Using CSS Selectors?
Typically, web scraping involves having your scraper bot read the HTML and extract information from it. However, HTML on its own doesn’t always provide the amount of information you need to narrow down the data you want to collect. CSS selectors can help you collect only the data that matters to you.
When web-scraping using CSS selectors, you instruct the scraper to collect nothing but information following that selector. Instead of collecting every bit of bolded text, for instance, scraping websites with CSS selectors can help you collect bolded headers and nothing else. A well-written CSS selector scraping program will help you avoid junk data and make your scrapes more effective overall.
The Best CSS Selectors for Scraping Websites
There are hundreds of potential CSS selectors you can use in your web scrapers. However, these selectors can be broken down into a few critical categories. This CSS selectors cheat sheet covers the most critical structures to look for in your scrapes and what they may look like in action.
* (All Element)
The asterisk (*) is one of the broadest CSS selectors available. That also means that it’s one of the least valuable selectors. With the asterisk selector, you instruct the scraper to pick up the entire page. The asterisk is known as the “all” element because it tells the program to find all CSS elements on the webpage. This broad sweep can be useful if you want to collect entire pages, but it’s not the right choice for narrower scrapes.
The attribute selector can give HTML elements a way to set themselves apart. If an HTML element doesn’t have a clear class, attributes help make them unique. You can use attribute selectors to find all elements with that attribute, such as links or titles.
Examples of attribute selectors include:
- [href]: Designates most elements that have a link.
- [exampleattribute]: Designates elements with the specific title in the square brackets.
You can make your attribute search a little more specific by adjusting how you look for them. With this selector, you can look for attributes that contain specific strings and collect only those elements:
- [href~=.org]: Looks for links within the page and gathers the ones with “.org” in the string.
- [exampleattribute~=best]: Finds all elements with the “exampleattribute” and picks up the ones that include the string “best.”
You can also look for attribute elements that only contain a specific string. This selector finds all the elements with a specific attribute, then filters out those that contain text other than the specific string. Examples include:
- [exampleattribute=”Target”]: Collects only the exampleattribute elements that are called “Target.”
- [href=”http://www.examplesite.com”]: Finds only links to examplesite.com.
The simplest and most useful CSS selector is “.class.” Everything in CSS has a class. If you’re targeting information that falls under a specific class, you can use this selector to collect everything from that group. Just make sure that your information is the only thing in that class, or you’ll collect junk data too. Samples of classes might be:
- .orange-text: This selector could be built to turn all text within it bright orange, and targeting it would collect the orange strings.
- .card-header: This class might be used to make information card titles stand out against colorful backgrounds, so you could use it to collect all of those titles.
An element can have multiple classes. If you only want to collect elements with two (or more) specific classes, you can use this selector to find them.
- .orange-text.card-header: This selector would help you gather only the strings that are both orange text and a card title.
- .orange-text card-header: In some cases, there will be no dot between the two classes, and it will still work on the element. In this case, you’ll need to remove the dot in your selector to find the classes as well.
An elemental (ba dum-tsh) CSS selector is .element. These selectors point toward particular fundamental functions of the page, such as headers, images, or links. While you can also find these elements through HTML, it may be more specific to use the CSS selector instead if the page is complex.
- h2: This element designates all second-level headers. Other header elements include h1, h3, h4, and beyond.
- a: The <a> element typically designates a link, so you can use the selector to collect all links.
Now we’re starting to get more specific. An element.class selector finds only elements that are also part of a particular class. This selector refines both the element and class selectors to filter out noise like headers found on other parts of the page.
- h2.orange-text: Identifies all h2s that are also in the orange-text class, so it should collect all orange headers at the h2 level.
- h4.card-header: Only collects card headers that are also h4 headers.
Getting even more detailed, this selector looks for elements with an attribute of a specific value. You can also substitute the [attribute~=value] format here to find elements that include a particular value but may also include other information.
- h2[href=https://www.examplesite.com]: Finds all second-level headers that also link to that specific site.
- h4[href~=.org]: Collects all level four headers that are also a link to a .org website.
If you want to be hyperspecific and gather one piece of information per page, you can use #id to collect elements with a specific ID.
- #email-signup: Finds the page element labeled email-signup.
- #offer: Collects information under #offer if the site uses the ID to identify coupon codes.
parentElement > childElement
Another way you can refine your search is to look for elements within other elements, which will rule out junk information you don’t want. You can also narrow down the selector by making it parentElement.class > childElement, so it only finds child elements under parent elements with specific classes.
- h2 > h4: Gathers all level-four headers whose direct parent is a second-level header.
- div.orange-text > a: Spots all links whose parent element is a div that’s assigned the orange-text class.
How to Get CSS Selectors for Web Scraping
Each of the above CSS selectors can come in a variety of forms. Finding CSS selectors that get you the information you want takes a little work. The best way to find the specific selectors you should use is to extract CSS from websites you’re targeting and look at the structure yourself.
Ideally, the website will be well-designed, and the CSS will be consistent across different pages. If so, you can check two or three pages to find the CSS selectors that surround the information you want.
In some cases, unfortunately, websites aren’t well-designed. These sites may not have consistent CSS tags across every page. You can still scrape these sites, but you’ll need to build a more in-depth scraper that targets a broader range of labels. This takes more work, but it’s essential to collect valuable data.
The Best Way to Scrape CSS Selectors
You don’t need to develop a web scraper yourself to use CSS selectors. You can use prebuilt, free scraping solutions that already have CSS capabilities built-in. Scraping Robot offers all that and more in a simple API that you can implement in your scrapes.
With Scraping Robot, you can begin targeting CSS selectors with nothing more than a basic review of the sites you want to scrape. There’s no need to go through a complicated CSS selectors tutorial or build a web scraper from scratch. When using Scraping Robot, you can use this CSS selectors cheat sheet to target the information you want. You can learn more about how to start working with Scraping Robot or sign up to access your free scrapes today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.