A Guide To Web Crawling With Python

Scraping Robot
June 15, 2023
Community

Also known as web spiders, web crawlers are bots that continually search and index internet content. They are responsible for understanding web page content so they can fetch it when users make inquiries. Although the main users of web crawlers are search engines, website owners and marketers can also use them for site optimization.

Table of Contents

One of the best ways to learn web crawling with Python is through Scrapy, an open-source and collaborative Python web scraping framework.

Data Crawling Python Use Cases

Data Crawling Python Use Cases

Web crawlers and web crawling with Python are primarily utilized by search engines such as Google and Bing. These search engines employ algorithms to tell web crawlers where to find pertinent information when addressing search queries. Rather than crawling the entire internet, web crawlers determine each web page’s significance based on factors such as brand authority, page views, and the number of web pages that link to it.

Marketers and website owners can also use web crawling with Python to optimize their sites. By performing web crawling with Python, users can identify issues that reduce search engine results page (SERP) rankings, including:

  • Page title problems: Missing, duplicate, too short, and too long title tags can affect your page ranking.
  • Repeated content: When you repeat content across different URLs, search engines have difficulty choosing which version is the most relevant to a user’s search query. You can fix this problem by combining the repeated content through a 301 redirect.
  • Broken links: Broken links ruin the user experience and lower your SERP ranking.

Marketers and website owners can also use web crawling for Python for competitor research. For instance, suppose you run an online laptop store. To better understand your industry and competitors, you can use a web crawler with Python to index several competitor sites. You can then regularly extract important information from these pages, such as product names and prices.

Crawling the Web With Python and Scrapy

Crawling the Web With Python and Scrapy

Now that you understand web crawling, follow these steps to perform data crawling with Python. For this example, we will perform web crawling with Python and Scrapy on a section of Encyclopedia Britannica.

1. Download Python and Scrapy

Download Python and install it on your computer. You can then install Scrapy by using the following Python command:

pip install scrapy

2. Create a new Scrapy project

Create a new Scrapy project by typing the following:

scrapy startproject encyclo

This will create a Scrapy project with various folders, including a spiders folder. Create and open a new file called encyclo.py in the spiders folder. Import all required Scrapy modules.

3. Create an Encyclopedia Britannica crawler

Next, create an Encyclopedia Britannica crawler by defining a new class called EncycloSpider. This crawler is based on Scrapy’s default CrawlSpider class:

class EncycloSpider(CrawlSpider)

4. Tell the crawler what to do

Before web crawling with Python, you must tell the crawler what to do. Define four attributes of the crawler: name, start_urls, rules, and DEPTH_LIMIT. Set “name” to “encyclo” and link to the Encyclopedia Britannica articles of your choice in “start_urls.”

name = ‘encyclo’

start_urls = [‘https://www.britannica.com/topic/clambake’]

Use the “rules” attribute to tell the crawler which links to parse and what to do with the links. Here’s what you should write to crawl Encyclopedia Britannica:

rules = (

Rule(LinkExtractor(

allow=r”https:\/\/www\.britannica\.com\/topic\/.*”,

deny=[“Article Contributors:”, “Article History:”],

callback=’parse_item’,

follow=True),

)

The first part of “rules” is a link extractor, which pulls the links you want. It has two parts: allow and deny.

  • Allow consists of a regex pattern that matches the URLs that the link extractor should parse.
  • Deny has regex patterns that match URLs that the link extractor should ignore. In this example, we only want the article itself. This is why we are telling the link extractor to avoid “article contributors” and “article history” subpages.

“Callback” has a function that should be used on responses from the links that the web crawler has extracted. “Follow” is set to “True” to show that the spider should follow the links it extracts. Set it to “False” if you don’t want the spider to follow the links it extracts.

The last attribute is DEPTH_LIMIT, which determines the maximum depth at which the crawler will continue crawling. The depth is the number of requests made from the initial URL. Setting a DEPTH_LIMIT controls the crawling scope and helps the crawler focus on relevant information.

For our example, we will set the DEPTH_LIMIT to 3, which means the spider will stop crawling when it has made three requests from the initial URL.

custom_settings = {

“DEPTH_LIMIT”: 3

}

To finish the crawler, write a “parse_item” callback function to return output.

def parse_item(self, response):

title = response.css(‘title::text’).get().split(” – Encyclopedia Britannica”)[0]

url = response.url

yield {‘title’: title, ‘url’: url}

5. Prevent the crawler from getting banned

Sometimes, web crawling with Python can be blocked from accessing websites for sending too many requests in a short period. Sites don’t want bots and crawlers to slow down the traffic for real users.

To avoid getting banned, you can make your crawler appear more “human” by:

  • Adding a delay of two seconds for each page will make your crawler seem more “human” to the Encyclopedia Britannica server. You can do this by going to the project’s settings.py file and adding “DOWNLOAD_DELAY = 2.”
  • Using a pool of proxies to hide the fact that all requests are coming from you. Using multiple proxies will make it seem like multiple people are accessing Encyclopedia Britannica at the same time from different IP addresses. This will let you make more requests per second.

6. Put your Scrapy output into a CSV or XML file

After using your crawler for the first time, you should put the output in an XML or CSV file. You can do this by changing the output file extension when running the crawl command.

The command for XML output is:

scrapy crawl encyclo -o pages.xml

Here is the command for CSV output:

scrapy crawl encyclo -o pages.csv

Web Crawling With Python in Scrapy vs. Beautiful Soup

Web Crawling With Python in Scrapy vs. Beautiful Soup

BeautifulSoup is another popular tool for web crawling with Python. It is a Python package for parsing XML and HTML documents and extracting data from them.

BeautifulSoup has a shallower learning curve than Scrapy, making it a great fit for newbies to programming and web crawling with Python. However, using Scrapy to Python crawl websites provides many advantages over using BeautifulSoup:

  • Scrapy is a complete package for web crawling and scraping. Meanwhile, BeautifulSoup is just an XML and HTML parser and requires additional libraries such as urlib2 and requests to open URLs and store outputs.
  • Scrapy is generally faster than BeautifulSoup, especially for bigger jobs.
  • Scrapy lets you customize your jobs with cookies, data pipelines, and proxies. In contrast, BeautifulSoup does not offer many customization options.

The Difference Between Web Scraping and Using Python To Crawl Websites

The Difference Between Web Scraping and Using Python To Crawl Websites

Although many people use “web crawling” and “web scraping” synonymously, they refer to two distinct activities.

Web crawling with Python or another language or tool involves finding and indexing new URLs. The goal of web crawling is to understand the content of a website. Users can then decide whether to extract information from one or more pages.

In contrast, web scraping involves extracting raw data from a website. Web scraping can be done manually by copying and pasting information from websites or automatically with the help of web scraping application programming interfaces (APIs) or web scraping bots.

Web scraping APIs are software intermediaries that enable automated data extraction from certain sections of a target site. They only provide access to data that website owners allow you to access. You can use web scraping APIs to:

  • Gather specific types of data
  • Schedule the data aggregation process
  • Automate the data aggregation process

On the other hand, web scraping bots or web scrapers are software or pieces of code that extract data from websites. Unlike web scraping APIs, they extract every part of a target site. You should use web scrapers to:

  • Access real-time data such as stock market prices
  • Avoid web scraping API restrictions or limitations
  • Fetch data from sites that don’t provide automated access or APIs

Scraping Robot API and Web Scrapers for Web Crawling and Scraping

Scraping Robot API and Web Scrapers for Web Crawling and Scraping

Web crawling with Python can be time-consuming and daunting, especially if you aren’t a programmer. If you want a high-quality codeless Python web crawler with authentication, consider using Scraping Robot’s API and web scraper. Our API provides structured JSON output of a target website’s metadata, and our web scrapers boast a broad range of features, including:

  • Parsed metadata: Our scrapers have built-in parsing logic to give you the data you need. This means you don’t have to create a separate parser to handle metadata.
  • Guaranteed successful results: Scraping Robot’s scrapers will retry your requests if they encounter any bans.
  • JavaScript rendering: Our scraping bots will ensure that all JavaScript has loaded the HTML content before fetching that content for you.
  • Statistics and usage: We offer beautiful graphs showing how many scrapes you’ve performed in the last month, week, or day. You will also receive records of your most recent projects and modules.
  • No proxies required: You don’t need to get separate proxies to bypass anti-scraping technology. Send in your keywords or URLs, and we’ll handle the rest. We rotate through multiple IP pools to ensure you get the needed data.

Interested in trying Scraping Robot’s Python web crawler with JavaScript support? Register for a free account today to start web crawling with Python. You will be able to access all of our features, including new monthly modules, frequent improvement updates, seven-day data storage, and 5,000 free scrapes per month.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.