Web Crawler Detected: How To Scrape Under The Radar

Scraping Robot
January 25, 2022
Community

Web crawling is growing increasingly common due to its use in competitor price analysis, search engine optimization (SEO), competitive intelligence, and data mining.

Table of Contents

While web crawling has significant benefits for users, it can also significantly increase loading on websites, leading to bandwidth or server overloads. Because of this, many websites can now identify crawlers — and block them.

Techniques used in traditional computer security aren’t used much for web scraping detection because the problem is not related to malicious code execution like viruses or worms. It’s all about the sheer number of requests a crawling bot sends. Therefore, websites have other mechanisms in place to detect crawler bots.

This guide discusses why your crawler may have been detected and how to avoid detection during web scraping.

How Is a Crawler Detected?

How Is a Crawler Detected?

Web crawlers typically use the User-Agent header in an HTTP request to identify themselves to a web server. This header is what identifies the browser used to access a site. It can be any text but commonly includes the browser type and version number. It can also be more generic, such as “bot” or “page-downloader.”

Website administrators examine the webserver log and check the User-Agent field to find out which crawlers have previously visited the website and how often. In some instances, the User-Agent field also has a URL. Using this information, the website administrator can find out more about the crawling bot.

Because checking the web server log for each request is a tedious task, many site administrators use certain tools to track, verify, and identify web crawlers. Crawler traps are one such tool. These traps are web pages that trick a web crawler into crawling an infinite number of irrelevant URLs. If your web crawler stumbles upon such a page, it will either crash or need to be manually terminated.

When your scraper gets stuck in one of these traps, the site administrator can then identify your trapped crawler through the User-Agent identifier.

Such tools are used by website administrators for several reasons. For one, if a crawler bot is sending too many requests to a website, it may overload the server. In this case, knowing the crawler’s identity can allow the website administrator to contact the owner and troubleshoot with them.

Website administrators can also perform crawler detection by embedding JavaScript or PHP code in HTML pages to “tag” web crawlers. The code is executed in the browser when it renders the web pages. The main purpose of doing this is to identify the User-Agent of the web crawler to prevent it from accessing future pages on the website, or at least to limit its access as much as possible.

Using such code snippets, site administrators restrict the number of requests web crawlers can make. By doing this, they can prevent web crawlers from overloading the server with a large number of requests.

Why Was Your Crawler Detected?

Why Was Your Crawler Detected?

If you’re getting errors such as ”Request Blocked: Crawler Detected” or ”Access Denied: Crawler Detected” when you’re trying to scrape a website, the website administrator likely detected your web crawler.

Most website administrators use the User-Agent field to identify web crawlers. However, some other common methods will detect your crawler if it’s:

  • Sending too many requests: If a crawler sends too many requests to a server, it may be detected and/or blocked. The website administrator might think that you’ll overload their server. For instance, your crawler can be easily detected if it sends more requests in a short period than human users are likely to send.
  • Using a single IP: If you’re sending too many requests from a single IP, you’re bound to get discovered pretty quickly. Making many requests from the same IP is suspicious, and website administrators will quickly suspect it’s a bot and not a human searcher.
  • Not spacing the requests: If you don’t space your crawler’s requests properly, the server might notice that you’re sending rapid requests or sending them at a regular interval. Spacing the requests is not necessary if you’re running a crawler that does this automatically. But for some crawlers, spacing them properly can help avoid detection by web servers.
  • Following similar patterns: If the website notices a pattern between your crawler’s activities and those of other bots, it can put you in the ”bots” category. For instance, if your web crawler is only sending requests for links or images, the website administrator may be able to tell that your goal is to scrape their website.

How To Avoid Web Crawler Detection

How To Avoid Web Crawler Detection

It’s important to familiarize yourself with crawler detection prevention tips to ensure that you can go undetected in your future web scraping efforts. Here are some ways to prevent web crawler detection.

Understand the robots.txt file

The robots.txt file can be found in the root directory of a website. Its purpose is to provide web crawlers with information on how they should interact with the website. Some web developers put certain instructions or rules in this file to prevent unauthorized access to their servers.

If a website has User-agent: * and Disallow: / in the robots.txt file, it means the site administrator does not want you to scrape their website. Make sure you understand the restrictions mentioned in the robots.txt file to avoid being blocked for violating them.

Rotate your IP

Your IP address is your identity on the internet. Web servers usually record your IP address when you request a web page. If several rapid requests are made from a single IP, most sites will think that they’re coming from a bot trying to scrape their data. They will then block your access.

One way to avoid getting blocked by the server is by rotating your IP address by using proxies, such as those provided by Rayobyte.

A proxy server acts as a middleman between you and the web server. The requests you send will then have to go through that proxy before they reach the server, thus hiding your true IP address from the server.

Use a real User-Agent

As mentioned above, websites use your User-Agent header to identify whether you’re a human user or a web crawler.

If your User-Agent header is similar to that of a specific browser type, the server will think that your spider is a browser and respond accordingly. To avoid being detected as a web crawler, change the value of this header before sending out multiple requests to the same website.

You can either create a list of User-Agents or use a fake User-Agent library for this.

Scrape at random intervals

If the web crawler is too systematic or follows a set interval, the target website will eventually identify it as a bot because humans don’t follow regular intervals while browsing the web. To avoid having your crawler detected, you should scrape at random intervals.

Use a headless browser

A headless browser is a web browser without a graphical user interface. You can run this as a command-line tool and scrape websites the same as you would with a regular browser.

The advantage of using headless browsers is that they’re hard to detect because web servers can’t determine if the requests are coming from a real browser or a bot.

Use CAPTCHA-solving services

CAPTCHAs can be a hurdle for web scrapers. Currently, websites use image-based CAPTCHAs that basic web crawlers are unable to read. If your script is unable to solve CAPTCHAs, you can use third-party services that specialize in solving these kinds of tests for other web crawlers.

Lower your scraping speed

Web crawlers browse the web much quicker than humans can. This makes them easier to spot. To avoid detection, slow down the speed of your web scraper. For example, you can program breaks between subsequent requests or set a delay period.

Use a scraping tool

If you’re having a hard time getting the hang of web scraping and keep getting blocked, consider using Scraping Robot. Our tools take care of all your scraping needs so you don’t have to worry about things like proxy management and solving CAPTCHAs.

Scraping Robot manages everything from anti-scraping updates to server management and browser scalability, giving you ample time and peace of mind to work with your scraped data.

Final Thoughts

Final Thoughts

If your crawler was blocked in your recent web scraping efforts, you should now have a better idea about why it happened. It’s important to know how websites identify web crawlers so you can take active steps to avoid crawler detection.

By making certain changes in your web scraping practices, such as slowing the speed of requests or using a proxy or automated scraping tool, you can lower the chance of your web crawler being detected and blocked.

The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.