As businesses flock to web scraping to gather massive amounts of information, website owners have become progressively suspicious of the practice — and as a result, an increasing number have started to install anti-scraping IP blocks on their websites.
Table of Contents
We know how frustrating it can be to see your target website being blocked while using a scraper. We’ve written this guide with strategies and mitigation techniques you can use to scrape while avoiding IP blocks and bans. Read on to learn more about how to get past an IP ban when scraping so you can gather all the data you need for your project or personal use.
Be Aware of Anti-Scraping Policies and Traps
1. Check your target site’s robots.txt file
Before you start scraping, make sure the target website allows web scraping. Take a look at the site’s robots exclusion protocol (robots.txt) file, a standard used by websites to communicate with web scrapers and other robots.
Respect the website’s rules and be an ethical web scraper. Not only is this the right thing to do, but it will also make the scraping process a lot easier for you. Even if your target site allows scraping, don’t go overboard. Read robots.txt, look at your bot’s Acceptable Use Policy, crawl during less busy times of the day, and limit requests coming from one IP address. Otherwise, you’re likely to get a “request blocked, crawler detected” message.
2. Be on the alert for honeypot traps
Before you start scraping, you need to know what honeypot traps are. Honeypots are links in the website’s HTML code that are used to identify and block bots since only bots can follow these links.
Fortunately, you’re probably not going to encounter honeypots regularly since they require a lot of work to set up. However, if you suddenly get blocked while web scraping, your target website may be using honeypot traps — so it’s important to be on the lookout for them.
How to Scrape Without Getting IP Blocked by Websites
Now that you know what to look for before you start scraping, let’s talk about how you can scrape websites without getting IP blocked.
1. Slow down your scraping and scrape during off-peak hours
Since most web scraping tools work as quickly as possible to get data, IP blocks can easily detect them when they’re working.
To avoid this, you should slow down your scraping so it feels more “human.” Put some random stops between requests and try to access only one to two pages at a time. You can also check out your target website’s robots.txt and see if there’s a line about what crawl delay you should implement so you won’t cause problems with heavy server traffic.
As a courtesy, you should also save scraping for off-peak hours — like after midnight in the target server’s time zone — to prevent the target website from overloading.
Because your bot goes through pages significantly faster than a human user, it may affect the website’s load times if you scrape during peak hours. This will make it easier for the website managers to spot and block your bot.
2. Use proxy servers
A proxy server is a server that allows you to send requests to websites using a custom IP to mask your real IP address. Websites flag multiple requests from a single IP address as “suspicious” and “bot-like.” This is why you need to use proxy servers to make your scraping bot appear more human.
To avoid getting a blocked proxy, you need to also rotate your pool of IP addresses. If you send a lot of requests from the same proxy address, the website will block that proxy IP address. By rotating your proxy IPs, you will fool the website into thinking you are many different internet users.
3. Set and switch user-agents (UAs)
User-agents (UAs) are strings in the header of a request that identifies the operating system and browser to the webserver. Since every request contains a UA, using the same UA for all of your scrapings will lead to an IP ban or block.
To get past this, you need to switch UAs instead of just using one.
You need to also set up-to-date and popular UAs that look like organic users. For example, f you use a 10-year old UA from a version of Opera that’s no longer supported, you’re likely to get booted from the website since it doesn’t feel like an organic user. To create convincing UAs, research which UAs are currently the most popular and use popular browsers. As of August 2021, UAs that use Safari 14 on iPhone and Macintosh are some of the most common.
If you’re an advanced user, try setting your UA to the Googlebot User Agent. Most websites want to be listed on Google, so they have no problem letting Googlebot in.
4. Use different scraping patterns
Bots typically use the same scraping pattern, so you should try using different scraping patterns to avoid getting blocked.
You can do so by adding random scrolls, mouse movements, and clicks to make your crawling less predictable. Think of how an ordinary visitor would use or browse the website, and apply these patterns to the bot itself. For example, you can program your bot so it visits the target website’s home page first and then make some requests to inner pages such as the website’s shop, blog, or contact us page.
Images are not only data-heavy but are also often protected by copyright. To save on storage space and protect yourself against potential lawsuits, it’s better to avoid image scraping altogether.
How to Get Past an IP Ban For Advanced Users
1. Set a Referrer
A Referrer is an HTTP request header that lets your target site know where you’re coming from. You should make it look like you’re coming from Google by typing this into the header:
“Referer”: “ https://www.google.com/”
Feel free to localize the Referrer depending on what country your target site is located in. For example, if you’re scraping a site in Canada, you should use “ https://www.google.ca” instead of “https://www.google.com/”.
You can also add in variety by including other referral sites such as YouTube, Instagram, and other common sites. Check out tools like SimilarWeb.com to discover the most common referrers on the Internet.
2. Set your IP fingerprint properly
As anti-scraping tools become more powerful, some websites have adopted Transmission Control Protocol (TCP) or IP fingerprinting to block and ban bots.
When you scrape a website that has TCP, your bot will leave various parameters which may out it as a bot. You need to carefully configure your bot’s parameters so TCP won’t pick it up.
3. Use headless browsing
Another way of getting around TCP is using a headless browser. It works like a regular browser, only it doesn’t have a graphical user interface (GUI), and you will be able to use it programmatically.
If you want to start using headless browsing, we recommend using the headless versions of popular web browsers such as Firefox and Chrome.
4. Set other request headers
If you don’t use headless browsing, you can still make your bot feel more human by setting other request headers. To make your scraper feel like a real browser, go to this site and copy the headers that your current browser is using. Parameters like “Accept-Language,” “Accept-Encoding,” and “Upgrade-Insecure-Requests” will make your bot’s requests look like they are from a real browser so they won’t get blocked.
5. Use CAPTCHA solving services or crawling tools
Like TCP, CAPTCHA is another major challenge your web scraper will face. Because CAPTCHAs use images that are usually impossible to read for bots, they will typically successfully identify and ban your bots.
As such, you need to use dedicated CAPTCHA solving services or crawling tools to bypass CAPTCHAs when scraping. Many of these are expensive and slow, but affordable scraping bots like Scraping Robot can solve the CAPTCHA problem for you.
6. Scrape Google cache
Finally, if you’re scraping data that doesn’t change often, you may want to scrape information directly out of Google’s caches rather than the live website itself. However, if you need real-time or time-sensitive data, this isn’t a good solution.
Avoid IP Blocks With Scraping Robot
With most scraping robots out on the market, you have to do much of the work yourself — including proxy rotation, CAPTCHA solving, and server management.
However, with Scraping Robot, you won’t have to worry about many of these issues anymore. We offer the following — and more:
- Server management
- Proxy rotation and management: you won’t have to worry about rotating proxies anymore, because we will do it for you!
- CAPTCHA solving
What’s more, the first 5000 scrapes are totally free. We also have a free Scraping Robot API that you can try. This will allow you to scrape any webpage with only one command. To get the full suite of Scraping Robot features, sign up for Scraping Robot today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.