How To Avoid CAPTCHA When Web Scraping
It’s basically impossible to avoid Captchas. You might be asked to solve a Captcha before commenting on a blog post, logging into your bank, or even accessing a webpage.
Table of Contents
These tests are annoying enough when you’re just browsing the web. But when you’re trying to do research with web scraping, Captchas become more than just an annoyance – they present a major obstacle. Keep reading to learn what Captchas are, why they exist, and how to avoid Captchas without a major headache.
To Learn How to Avoid Captchas, You Need to Know What They Are
Captcha stands for “Completely Automated Public Turing test to tell Computers and Humans Apart.” A Captcha is a simple test that websites can use to judge whether a visitor is an actual human or a programmed bot. Captchas, sometimes called CAPTCHAS or reCAPTCHAS, are designed to be easy for the average person to solve but basically impossible for a robot.
But why do websites want to tell humans and robots apart? It’s because the two different types of traffic mean very different things for a site. Human visitors can click ads and make purchases, making the site owner money. On the other hand, bots will almost never create legitimate revenue. What they might do is harm the site with malware or DDoS attacks. If a site can differentiate bots from humans, they can block all the bots and cut down on the risk of attacks.
There are a variety of different Captcha types that all pose unique challenges for bots. The most common styles include:
- Character Recognition: A picture with a string of letters and numbers that are distorted in some way. Humans can usually read the series easily, but bots don’t have the image recognition to manage it.
- Audio recognition: For visually impaired people, audio recognition is sometimes used. These captchas play a distorted recording, and the user types in what they heard. These are becoming less common as audio recognition software improves.
- Math Problems: These Captchas pose a word problem asking the user to do a very simple math problem like “Two plus 3 equals?” Bots can’t parse the question to answer it.
- Image Recognition: A more secure and reliable type of Captcha, used by Google’s reCAPTCHA. This program presents pictures from books or Google Streetview and asks the user to identify what’s in them.
- Social Media Logins: A secure but unpopular type of Captcha requires the user to log in with a social media profile. Bots can’t accomplish that, but users are hesitant to connect their accounts to too many sites.
- Invisible Captchas: These are any Captchas that the user doesn’t see. They can be time-based, where the page records how long it takes the visitor to fill out a form. Bots can fill out forms in seconds, but humans take longer. Anyone that fills out the form too quickly is blocked. Similarly, honeypots are secret Captchas hidden in invisible parts of the page. If a bot crawls the page, it will trigger the Captcha, which will block it.
Why It’s So Hard to Spot and Bypass Captchas
If you’ve already started using web scrapers, you’ve probably realized it’s pretty hard to spot when your bot is running into Captchas. When you can’t figure out that Captchas are the problem, you can’t take steps to avoid them. But why are Captchas so hard to spot in the first place?
It’s because most web scrapers don’t have a good way to let you know that they’ve run into a Captcha. Instead, the scraper at most lets you know that an error occurred.
With many bots, you need to visit the page yourself to see what’s going on. Make sure you visit the page with the same IP address as your bot. If you’re using proxies, for example, use the same proxies to go to the site. Some sites only present Captchas to users they believe are using a proxy. Checking the site under your own IP address won’t let you see a Captcha.
If you don’t see anything, your bot may still be triggering an invisible Captcha. This is when you need to look for errors. These may be reported as constant timeout, 503 Service Unavailable, or 504 Gateway Timeout errors.
No matter what, if Captchas are making it harder for you to use web scraping tools, you need to learn how to beat reCAPTCHA and Captchas alike.
Can You Solve Captchas Automatically?
The first solution most people consider for beating Captchas is solving them. After all, a Captcha isn’t that hard, right? Wrong! Captchas are effective because they are wildly tricky for computers to solve. Writing a program that successfully solves Captchas takes hundreds of times more work than writing a great web scraper.
Some services claim to solve Captchas for you. These offerings all have one of two problems.
The first problem is that Captcha solving programs are expensive. Some of these systems rely on human employees for solutions, which obviously requires paying the person. Others mechanically solve the Captchas, but that requires extensive programming and server support. That also cranks up the price.
The second problem is that some simply don’t work. The program claims to know how to solve Captcha automatically but actually bypasses it in some way. While that still gets around the Captcha, these dishonest programs take advantage of the high price tag that normally comes with functional solving services and set their prices high too, significantly overcharging their customers.
Either way, solving Captchas isn’t the way to go. Instead, you want to learn how to avoid captchas entirely.
How to Beat ReCAPTCHAS and Captchas
So, how do you go about avoiding Captchas? If you’re simply web scraping, it’s easier than you’d think. Most sites don’t bother with Captchas before every page unless a visitor triggers security measures. You can avoid those security protocols and, in turn, avoid Captchas by following a few best practices.
1. Take your time
One of the biggest giveaways that a visitor is a bot is how fast they access and move through a site. Don’t let your web scraper access sites at the speed of light. Instead, slow things down a little so it looks a bit more human, and many sites won’t throw up a Captcha at all.
2. Use good tools
Your web scraping program should be top-notch, and you should support it with excellent tools. That means using high-quality proxies to make sure your web scraper can keep working for years to come. An insecure or faulty scraper will trigger Captchas, as will public proxies. Instead, use software like Puppeteer and Rayobyte and you’ll get fewer interruptions.
3. Implement cooldown logic
Cooldown logic is a critical protocol in scraper bots. If you’re using proxies for your web scraping (and you should be), you don’t want to use any proxy too often or for too long. You should cycle through your proxies to keep them from triggering Captchas. The right scraping solution will handle this for you with cooldown logic. It will judge how long each proxy should be used, how long they should stay offline, and when it’s safe to put them in rotation again. The bot will handle all of it, and you can just relax.
4. Check for honeypots
Honeypots are supposed to be invisible, but you can spot them if you try. You can have your bot look for elements of a webpage that are set to be invisible to users, which is where honeypots hide. If your bot spots CSS elements with display set to hidden or visibility turned off, it should avoid interacting with that part of the page.
5. Avoid using direct links
People don’t visit sites by typing an address into their search bar most of the time. If you’re scraping many pages from a site, get to those pages through the links in the site’s navigation, or by using whatever referring links are available. This will get your scraper bot to the right page while looking more natural to the website.
How to Solve Captcha Issues With a Scraping Service
All of this is a lot to take in, but learning how to avoid reCAPTCHAs and Captchas is your best solution to getting uninterrupted web scraping. But what if you only want to scrape something once, or you only need one scraping program? In that case, it’s overkill to learn all the information available on the subject of how to avoid Captchas.
That’s where services like Scraping Robot come in. Scraping Robot provides modules that you can use to scrape whatever data you want. You can either use pre-built modules and let Scraping Robot handle the scrape for you or request a custom solution.
Either way, Scraping Robot handles all the problems of Captcha avoidance for you. Scraping Robot performs the scrape, solves any issues, and manages proxies for you. It even delivers you the results you need through its simple API. All you need to do is choose or commission the module you need and insert the URL you want to be scraped.
Final Thoughts
Captchas are a thorn in the side of anyone who wants to perform web scraping. It’s not efficient to worry about how to solve Captchas. Instead, focus on learning best practices for how to avoid Captchas, and you’ll get much better results.
You can neatly sidestep the entire Captcha problem by using a scraping service like Scraping Robot. If you’re ready to stop worrying about Captchas ruining your scrapes, reach out to Scraping Robot to discuss your custom solution today.
The information contained within this article, including information posted by official staff, guest-submitted material, message board postings, or other third-party material is presented solely for the purposes of education and furtherance of the knowledge of the reader. All trademarks used in this publication are hereby acknowledged as the property of their respective owners.