How to Bypass CAPTCHAs with Scrapy

How to Bypass CAPTCHAs with Scrapy

In this guide, I’m going to show you how I deal with CAPTCHAs when using Scrapy. I’ll walk you through the methods I use to bypass them, so you can keep your scraper running smoothly without constant interruptions.

What is CAPTCHA?

CAPTCHA stands for “Completely Automated Public Turing test to tell Computers and Humans Apart.” In simple words, it’s a test used to tell whether a visitor is a human or a bot.

CAPTCHA types include:

  • Image selection tests (choose all squares with a car).
  • Text-based puzzles (type the letters you see).
  • ReCAPTCHA from Google (I’m not a robot checkbox).
  • Invisible CAPTCHAs (run in the background and block bots silently).

Why Scrapy Scrapers Get Blocked

Scrapy is efficient and fast. But that’s also a problem. Bots often make many requests very quickly. Servers detect these patterns and activate CAPTCHA.

Here are some reasons Scrapy scrapers face CAPTCHA:

  • Repeated visits from the same IP address.
  • No proper browser headers.
  • Requests made too quickly.
  • Missing JavaScript execution.

Now let’s explore how to beat CAPTCHA when using Scrapy.

Method 1: Use a Web Scraping API

The easiest way to bypass CAPTCHA is by using a scraping API. These APIs are made to handle everything for you — proxies, CAPTCHAs, and JavaScript rendering.

What is a Web Scraping API?

It’s a service that fetches data for you. You send your target URL, and the API gives you the page data after solving CAPTCHA and avoiding blocks.

How It Works

  • You send a request with the target URL.
  • The API handles all bot protection.
  • You receive the clean HTML content.

Best Tools

Here are some top scraping APIs:

  • Bright Data
  • ScraperAPI
  • Smartproxy
  • Oxylabs
  • Apify

Example with Bright Data

Let’s see how to use Bright Data with Scrapy.

Step 1: Sign up for Bright Data

Go to their website and create an account. Get your API credentials.

Step 2: Send a Request

import scrapy
import requests
class APISpider(scrapy.Spider):
name = 'api_spider'
def start_requests(self):
url = "https://target-website.com"
proxy_url = f"https://brd.superproxy.io:22225"
response = requests.get(
"https://example-api.brightdata.com",
params={"url": url, "render": "true"},
auth=('username', 'password') # Replace with your API credentials
)
yield scrapy.http.HtmlResponse(
url=url,
body=response.text,
encoding='utf-8'
)

This approach is quick and avoids most CAPTCHAs.

Method 2: Use CAPTCHA-Solving Services

Another way to bypass CAPTCHA is by using solving services. These services use humans or AI to solve CAPTCHAs for you.

Popular Solving Services

  • Bright Data’s CAPTCHA Solver
  • 2Captcha
  • Anti-Captcha
  • DeathByCaptcha
  • CapMonster

These platforms give you an API key. You send them the CAPTCHA, and they return the answer.

Example with 2Captcha

Let’s solve a reCAPTCHA with 2Captcha using Scrapy.

Step 1: Install Library

pip install 2captcha-python

Step 2: Get API Key

Sign up at 2Captcha and copy your API key from the dashboard.

Step 3: Scrapy Spider with CAPTCHA Solver

import scrapy
from twocaptcha import TwoCaptcha
class CaptchaSpider(scrapy.Spider):
name = 'captcha_spider'
start_urls = ["https://site-with-captcha.com"]
def solve_captcha(self, site_key, url):
solver = TwoCaptcha('YOUR_API_KEY')
try:
result = solver.recaptcha(sitekey=site_key, url=url)
return result.get('code')
except Exception as e:
self.logger.error(f"Error solving CAPTCHA: {e}")
return None
def parse(self, response):
site_key = "SITE_KEY_FROM_HTML"
captcha_code = self.solve_captcha(site_key, response.url)
if captcha_code:
self.logger.info("CAPTCHA Solved!")
# Continue scraping using captcha_code

This approach is useful for forms and login pages that use reCAPTCHA.

Method 3: Rotate Premium Proxies

A smart way to avoid CAPTCHAs is to use different IP addresses. This hides your bot’s identity and reduces the chance of hitting a CAPTCHA.

What Are Rotating Proxies?

Rotating proxies give you a new IP address for every request. They prevent websites from detecting scraping patterns.

Types of Proxies

Using Proxies in Scrapy

Edit the Scrapy settings file (settings.py) and add proxy settings.

DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 1,
'myproject.middlewares.RandomProxy': 100,
}
PROXY_LIST = [
"http://user:pass@ip1:port",
"http://user:pass@ip2:port",
# Add more proxies9
]

Then create a middleware that selects a random proxy.

import random
class RandomProxy:
def process_request(self, request, spider):
proxy = random.choice(spider.settings.get('PROXY_LIST'))
request.meta['proxy'] = proxy

This rotates proxies and reduces your chances of facing CAPTCHA.

Method 4: Use Headless Browsers

Sometimes, Scrapy alone can’t deal with JavaScript-heavy websites. Some CAPTCHAs load with JavaScript and cannot be handled directly.

To fix this, you can use headless browsers like:

  • Selenium
  • Playwright
  • Splash (for Scrapy)

Using Splash with Scrapy

Splash is a JavaScript rendering service built for Scrapy.

Step 1: Install Splash

You can run it with Docker:

docker run -p 8050:8050 scrapinghub/splash

Step 2: Add Middleware

In settings.py:

DOWNLOADER_MIDDLEWARES.update({
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
})
SPIDER_MIDDLEWARES.update({
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
})
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

Step 3: Use SplashRequest

from scrapy_splash import SplashRequest
class JSPageSpider(scrapy.Spider):
name = 'js_page'
start_urls = ['https://example-js-page.com']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 2})

Best Practices to Avoid CAPTCHA

Even without fancy tools, there are simple things you can do to avoid CAPTCHA:

1. Use User-Agent Headers

Always use real browser headers to look like a normal user.

USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"

2. Slow Down Your Requests

Add delays between requests to avoid suspicion.

DOWNLOAD_DELAY = 2

3. Use Cookies

Save and reuse session cookies when scraping. This makes your bot look consistent.

4. Use Referer and Accept Headers

Add headers to mimic real browser behavior.

Which Method Should You Choose?

Final Thoughts

CAPTCHAs are everywhere, and they are getting smarter. But you don’t need to give up on your Scrapy project. With the right tools and methods, you can bypass almost any CAPTCHA.

Use APIs for convenience, solvers when necessary, rotate proxies to stay anonymous, and always mimic human behavior with headers and delays.

With these tools in your toolkit, you can keep scraping safely and efficiently.

Similar Posts