Speed Up Web Scraping with Concurrency in Python

Speed Up Web Scraping with Concurrency in Python

In this guide, I’ll show you how to use asyncio and aiohttp in Python to make your web scraping faster and more efficient. I’ll break it down step-by-step, so you can boost the performance of your scraping scripts and get the data you need in no time. Let’s dive in!

What is Concurrency?

Concurrency is the ability to handle multiple tasks at once or out of order. When it comes to web scraping, this means sending several requests at the same time instead of waiting for each one to finish before starting the next. In a traditional scraping setup, each request takes time, especially if the server is slow or has many pages to scrape. With concurrent scraping, you can send multiple requests at the same time. While one request is waiting for a response, you can work on others. This reduces waiting time and speeds up your scraper.

Best Alternative to Python Web Scraping

Bright Data’s Web Scrapers provide a complete, enterprise-grade solution for your data extraction needs. With access to an expansive proxy network and advanced anti-bot measures, Bright Data makes scaling your web scraping projects easier and more reliable. Whether you’re collecting data for market research or automating large-scale scraping tasks, their powerful infrastructure lets you focus on extracting high-quality data without the hassle.

If you are interested in learning about other tools too, visit my list of the best scraping tools.

Sequential Scraping: The Slow Way

Let’s first look at how a typical scraping script works without concurrency. Imagine you need to scrape a website with 12 pages, each taking about 2 seconds to load. A sequential approach means:

  1. You make a request to the first page.
  2. Wait for 2 seconds for the server to respond.
  3. Process the data.
  4. Move on to the next page, and repeat the process.

If each page takes 2 seconds, the total time to scrape all 12 pages sequentially would be 24 seconds.

Here’s an example of how you might implement this in Python using the requests library:

import requests
from bs4 import BeautifulSoup
import csv
base_url = "https://example.com/products/page"
pages = range(1, 13) # Scrape 12 pages
def extract_data(page):
url = f"{base_url}/{page}/"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
# Extract data here (e.g., product name, price)
products = []
for product in soup.select(".product"):
products.append({
"name": product.find("h2").text.strip(),
"price": product.find(class_="price").text.strip(),
"url": product.find("a").get("href")
})
return products
def store_results(products):
with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=products[0].keys())
writer.writeheader()
writer.writerows(products)
all_products = []
for page in pages:
all_products.extend(extract_data(page))
store_results(all_products)

Running the Above Code

When you run the script, the program will take around 24 seconds (2 seconds for each of the 12 pages). While this is fine for small websites, it becomes a problem when you need to scrape large numbers of pages.

The Problem with Sequential Scraping

While the above approach works, it’s not efficient. As you can see, there’s much idle time waiting for the server’s response. If you’re scraping many pages, this idle time adds up quickly. A more efficient way to handle this would be to send requests to multiple pages at the same time.

Concurrency: Speeding Up Scraping with Asyncio and Aiohttp

To tackle the inefficiencies of sequential scraping, we can use concurrency. This allows you to send multiple requests at once and handle the responses as they come in. We’ll use asyncio and aiohttp, which are designed for asynchronous programming in Python.

What is Aiohttp?

Aiohttp is an asynchronous HTTP client/server library for Python. It works perfectly with asyncio, allowing you to send HTTP requests without blocking the main thread. Unlike the requests library, which is synchronous, aiohttp supports asynchronous operations.

Scraping with Asyncio and Aiohttp

Now, let’s modify the above script to use asyncio and aiohttp for concurrency.

Step 1: Install Required Libraries

Before we begin, we need to install aiohttp and beautifulsoup4:

pip install aiohttp beautifulsoup4

Step 2: Implementing Concurrency

Here’s how you can rewrite the script to use concurrency:

import aiohttp
import asyncio
from bs4 import BeautifulSoup
import csv
base_url = "https://example.com/products/page"
pages = range(1, 13) # Scrape 12 pages
async def extract_data(page, session):
url = f"{base_url}/{page}/"
async with session.get(url) as response:
soup = BeautifulSoup(await response.text(), "html.parser")
products = []
for product in soup.select(".product"):
products.append({
"name": product.find("h2").text.strip(),
"price": product.find(class_="price").text.strip(),
"url": product.find("a").get("href")
})
return products
async def main():
async with aiohttp.ClientSession() as session:
tasks = [extract_data(page, session) for page in pages]
results = await asyncio.gather(*tasks)
all_products = [item for sublist in results for item in sublist] # Flatten the list
store_results(all_products)
def store_results(products):
with open("products.csv", "w") as file:
writer = csv.DictWriter(file, fieldnames=products[0].keys())
writer.writeheader()
writer.writerows(products)
# Run the asyncio event loop
asyncio.run(main())

Key Changes in the Code

  1. Asynchronous Requests: We’ve replaced requests.get() with session.get() from aiohttp. The async with statement ensures that the request is handled asynchronously.
  2. Async Function: The extract_data function is now asynchronous, denoted by async def. The await keyword is used to wait for the response from the server without blocking the entire program.
  3. Concurrent Tasks: We use asyncio.gather() to execute multiple tasks concurrently. Each task is responsible for scraping one page.
  4. Flattening Results: After scraping all pages concurrently, the results are stored in results. Since asyncio.gather() returns a list of lists, we need to flatten it to get a single list of all products.

Step 3: Running the Code

Now, when you run the updated script, it will scrape all 12 pages concurrently. The time it takes will be much faster than the sequential version. The exact speed-up will depend on your network and the server’s response time, but typically, you will see a significant improvement.

Performance Comparison

  • Sequential Scraping: As mentioned, scraping 12 pages sequentially would take around 24 seconds.
  • Concurrent Scraping with Asyncio: With the concurrent version, the script can complete the same task in around 5 to 10 seconds, depending on the network latency.

Limiting Concurrency to Prevent Overloading the Server

While concurrency speeds up scraping, sending too many requests at once can overload the server or block your IP. To avoid this, we can limit the number of concurrent requests using a semaphore in asyncio.

Example with Semaphore

max_concurrency = 5 # Limit concurrent requests
sem = asyncio.Semaphore(max_concurrency)
async def extract_data(page, session):
async with sem: # This limits the number of concurrent requests
url = f"{base_url}/{page}/"
async with session.get(url) as response:
soup = BeautifulSoup(await response.text(), "html.parser")
products = []
for product in soup.select(".product"):
products.append({
"name": product.find("h2").text.strip(),
"price": product.find(class_="price").text.strip(),
"url": product.find("a").get("href")
})
return products

Conclusion

By limiting the concurrency, we ensure that only a specific number of requests are sent at once. This reduces the risk of overwhelming the server or getting blocked.

So, here you have it: we’ve learned how to speed up web scraping by using concurrency with async io and aiohttp in Python. By sending multiple requests at the same time, we can drastically reduce the time required to scrape a website, especially when dealing with a large number of pages. Remember to respect website terms of service and avoid overloading servers with too many requests. Happy scraping!

Similar Posts