Python Requests Pagination to Scrape Multiple Pages

Python Requests Pagination to Scrape Multiple Pages

In this article, I’ll show you how to handle pagination and scrape data from multiple pages, step by step, so you can get the complete dataset you need.

What is Web Scraping?

Web scraping is the process of extracting data from websites. Python offers libraries such as requests for making HTTP requests to access web pages and BeautifulSoup for parsing HTML to extract the necessary content. While scraping simple websites is easy, many modern websites use pagination to distribute content across different pages. Without knowing how to handle pagination, you may only scrape data from a single page and miss out on the rest.

Why is Pagination Important?

When scraping data, websites often split information across multiple pages to avoid loading all content at once, which can slow down page performance. For example, an e-commerce website might list thousands of products, each with its page number. Pagination ensures that users can browse through different pages to view more products. For web scraping, this means that to collect all the products, you’ll need to crawl through each page individually and scrape the data.

A Smarter Way to Handle Pagination 🚀🧠

While Python’s requests library is great for scraping simple paginated websites, it struggles when pages load content dynamically using JavaScript, infinite scroll, or “Load More” buttons. That’s where Scraping Browser comes in! 🖥️✨

The Scraping Browser acts like a real browser — it loads JavaScript, scrolls, clicks, and navigates through pages just like a human would. This makes it perfect for handling complex pagination scenarios that requests alone can't manage. Plus, it’s optimized for large-scale scraping, with built-in proxy management and session control. 🌍🔄

If you’re scraping modern websites and need a more reliable way to deal with pagination, using a scraping browser could save you a lot of time and headaches.

Different Types of Pagination

Before you start scraping, it’s important to understand the types of pagination commonly found on websites:

  1. Classic Pagination with Page Numbers: This is the most common form of pagination. The URL usually includes the page number (e.g., page=2 or /page/2/). The website will display a set number of items per page, and the user can click on the page number to navigate to the next set of results.
  2. Next Page Button: Some websites use a “Next” button to move to the next page. The page URL remains the same, but the website changes the content based on the page number.
  3. Infinite Scroll: This technique is used on websites where the content loads as you scroll down. It dynamically loads new content without changing the page or URL.
  4. Load More Button: Websites using a load more button allow users to click a button to load additional content, rather than going to a new page.

Setting Up Python for Web Scraping

Before diving into scraping, let’s ensure that we have the required tools set up:

Install Required Libraries: First, we must install two libraries: requests and beautifulsoup4. You can install them using the following commands:

pip install requests
pip install beautifulsoup4

Import Libraries: Once installed, we can import them into our script. requests will handle making HTTP requests to the website, and BeautifulSoup will parse the HTML and allow us to extract specific elements.

import requests
from bs4 import BeautifulSoup

Scraping Data from Multiple Pages

Now, let’s walk through scraping a website with multiple paginated pages. For this example, we’ll assume we’re scraping a fictional e-commerce website that lists products across multiple pages. We’ll retrieve product names and prices.

Step 1: Inspecting the Website

Before writing any code, you need to understand the website’s structure. Open the website in your browser, right-click on the page, and choose “Inspect” to view the HTML structure. Look for the section that contains the product listings and the pagination controls (such as page numbers or a next page button).

You’ll typically find pagination links within an HTML

    • or
<nav class="pagination">
<a href="page/1">1</a>
<a href="page/2">2</a>
<a href="page/3">3</a>
<a href="next" class="next">Next</a>
</nav>

In this case, we can use the “Next” link to move to the next page, and the page numbers can be used to determine the total number of pages.

Step 2: Scraping the First Page

Let’s start by scraping the first page. We will write a function that requests the website, parses the HTML, and extracts the product names and prices.

def scrape_page(url):
response = requests.get(url)
if response.status_code != 200:
print(f"Failed to retrieve page: {response.status_code}")
return
soup = BeautifulSoup(response.text, 'html.parser')
product_cards = soup.find_all('div', class_='product-card')
for product in product_cards:
name = product.find('h2', class_='product-title').text
price = product.find('span', class_='price').text
print(f"Name: {name}, Price: {price}")

In this code, we send a request to the page, check if the request was successful, and parse the content using BeautifulSoup. We then find all product cards and extract the product name and price.

Step 3: Handling Pagination

Now that we can scrape the first page, we need to handle pagination. Websites with a “Next” page link typically include a URL for the next page. We can modify our function to follow the “Next” link and scrape the next page. Let’s add some logic to handle this.

def scrape_all_pages(start_url):
url = start_url
while url:
print(f"Scraping page: {url}")
scrape_page(url)
# Get the next page URL
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
next_page = soup.find('a', class_='next')
if next_page:
url = next_page.get('href')
else:
print("No more pages to scrape.")
break

In this function, we start scraping from the given URL. After scraping the content, we check for a “Next” link. If it exists, we follow it to scrape the next page. The loop continues until there are no more pages to scrape.

Step 4: Scraping All Pages

Now that we have a function to scrape all the pages, we can call it with the starting URL of the first page. Here’s how to run the scraper:

start_url = 'https://example.com/products'
scrape_all_pages(start_url)

This will start scraping from the first page and continue until all pages have been scraped.

Advanced Pagination Techniques

Changing Page Numbers in the URL

Some websites use page numbers directly in the URL (e.g., https://example.com/products?page=2). In such cases, we can simply iterate through the pages by changing the page number in the URL. Here’s how to do it:

def scrape_by_page_number(start_url, total_pages):
for page in range(1, total_pages   1):
url = f"{start_url}?page={page}"
print(f"Scraping page: {url}")
scrape_page(url)

In this function, we iterate through a specified number of pages and construct the URL for each page by appending the page number. We then scrape the page and move to the next.

Infinite Scroll and AJAX Requests

Some websites use infinite scrolling, loading more content as you scroll down the page. This content is often loaded dynamically via AJAX requests. To scrape these pages, we need to observe the network requests made by the browser when scrolling.

You can monitor these requests in your browser’s Developer Tools (Network tab). Once you identify the request URL for fetching more content, you can simulate these requests in Python using requests to load additional data.

def scrape_infinite_scroll(url):
page = 1
while True:
# Construct the URL for the next batch of content
request_url = f"{url}?page={page}"
response = requests.get(request_url)
if response.status_code != 200:
print("Failed to retrieve more data.")
break
data = response.json() # Assuming the data is returned as JSON
for product in data['items']:
print(f"Name: {product['name']}, Price: {product['price']}")
if not data['has_more']:
print("No more items to scrape.")
break
page  = 1

In this example, we request an HTTP to the dynamic URL for each data page. The response is assumed to be in JSON format, containing the product information and a flag (has_more) indicating whether there is more data to load.

Handling the “Load More” Button

Some websites use a “Load More” button instead of pagination links. You can simulate the button click by sending requests to the appropriate URL or API endpoint, just like with infinite scroll. Monitor the network requests to understand how the website loads more content when the button is clicked.

Handling pagination is an essential skill for web scraping. Whether dealing with a simple page number-based pagination or a more complex “Load More” button or infinite scroll, the key is carefully observing the website’s structure and understanding how new content is loaded. Using Python’s requests and BeautifulSoup libraries, you can automate the process of scraping multiple pages and collect the data you need.

By applying the techniques discussed in this article, you can scrape content from paginated websites efficiently. Just remember to respect the website’s terms of service and avoid overloading their servers with too many requests in a short period.

 

Similar Posts