How to Scrape Real Estate Data: A Complete Guide

Scraping real estate data from websites presents unique challenges. Property sites use JavaScript rendering, dynamic content loading, and anti-bot measures that block traditional scrapers. This guide covers practical solutions using Python and Bright Data’s tools.

Note: I am not affiliated with Bright Data. It’s the platform I am most familiar with therefore I chose to use it here.

Why Real Estate Scraping Is Challenging

Real estate websites present three main obstacles:

1. Summary vs. Detail Pages
Search results show limited data (price, address, thumbnail). Full property details — square footage, property history, agent information — require visiting individual listing pages, multiplying the number of requests needed.

2. JavaScript Rendering
Most real estate platforms render content client-side. A basic HTTP request returns incomplete HTML because data loads via JavaScript after the initial page load.

3. Anti-Bot Protection
Sites implement IP filtering, CAPTCHAs, rate limiting, and browser fingerprinting to detect and block scrapers.

Basic Scraper with Requests and BeautifulSoup

This approach works for simple sites without heavy anti-bot measures.

Inspecting the Target Page

Before writing code, inspect the page structure:

  1. Open the search results in your browser
  2. Right-click on a property listing and select “Inspect”
  3. Identify the HTML elements containing the data you need (price, address, links)

Installation

pip install requests beautifulsoup4

Basic Scraper Code

import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.example.com/homes/for_sale/San-Francisco/"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
    soup = BeautifulSoup(response.content, 'html.parser')
    results = []
    property_items = soup.find_all("li", {"data-testid": "property-card"})
    for item in property_items:
        try:
            address = item.find("address").get_text(strip=True)
            price = item.find("span", {"data-test": "property-card-price"}).get_text(strip=True)
            url_link = item.find("a").get("href")
            full_url = f"https://www.example.com{url_link}" if url_link and not url_link.startswith("http") else url_link
            if address or price:
                results.append({
                    "address": address,
                    "price": price,
                    "link": full_url
                })
        except Exception as e:
            print(f"Error parsing card: {e}")
    if results:
        with open("real_estate_data.csv", "w", newline="", encoding="utf-8") as f:
            writer = csv.DictWriter(f, fieldnames=["address", "price", "link"])
            writer.writeheader()
            writer.writerows(results)
        print(f"Scraped {len(results)} listings")
else:
    print(f"Failed to fetch page. Status code: {response.status_code}")

Limitation: This approach fails on sites with JavaScript rendering or anti-bot protection.

Handling Anti-Bot Measures with Bright Data

When basic scraping fails, Bright Data offers several solutions depending on your needs.

Option 1: Unlocker APIs (Recommended for Static Pages)

Unlocker APIs handle proxy rotationCAPTCHA solving, and anti-bot bypassing automatically. You send one request; it returns clean HTML or JSON.

Best for:

  • Pages that don’t require browser interaction
  • High-volume scraping with predictable costs
  • Teams without proxy infrastructure

Direct API Access (Recommended Method)

import requests
API_KEY = "YOUR_API_KEY"
ZONE_NAME = "YOUR_ZONE_NAME"
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}
payload = {
    "zone": ZONE_NAME,
    "url": "https://www.example.com/homes/for_sale/San-Francisco/",
    "format": "raw"
}
response = requests.post(
    "https://api.brightdata.com/request",
    headers=headers,
    json=payload
)
if response.status_code == 200:
    html_content = response.text
    # Parse with BeautifulSoup
    soup = BeautifulSoup(html_content, 'html.parser')
    # Extract data as shown above

Native Proxy-Based Access

For workflows already using proxy routing:

import requests
host = 'brd.superproxy.io'
port = 33335
username = 'brd-customer--zone-'
password = ''
proxy_url = f'http://{username}:{password}@{host}:{port}'
proxies = {
    'http': proxy_url,
    'https': proxy_url
}
url = "https://www.example.com/homes/for_sale/San-Francisco/"
response = requests.get(url, proxies=proxies)

Note: For native proxy access, install the Bright Data SSL certificate to avoid SSL errors, or set verify=False in your requests (not recommended for production).

Handling JavaScript-Rendered Content

If pages return incomplete data, use the x-unblock-expect header to wait for specific elements:

headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}
payload = {
    "zone": ZONE_NAME,
    "url": "https://www.example.com/property/12345",
    "format": "raw",
    "headers": {
        "x-unblock-expect": '{"element": ".property-details"}'
    }
}

Option 2: Browser API (For Interactive Pages)

When you need full browser interaction — clicking buttons, scrolling, handling login flows — use the Browser API. For other tools, read my article about the best scraping browsers.

Best for:

  • JavaScript-heavy sites requiring interaction
  • Multi-step navigation flows
  • Sites with complex anti-bot detection

Puppeteer Example

const puppeteer = require('puppeteer-core');
const AUTH = 'YOUR_USERNAME:YOUR_PASSWORD';
const TARGET_URL = 'https://www.example.com/homes/for_sale/San-Francisco/';
async function scrapeRealEstate() {
    const browserWSEndpoint = `wss://${@brd.superproxy.io">AUTH}@brd.superproxy.io:9222`;
    const browser = await puppeteer.connect({ browserWSEndpoint });
    
    try {
        const page = await browser.newPage();
        await page.goto(TARGET_URL, { timeout: 120000 });
        
        // Wait for listings to load
        await page.waitForSelector('[data-testid="property-card"]');
        
        // Extract data
        const listings = await page.evaluate(() => {
            const cards = document.querySelectorAll('[data-testid="property-card"]');
            return Array.from(cards).map(card => ({
                address: card.querySelector('address')?.textContent?.trim(),
                price: card.querySelector('[data-test="property-card-price"]')?.textContent?.trim(),
                link: card.querySelector('a')?.href
            }));
        });
        
        console.log(listings);
    } finally {
        await browser.close();
    }
}
scrapeRealEstate();

Playwright Example

from playwright.sync_api import sync_playwright
AUTH = 'YOUR_USERNAME:YOUR_PASSWORD'
def scrape_real_estate():
    with sync_playwright() as p:
        browser = p.chromium.connect_over_cdp(f'wss://{@brd.superproxy.io">AUTH}@brd.superproxy.io:9222')
        page = browser.new_page()
        
        page.goto('https://www.example.com/homes/for_sale/San-Francisco/', timeout=120000)
        page.wait_for_selector('[data-testid="property-card"]')
        
        listings = page.evaluate('''() => {
            const cards = document.querySelectorAll('[data-testid="property-card"]');
            return Array.from(cards).map(card => ({
                address: card.querySelector('address')?.textContent?.trim(),
                price: card.querySelector('[data-test="property-card-price"]')?.textContent?.trim(),
                link: card.querySelector('a')?.href
            }));
        }''')
        
        browser.close()
        return listings

Option 3: Web Scraper API (Pre-Built Scrapers)

For popular real estate sites like Zillow, Realtor.com, or Redfin, Bright Data and other providers offer pre-built scrapers that return structured data directly.

import requests
API_KEY = "YOUR_API_KEY"
DATASET_ID = "gd_xxxxx"  # Real estate scraper ID
headers = {
    "Authorization": f"Bearer {API_KEY}",
    "Content-Type": "application/json"
}
# Synchronous scraping (up to 20 URLs)
response = requests.post(
    f"https://api.brightdata.com/datasets/v3/scrape?dataset_id={DATASET_ID}&format=json",
    headers=headers,
    json=[{"url": "https://www.zillow.com/homedetails/123-main-st/12345_zpid/"}]
)
data = response.json()

Pagination and Multiple Pages

Real estate listings span multiple pages. Handle pagination by iterating through page numbers:

import requests
from bs4 import BeautifulSoup
import time
API_KEY = "YOUR_API_KEY"
ZONE_NAME = "YOUR_ZONE_NAME"
BASE_URL = "https://www.example.com/homes/for_sale/San-Francisco/"
all_results = []
for page_num in range(1, 6):
    url = f"{BASE_URL}?page={page_num}"
    
    payload = {
        "zone": ZONE_NAME,
        "url": url,
        "format": "raw"
    }
    
    response = requests.post(
        "https://api.brightdata.com/request",
        headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
        json=payload
    )
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        # Parse and append results
        # ...
    
    time.sleep(1)  # Basic rate limiting

Extracting Full Property Details

Listing pages contain limited data. Scrape individual property pages for complete information:

def scrape_property_details(property_url, api_key, zone_name):
    payload = {
        "zone": zone_name,
        "url": property_url,
        "format": "raw"
    }
    
    response = requests.post(
        "https://api.brightdata.com/request",
        headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
        json=payload
    )
    
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, 'html.parser')
        return {
            "description": soup.find("div", {"data-testid": "description"})?.get_text(strip=True),
            "bedrooms": soup.find("span", {"data-testid": "beds"})?.get_text(strip=True),
            "bathrooms": soup.find("span", {"data-testid": "baths"})?.get_text(strip=True),
            "sqft": soup.find("span", {"data-testid": "sqft"})?.get_text(strip=True),
        }
    return None
# Enrich listing data
for listing in results:
    details = scrape_property_details(listing["link"], API_KEY, ZONE_NAME)
    if details:
        listing.update(details)

Choosing the Right Product

Summary

  1. Start simple with requests BeautifulSoup for basic sites
  2. Use Unlocker API when you encounter anti-bot protection on static pages
  3. Switch to Browser API when pages require JavaScript interaction
  4. Consider Web Scraper API for popular sites with pre-built scrapers
  5. Handle pagination to collect data across multiple pages
  6. Scrape detail pages for complete property information

Similar Posts