How to Scrape Real Estate Data: A Complete Guide
Scraping real estate data from websites presents unique challenges. Property sites use JavaScript rendering, dynamic content loading, and anti-bot measures that block traditional scrapers. This guide covers practical solutions using Python and Bright Data’s tools.
Note: I am not affiliated with Bright Data. It’s the platform I am most familiar with therefore I chose to use it here.
Why Real Estate Scraping Is Challenging
Real estate websites present three main obstacles:
1. Summary vs. Detail Pages
Search results show limited data (price, address, thumbnail). Full property details — square footage, property history, agent information — require visiting individual listing pages, multiplying the number of requests needed.
2. JavaScript Rendering
Most real estate platforms render content client-side. A basic HTTP request returns incomplete HTML because data loads via JavaScript after the initial page load.
3. Anti-Bot Protection
Sites implement IP filtering, CAPTCHAs, rate limiting, and browser fingerprinting to detect and block scrapers.
Basic Scraper with Requests and BeautifulSoup
This approach works for simple sites without heavy anti-bot measures.
Inspecting the Target Page
Before writing code, inspect the page structure:
- Open the search results in your browser
- Right-click on a property listing and select “Inspect”
- Identify the HTML elements containing the data you need (price, address, links)
Installation
pip install requests beautifulsoup4
Basic Scraper Code
import requests
from bs4 import BeautifulSoup
import csv
url = "https://www.example.com/homes/for_sale/San-Francisco/"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
soup = BeautifulSoup(response.content, 'html.parser')
results = []
property_items = soup.find_all("li", {"data-testid": "property-card"})
for item in property_items:
try:
address = item.find("address").get_text(strip=True)
price = item.find("span", {"data-test": "property-card-price"}).get_text(strip=True)
url_link = item.find("a").get("href")
full_url = f"https://www.example.com{url_link}" if url_link and not url_link.startswith("http") else url_link
if address or price:
results.append({
"address": address,
"price": price,
"link": full_url
})
except Exception as e:
print(f"Error parsing card: {e}")
if results:
with open("real_estate_data.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=["address", "price", "link"])
writer.writeheader()
writer.writerows(results)
print(f"Scraped {len(results)} listings")
else:
print(f"Failed to fetch page. Status code: {response.status_code}")
Limitation: This approach fails on sites with JavaScript rendering or anti-bot protection.
Handling Anti-Bot Measures with Bright Data
When basic scraping fails, Bright Data offers several solutions depending on your needs.
Option 1: Unlocker APIs (Recommended for Static Pages)
Unlocker APIs handle proxy rotation, CAPTCHA solving, and anti-bot bypassing automatically. You send one request; it returns clean HTML or JSON.
Best for:
- Pages that don’t require browser interaction
- High-volume scraping with predictable costs
- Teams without proxy infrastructure
Direct API Access (Recommended Method)
import requests
API_KEY = "YOUR_API_KEY"
ZONE_NAME = "YOUR_ZONE_NAME"
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"zone": ZONE_NAME,
"url": "https://www.example.com/homes/for_sale/San-Francisco/",
"format": "raw"
}
response = requests.post(
"https://api.brightdata.com/request",
headers=headers,
json=payload
)
if response.status_code == 200:
html_content = response.text
# Parse with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
# Extract data as shown above
Native Proxy-Based Access
For workflows already using proxy routing:
import requests
host = 'brd.superproxy.io'
port = 33335
username = 'brd-customer--zone-'
password = ''
proxy_url = f'http://{username}:{password}@{host}:{port}'
proxies = {
'http': proxy_url,
'https': proxy_url
}
url = "https://www.example.com/homes/for_sale/San-Francisco/"
response = requests.get(url, proxies=proxies)
Note: For native proxy access, install the Bright Data SSL certificate to avoid SSL errors, or set
verify=Falsein your requests (not recommended for production).
Handling JavaScript-Rendered Content
If pages return incomplete data, use the x-unblock-expect header to wait for specific elements:
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
payload = {
"zone": ZONE_NAME,
"url": "https://www.example.com/property/12345",
"format": "raw",
"headers": {
"x-unblock-expect": '{"element": ".property-details"}'
}
}
Option 2: Browser API (For Interactive Pages)
When you need full browser interaction — clicking buttons, scrolling, handling login flows — use the Browser API. For other tools, read my article about the best scraping browsers.
Best for:
- JavaScript-heavy sites requiring interaction
- Multi-step navigation flows
- Sites with complex anti-bot detection
Puppeteer Example
const puppeteer = require('puppeteer-core');
const AUTH = 'YOUR_USERNAME:YOUR_PASSWORD';
const TARGET_URL = 'https://www.example.com/homes/for_sale/San-Francisco/';
async function scrapeRealEstate() {
const browserWSEndpoint = `wss://${@brd.superproxy.io">AUTH}@brd.superproxy.io:9222`;
const browser = await puppeteer.connect({ browserWSEndpoint });
try {
const page = await browser.newPage();
await page.goto(TARGET_URL, { timeout: 120000 });
// Wait for listings to load
await page.waitForSelector('[data-testid="property-card"]');
// Extract data
const listings = await page.evaluate(() => {
const cards = document.querySelectorAll('[data-testid="property-card"]');
return Array.from(cards).map(card => ({
address: card.querySelector('address')?.textContent?.trim(),
price: card.querySelector('[data-test="property-card-price"]')?.textContent?.trim(),
link: card.querySelector('a')?.href
}));
});
console.log(listings);
} finally {
await browser.close();
}
}
scrapeRealEstate();
Playwright Example
from playwright.sync_api import sync_playwright
AUTH = 'YOUR_USERNAME:YOUR_PASSWORD'
def scrape_real_estate():
with sync_playwright() as p:
browser = p.chromium.connect_over_cdp(f'wss://{@brd.superproxy.io">AUTH}@brd.superproxy.io:9222')
page = browser.new_page()
page.goto('https://www.example.com/homes/for_sale/San-Francisco/', timeout=120000)
page.wait_for_selector('[data-testid="property-card"]')
listings = page.evaluate('''() => {
const cards = document.querySelectorAll('[data-testid="property-card"]');
return Array.from(cards).map(card => ({
address: card.querySelector('address')?.textContent?.trim(),
price: card.querySelector('[data-test="property-card-price"]')?.textContent?.trim(),
link: card.querySelector('a')?.href
}));
}''')
browser.close()
return listings
Option 3: Web Scraper API (Pre-Built Scrapers)
For popular real estate sites like Zillow, Realtor.com, or Redfin, Bright Data and other providers offer pre-built scrapers that return structured data directly.
import requests
API_KEY = "YOUR_API_KEY"
DATASET_ID = "gd_xxxxx" # Real estate scraper ID
headers = {
"Authorization": f"Bearer {API_KEY}",
"Content-Type": "application/json"
}
# Synchronous scraping (up to 20 URLs)
response = requests.post(
f"https://api.brightdata.com/datasets/v3/scrape?dataset_id={DATASET_ID}&format=json",
headers=headers,
json=[{"url": "https://www.zillow.com/homedetails/123-main-st/12345_zpid/"}]
)
data = response.json()
Pagination and Multiple Pages
Real estate listings span multiple pages. Handle pagination by iterating through page numbers:
import requests
from bs4 import BeautifulSoup
import time
API_KEY = "YOUR_API_KEY"
ZONE_NAME = "YOUR_ZONE_NAME"
BASE_URL = "https://www.example.com/homes/for_sale/San-Francisco/"
all_results = []
for page_num in range(1, 6):
url = f"{BASE_URL}?page={page_num}"
payload = {
"zone": ZONE_NAME,
"url": url,
"format": "raw"
}
response = requests.post(
"https://api.brightdata.com/request",
headers={"Authorization": f"Bearer {API_KEY}", "Content-Type": "application/json"},
json=payload
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
# Parse and append results
# ...
time.sleep(1) # Basic rate limiting
Extracting Full Property Details
Listing pages contain limited data. Scrape individual property pages for complete information:
def scrape_property_details(property_url, api_key, zone_name):
payload = {
"zone": zone_name,
"url": property_url,
"format": "raw"
}
response = requests.post(
"https://api.brightdata.com/request",
headers={"Authorization": f"Bearer {api_key}", "Content-Type": "application/json"},
json=payload
)
if response.status_code == 200:
soup = BeautifulSoup(response.text, 'html.parser')
return {
"description": soup.find("div", {"data-testid": "description"})?.get_text(strip=True),
"bedrooms": soup.find("span", {"data-testid": "beds"})?.get_text(strip=True),
"bathrooms": soup.find("span", {"data-testid": "baths"})?.get_text(strip=True),
"sqft": soup.find("span", {"data-testid": "sqft"})?.get_text(strip=True),
}
return None
# Enrich listing data
for listing in results:
details = scrape_property_details(listing["link"], API_KEY, ZONE_NAME)
if details:
listing.update(details)
Choosing the Right Product

Summary
- Start simple with requests BeautifulSoup for basic sites
- Use Unlocker API when you encounter anti-bot protection on static pages
- Switch to Browser API when pages require JavaScript interaction
- Consider Web Scraper API for popular sites with pre-built scrapers
- Handle pagination to collect data across multiple pages
- Scrape detail pages for complete property information

