Back to blog
April 6, 2026
Zillow is the largest real estate marketplace in the United States, hosting data on over 135 million properties nationwide. For investors, analysts, and developers, extracting this data programmatically opens the door to powerful market insights that would take weeks to gather manually. Whether you need to track housing prices across neighborhoods, monitor listing trends, or build datasets for machine learning models, scraping Zillow can provide the raw data you need.
In this guide, we will walk through everything you need to know about building a Zillow scraper with Python in 2026. We will cover basic scraping with requests and BeautifulSoup, advanced rendering with Playwright, proxy rotation with residential proxies, and strategies for handling anti-bot detection. By the end, you will have production-ready code that can reliably extract property data at scale.
Zillow aggregates an enormous amount of real estate data that is valuable across many industries and use cases. Here are the most common reasons people scrape Zillow:
Real estate investors and analysts rely on Zillow data to understand market dynamics. By scraping property prices, days on market, and listing volumes across different zip codes, you can identify emerging markets before they become mainstream. This kind of market research is invaluable for making data-driven investment decisions. Tracking metrics like median price changes, inventory levels, and price-to-rent ratios over time gives you a comprehensive picture of where a market is heading.
For real estate investors, having access to granular property data helps evaluate potential deals quickly. Scraping Zillow allows you to compare asking prices against Zestimate values, identify underpriced properties, and calculate potential returns based on comparable sales in the area. You can also track foreclosure listings, auction properties, and price reductions to find opportunities that fit your investment criteria.
Real estate agents and mortgage brokers use Zillow data to generate leads. By tracking new listings, price changes, and recently sold properties, you can identify homeowners who may be interested in selling or buyers actively searching in specific neighborhoods. This data can feed directly into your CRM and outreach workflows.
Whether you are a homeowner tracking your property value or a company monitoring competitor pricing, price monitoring on Zillow provides real-time market intelligence. Automated scrapers can alert you to price drops, new listings, or market shifts in your target areas, giving you an edge in negotiations and decision-making.
Machine learning teams building property valuation models, recommendation engines, or market prediction tools need large, structured datasets. Scraping Zillow provides the training data required for these models. This type of AI data collection is becoming increasingly common as more companies apply machine learning to real estate.
The legality of web scraping sits in a nuanced area that depends on several factors. Here is what you need to know before scraping Zillow:
Zillow maintains a robots.txt file that specifies which parts of the site automated bots can and cannot access. While robots.txt is technically advisory rather than legally binding, respecting it demonstrates good faith. Always review the current robots.txt at zillow.com/robots.txt before building your scraper to understand which paths are disallowed.
Zillow's Terms of Service explicitly restrict automated data collection. Violating these terms could result in your IP being blocked or, in extreme cases, legal action. It is important to understand and consider these terms before proceeding with any scraping project.
Courts have generally drawn a distinction between publicly accessible data and data that requires authentication. Property listings displayed on public search result pages are generally considered less protected than data behind login walls. However, the legal landscape continues to evolve, and you should consult with a legal professional for your specific use case.
Zillow exposes a wealth of property data on its public pages. Here is a breakdown of the data points you can typically extract:
The exact data available depends on the listing type and the specific page you are scraping. Search result pages provide summary data for multiple properties, while individual listing pages contain the full detail set.
Before writing any code, let us set up a clean Python environment with all the dependencies we need. We recommend using Python 3.10 or later for the best compatibility.
pip install requests beautifulsoup4 lxml pandas playwright
playwright install chromium
Here is what each package does:
zillow-scraper/
main.py # Entry point for the scraper
scraper.py # Core scraping logic
proxy_manager.py # Proxy rotation and management
parser.py # HTML and JSON parsing functions
storage.py # Data storage and export
config.py # Configuration and constants
requirements.txt # Python dependencies
output/ # Directory for scraped data files
This modular structure keeps your code organized and makes it easy to maintain and extend as your scraping needs grow.
Let us start with a straightforward scraper using requests and BeautifulSoup. Zillow is a Next.js application, which means most of the property data is embedded in a __NEXT_DATA__ JSON object within a script tag on the page. This is actually convenient for scraping because we can parse structured JSON instead of navigating complex HTML.
import requests
from bs4 import BeautifulSoup
import json
import time
import random
class ZillowScraper:
"""Basic Zillow scraper using requests and BeautifulSoup."""
def __init__(self):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": "https://www.google.com/",
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
})
self.base_url = "https://www.zillow.com"
def search_properties(self, location, page=1):
"""
Search for properties in a given location.
Returns a list of property dictionaries.
"""
search_url = f"{self.base_url}/homes/{location}_rb/"
if page > 1:
search_url += f"{page}_p/"
print(f"Fetching: {search_url}")
response = self.session.get(search_url, timeout=30)
if response.status_code != 200:
print(f"Error: Received status code {response.status_code}")
return []
return self._parse_search_results(response.text)
def _parse_search_results(self, html):
"""
Parse property data from Zillow search results page.
Extracts __NEXT_DATA__ JSON embedded in the page.
"""
soup = BeautifulSoup(html, "lxml")
properties = []
# Find the __NEXT_DATA__ script tag containing all page data
script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if not script_tag:
print("Warning: Could not find __NEXT_DATA__ script tag.")
return properties
try:
next_data = json.loads(script_tag.string)
except json.JSONDecodeError as e:
print(f"Error parsing JSON: {e}")
return properties
# Navigate the JSON structure to find search results
try:
query_state = next_data["props"]["pageProps"]["searchPageState"]
cat1 = query_state["cat1"]
search_results = cat1["searchResults"]["listResults"]
except (KeyError, TypeError) as e:
print(f"Error navigating JSON structure: {e}")
return properties
for result in search_results:
property_data = {
"zpid": result.get("zpid"),
"address": result.get("address"),
"city": result.get("addressCity"),
"state": result.get("addressState"),
"zipcode": result.get("addressZipcode"),
"price": result.get("unformattedPrice"),
"bedrooms": result.get("beds"),
"bathrooms": result.get("baths"),
"sqft": result.get("area"),
"zestimate": result.get("zestimate"),
"listing_type": result.get("statusType"),
"days_on_zillow": result.get("timeOnZillow"),
"url": result.get("detailUrl"),
"latitude": result.get("latLong", {}).get("latitude"),
"longitude": result.get("latLong", {}).get("longitude"),
"broker": result.get("brokerName"),
}
properties.append(property_data)
print(f"Found {len(properties)} properties.")
return properties
def get_property_details(self, property_url):
"""
Fetch detailed information for a single property listing.
"""
if not property_url.startswith("http"):
property_url = f"{self.base_url}{property_url}"
# Add a random delay to avoid rate limiting
time.sleep(random.uniform(2, 5))
response = self.session.get(property_url, timeout=30)
if response.status_code != 200:
print(f"Error fetching details: {response.status_code}")
return None
soup = BeautifulSoup(response.text, "lxml")
script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if not script_tag:
return None
try:
next_data = json.loads(script_tag.string)
property_info = next_data["props"]["pageProps"]["initialReduxState"]
gdp = property_info["gdp"]["building"]
details = {
"description": gdp.get("description"),
"year_built": gdp.get("yearBuilt"),
"lot_size": gdp.get("lotSize"),
"property_type": gdp.get("homeType"),
"heating": gdp.get("heatingSystem"),
"cooling": gdp.get("coolingSystem"),
"parking": gdp.get("parkingFeatures"),
"hoa_fee": gdp.get("monthlyHoaFee"),
"tax_assessed_value": gdp.get("taxAssessedValue"),
"annual_tax": gdp.get("propertyTaxRate"),
}
return details
except (KeyError, TypeError, json.JSONDecodeError) as e:
print(f"Error parsing property details: {e}")
return None
# Usage example
if __name__ == "__main__":
scraper = ZillowScraper()
# Search for properties in Austin, TX
results = scraper.search_properties("Austin-TX")
for prop in results[:5]:
print(f"{prop['address']} - ${prop['price']:,} - "
f"{prop['bedrooms']}bd/{prop['bathrooms']}ba - "
f"{prop['sqft']} sqft")
# Optionally fetch detailed info
if prop.get("url"):
details = scraper.get_property_details(prop["url"])
if details:
print(f" Year Built: {details['year_built']}, "
f"Type: {details['property_type']}")
This basic scraper works well for small-scale data collection, but you will quickly run into issues if you try to send too many requests from a single IP address. Zillow actively monitors for automated traffic and will block IPs that exhibit scraping behavior. This is where proxy rotation becomes essential.
When scraping Zillow at any meaningful scale, you need to rotate your IP address to avoid detection and blocking. Without proxies, you will likely encounter 403 Forbidden errors, CAPTCHAs, or complete IP bans after just a few dozen requests. Using residential proxies from SpyderProxy gives you access to a pool of real residential IP addresses that rotate automatically with each request.
Residential proxies are the best choice for Zillow scraping because they use IP addresses assigned by Internet Service Providers to real households. This makes your requests appear as normal user traffic rather than automated bots. For those on a tighter budget, budget residential proxies offer a cost-effective alternative that still provides solid performance for most scraping tasks. You can also explore rotating datacenter proxies for higher-speed operations where residential IPs are not strictly required.
For a deeper dive into configuring proxies with Python, check out our guide on using rotating proxies with Python requests.
import requests
from bs4 import BeautifulSoup
import json
import time
import random
class ZillowProxyScraper:
"""
Zillow scraper with SpyderProxy residential proxy rotation.
Uses rotating proxies to avoid IP bans and rate limits.
"""
def __init__(self, proxy_user, proxy_pass):
self.proxy_user = proxy_user
self.proxy_pass = proxy_pass
self.base_url = "https://www.zillow.com"
# SpyderProxy rotating residential proxy configuration
# Each request automatically gets a new IP address
self.proxy_url_http = (
f"http://{proxy_user}:{proxy_pass}"
f"@geo.spyderproxy.com:10000"
)
self.proxy_url_socks5 = (
f"socks5://{proxy_user}:{proxy_pass}"
f"@geo.spyderproxy.com:10000"
)
# Use HTTP proxy by default
self.proxies = {
"http": self.proxy_url_http,
"https": self.proxy_url_http,
}
# Rotate user agents to further reduce detection
self.user_agents = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) "
"Gecko/20100101 Firefox/125.0",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/605.1.15 (KHTML, like Gecko) "
"Version/17.4 Safari/605.1.15",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
]
def _get_headers(self):
"""Generate randomized headers for each request."""
return {
"User-Agent": random.choice(self.user_agents),
"Accept": "text/html,application/xhtml+xml,application/xml;"
"q=0.9,image/webp,*/*;q=0.8",
"Accept-Language": "en-US,en;q=0.9",
"Accept-Encoding": "gzip, deflate, br",
"Referer": random.choice([
"https://www.google.com/",
"https://www.google.com/search?q=homes+for+sale",
"https://www.bing.com/",
]),
"DNT": "1",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
def _make_request(self, url, max_retries=3):
"""
Make a request with proxy rotation and retry logic.
SpyderProxy automatically assigns a new IP per request.
"""
for attempt in range(max_retries):
try:
response = requests.get(
url,
headers=self._get_headers(),
proxies=self.proxies,
timeout=30,
)
if response.status_code == 200:
return response
if response.status_code == 403:
print(f"Attempt {attempt + 1}: 403 Forbidden. "
f"Rotating IP and retrying...")
time.sleep(random.uniform(3, 7))
continue
if response.status_code == 429:
print(f"Attempt {attempt + 1}: Rate limited. "
f"Waiting before retry...")
time.sleep(random.uniform(10, 20))
continue
print(f"Attempt {attempt + 1}: Status {response.status_code}")
except requests.exceptions.Timeout:
print(f"Attempt {attempt + 1}: Request timed out.")
time.sleep(random.uniform(2, 5))
except requests.exceptions.ProxyError:
print(f"Attempt {attempt + 1}: Proxy error. Retrying...")
time.sleep(random.uniform(1, 3))
except requests.exceptions.RequestException as e:
print(f"Attempt {attempt + 1}: Request error: {e}")
time.sleep(random.uniform(2, 5))
print(f"Failed to fetch {url} after {max_retries} attempts.")
return None
def search_properties(self, location, max_pages=5):
"""
Search for properties across multiple pages.
Uses proxy rotation for each request.
"""
all_properties = []
for page in range(1, max_pages + 1):
search_url = f"{self.base_url}/homes/{location}_rb/"
if page > 1:
search_url += f"{page}_p/"
print(f"Scraping page {page}: {search_url}")
response = self._make_request(search_url)
if not response:
print(f"Skipping page {page} due to request failure.")
continue
properties = self._parse_search_results(response.text)
all_properties.extend(properties)
print(f"Page {page}: Found {len(properties)} properties "
f"(Total: {len(all_properties)})")
# Respectful delay between pages
time.sleep(random.uniform(3, 8))
return all_properties
def _parse_search_results(self, html):
"""Parse property data from the __NEXT_DATA__ JSON."""
soup = BeautifulSoup(html, "lxml")
properties = []
script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
if not script_tag:
return properties
try:
next_data = json.loads(script_tag.string)
query_state = next_data["props"]["pageProps"]["searchPageState"]
results = query_state["cat1"]["searchResults"]["listResults"]
for result in results:
property_data = {
"zpid": result.get("zpid"),
"address": result.get("address"),
"city": result.get("addressCity"),
"state": result.get("addressState"),
"zipcode": result.get("addressZipcode"),
"price": result.get("unformattedPrice"),
"bedrooms": result.get("beds"),
"bathrooms": result.get("baths"),
"sqft": result.get("area"),
"zestimate": result.get("zestimate"),
"listing_type": result.get("statusType"),
"url": result.get("detailUrl"),
"latitude": result.get("latLong", {}).get("latitude"),
"longitude": result.get("latLong", {}).get("longitude"),
"broker": result.get("brokerName"),
}
properties.append(property_data)
except (KeyError, TypeError, json.JSONDecodeError) as e:
print(f"Parse error: {e}")
return properties
def use_socks5_proxy(self):
"""
Switch to SOCKS5 proxy protocol.
Useful when HTTP proxies are being detected.
Requires: pip install requests[socks]
"""
self.proxies = {
"http": self.proxy_url_socks5,
"https": self.proxy_url_socks5,
}
print("Switched to SOCKS5 proxy protocol.")
# Usage example
if __name__ == "__main__":
scraper = ZillowProxyScraper(
proxy_user="your_spyderproxy_username",
proxy_pass="your_spyderproxy_password",
)
# Scrape multiple pages of results for Miami, FL
properties = scraper.search_properties("Miami-FL", max_pages=3)
print(f"\nTotal properties scraped: {len(properties)}")
for prop in properties[:10]:
price = prop['price']
price_str = f"${price:,}" if price else "N/A"
print(f" {prop['address']} - {price_str}")
The key advantage of using SpyderProxy's rotating residential proxies is that each request is automatically routed through a different US-based residential IP address. This means Zillow sees each request as coming from a different household, making it extremely difficult to detect and block your scraper. You can verify your proxy setup is working correctly using our proxy checker tool before running your scraper at scale.
Some Zillow pages rely heavily on JavaScript rendering, meaning the property data is loaded dynamically after the initial page load. In these cases, a simple HTTP request will not capture all the data. Playwright is a browser automation library that runs a full Chromium browser, allowing you to interact with pages exactly as a real user would.
This approach is slower than direct HTTP requests, but it captures data that only appears after JavaScript execution, including dynamically loaded map results, infinite scroll listings, and interactive property details.
import asyncio
import json
import random
from playwright.async_api import async_playwright
class ZillowPlaywrightScraper:
"""
Advanced Zillow scraper using Playwright for JS-rendered pages.
Supports proxy rotation via SpyderProxy.
"""
def __init__(self, proxy_user=None, proxy_pass=None):
self.proxy_user = proxy_user
self.proxy_pass = proxy_pass
self.base_url = "https://www.zillow.com"
async def _create_browser(self):
"""Create a Playwright browser instance with proxy configuration."""
playwright = await async_playwright().start()
launch_options = {
"headless": True,
"args": [
"--disable-blink-features=AutomationControlled",
"--no-sandbox",
"--disable-dev-shm-usage",
],
}
# Configure SpyderProxy if credentials are provided
if self.proxy_user and self.proxy_pass:
launch_options["proxy"] = {
"server": "http://geo.spyderproxy.com:10000",
"username": self.proxy_user,
"password": self.proxy_pass,
}
browser = await playwright.chromium.launch(**launch_options)
return playwright, browser
async def _create_context(self, browser):
"""Create a browser context with realistic settings."""
context = await browser.new_context(
viewport={"width": 1920, "height": 1080},
user_agent=(
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
),
locale="en-US",
timezone_id="America/New_York",
geolocation={"longitude": -73.935242, "latitude": 40.730610},
permissions=["geolocation"],
)
# Remove webdriver detection signals
await context.add_init_script("""
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
Object.defineProperty(navigator, 'languages', {
get: () => ['en-US', 'en']
});
Object.defineProperty(navigator, 'plugins', {
get: () => [1, 2, 3, 4, 5]
});
""")
return context
async def search_properties(self, location, max_pages=3):
"""
Search for properties using a headless browser.
Handles JavaScript rendering and dynamic content.
"""
playwright, browser = await self._create_browser()
context = await self._create_context(browser)
page = await context.new_page()
all_properties = []
try:
for page_num in range(1, max_pages + 1):
search_url = f"{self.base_url}/homes/{location}_rb/"
if page_num > 1:
search_url += f"{page_num}_p/"
print(f"Loading page {page_num}: {search_url}")
await page.goto(search_url, wait_until="networkidle")
await page.wait_for_timeout(random.randint(2000, 4000))
# Scroll down to trigger lazy-loaded content
await self._scroll_page(page)
# Extract __NEXT_DATA__ from the page
next_data = await page.evaluate("""
() => {
const el = document.getElementById('__NEXT_DATA__');
return el ? JSON.parse(el.textContent) : null;
}
""")
if next_data:
properties = self._parse_next_data(next_data)
all_properties.extend(properties)
print(f"Page {page_num}: {len(properties)} properties")
else:
# Fallback: scrape from rendered DOM
properties = await self._parse_dom(page)
all_properties.extend(properties)
print(f"Page {page_num}: {len(properties)} properties "
f"(from DOM)")
# Random delay between pages
await page.wait_for_timeout(random.randint(3000, 7000))
finally:
await context.close()
await browser.close()
await playwright.stop()
return all_properties
async def _scroll_page(self, page):
"""Simulate natural scrolling to load lazy content."""
total_height = await page.evaluate("document.body.scrollHeight")
current_position = 0
scroll_step = random.randint(300, 600)
while current_position < total_height:
current_position += scroll_step
await page.evaluate(f"window.scrollTo(0, {current_position})")
await page.wait_for_timeout(random.randint(200, 500))
# Scroll back to top
await page.evaluate("window.scrollTo(0, 0)")
await page.wait_for_timeout(1000)
def _parse_next_data(self, next_data):
"""Parse properties from the __NEXT_DATA__ JSON object."""
properties = []
try:
query_state = next_data["props"]["pageProps"]["searchPageState"]
results = query_state["cat1"]["searchResults"]["listResults"]
for result in results:
properties.append({
"zpid": result.get("zpid"),
"address": result.get("address"),
"city": result.get("addressCity"),
"state": result.get("addressState"),
"zipcode": result.get("addressZipcode"),
"price": result.get("unformattedPrice"),
"bedrooms": result.get("beds"),
"bathrooms": result.get("baths"),
"sqft": result.get("area"),
"zestimate": result.get("zestimate"),
"url": result.get("detailUrl"),
})
except (KeyError, TypeError):
pass
return properties
async def _parse_dom(self, page):
"""Fallback: parse property cards directly from the DOM."""
properties = await page.evaluate("""
() => {
const cards = document.querySelectorAll(
'article[data-test="property-card"]'
);
return Array.from(cards).map(card => {
const priceEl = card.querySelector(
'[data-test="property-card-price"]'
);
const addressEl = card.querySelector('address');
const linkEl = card.querySelector('a[data-test="property-card-link"]');
const detailsEl = card.querySelector(
'[data-test="property-card-details"]'
);
return {
price: priceEl ? priceEl.textContent.trim() : null,
address: addressEl
? addressEl.textContent.trim() : null,
url: linkEl ? linkEl.href : null,
details: detailsEl
? detailsEl.textContent.trim() : null,
};
});
}
""")
return properties
async def get_property_details(self, property_url):
"""Scrape full details from a single property listing page."""
playwright, browser = await self._create_browser()
context = await self._create_context(browser)
page = await context.new_page()
details = None
try:
if not property_url.startswith("http"):
property_url = f"{self.base_url}{property_url}"
await page.goto(property_url, wait_until="networkidle")
await page.wait_for_timeout(random.randint(2000, 4000))
await self._scroll_page(page)
details = await page.evaluate("""
() => {
const getData = (selector) => {
const el = document.querySelector(selector);
return el ? el.textContent.trim() : null;
};
return {
price: getData('[data-testid="price"]'),
address: getData(
'[data-testid="bdp-property-address"]'
),
beds: getData('[data-testid="bed-bath-item"]:nth-child(1)'),
baths: getData('[data-testid="bed-bath-item"]:nth-child(2)'),
sqft: getData('[data-testid="bed-bath-item"]:nth-child(3)'),
description: getData(
'[data-testid="description-text"]'
),
zestimate: getData('[data-testid="zestimate-text"]'),
};
}
""")
finally:
await context.close()
await browser.close()
await playwright.stop()
return details
# Usage example
async def main():
scraper = ZillowPlaywrightScraper(
proxy_user="your_spyderproxy_username",
proxy_pass="your_spyderproxy_password",
)
properties = await scraper.search_properties("Denver-CO", max_pages=2)
print(f"\nTotal properties: {len(properties)}")
for prop in properties[:5]:
print(f" {prop.get('address')} - {prop.get('price')}")
if __name__ == "__main__":
asyncio.run(main())
The Playwright approach is particularly useful for scraping property detail pages where data is loaded progressively as you scroll. It also handles situations where Zillow returns a JavaScript challenge page instead of the actual content, since the full browser can execute the challenge script just like a regular user's browser would.
Zillow employs several layers of anti-bot protection. Understanding these mechanisms and how to work around them is critical for building a reliable scraper. Here are the main techniques you should implement:
Sending the same User-Agent header with every request is a clear signal of automated traffic. Maintain a list of current, realistic user agent strings and rotate them randomly. Make sure your user agents match actual browser versions that are currently in use, as outdated user agents are a red flag.
Real users do not load pages at machine speed. Add random delays between requests to simulate natural browsing behavior. A delay of 3 to 8 seconds between requests is a good baseline. For detail pages, wait even longer since users typically spend time reading listing information before moving to the next page.
Maintain cookies across requests within a session to appear as a consistent user. However, periodically rotate sessions to avoid building a suspicious cookie profile. Creating a new session every 20 to 30 requests is a reasonable strategy.
Modern anti-bot systems analyze the full set of HTTP headers, not just the User-Agent. Make sure your headers are consistent and realistic. The Accept, Accept-Language, Accept-Encoding, and Sec-Fetch headers should all match what a real browser would send. Inconsistent headers are a common reason scrapers get detected.
When using Playwright, Zillow may check for browser automation signals like the navigator.webdriver property. The initialization script in our Playwright example above removes these signals, but you should stay updated on new detection techniques as they evolve.
Arriving at a Zillow search page without a referrer or with a suspicious one can trigger anti-bot detection. Set your Referer header to Google search or another natural source. Vary the referrer across requests to appear more organic.
For a comprehensive overview of proxy selection for scraping projects, see our guide on the best proxies for web scraping.
Once you have scraped property data from Zillow, you need to store it in a structured format and perform analysis. Pandas makes this straightforward. Here is a complete example of saving scraped data to CSV and running basic analysis:
import pandas as pd
import json
from datetime import datetime
class ZillowDataManager:
"""Manage storage and analysis of scraped Zillow data."""
def __init__(self, output_dir="output"):
self.output_dir = output_dir
def save_to_csv(self, properties, filename=None):
"""Save scraped properties to a CSV file."""
if not properties:
print("No properties to save.")
return None
df = pd.DataFrame(properties)
# Add metadata columns
df["scraped_at"] = datetime.now().isoformat()
df["source"] = "zillow"
# Clean price data
if "price" in df.columns:
df["price"] = pd.to_numeric(df["price"], errors="coerce")
if "sqft" in df.columns:
df["sqft"] = pd.to_numeric(df["sqft"], errors="coerce")
if "zestimate" in df.columns:
df["zestimate"] = pd.to_numeric(
df["zestimate"], errors="coerce"
)
if filename is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"zillow_properties_{timestamp}.csv"
filepath = f"{self.output_dir}/{filename}"
df.to_csv(filepath, index=False)
print(f"Saved {len(df)} properties to {filepath}")
return filepath
def save_to_json(self, properties, filename=None):
"""Save scraped properties to a JSON file."""
if filename is None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"zillow_properties_{timestamp}.json"
filepath = f"{self.output_dir}/{filename}"
with open(filepath, "w") as f:
json.dump(properties, f, indent=2)
print(f"Saved {len(properties)} properties to {filepath}")
return filepath
def analyze_market(self, csv_path):
"""Run basic market analysis on scraped property data."""
df = pd.read_csv(csv_path)
print("=" * 60)
print("ZILLOW MARKET ANALYSIS REPORT")
print("=" * 60)
print(f"\nTotal Properties: {len(df)}")
print(f"Date Range: {df['scraped_at'].min()} to "
f"{df['scraped_at'].max()}")
# Price analysis
if "price" in df.columns:
price_data = df["price"].dropna()
print(f"\n--- Price Analysis ---")
print(f" Median Price: ${price_data.median():,.0f}")
print(f" Mean Price: ${price_data.mean():,.0f}")
print(f" Min Price: ${price_data.min():,.0f}")
print(f" Max Price: ${price_data.max():,.0f}")
print(f" Std Dev: ${price_data.std():,.0f}")
# Price per square foot
if "price" in df.columns and "sqft" in df.columns:
df["price_per_sqft"] = df["price"] / df["sqft"]
ppsf = df["price_per_sqft"].dropna()
print(f"\n--- Price per Sq Ft ---")
print(f" Median: ${ppsf.median():,.0f}/sqft")
print(f" Mean: ${ppsf.mean():,.0f}/sqft")
# Zestimate comparison
if "price" in df.columns and "zestimate" in df.columns:
df["price_vs_zestimate"] = df["price"] - df["zestimate"]
diff = df["price_vs_zestimate"].dropna()
underpriced = len(diff[diff < 0])
overpriced = len(diff[diff > 0])
print(f"\n--- Zestimate Comparison ---")
print(f" Below Zestimate: {underpriced} properties")
print(f" Above Zestimate: {overpriced} properties")
print(f" Avg Difference: ${diff.mean():,.0f}")
# Bedroom distribution
if "bedrooms" in df.columns:
print(f"\n--- Bedroom Distribution ---")
bed_counts = df["bedrooms"].value_counts().sort_index()
for beds, count in bed_counts.items():
print(f" {beds} bed: {count} properties")
# City breakdown
if "city" in df.columns:
print(f"\n--- Top Cities ---")
city_counts = df["city"].value_counts().head(10)
for city, count in city_counts.items():
median = df[df["city"] == city]["price"].median()
median_str = f"${median:,.0f}" if pd.notna(median) else "N/A"
print(f" {city}: {count} listings "
f"(median: {median_str})")
print("\n" + "=" * 60)
return df
# Usage example
if __name__ == "__main__":
manager = ZillowDataManager(output_dir="output")
# Example: save properties from a scraping session
sample_properties = [
{
"address": "123 Main St, Austin, TX 78701",
"city": "Austin",
"state": "TX",
"price": 450000,
"bedrooms": 3,
"bathrooms": 2,
"sqft": 1800,
"zestimate": 465000,
},
{
"address": "456 Oak Ave, Austin, TX 78704",
"city": "Austin",
"state": "TX",
"price": 625000,
"bedrooms": 4,
"bathrooms": 3,
"sqft": 2400,
"zestimate": 610000,
},
]
csv_path = manager.save_to_csv(sample_properties)
if csv_path:
manager.analyze_market(csv_path)
This data management class gives you a clean workflow for saving scraped results and generating quick market reports. You can extend the analysis methods to calculate more sophisticated metrics like absorption rates, inventory turnover, or price trend regressions.
Once your basic scraper is working reliably, you will likely want to scale it to cover more locations and properties. Here are key strategies for scaling effectively:
Python's asyncio and aiohttp libraries allow you to make multiple requests concurrently without blocking. Instead of scraping cities sequentially, you can scrape several at once. Be careful not to exceed reasonable concurrency levels, as five to ten concurrent requests is usually a safe upper limit to avoid overwhelming the target server.
At scale, you will encounter more errors. Implement exponential backoff for retries, where each subsequent retry waits longer than the previous one. Log all errors with timestamps and URLs so you can identify patterns and adjust your strategy. Keep track of which pages failed so you can retry them in a separate pass rather than losing that data entirely.
For large-scale scraping jobs, consider using a task queue like Redis Queue or Celery. This allows you to distribute scraping work across multiple machines, each with its own set of proxies. A queue also provides natural retry logic and progress tracking, making your scraper more resilient and easier to monitor.
When scraping the same locations over time, you will encounter duplicate listings. Use the Zillow property ID (zpid) as a unique key to identify and handle duplicates. Store the zpid in a set or database index so you can quickly skip properties you have already scraped or update existing records with new price data.
Set up cron jobs or scheduled tasks to run your scraper at regular intervals. Daily or weekly scrapes allow you to build historical datasets that are valuable for trend analysis. Make sure your scheduling accounts for reasonable hours and avoids peak traffic times on Zillow.
For a broader perspective on web scraping best practices, including architecture patterns for large-scale projects, see our dedicated use case guide.
Here are the most frequently encountered issues when scraping Zillow and how to resolve them:
This is the most common error and means Zillow has detected your request as automated traffic. Causes include using the same IP address for too many requests, sending requests without proper headers, or using a flagged proxy IP. Fix this by rotating your proxies, randomizing your headers, and adding delays between requests. SpyderProxy's residential proxies significantly reduce the frequency of 403 errors because they use genuine residential IP addresses.
Zillow may present a CAPTCHA page instead of the actual content. If you are using requests, you will receive an HTML page containing the CAPTCHA markup instead of property data. With Playwright, you can detect CAPTCHAs by checking for specific page elements. The best mitigation is to reduce your request rate, use high-quality residential proxies, and avoid patterns that trigger CAPTCHAs. If you encounter CAPTCHAs frequently, your proxy quality or request patterns likely need adjustment.
Sometimes Zillow returns a valid 200 response, but the __NEXT_DATA__ script tag is missing or contains incomplete data. This can happen when Zillow serves a different page variant, when the location is invalid, or when JavaScript rendering is required. Check that your location string matches Zillow's URL format exactly, and consider switching to the Playwright scraper for pages that return incomplete data with simple HTTP requests.
A 429 response means you are sending requests too quickly. Implement exponential backoff when you receive this status code. Start with a 10-second wait and double it with each consecutive 429 response. Once you can successfully complete a request again, gradually return to your normal request rate.
Timeouts can occur due to proxy issues, network instability, or Zillow's server being slow. Set a reasonable timeout of 30 seconds for each request and implement retry logic for timed-out requests. If you experience frequent timeouts with a specific proxy, try switching to a different proxy protocol or checking your proxy connection with the proxy checker tool.
Zillow periodically updates their website structure, which can break your JSON parsing logic. Build your parser to handle missing keys gracefully using .get() with default values. Set up monitoring that alerts you when your scraper returns zero results for previously working locations, which is a strong indicator that the page structure has changed.
If you are also interested in scraping other e-commerce platforms, check out our guide on scraping Amazon without getting blocked, which covers many of the same anti-detection techniques.
You can build a basic scraper using free tools like Python, requests, and BeautifulSoup. However, without proxies, your IP address will be quickly blocked after a relatively small number of requests. For any serious data collection, you will need residential proxies. SpyderProxy offers affordable plans starting with pay-as-you-go pricing that works well for small to medium scraping projects.
There is no official public limit. However, based on community experience, making more than 100 to 200 requests per hour from a single IP address is likely to trigger blocking. With rotating residential proxies, you can safely scale to thousands of requests per day since each request comes from a different IP address.
Zillow has deprecated most of its public APIs in recent years. The Zillow API (formerly known as the Zestimate API) was shut down for new users. Some real estate data is available through third-party APIs and data providers like Bridge Interactive or RESO, but these typically require commercial agreements and may not include the same breadth of data visible on Zillow's website.
Residential proxies are the gold standard for Zillow scraping because they use IP addresses from real ISPs, making them nearly indistinguishable from regular user traffic. SpyderProxy residential proxies are ideal for this purpose. Datacenter proxies are faster but more likely to be detected and blocked by Zillow's anti-bot systems.
Most of Zillow's property data is embedded in the __NEXT_DATA__ JSON object on the page, which is available in the initial HTML response. For content that loads dynamically via JavaScript, use Playwright or a similar browser automation tool. The Playwright approach is slower but captures everything a real browser would see.
Zestimate values are displayed on individual property pages and in search results. You can extract them from the __NEXT_DATA__ JSON just like other property attributes. Keep in mind that Zestimate is Zillow's proprietary estimate and its accuracy varies by market. Always check Zillow's terms regarding the use and redistribution of Zestimate data.
For price monitoring and market tracking, daily scrapes of your target locations provide a good balance between data freshness and resource usage. Weekly scrapes are sufficient for broader market trend analysis. Avoid scraping the same pages more frequently than once per day, as property data does not change that often and excessive requests waste resources.
First, check if Zillow has updated their website structure by manually visiting the page in a browser and inspecting the __NEXT_DATA__ JSON. If the structure has changed, update your parsing logic. If the page loads fine in a browser but your scraper gets blocked, review your proxy configuration, headers, and request patterns. Sometimes a simple change like updating your user agent strings to the latest browser versions resolves the issue.
Scraping Zillow in 2026 requires a combination of the right tools, proper proxy infrastructure, and smart request management. The Python-based approaches covered in this guide give you everything you need to get started, from basic HTTP scraping with requests and BeautifulSoup to advanced browser automation with Playwright.
The most critical factor for reliable Zillow scraping is proxy quality. Without rotating residential proxies, your scraper will be blocked almost immediately. SpyderProxy's residential proxy network provides the IP diversity and rotation capabilities needed to scrape Zillow at scale without interruptions.
Remember to always scrape responsibly. Respect rate limits, add reasonable delays between requests, and do not overload Zillow's servers. Use the data you collect ethically and in compliance with applicable laws and terms of service.
Ready to start building your Zillow scraper? Set up your proxy infrastructure first, then implement the code examples in this guide step by step. Start with the basic scraper to validate your approach, add proxy rotation once you need to scale, and upgrade to Playwright when you encounter pages that require JavaScript rendering.
Get access to millions of residential IPs with automatic rotation. SpyderProxy makes Zillow scraping reliable, fast, and undetectable.
Start Your Free Trial