spyderproxy
back

Back to blog

How to Scrape Zillow in 2026: Complete Python Guide with Proxies

calendar

April 6, 2026

Zillow is the largest real estate marketplace in the United States, hosting data on over 135 million properties nationwide. For investors, analysts, and developers, extracting this data programmatically opens the door to powerful market insights that would take weeks to gather manually. Whether you need to track housing prices across neighborhoods, monitor listing trends, or build datasets for machine learning models, scraping Zillow can provide the raw data you need.

In this guide, we will walk through everything you need to know about building a Zillow scraper with Python in 2026. We will cover basic scraping with requests and BeautifulSoup, advanced rendering with Playwright, proxy rotation with residential proxies, and strategies for handling anti-bot detection. By the end, you will have production-ready code that can reliably extract property data at scale.

Why Scrape Zillow?

Zillow aggregates an enormous amount of real estate data that is valuable across many industries and use cases. Here are the most common reasons people scrape Zillow:

Market Analysis and Research

Real estate investors and analysts rely on Zillow data to understand market dynamics. By scraping property prices, days on market, and listing volumes across different zip codes, you can identify emerging markets before they become mainstream. This kind of market research is invaluable for making data-driven investment decisions. Tracking metrics like median price changes, inventory levels, and price-to-rent ratios over time gives you a comprehensive picture of where a market is heading.

Investment Research

For real estate investors, having access to granular property data helps evaluate potential deals quickly. Scraping Zillow allows you to compare asking prices against Zestimate values, identify underpriced properties, and calculate potential returns based on comparable sales in the area. You can also track foreclosure listings, auction properties, and price reductions to find opportunities that fit your investment criteria.

Lead Generation

Real estate agents and mortgage brokers use Zillow data to generate leads. By tracking new listings, price changes, and recently sold properties, you can identify homeowners who may be interested in selling or buyers actively searching in specific neighborhoods. This data can feed directly into your CRM and outreach workflows.

Price Monitoring

Whether you are a homeowner tracking your property value or a company monitoring competitor pricing, price monitoring on Zillow provides real-time market intelligence. Automated scrapers can alert you to price drops, new listings, or market shifts in your target areas, giving you an edge in negotiations and decision-making.

AI and Data Collection

Machine learning teams building property valuation models, recommendation engines, or market prediction tools need large, structured datasets. Scraping Zillow provides the training data required for these models. This type of AI data collection is becoming increasingly common as more companies apply machine learning to real estate.

Is It Legal to Scrape Zillow?

The legality of web scraping sits in a nuanced area that depends on several factors. Here is what you need to know before scraping Zillow:

Robots.txt and Technical Restrictions

Zillow maintains a robots.txt file that specifies which parts of the site automated bots can and cannot access. While robots.txt is technically advisory rather than legally binding, respecting it demonstrates good faith. Always review the current robots.txt at zillow.com/robots.txt before building your scraper to understand which paths are disallowed.

Terms of Service

Zillow's Terms of Service explicitly restrict automated data collection. Violating these terms could result in your IP being blocked or, in extreme cases, legal action. It is important to understand and consider these terms before proceeding with any scraping project.

Public vs. Private Data

Courts have generally drawn a distinction between publicly accessible data and data that requires authentication. Property listings displayed on public search result pages are generally considered less protected than data behind login walls. However, the legal landscape continues to evolve, and you should consult with a legal professional for your specific use case.

Disclaimer: This guide is provided for educational purposes only. The techniques described here are intended to teach web scraping concepts. You are responsible for ensuring your use of these techniques complies with applicable laws, terms of service, and ethical guidelines. Always respect website rate limits, do not overload servers, and use collected data responsibly. SpyderProxy does not encourage or endorse any activity that violates the terms of service of any website.

What Data Can You Scrape from Zillow?

Zillow exposes a wealth of property data on its public pages. Here is a breakdown of the data points you can typically extract:

  • Property Prices: Current listing price, price history, price cuts, and original list price
  • Addresses: Full street address, city, state, zip code, and neighborhood name
  • Zestimate: Zillow's proprietary home value estimate and its historical trajectory
  • Listing Details: Number of bedrooms, bathrooms, square footage, lot size, year built, property type, and listing status
  • Agent Information: Listing agent name, brokerage, and contact details
  • Historical Data: Tax history, price history, and previous sale dates and prices
  • Property Features: Heating and cooling type, parking, appliances, interior features, and exterior features
  • Photos and Descriptions: Listing photos URLs and the full property description text
  • Estimated Payments: Monthly mortgage estimate, property taxes, HOA fees, and insurance estimates
  • Neighborhood Data: Walk score, transit score, school ratings, and nearby amenities

The exact data available depends on the listing type and the specific page you are scraping. Search result pages provide summary data for multiple properties, while individual listing pages contain the full detail set.

Setting Up Your Python Environment

Before writing any code, let us set up a clean Python environment with all the dependencies we need. We recommend using Python 3.10 or later for the best compatibility.

Installing Dependencies

pip install requests beautifulsoup4 lxml pandas playwright
playwright install chromium

Here is what each package does:

  • requests: HTTP library for making web requests
  • beautifulsoup4: HTML parsing library for extracting data from page source
  • lxml: Fast XML and HTML parser used as a backend for BeautifulSoup
  • pandas: Data manipulation library for organizing and exporting scraped data
  • playwright: Browser automation library for rendering JavaScript-heavy pages

Project Structure

zillow-scraper/
    main.py              # Entry point for the scraper
    scraper.py           # Core scraping logic
    proxy_manager.py     # Proxy rotation and management
    parser.py            # HTML and JSON parsing functions
    storage.py           # Data storage and export
    config.py            # Configuration and constants
    requirements.txt     # Python dependencies
    output/              # Directory for scraped data files

This modular structure keeps your code organized and makes it easy to maintain and extend as your scraping needs grow.

Basic Zillow Scraper with Python

Let us start with a straightforward scraper using requests and BeautifulSoup. Zillow is a Next.js application, which means most of the property data is embedded in a __NEXT_DATA__ JSON object within a script tag on the page. This is actually convenient for scraping because we can parse structured JSON instead of navigating complex HTML.

import requests
from bs4 import BeautifulSoup
import json
import time
import random

class ZillowScraper:
    """Basic Zillow scraper using requests and BeautifulSoup."""

    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update({
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/124.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;"
                      "q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": "https://www.google.com/",
            "DNT": "1",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        })
        self.base_url = "https://www.zillow.com"

    def search_properties(self, location, page=1):
        """
        Search for properties in a given location.
        Returns a list of property dictionaries.
        """
        search_url = f"{self.base_url}/homes/{location}_rb/"
        if page > 1:
            search_url += f"{page}_p/"

        print(f"Fetching: {search_url}")
        response = self.session.get(search_url, timeout=30)

        if response.status_code != 200:
            print(f"Error: Received status code {response.status_code}")
            return []

        return self._parse_search_results(response.text)

    def _parse_search_results(self, html):
        """
        Parse property data from Zillow search results page.
        Extracts __NEXT_DATA__ JSON embedded in the page.
        """
        soup = BeautifulSoup(html, "lxml")
        properties = []

        # Find the __NEXT_DATA__ script tag containing all page data
        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})

        if not script_tag:
            print("Warning: Could not find __NEXT_DATA__ script tag.")
            return properties

        try:
            next_data = json.loads(script_tag.string)
        except json.JSONDecodeError as e:
            print(f"Error parsing JSON: {e}")
            return properties

        # Navigate the JSON structure to find search results
        try:
            query_state = next_data["props"]["pageProps"]["searchPageState"]
            cat1 = query_state["cat1"]
            search_results = cat1["searchResults"]["listResults"]
        except (KeyError, TypeError) as e:
            print(f"Error navigating JSON structure: {e}")
            return properties

        for result in search_results:
            property_data = {
                "zpid": result.get("zpid"),
                "address": result.get("address"),
                "city": result.get("addressCity"),
                "state": result.get("addressState"),
                "zipcode": result.get("addressZipcode"),
                "price": result.get("unformattedPrice"),
                "bedrooms": result.get("beds"),
                "bathrooms": result.get("baths"),
                "sqft": result.get("area"),
                "zestimate": result.get("zestimate"),
                "listing_type": result.get("statusType"),
                "days_on_zillow": result.get("timeOnZillow"),
                "url": result.get("detailUrl"),
                "latitude": result.get("latLong", {}).get("latitude"),
                "longitude": result.get("latLong", {}).get("longitude"),
                "broker": result.get("brokerName"),
            }
            properties.append(property_data)

        print(f"Found {len(properties)} properties.")
        return properties

    def get_property_details(self, property_url):
        """
        Fetch detailed information for a single property listing.
        """
        if not property_url.startswith("http"):
            property_url = f"{self.base_url}{property_url}"

        # Add a random delay to avoid rate limiting
        time.sleep(random.uniform(2, 5))

        response = self.session.get(property_url, timeout=30)

        if response.status_code != 200:
            print(f"Error fetching details: {response.status_code}")
            return None

        soup = BeautifulSoup(response.text, "lxml")
        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})

        if not script_tag:
            return None

        try:
            next_data = json.loads(script_tag.string)
            property_info = next_data["props"]["pageProps"]["initialReduxState"]
            gdp = property_info["gdp"]["building"]

            details = {
                "description": gdp.get("description"),
                "year_built": gdp.get("yearBuilt"),
                "lot_size": gdp.get("lotSize"),
                "property_type": gdp.get("homeType"),
                "heating": gdp.get("heatingSystem"),
                "cooling": gdp.get("coolingSystem"),
                "parking": gdp.get("parkingFeatures"),
                "hoa_fee": gdp.get("monthlyHoaFee"),
                "tax_assessed_value": gdp.get("taxAssessedValue"),
                "annual_tax": gdp.get("propertyTaxRate"),
            }
            return details
        except (KeyError, TypeError, json.JSONDecodeError) as e:
            print(f"Error parsing property details: {e}")
            return None


# Usage example
if __name__ == "__main__":
    scraper = ZillowScraper()

    # Search for properties in Austin, TX
    results = scraper.search_properties("Austin-TX")

    for prop in results[:5]:
        print(f"{prop['address']} - ${prop['price']:,} - "
              f"{prop['bedrooms']}bd/{prop['bathrooms']}ba - "
              f"{prop['sqft']} sqft")

        # Optionally fetch detailed info
        if prop.get("url"):
            details = scraper.get_property_details(prop["url"])
            if details:
                print(f"  Year Built: {details['year_built']}, "
                      f"Type: {details['property_type']}")

This basic scraper works well for small-scale data collection, but you will quickly run into issues if you try to send too many requests from a single IP address. Zillow actively monitors for automated traffic and will block IPs that exhibit scraping behavior. This is where proxy rotation becomes essential.

Adding Proxy Rotation for Reliable Scraping

When scraping Zillow at any meaningful scale, you need to rotate your IP address to avoid detection and blocking. Without proxies, you will likely encounter 403 Forbidden errors, CAPTCHAs, or complete IP bans after just a few dozen requests. Using residential proxies from SpyderProxy gives you access to a pool of real residential IP addresses that rotate automatically with each request.

Residential proxies are the best choice for Zillow scraping because they use IP addresses assigned by Internet Service Providers to real households. This makes your requests appear as normal user traffic rather than automated bots. For those on a tighter budget, budget residential proxies offer a cost-effective alternative that still provides solid performance for most scraping tasks. You can also explore rotating datacenter proxies for higher-speed operations where residential IPs are not strictly required.

For a deeper dive into configuring proxies with Python, check out our guide on using rotating proxies with Python requests.

import requests
from bs4 import BeautifulSoup
import json
import time
import random

class ZillowProxyScraper:
    """
    Zillow scraper with SpyderProxy residential proxy rotation.
    Uses rotating proxies to avoid IP bans and rate limits.
    """

    def __init__(self, proxy_user, proxy_pass):
        self.proxy_user = proxy_user
        self.proxy_pass = proxy_pass
        self.base_url = "https://www.zillow.com"

        # SpyderProxy rotating residential proxy configuration
        # Each request automatically gets a new IP address
        self.proxy_url_http = (
            f"http://{proxy_user}:{proxy_pass}"
            f"@geo.spyderproxy.com:10000"
        )
        self.proxy_url_socks5 = (
            f"socks5://{proxy_user}:{proxy_pass}"
            f"@geo.spyderproxy.com:10000"
        )

        # Use HTTP proxy by default
        self.proxies = {
            "http": self.proxy_url_http,
            "https": self.proxy_url_http,
        }

        # Rotate user agents to further reduce detection
        self.user_agents = [
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/124.0.0.0 Safari/537.36",
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:125.0) "
            "Gecko/20100101 Firefox/125.0",
            "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/605.1.15 (KHTML, like Gecko) "
            "Version/17.4 Safari/605.1.15",
            "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
            "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36",
        ]

    def _get_headers(self):
        """Generate randomized headers for each request."""
        return {
            "User-Agent": random.choice(self.user_agents),
            "Accept": "text/html,application/xhtml+xml,application/xml;"
                      "q=0.9,image/webp,*/*;q=0.8",
            "Accept-Language": "en-US,en;q=0.9",
            "Accept-Encoding": "gzip, deflate, br",
            "Referer": random.choice([
                "https://www.google.com/",
                "https://www.google.com/search?q=homes+for+sale",
                "https://www.bing.com/",
            ]),
            "DNT": "1",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "cross-site",
            "Sec-Fetch-User": "?1",
            "Cache-Control": "max-age=0",
        }

    def _make_request(self, url, max_retries=3):
        """
        Make a request with proxy rotation and retry logic.
        SpyderProxy automatically assigns a new IP per request.
        """
        for attempt in range(max_retries):
            try:
                response = requests.get(
                    url,
                    headers=self._get_headers(),
                    proxies=self.proxies,
                    timeout=30,
                )

                if response.status_code == 200:
                    return response

                if response.status_code == 403:
                    print(f"Attempt {attempt + 1}: 403 Forbidden. "
                          f"Rotating IP and retrying...")
                    time.sleep(random.uniform(3, 7))
                    continue

                if response.status_code == 429:
                    print(f"Attempt {attempt + 1}: Rate limited. "
                          f"Waiting before retry...")
                    time.sleep(random.uniform(10, 20))
                    continue

                print(f"Attempt {attempt + 1}: Status {response.status_code}")

            except requests.exceptions.Timeout:
                print(f"Attempt {attempt + 1}: Request timed out.")
                time.sleep(random.uniform(2, 5))
            except requests.exceptions.ProxyError:
                print(f"Attempt {attempt + 1}: Proxy error. Retrying...")
                time.sleep(random.uniform(1, 3))
            except requests.exceptions.RequestException as e:
                print(f"Attempt {attempt + 1}: Request error: {e}")
                time.sleep(random.uniform(2, 5))

        print(f"Failed to fetch {url} after {max_retries} attempts.")
        return None

    def search_properties(self, location, max_pages=5):
        """
        Search for properties across multiple pages.
        Uses proxy rotation for each request.
        """
        all_properties = []

        for page in range(1, max_pages + 1):
            search_url = f"{self.base_url}/homes/{location}_rb/"
            if page > 1:
                search_url += f"{page}_p/"

            print(f"Scraping page {page}: {search_url}")
            response = self._make_request(search_url)

            if not response:
                print(f"Skipping page {page} due to request failure.")
                continue

            properties = self._parse_search_results(response.text)
            all_properties.extend(properties)

            print(f"Page {page}: Found {len(properties)} properties "
                  f"(Total: {len(all_properties)})")

            # Respectful delay between pages
            time.sleep(random.uniform(3, 8))

        return all_properties

    def _parse_search_results(self, html):
        """Parse property data from the __NEXT_DATA__ JSON."""
        soup = BeautifulSoup(html, "lxml")
        properties = []

        script_tag = soup.find("script", {"id": "__NEXT_DATA__"})
        if not script_tag:
            return properties

        try:
            next_data = json.loads(script_tag.string)
            query_state = next_data["props"]["pageProps"]["searchPageState"]
            results = query_state["cat1"]["searchResults"]["listResults"]

            for result in results:
                property_data = {
                    "zpid": result.get("zpid"),
                    "address": result.get("address"),
                    "city": result.get("addressCity"),
                    "state": result.get("addressState"),
                    "zipcode": result.get("addressZipcode"),
                    "price": result.get("unformattedPrice"),
                    "bedrooms": result.get("beds"),
                    "bathrooms": result.get("baths"),
                    "sqft": result.get("area"),
                    "zestimate": result.get("zestimate"),
                    "listing_type": result.get("statusType"),
                    "url": result.get("detailUrl"),
                    "latitude": result.get("latLong", {}).get("latitude"),
                    "longitude": result.get("latLong", {}).get("longitude"),
                    "broker": result.get("brokerName"),
                }
                properties.append(property_data)

        except (KeyError, TypeError, json.JSONDecodeError) as e:
            print(f"Parse error: {e}")

        return properties

    def use_socks5_proxy(self):
        """
        Switch to SOCKS5 proxy protocol.
        Useful when HTTP proxies are being detected.
        Requires: pip install requests[socks]
        """
        self.proxies = {
            "http": self.proxy_url_socks5,
            "https": self.proxy_url_socks5,
        }
        print("Switched to SOCKS5 proxy protocol.")


# Usage example
if __name__ == "__main__":
    scraper = ZillowProxyScraper(
        proxy_user="your_spyderproxy_username",
        proxy_pass="your_spyderproxy_password",
    )

    # Scrape multiple pages of results for Miami, FL
    properties = scraper.search_properties("Miami-FL", max_pages=3)

    print(f"\nTotal properties scraped: {len(properties)}")
    for prop in properties[:10]:
        price = prop['price']
        price_str = f"${price:,}" if price else "N/A"
        print(f"  {prop['address']} - {price_str}")

The key advantage of using SpyderProxy's rotating residential proxies is that each request is automatically routed through a different US-based residential IP address. This means Zillow sees each request as coming from a different household, making it extremely difficult to detect and block your scraper. You can verify your proxy setup is working correctly using our proxy checker tool before running your scraper at scale.

Advanced: Scraping Zillow with Playwright

Some Zillow pages rely heavily on JavaScript rendering, meaning the property data is loaded dynamically after the initial page load. In these cases, a simple HTTP request will not capture all the data. Playwright is a browser automation library that runs a full Chromium browser, allowing you to interact with pages exactly as a real user would.

This approach is slower than direct HTTP requests, but it captures data that only appears after JavaScript execution, including dynamically loaded map results, infinite scroll listings, and interactive property details.

import asyncio
import json
import random
from playwright.async_api import async_playwright

class ZillowPlaywrightScraper:
    """
    Advanced Zillow scraper using Playwright for JS-rendered pages.
    Supports proxy rotation via SpyderProxy.
    """

    def __init__(self, proxy_user=None, proxy_pass=None):
        self.proxy_user = proxy_user
        self.proxy_pass = proxy_pass
        self.base_url = "https://www.zillow.com"

    async def _create_browser(self):
        """Create a Playwright browser instance with proxy configuration."""
        playwright = await async_playwright().start()

        launch_options = {
            "headless": True,
            "args": [
                "--disable-blink-features=AutomationControlled",
                "--no-sandbox",
                "--disable-dev-shm-usage",
            ],
        }

        # Configure SpyderProxy if credentials are provided
        if self.proxy_user and self.proxy_pass:
            launch_options["proxy"] = {
                "server": "http://geo.spyderproxy.com:10000",
                "username": self.proxy_user,
                "password": self.proxy_pass,
            }

        browser = await playwright.chromium.launch(**launch_options)
        return playwright, browser

    async def _create_context(self, browser):
        """Create a browser context with realistic settings."""
        context = await browser.new_context(
            viewport={"width": 1920, "height": 1080},
            user_agent=(
                "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                "AppleWebKit/537.36 (KHTML, like Gecko) "
                "Chrome/124.0.0.0 Safari/537.36"
            ),
            locale="en-US",
            timezone_id="America/New_York",
            geolocation={"longitude": -73.935242, "latitude": 40.730610},
            permissions=["geolocation"],
        )

        # Remove webdriver detection signals
        await context.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', {
                get: () => undefined
            });
            Object.defineProperty(navigator, 'languages', {
                get: () => ['en-US', 'en']
            });
            Object.defineProperty(navigator, 'plugins', {
                get: () => [1, 2, 3, 4, 5]
            });
        """)

        return context

    async def search_properties(self, location, max_pages=3):
        """
        Search for properties using a headless browser.
        Handles JavaScript rendering and dynamic content.
        """
        playwright, browser = await self._create_browser()
        context = await self._create_context(browser)
        page = await context.new_page()

        all_properties = []

        try:
            for page_num in range(1, max_pages + 1):
                search_url = f"{self.base_url}/homes/{location}_rb/"
                if page_num > 1:
                    search_url += f"{page_num}_p/"

                print(f"Loading page {page_num}: {search_url}")

                await page.goto(search_url, wait_until="networkidle")
                await page.wait_for_timeout(random.randint(2000, 4000))

                # Scroll down to trigger lazy-loaded content
                await self._scroll_page(page)

                # Extract __NEXT_DATA__ from the page
                next_data = await page.evaluate("""
                    () => {
                        const el = document.getElementById('__NEXT_DATA__');
                        return el ? JSON.parse(el.textContent) : null;
                    }
                """)

                if next_data:
                    properties = self._parse_next_data(next_data)
                    all_properties.extend(properties)
                    print(f"Page {page_num}: {len(properties)} properties")
                else:
                    # Fallback: scrape from rendered DOM
                    properties = await self._parse_dom(page)
                    all_properties.extend(properties)
                    print(f"Page {page_num}: {len(properties)} properties "
                          f"(from DOM)")

                # Random delay between pages
                await page.wait_for_timeout(random.randint(3000, 7000))

        finally:
            await context.close()
            await browser.close()
            await playwright.stop()

        return all_properties

    async def _scroll_page(self, page):
        """Simulate natural scrolling to load lazy content."""
        total_height = await page.evaluate("document.body.scrollHeight")
        current_position = 0
        scroll_step = random.randint(300, 600)

        while current_position < total_height:
            current_position += scroll_step
            await page.evaluate(f"window.scrollTo(0, {current_position})")
            await page.wait_for_timeout(random.randint(200, 500))

        # Scroll back to top
        await page.evaluate("window.scrollTo(0, 0)")
        await page.wait_for_timeout(1000)

    def _parse_next_data(self, next_data):
        """Parse properties from the __NEXT_DATA__ JSON object."""
        properties = []
        try:
            query_state = next_data["props"]["pageProps"]["searchPageState"]
            results = query_state["cat1"]["searchResults"]["listResults"]

            for result in results:
                properties.append({
                    "zpid": result.get("zpid"),
                    "address": result.get("address"),
                    "city": result.get("addressCity"),
                    "state": result.get("addressState"),
                    "zipcode": result.get("addressZipcode"),
                    "price": result.get("unformattedPrice"),
                    "bedrooms": result.get("beds"),
                    "bathrooms": result.get("baths"),
                    "sqft": result.get("area"),
                    "zestimate": result.get("zestimate"),
                    "url": result.get("detailUrl"),
                })
        except (KeyError, TypeError):
            pass
        return properties

    async def _parse_dom(self, page):
        """Fallback: parse property cards directly from the DOM."""
        properties = await page.evaluate("""
            () => {
                const cards = document.querySelectorAll(
                    'article[data-test="property-card"]'
                );
                return Array.from(cards).map(card => {
                    const priceEl = card.querySelector(
                        '[data-test="property-card-price"]'
                    );
                    const addressEl = card.querySelector('address');
                    const linkEl = card.querySelector('a[data-test="property-card-link"]');
                    const detailsEl = card.querySelector(
                        '[data-test="property-card-details"]'
                    );
                    return {
                        price: priceEl ? priceEl.textContent.trim() : null,
                        address: addressEl
                            ? addressEl.textContent.trim() : null,
                        url: linkEl ? linkEl.href : null,
                        details: detailsEl
                            ? detailsEl.textContent.trim() : null,
                    };
                });
            }
        """)
        return properties

    async def get_property_details(self, property_url):
        """Scrape full details from a single property listing page."""
        playwright, browser = await self._create_browser()
        context = await self._create_context(browser)
        page = await context.new_page()

        details = None

        try:
            if not property_url.startswith("http"):
                property_url = f"{self.base_url}{property_url}"

            await page.goto(property_url, wait_until="networkidle")
            await page.wait_for_timeout(random.randint(2000, 4000))
            await self._scroll_page(page)

            details = await page.evaluate("""
                () => {
                    const getData = (selector) => {
                        const el = document.querySelector(selector);
                        return el ? el.textContent.trim() : null;
                    };
                    return {
                        price: getData('[data-testid="price"]'),
                        address: getData(
                            '[data-testid="bdp-property-address"]'
                        ),
                        beds: getData('[data-testid="bed-bath-item"]:nth-child(1)'),
                        baths: getData('[data-testid="bed-bath-item"]:nth-child(2)'),
                        sqft: getData('[data-testid="bed-bath-item"]:nth-child(3)'),
                        description: getData(
                            '[data-testid="description-text"]'
                        ),
                        zestimate: getData('[data-testid="zestimate-text"]'),
                    };
                }
            """)

        finally:
            await context.close()
            await browser.close()
            await playwright.stop()

        return details


# Usage example
async def main():
    scraper = ZillowPlaywrightScraper(
        proxy_user="your_spyderproxy_username",
        proxy_pass="your_spyderproxy_password",
    )

    properties = await scraper.search_properties("Denver-CO", max_pages=2)

    print(f"\nTotal properties: {len(properties)}")
    for prop in properties[:5]:
        print(f"  {prop.get('address')} - {prop.get('price')}")


if __name__ == "__main__":
    asyncio.run(main())

The Playwright approach is particularly useful for scraping property detail pages where data is loaded progressively as you scroll. It also handles situations where Zillow returns a JavaScript challenge page instead of the actual content, since the full browser can execute the challenge script just like a regular user's browser would.

Handling Anti-Bot Detection

Zillow employs several layers of anti-bot protection. Understanding these mechanisms and how to work around them is critical for building a reliable scraper. Here are the main techniques you should implement:

User Agent Rotation

Sending the same User-Agent header with every request is a clear signal of automated traffic. Maintain a list of current, realistic user agent strings and rotate them randomly. Make sure your user agents match actual browser versions that are currently in use, as outdated user agents are a red flag.

Request Delays and Throttling

Real users do not load pages at machine speed. Add random delays between requests to simulate natural browsing behavior. A delay of 3 to 8 seconds between requests is a good baseline. For detail pages, wait even longer since users typically spend time reading listing information before moving to the next page.

Session Management

Maintain cookies across requests within a session to appear as a consistent user. However, periodically rotate sessions to avoid building a suspicious cookie profile. Creating a new session every 20 to 30 requests is a reasonable strategy.

Header Fingerprinting

Modern anti-bot systems analyze the full set of HTTP headers, not just the User-Agent. Make sure your headers are consistent and realistic. The Accept, Accept-Language, Accept-Encoding, and Sec-Fetch headers should all match what a real browser would send. Inconsistent headers are a common reason scrapers get detected.

JavaScript Fingerprinting

When using Playwright, Zillow may check for browser automation signals like the navigator.webdriver property. The initialization script in our Playwright example above removes these signals, but you should stay updated on new detection techniques as they evolve.

Referrer Headers

Arriving at a Zillow search page without a referrer or with a suspicious one can trigger anti-bot detection. Set your Referer header to Google search or another natural source. Vary the referrer across requests to appear more organic.

For a comprehensive overview of proxy selection for scraping projects, see our guide on the best proxies for web scraping.

Storing and Analyzing Scraped Data

Once you have scraped property data from Zillow, you need to store it in a structured format and perform analysis. Pandas makes this straightforward. Here is a complete example of saving scraped data to CSV and running basic analysis:

import pandas as pd
import json
from datetime import datetime

class ZillowDataManager:
    """Manage storage and analysis of scraped Zillow data."""

    def __init__(self, output_dir="output"):
        self.output_dir = output_dir

    def save_to_csv(self, properties, filename=None):
        """Save scraped properties to a CSV file."""
        if not properties:
            print("No properties to save.")
            return None

        df = pd.DataFrame(properties)

        # Add metadata columns
        df["scraped_at"] = datetime.now().isoformat()
        df["source"] = "zillow"

        # Clean price data
        if "price" in df.columns:
            df["price"] = pd.to_numeric(df["price"], errors="coerce")

        if "sqft" in df.columns:
            df["sqft"] = pd.to_numeric(df["sqft"], errors="coerce")

        if "zestimate" in df.columns:
            df["zestimate"] = pd.to_numeric(
                df["zestimate"], errors="coerce"
            )

        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"zillow_properties_{timestamp}.csv"

        filepath = f"{self.output_dir}/{filename}"
        df.to_csv(filepath, index=False)
        print(f"Saved {len(df)} properties to {filepath}")
        return filepath

    def save_to_json(self, properties, filename=None):
        """Save scraped properties to a JSON file."""
        if filename is None:
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            filename = f"zillow_properties_{timestamp}.json"

        filepath = f"{self.output_dir}/{filename}"
        with open(filepath, "w") as f:
            json.dump(properties, f, indent=2)
        print(f"Saved {len(properties)} properties to {filepath}")
        return filepath

    def analyze_market(self, csv_path):
        """Run basic market analysis on scraped property data."""
        df = pd.read_csv(csv_path)
        print("=" * 60)
        print("ZILLOW MARKET ANALYSIS REPORT")
        print("=" * 60)

        print(f"\nTotal Properties: {len(df)}")
        print(f"Date Range: {df['scraped_at'].min()} to "
              f"{df['scraped_at'].max()}")

        # Price analysis
        if "price" in df.columns:
            price_data = df["price"].dropna()
            print(f"\n--- Price Analysis ---")
            print(f"  Median Price: ${price_data.median():,.0f}")
            print(f"  Mean Price:   ${price_data.mean():,.0f}")
            print(f"  Min Price:    ${price_data.min():,.0f}")
            print(f"  Max Price:    ${price_data.max():,.0f}")
            print(f"  Std Dev:      ${price_data.std():,.0f}")

        # Price per square foot
        if "price" in df.columns and "sqft" in df.columns:
            df["price_per_sqft"] = df["price"] / df["sqft"]
            ppsf = df["price_per_sqft"].dropna()
            print(f"\n--- Price per Sq Ft ---")
            print(f"  Median: ${ppsf.median():,.0f}/sqft")
            print(f"  Mean:   ${ppsf.mean():,.0f}/sqft")

        # Zestimate comparison
        if "price" in df.columns and "zestimate" in df.columns:
            df["price_vs_zestimate"] = df["price"] - df["zestimate"]
            diff = df["price_vs_zestimate"].dropna()
            underpriced = len(diff[diff < 0])
            overpriced = len(diff[diff > 0])
            print(f"\n--- Zestimate Comparison ---")
            print(f"  Below Zestimate: {underpriced} properties")
            print(f"  Above Zestimate: {overpriced} properties")
            print(f"  Avg Difference:  ${diff.mean():,.0f}")

        # Bedroom distribution
        if "bedrooms" in df.columns:
            print(f"\n--- Bedroom Distribution ---")
            bed_counts = df["bedrooms"].value_counts().sort_index()
            for beds, count in bed_counts.items():
                print(f"  {beds} bed: {count} properties")

        # City breakdown
        if "city" in df.columns:
            print(f"\n--- Top Cities ---")
            city_counts = df["city"].value_counts().head(10)
            for city, count in city_counts.items():
                median = df[df["city"] == city]["price"].median()
                median_str = f"${median:,.0f}" if pd.notna(median) else "N/A"
                print(f"  {city}: {count} listings "
                      f"(median: {median_str})")

        print("\n" + "=" * 60)
        return df


# Usage example
if __name__ == "__main__":
    manager = ZillowDataManager(output_dir="output")

    # Example: save properties from a scraping session
    sample_properties = [
        {
            "address": "123 Main St, Austin, TX 78701",
            "city": "Austin",
            "state": "TX",
            "price": 450000,
            "bedrooms": 3,
            "bathrooms": 2,
            "sqft": 1800,
            "zestimate": 465000,
        },
        {
            "address": "456 Oak Ave, Austin, TX 78704",
            "city": "Austin",
            "state": "TX",
            "price": 625000,
            "bedrooms": 4,
            "bathrooms": 3,
            "sqft": 2400,
            "zestimate": 610000,
        },
    ]

    csv_path = manager.save_to_csv(sample_properties)
    if csv_path:
        manager.analyze_market(csv_path)

This data management class gives you a clean workflow for saving scraped results and generating quick market reports. You can extend the analysis methods to calculate more sophisticated metrics like absorption rates, inventory turnover, or price trend regressions.

Scaling Your Zillow Scraper

Once your basic scraper is working reliably, you will likely want to scale it to cover more locations and properties. Here are key strategies for scaling effectively:

Concurrent Requests

Python's asyncio and aiohttp libraries allow you to make multiple requests concurrently without blocking. Instead of scraping cities sequentially, you can scrape several at once. Be careful not to exceed reasonable concurrency levels, as five to ten concurrent requests is usually a safe upper limit to avoid overwhelming the target server.

Error Handling and Retry Logic

At scale, you will encounter more errors. Implement exponential backoff for retries, where each subsequent retry waits longer than the previous one. Log all errors with timestamps and URLs so you can identify patterns and adjust your strategy. Keep track of which pages failed so you can retry them in a separate pass rather than losing that data entirely.

Queue-Based Architecture

For large-scale scraping jobs, consider using a task queue like Redis Queue or Celery. This allows you to distribute scraping work across multiple machines, each with its own set of proxies. A queue also provides natural retry logic and progress tracking, making your scraper more resilient and easier to monitor.

Data Deduplication

When scraping the same locations over time, you will encounter duplicate listings. Use the Zillow property ID (zpid) as a unique key to identify and handle duplicates. Store the zpid in a set or database index so you can quickly skip properties you have already scraped or update existing records with new price data.

Scheduling Regular Scrapes

Set up cron jobs or scheduled tasks to run your scraper at regular intervals. Daily or weekly scrapes allow you to build historical datasets that are valuable for trend analysis. Make sure your scheduling accounts for reasonable hours and avoids peak traffic times on Zillow.

For a broader perspective on web scraping best practices, including architecture patterns for large-scale projects, see our dedicated use case guide.

Common Errors and How to Fix Them

Here are the most frequently encountered issues when scraping Zillow and how to resolve them:

403 Forbidden

This is the most common error and means Zillow has detected your request as automated traffic. Causes include using the same IP address for too many requests, sending requests without proper headers, or using a flagged proxy IP. Fix this by rotating your proxies, randomizing your headers, and adding delays between requests. SpyderProxy's residential proxies significantly reduce the frequency of 403 errors because they use genuine residential IP addresses.

CAPTCHA Challenges

Zillow may present a CAPTCHA page instead of the actual content. If you are using requests, you will receive an HTML page containing the CAPTCHA markup instead of property data. With Playwright, you can detect CAPTCHAs by checking for specific page elements. The best mitigation is to reduce your request rate, use high-quality residential proxies, and avoid patterns that trigger CAPTCHAs. If you encounter CAPTCHAs frequently, your proxy quality or request patterns likely need adjustment.

Empty Responses and Missing Data

Sometimes Zillow returns a valid 200 response, but the __NEXT_DATA__ script tag is missing or contains incomplete data. This can happen when Zillow serves a different page variant, when the location is invalid, or when JavaScript rendering is required. Check that your location string matches Zillow's URL format exactly, and consider switching to the Playwright scraper for pages that return incomplete data with simple HTTP requests.

Rate Limiting (429 Too Many Requests)

A 429 response means you are sending requests too quickly. Implement exponential backoff when you receive this status code. Start with a 10-second wait and double it with each consecutive 429 response. Once you can successfully complete a request again, gradually return to your normal request rate.

Connection Timeouts

Timeouts can occur due to proxy issues, network instability, or Zillow's server being slow. Set a reasonable timeout of 30 seconds for each request and implement retry logic for timed-out requests. If you experience frequent timeouts with a specific proxy, try switching to a different proxy protocol or checking your proxy connection with the proxy checker tool.

JSON Structure Changes

Zillow periodically updates their website structure, which can break your JSON parsing logic. Build your parser to handle missing keys gracefully using .get() with default values. Set up monitoring that alerts you when your scraper returns zero results for previously working locations, which is a strong indicator that the page structure has changed.

If you are also interested in scraping other e-commerce platforms, check out our guide on scraping Amazon without getting blocked, which covers many of the same anti-detection techniques.

Frequently Asked Questions

Can I scrape Zillow for free?

You can build a basic scraper using free tools like Python, requests, and BeautifulSoup. However, without proxies, your IP address will be quickly blocked after a relatively small number of requests. For any serious data collection, you will need residential proxies. SpyderProxy offers affordable plans starting with pay-as-you-go pricing that works well for small to medium scraping projects.

How many requests can I make to Zillow per day?

There is no official public limit. However, based on community experience, making more than 100 to 200 requests per hour from a single IP address is likely to trigger blocking. With rotating residential proxies, you can safely scale to thousands of requests per day since each request comes from a different IP address.

Is the Zillow API an alternative to scraping?

Zillow has deprecated most of its public APIs in recent years. The Zillow API (formerly known as the Zestimate API) was shut down for new users. Some real estate data is available through third-party APIs and data providers like Bridge Interactive or RESO, but these typically require commercial agreements and may not include the same breadth of data visible on Zillow's website.

Which proxy type is best for scraping Zillow?

Residential proxies are the gold standard for Zillow scraping because they use IP addresses from real ISPs, making them nearly indistinguishable from regular user traffic. SpyderProxy residential proxies are ideal for this purpose. Datacenter proxies are faster but more likely to be detected and blocked by Zillow's anti-bot systems.

How do I handle Zillow's dynamic content?

Most of Zillow's property data is embedded in the __NEXT_DATA__ JSON object on the page, which is available in the initial HTML response. For content that loads dynamically via JavaScript, use Playwright or a similar browser automation tool. The Playwright approach is slower but captures everything a real browser would see.

Can I scrape Zillow's Zestimate data?

Zestimate values are displayed on individual property pages and in search results. You can extract them from the __NEXT_DATA__ JSON just like other property attributes. Keep in mind that Zestimate is Zillow's proprietary estimate and its accuracy varies by market. Always check Zillow's terms regarding the use and redistribution of Zestimate data.

How often should I scrape Zillow for market tracking?

For price monitoring and market tracking, daily scrapes of your target locations provide a good balance between data freshness and resource usage. Weekly scrapes are sufficient for broader market trend analysis. Avoid scraping the same pages more frequently than once per day, as property data does not change that often and excessive requests waste resources.

What should I do if my scraper suddenly stops working?

First, check if Zillow has updated their website structure by manually visiting the page in a browser and inspecting the __NEXT_DATA__ JSON. If the structure has changed, update your parsing logic. If the page loads fine in a browser but your scraper gets blocked, review your proxy configuration, headers, and request patterns. Sometimes a simple change like updating your user agent strings to the latest browser versions resolves the issue.

Conclusion

Scraping Zillow in 2026 requires a combination of the right tools, proper proxy infrastructure, and smart request management. The Python-based approaches covered in this guide give you everything you need to get started, from basic HTTP scraping with requests and BeautifulSoup to advanced browser automation with Playwright.

The most critical factor for reliable Zillow scraping is proxy quality. Without rotating residential proxies, your scraper will be blocked almost immediately. SpyderProxy's residential proxy network provides the IP diversity and rotation capabilities needed to scrape Zillow at scale without interruptions.

Remember to always scrape responsibly. Respect rate limits, add reasonable delays between requests, and do not overload Zillow's servers. Use the data you collect ethically and in compliance with applicable laws and terms of service.

Ready to start building your Zillow scraper? Set up your proxy infrastructure first, then implement the code examples in this guide step by step. Start with the basic scraper to validate your approach, add proxy rotation once you need to scale, and upgrade to Playwright when you encounter pages that require JavaScript rendering.

Ready to Scrape Zillow at Scale?

Get access to millions of residential IPs with automatic rotation. SpyderProxy makes Zillow scraping reliable, fast, and undetectable.

Start Your Free Trial