spyderproxy

How to Scrape Yelp Data with Python (2026): Complete Tutorial

S

SpyderProxy Team

|
Published date

2026-04-19

Yelp sits on one of the largest collections of small-business data on the open web — names, addresses, phone numbers, hours, categories, ratings, and tens of millions of reviews. For local SEO research, lead generation, market analysis, and competitive intelligence, that data is gold. The catch: Yelp knows it's gold, and protects it aggressively with 403 responses, CAPTCHA walls, and HTML class names that change just often enough to break naïve scrapers.

This guide walks through the full pipeline for scraping Yelp with Python in 2026: what data is worth pulling, what's legal versus risky, the requests/BeautifulSoup baseline, the Selenium fallback for when Yelp gets aggressive, and the residential proxy rotation that keeps you out of Yelp's block lists. Every code block is copy-paste-ready.

What Yelp Data Is Actually Worth Scraping?

Yelp pages roughly break down into four categories of useful data:

  • Business listings — name, address, phone, website, hours, business categories, neighborhood. The bread-and-butter of local-business datasets.
  • Aggregated ratings — overall star rating (1–5) and total review count. Critical for local SEO benchmarking and competitor analysis.
  • Reviews — individual customer reviews with star rating, date, reviewer name, and review text. Highest legal/ethical sensitivity (user-generated content).
  • Metadata — amenities (Wi-Fi, parking, outdoor seating), photo counts, price range ($–$$$$), claimed/unclaimed status, "popular with" tags.

Most legitimate use cases need #1, #2, and #4 — the structured business data Yelp surfaces in search results. Review text (#3) is where you should pause and check the legal section below.

Is Scraping Yelp Legal?

Short answer: publicly visible business data is generally fair game; scraping reviews, photos, or anything personally identifiable is much riskier. The factors that matter:

  • Yelp's Terms of Service explicitly prohibit automated scraping. Violating ToS isn't automatically illegal but can expose you to civil claims (the famous hiQ Labs v. LinkedIn case settled this for public LinkedIn profiles, but Yelp will still send cease-and-desist letters).
  • Public business listings (name, address, phone, hours) are factual data not protected by copyright. Scraping these for research, lead generation, or directory enrichment is on relatively safe ground.
  • Reviews and photos are user-generated content that Yelp licenses (with restrictions) from reviewers. Scraping and republishing them risks copyright issues plus DMCA exposure.
  • Personal data in reviews (reviewer names, locations, profile pics) is GDPR/CCPA personal data. Document a lawful basis and honor deletion requests.
  • Yelp Fusion API exists. If your use case fits the API's terms (limited to 5,000 requests/day on the free tier), it's the cleaner legal path.

The safest strategy: scrape search results pages (high-volume, low-personal-data) for business listings; use the Fusion API for review counts and ratings; and avoid scraping individual review text unless you've cleared it with a lawyer.

Tools You'll Need

  • requests — HTTP requests for the static-HTML pages.
  • BeautifulSoup4 — HTML parsing.
  • Selenium (or Playwright) — fallback for pages where Yelp serves JavaScript-rendered content or returns 403 to plain requests.
  • pandas — clean CSV/Excel export.
  • Residential proxies — non-negotiable. We'll use SpyderProxy's rotating endpoint.

Install everything:

pip install requests beautifulsoup4 selenium pandas

Step 1: Fetch a Yelp Search Page

Start with the basic search URL pattern: https://www.yelp.com/search?find_desc={query}&find_loc={location}. Wire up a fetch with a realistic User-Agent — Yelp serves 403 Forbidden to anything that smells like the default Python UA:

import requests

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}

def fetch(url, proxies=None):
    r = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
    if r.status_code == 200:
        return r.text
    print(f"[WARN] {url} -> HTTP {r.status_code}")
    return None

html = fetch("https://www.yelp.com/search?find_desc=coffee&find_loc=Brooklyn")

Three details that materially change your success rate: (1) the User-Agent must look like a current Chrome on Windows or macOS; (2) Accept-Language matches what real browsers send; (3) timeout prevents hung connections from blocking the whole crawl.

Step 2: Parse Business Listings with BeautifulSoup

Yelp's search-result HTML uses CSS classes that change every few months. Don't hardcode raw class names — instead, anchor on stable structural cues like business-card containers and the data-testid attributes Yelp ships for accessibility:

from bs4 import BeautifulSoup

def parse_search_results(html):
    soup = BeautifulSoup(html, "html.parser")
    results = []

    # Business cards typically live in a search-list container.
    # Anchor on the link to the business page (always /biz/...)
    cards = soup.select('div[data-testid*="serp"] a[href^="/biz/"]')

    seen = set()
    for a in cards:
        href = a.get("href", "").split("?")[0]
        if href in seen:
            continue
        seen.add(href)

        # Business name is the link text (or an aria-label fallback)
        name = a.get_text(strip=True) or a.get("aria-label", "")

        # Walk up to the card container to find rating / review count
        card = a.find_parent(["div", "li"])
        rating = None
        review_count = None
        if card:
            rating_el = card.find(attrs={"aria-label": lambda v: v and "star rating" in v.lower()})
            if rating_el:
                rating = rating_el["aria-label"].split()[0]
            count_el = card.find(string=lambda s: s and "review" in s.lower())
            if count_el:
                review_count = "".join(c for c in count_el if c.isdigit())

        results.append({
            "name": name,
            "yelp_url": "https://www.yelp.com" + href,
            "rating": rating,
            "review_count": review_count,
        })

    return results

This selector strategy survives Yelp's class-name churn because it anchors on structural patterns (links to /biz/, data-testid attributes, ARIA labels) rather than randomized class hashes. When Yelp ships a redesign, expect to spend 30 minutes re-tuning — never assume your selectors are permanent.

Step 3: Drill Into the Business Detail Page

The detail page (https://www.yelp.com/biz/{slug}) carries the structured data you want — address, phone, hours, categories. The most reliable way to extract this is to look for the JSON-LD <script type="application/ld+json"> block Yelp embeds for SEO. It's much more stable than the visible HTML:

import json

def parse_business_detail(html):
    soup = BeautifulSoup(html, "html.parser")
    data = {}

    # Yelp's JSON-LD blob carries name, address, phone, geo, rating
    for script in soup.find_all("script", type="application/ld+json"):
        try:
            blob = json.loads(script.string or "{}")
        except json.JSONDecodeError:
            continue
        if isinstance(blob, dict) and blob.get("@type") in {"LocalBusiness", "Restaurant"}:
            data["name"] = blob.get("name")
            data["phone"] = blob.get("telephone")
            addr = blob.get("address", {}) or {}
            data["street"] = addr.get("streetAddress")
            data["city"] = addr.get("addressLocality")
            data["region"] = addr.get("addressRegion")
            data["postal_code"] = addr.get("postalCode")
            data["country"] = addr.get("addressCountry")
            agg = blob.get("aggregateRating", {}) or {}
            data["rating"] = agg.get("ratingValue")
            data["review_count"] = agg.get("reviewCount")
            break

    # Categories are typically in a header span
    cats = [a.get_text(strip=True) for a in soup.select('a[href*="/c/"]')]
    if cats:
        data["categories"] = ", ".join(cats[:5])

    return data

JSON-LD is gold for any site that cares about SEO — the schema is standardized, the field names don't change with redesigns, and it's the same data Google reads for rich results.

Step 4: Handle 403 and CAPTCHA with Selenium

Yelp escalates fast. After 20–50 requests from the same IP — even with a clean User-Agent — you'll start seeing HTTP 403 or a CAPTCHA wall. The two-pronged fix is: (1) rotate IPs (next section), and (2) when 403s persist, fall back to a real browser via Selenium so the request looks fully human:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

def fetch_with_selenium(url):
    opts = Options()
    opts.add_argument("--headless=new")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--window-size=1920,1080")
    opts.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                      "AppleWebKit/537.36 (KHTML, like Gecko) "
                      "Chrome/124.0.0.0 Safari/537.36")
    driver = webdriver.Chrome(options=opts)
    try:
        driver.get(url)
        time.sleep(8)  # let JS render and any anti-bot timer pass
        return driver.page_source
    finally:
        driver.quit()

For repeated runs, use undetected-chromedriver (drop-in replacement for the standard Selenium driver) or Playwright with stealth patches — both significantly reduce the headless-browser fingerprint signals that Yelp uses to detect automation.

Step 5: Rotate IPs with Residential Proxies (Required)

This is the step that makes Yelp scraping actually work at any meaningful scale. From a single IP, you'll get blocked within the first hundred requests. From a rotating residential pool, each request appears to come from a different real consumer in a real city — exactly what Yelp's user base looks like.

SpyderProxy's residential proxies expose a single rotating endpoint. Drop one config in:

PROXY_USER = "your-spyder-username"
PROXY_PASS = "your-spyder-password"
PROXY_HOST = "gate.spyderproxy.com"
PROXY_PORT = 7777

def proxies_dict():
    auth = f"{PROXY_USER}:{PROXY_PASS}"
    return {
        "http":  f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
        "https": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
    }

# Then in your fetch:
html = fetch(url, proxies=proxies_dict())

For Yelp specifically, choose residential over datacenter — Yelp aggressively blocks known datacenter IP ranges. SpyderProxy's Premium Residential at $2.75/GB draws from a 130M+ IP pool with sub-0.3s latency. For most Yelp scraping, the Budget Residential tier at $1.75/GB is enough — Yelp HTML pages are 50–200 KB each, so a few dollars buys you tens of thousands of pages.

If you want to hold the same IP across multiple page loads on a single business (e.g., scraping detail page → reviews tab → photos), use a sticky session by appending a session token to your username:

# Same IP for the duration of one business's pages
PROXY_USER = "your-username-session-yelp001"

Step 6: Tie It All Together with Politeness

Putting the pieces together: paginate search results, dedupe by business URL, hit each detail page with a delay, and write to CSV after every successful business so you don't lose progress on a crash:

import time
import pandas as pd

QUERY = "coffee"
LOCATION = "Brooklyn, NY"
PAGES = 5  # 10 results per page

def scrape_yelp(query, location, pages):
    all_businesses = []
    for page in range(pages):
        start = page * 10
        url = (f"https://www.yelp.com/search?find_desc={query}"
               f"&find_loc={location}&start={start}")
        html = fetch(url, proxies=proxies_dict())
        if not html:
            html = fetch_with_selenium(url)  # fallback
        if not html:
            continue
        results = parse_search_results(html)
        for biz in results:
            time.sleep(2)  # polite delay between detail-page hits
            detail_html = fetch(biz["yelp_url"], proxies=proxies_dict())
            if detail_html:
                detail = parse_business_detail(detail_html)
                biz.update(detail)
            all_businesses.append(biz)
            # Write incrementally so a crash doesn't lose data
            pd.DataFrame(all_businesses).to_csv("yelp_results.csv", index=False)
        time.sleep(3)  # delay between search pages
    return all_businesses

scrape_yelp(QUERY, LOCATION, PAGES)
print("done.")

Anti-Bot Challenges Yelp Throws (and How to Handle Them)

  • HTTP 403 Forbidden — most common failure. Caused by suspicious User-Agent, datacenter IP, or aggressive request rate. Fix: real-browser User-Agent, residential proxies, 2–5 second delays.
  • CAPTCHA challenge pages — Yelp serves a "Press & hold" puzzle when behavior looks scripted. Fix: switch to Selenium with undetected-chromedriver, slow your rate, rotate IPs more aggressively. As a last resort, add a CAPTCHA-solving service like Capsolver or 2Captcha.
  • Dynamic CSS class hashes — Yelp randomizes class names every release. Fix: anchor selectors on data-testid, ARIA labels, structural relationships, and the JSON-LD blob (which doesn't churn).
  • Rate limiting (HTTP 429) — too-fast requests from one IP. Fix: rotate IPs every request, add jitter to delays (random 1–5 seconds, not a fixed value).
  • Honeypot links — invisible links designed to trap automated crawlers that follow every <a> tag. Fix: only follow visible links and check for display:none / visibility:hidden on parents.
  • Geo-restriction — Yelp routes some content based on visitor country. Fix: use SpyderProxy country-targeting (session-X-country-US or similar) to control your exit geo.

Best Practices for Production Yelp Scraping

  • Rate limit aggressively. 1–2 seconds between requests minimum, more on detail pages. Yelp would rather you scrape slowly than not at all.
  • Use residential proxies, not datacenter. Yelp aggressively flags datacenter ASNs. Save the datacenter IPs for less-protected targets.
  • Persist results incrementally. Write to CSV/database after every business, not at the end. You will hit failures.
  • Structure-anchor your selectors. Anchor on data-testid, ARIA, and JSON-LD. Avoid raw CSS class names.
  • Build a robust retry layer. Wrap fetches in try/except with exponential backoff for 5xx errors and rate-limit 429s.
  • Respect robots.txt spirit even when scraping anyway. Don't hammer Yelp; don't try to crawl every page; only fetch what you actually need.
  • Cache aggressively. Don't re-scrape the same business twice in a week unless you're actively monitoring changes.
  • Use the Fusion API where it covers your use case. Free tier is 5,000 calls/day. Cleaner legal posture, no anti-bot fights.

Common Mistakes That Will Get Your Scraper Blocked

  • Default Python User-Agent. Instant 403.
  • No delay between requests. 429s within minutes.
  • Datacenter proxies. Yelp blocks AWS/DigitalOcean/Hetzner IP ranges hard.
  • Hardcoded CSS class names. Selectors break on every Yelp redeploy.
  • One IP for everything. Reputation tanks fast; switch to a rotating residential pool.
  • Headless Chrome without stealth patches. Yelp detects standard headless fingerprints. Use undetected-chromedriver or Playwright stealth.
  • Scraping reviews and republishing them. Copyright + DMCA risk. Stick to factual business data unless you've cleared review usage with a lawyer.

The Bottom Line

Yelp scraping is doable in 2026 but it's not the "20 lines of Python" tutorial that other guides promise. Yelp invests heavily in anti-bot tech because their data is their moat. The realistic stack is: requests + BeautifulSoup as your fast path, Selenium with undetected-chromedriver as the fallback, residential proxies as the foundation, and disciplined rate limiting throughout.

The proxy choice is the biggest single lever. SpyderProxy Residential Proxies at $1.75/GB for the Budget tier or $2.75/GB for the full 130M+ Premium pool give you the IP rotation Yelp scraping requires, with sub-0.3s latency and sticky sessions up to 24 hours.

Frequently Asked Questions

Is scraping Yelp data legal?

Publicly visible business listings (name, address, phone, hours, categories, ratings) are factual data not protected by copyright and are generally safe to scrape for research, lead generation, and directory enrichment. Reviews and photos are user-generated content that Yelp licenses with restrictions — scraping and republishing them carries copyright and DMCA risk. Yelp's Terms of Service prohibit automated scraping; violating ToS isn't automatically illegal but can expose you to civil claims. Consult a lawyer for your specific use case.

Why does Yelp return 403 Forbidden?

Yelp serves HTTP 403 when a request looks suspicious. The three most common triggers are: (1) default Python requests User-Agent string, (2) request originating from a known datacenter IP, and (3) too-fast request rate from a single IP. Fix all three: set a current Chrome User-Agent, route through residential proxies, and add 2–5 second delays between requests.

What's the best proxy type for scraping Yelp?

Residential proxies, full stop. Yelp aggressively blocks known datacenter ASNs (AWS, DigitalOcean, Hetzner, OVH), so datacenter proxies will fail almost immediately on Yelp pages. SpyderProxy Residential at $1.75/GB delivers 120M+ real consumer IPs across 195+ countries with sub-0.3s latency, which is the right tool for Yelp scraping at any meaningful scale.

Does Yelp have an official API?

Yes. The Yelp Fusion API provides business search, business details, reviews (limited to 3 per business), and autocomplete on a free tier capped at 5,000 calls/day. For low-volume use cases that fit within those limits and don't need full review text, the Fusion API is the cleaner legal path. For higher volume, deeper data, or use cases the API doesn't cover, scraping is the alternative.

How do I extract business data without breaking on Yelp redesigns?

Anchor your selectors on structural patterns rather than CSS class names. Yelp randomizes class hashes on every release, but their data-testid attributes, ARIA labels, link patterns (/biz/, /c/), and embedded JSON-LD <script type="application/ld+json"> blocks remain stable. Parse the JSON-LD blob first — it gives you name, address, phone, geo coordinates, rating, and review count in standardized fields.

Can I scrape Yelp reviews?

Technically yes — they're publicly visible. Legally and ethically, it's the riskiest part of Yelp scraping. Reviews are user-generated content with copyright held by Yelp/the reviewer, contain personal data (reviewer names, sometimes locations), and Yelp will issue DMCA takedowns to anyone republishing review text. If you must scrape reviews, scrape aggregate data only (rating distributions, review counts, average length) and don't store or republish individual review text without a clear lawful basis and licensing review.

How fast can I scrape Yelp?

Without proxies, you'll get blocked within 50–100 requests from a single IP. With rotating residential proxies and 2-second delays, you can comfortably scrape 1,000–5,000 pages per hour without triggering CAPTCHAs. Beyond that you'll need more aggressive IP rotation, longer delays, or Selenium with stealth patches. There's no "official" rate limit for scrapers, but Yelp's anti-bot threshold scales with how human your traffic looks.

Should I use requests or Selenium for Yelp?

Start with requests + BeautifulSoup — it's 10–100× faster and uses a fraction of the proxy bandwidth. Fall back to Selenium (preferably with undetected-chromedriver) or Playwright when Yelp returns 403 or CAPTCHA pages despite a good User-Agent and proxies. The hybrid approach — requests as the fast path, Selenium as the fallback — is the standard production pattern for any non-trivial scraping target.

Ready to Scrape Yelp at Scale?

The fastest way to make a Yelp scraper actually work in production is residential IP rotation. SpyderProxy Residential from $1.75/GB gives you 120M+ rotating consumer IPs across 195+ countries — exactly what Yelp scraping requires.

Start at SpyderProxy.com — or join us on Discord and Telegram if you want help configuring your scraper.

Related Reading