Yelp sits on one of the largest collections of small-business data on the open web — names, addresses, phone numbers, hours, categories, ratings, and tens of millions of reviews. For local SEO research, lead generation, market analysis, and competitive intelligence, that data is gold. The catch: Yelp knows it's gold, and protects it aggressively with 403 responses, CAPTCHA walls, and HTML class names that change just often enough to break naïve scrapers.
This guide walks through the full pipeline for scraping Yelp with Python in 2026: what data is worth pulling, what's legal versus risky, the requests/BeautifulSoup baseline, the Selenium fallback for when Yelp gets aggressive, and the residential proxy rotation that keeps you out of Yelp's block lists. Every code block is copy-paste-ready.
Yelp pages roughly break down into four categories of useful data:
Most legitimate use cases need #1, #2, and #4 — the structured business data Yelp surfaces in search results. Review text (#3) is where you should pause and check the legal section below.
Short answer: publicly visible business data is generally fair game; scraping reviews, photos, or anything personally identifiable is much riskier. The factors that matter:
The safest strategy: scrape search results pages (high-volume, low-personal-data) for business listings; use the Fusion API for review counts and ratings; and avoid scraping individual review text unless you've cleared it with a lawyer.
Install everything:
pip install requests beautifulsoup4 selenium pandas
Start with the basic search URL pattern: https://www.yelp.com/search?find_desc={query}&find_loc={location}. Wire up a fetch with a realistic User-Agent — Yelp serves 403 Forbidden to anything that smells like the default Python UA:
import requests
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
}
def fetch(url, proxies=None):
r = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
if r.status_code == 200:
return r.text
print(f"[WARN] {url} -> HTTP {r.status_code}")
return None
html = fetch("https://www.yelp.com/search?find_desc=coffee&find_loc=Brooklyn")
Three details that materially change your success rate: (1) the User-Agent must look like a current Chrome on Windows or macOS; (2) Accept-Language matches what real browsers send; (3) timeout prevents hung connections from blocking the whole crawl.
Yelp's search-result HTML uses CSS classes that change every few months. Don't hardcode raw class names — instead, anchor on stable structural cues like business-card containers and the data-testid attributes Yelp ships for accessibility:
from bs4 import BeautifulSoup
def parse_search_results(html):
soup = BeautifulSoup(html, "html.parser")
results = []
# Business cards typically live in a search-list container.
# Anchor on the link to the business page (always /biz/...)
cards = soup.select('div[data-testid*="serp"] a[href^="/biz/"]')
seen = set()
for a in cards:
href = a.get("href", "").split("?")[0]
if href in seen:
continue
seen.add(href)
# Business name is the link text (or an aria-label fallback)
name = a.get_text(strip=True) or a.get("aria-label", "")
# Walk up to the card container to find rating / review count
card = a.find_parent(["div", "li"])
rating = None
review_count = None
if card:
rating_el = card.find(attrs={"aria-label": lambda v: v and "star rating" in v.lower()})
if rating_el:
rating = rating_el["aria-label"].split()[0]
count_el = card.find(string=lambda s: s and "review" in s.lower())
if count_el:
review_count = "".join(c for c in count_el if c.isdigit())
results.append({
"name": name,
"yelp_url": "https://www.yelp.com" + href,
"rating": rating,
"review_count": review_count,
})
return results
This selector strategy survives Yelp's class-name churn because it anchors on structural patterns (links to /biz/, data-testid attributes, ARIA labels) rather than randomized class hashes. When Yelp ships a redesign, expect to spend 30 minutes re-tuning — never assume your selectors are permanent.
The detail page (https://www.yelp.com/biz/{slug}) carries the structured data you want — address, phone, hours, categories. The most reliable way to extract this is to look for the JSON-LD <script type="application/ld+json"> block Yelp embeds for SEO. It's much more stable than the visible HTML:
import json
def parse_business_detail(html):
soup = BeautifulSoup(html, "html.parser")
data = {}
# Yelp's JSON-LD blob carries name, address, phone, geo, rating
for script in soup.find_all("script", type="application/ld+json"):
try:
blob = json.loads(script.string or "{}")
except json.JSONDecodeError:
continue
if isinstance(blob, dict) and blob.get("@type") in {"LocalBusiness", "Restaurant"}:
data["name"] = blob.get("name")
data["phone"] = blob.get("telephone")
addr = blob.get("address", {}) or {}
data["street"] = addr.get("streetAddress")
data["city"] = addr.get("addressLocality")
data["region"] = addr.get("addressRegion")
data["postal_code"] = addr.get("postalCode")
data["country"] = addr.get("addressCountry")
agg = blob.get("aggregateRating", {}) or {}
data["rating"] = agg.get("ratingValue")
data["review_count"] = agg.get("reviewCount")
break
# Categories are typically in a header span
cats = [a.get_text(strip=True) for a in soup.select('a[href*="/c/"]')]
if cats:
data["categories"] = ", ".join(cats[:5])
return data
JSON-LD is gold for any site that cares about SEO — the schema is standardized, the field names don't change with redesigns, and it's the same data Google reads for rich results.
Yelp escalates fast. After 20–50 requests from the same IP — even with a clean User-Agent — you'll start seeing HTTP 403 or a CAPTCHA wall. The two-pronged fix is: (1) rotate IPs (next section), and (2) when 403s persist, fall back to a real browser via Selenium so the request looks fully human:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
def fetch_with_selenium(url):
opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--disable-gpu")
opts.add_argument("--no-sandbox")
opts.add_argument("--window-size=1920,1080")
opts.add_argument("--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36")
driver = webdriver.Chrome(options=opts)
try:
driver.get(url)
time.sleep(8) # let JS render and any anti-bot timer pass
return driver.page_source
finally:
driver.quit()
For repeated runs, use undetected-chromedriver (drop-in replacement for the standard Selenium driver) or Playwright with stealth patches — both significantly reduce the headless-browser fingerprint signals that Yelp uses to detect automation.
This is the step that makes Yelp scraping actually work at any meaningful scale. From a single IP, you'll get blocked within the first hundred requests. From a rotating residential pool, each request appears to come from a different real consumer in a real city — exactly what Yelp's user base looks like.
SpyderProxy's residential proxies expose a single rotating endpoint. Drop one config in:
PROXY_USER = "your-spyder-username"
PROXY_PASS = "your-spyder-password"
PROXY_HOST = "gate.spyderproxy.com"
PROXY_PORT = 7777
def proxies_dict():
auth = f"{PROXY_USER}:{PROXY_PASS}"
return {
"http": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
"https": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
}
# Then in your fetch:
html = fetch(url, proxies=proxies_dict())
For Yelp specifically, choose residential over datacenter — Yelp aggressively blocks known datacenter IP ranges. SpyderProxy's Premium Residential at $2.75/GB draws from a 130M+ IP pool with sub-0.3s latency. For most Yelp scraping, the Budget Residential tier at $1.75/GB is enough — Yelp HTML pages are 50–200 KB each, so a few dollars buys you tens of thousands of pages.
If you want to hold the same IP across multiple page loads on a single business (e.g., scraping detail page → reviews tab → photos), use a sticky session by appending a session token to your username:
# Same IP for the duration of one business's pages
PROXY_USER = "your-username-session-yelp001"
Putting the pieces together: paginate search results, dedupe by business URL, hit each detail page with a delay, and write to CSV after every successful business so you don't lose progress on a crash:
import time
import pandas as pd
QUERY = "coffee"
LOCATION = "Brooklyn, NY"
PAGES = 5 # 10 results per page
def scrape_yelp(query, location, pages):
all_businesses = []
for page in range(pages):
start = page * 10
url = (f"https://www.yelp.com/search?find_desc={query}"
f"&find_loc={location}&start={start}")
html = fetch(url, proxies=proxies_dict())
if not html:
html = fetch_with_selenium(url) # fallback
if not html:
continue
results = parse_search_results(html)
for biz in results:
time.sleep(2) # polite delay between detail-page hits
detail_html = fetch(biz["yelp_url"], proxies=proxies_dict())
if detail_html:
detail = parse_business_detail(detail_html)
biz.update(detail)
all_businesses.append(biz)
# Write incrementally so a crash doesn't lose data
pd.DataFrame(all_businesses).to_csv("yelp_results.csv", index=False)
time.sleep(3) # delay between search pages
return all_businesses
scrape_yelp(QUERY, LOCATION, PAGES)
print("done.")
data-testid, ARIA labels, structural relationships, and the JSON-LD blob (which doesn't churn).<a> tag. Fix: only follow visible links and check for display:none / visibility:hidden on parents.session-X-country-US or similar) to control your exit geo.data-testid, ARIA, and JSON-LD. Avoid raw CSS class names.robots.txt spirit even when scraping anyway. Don't hammer Yelp; don't try to crawl every page; only fetch what you actually need.Yelp scraping is doable in 2026 but it's not the "20 lines of Python" tutorial that other guides promise. Yelp invests heavily in anti-bot tech because their data is their moat. The realistic stack is: requests + BeautifulSoup as your fast path, Selenium with undetected-chromedriver as the fallback, residential proxies as the foundation, and disciplined rate limiting throughout.
The proxy choice is the biggest single lever. SpyderProxy Residential Proxies at $1.75/GB for the Budget tier or $2.75/GB for the full 130M+ Premium pool give you the IP rotation Yelp scraping requires, with sub-0.3s latency and sticky sessions up to 24 hours.
Publicly visible business listings (name, address, phone, hours, categories, ratings) are factual data not protected by copyright and are generally safe to scrape for research, lead generation, and directory enrichment. Reviews and photos are user-generated content that Yelp licenses with restrictions — scraping and republishing them carries copyright and DMCA risk. Yelp's Terms of Service prohibit automated scraping; violating ToS isn't automatically illegal but can expose you to civil claims. Consult a lawyer for your specific use case.
Yelp serves HTTP 403 when a request looks suspicious. The three most common triggers are: (1) default Python requests User-Agent string, (2) request originating from a known datacenter IP, and (3) too-fast request rate from a single IP. Fix all three: set a current Chrome User-Agent, route through residential proxies, and add 2–5 second delays between requests.
Residential proxies, full stop. Yelp aggressively blocks known datacenter ASNs (AWS, DigitalOcean, Hetzner, OVH), so datacenter proxies will fail almost immediately on Yelp pages. SpyderProxy Residential at $1.75/GB delivers 120M+ real consumer IPs across 195+ countries with sub-0.3s latency, which is the right tool for Yelp scraping at any meaningful scale.
Yes. The Yelp Fusion API provides business search, business details, reviews (limited to 3 per business), and autocomplete on a free tier capped at 5,000 calls/day. For low-volume use cases that fit within those limits and don't need full review text, the Fusion API is the cleaner legal path. For higher volume, deeper data, or use cases the API doesn't cover, scraping is the alternative.
Anchor your selectors on structural patterns rather than CSS class names. Yelp randomizes class hashes on every release, but their data-testid attributes, ARIA labels, link patterns (/biz/, /c/), and embedded JSON-LD <script type="application/ld+json"> blocks remain stable. Parse the JSON-LD blob first — it gives you name, address, phone, geo coordinates, rating, and review count in standardized fields.
Technically yes — they're publicly visible. Legally and ethically, it's the riskiest part of Yelp scraping. Reviews are user-generated content with copyright held by Yelp/the reviewer, contain personal data (reviewer names, sometimes locations), and Yelp will issue DMCA takedowns to anyone republishing review text. If you must scrape reviews, scrape aggregate data only (rating distributions, review counts, average length) and don't store or republish individual review text without a clear lawful basis and licensing review.
Without proxies, you'll get blocked within 50–100 requests from a single IP. With rotating residential proxies and 2-second delays, you can comfortably scrape 1,000–5,000 pages per hour without triggering CAPTCHAs. Beyond that you'll need more aggressive IP rotation, longer delays, or Selenium with stealth patches. There's no "official" rate limit for scrapers, but Yelp's anti-bot threshold scales with how human your traffic looks.
Start with requests + BeautifulSoup — it's 10–100× faster and uses a fraction of the proxy bandwidth. Fall back to Selenium (preferably with undetected-chromedriver) or Playwright when Yelp returns 403 or CAPTCHA pages despite a good User-Agent and proxies. The hybrid approach — requests as the fast path, Selenium as the fallback — is the standard production pattern for any non-trivial scraping target.
The fastest way to make a Yelp scraper actually work in production is residential IP rotation. SpyderProxy Residential from $1.75/GB gives you 120M+ rotating consumer IPs across 195+ countries — exactly what Yelp scraping requires.
Start at SpyderProxy.com — or join us on Discord and Telegram if you want help configuring your scraper.