spyderproxy

How to Scrape Yellow Pages in 2026

A

Alex R.

|
Published date

Fri May 01 2026

Quick verdict: Yellow Pages can be scraped with Python (requests + BeautifulSoup) through a rotating residential proxy at roughly 5,000 listings per hour. The site uses Akamai bot detection that flags datacenter IPs within minutes — residential is the only viable proxy type. Each listing yields name, phone, address, ratings, hours, and category. Cost is $1–$3 per 10,000 listings on a $1.75/GB residential plan.

This guide covers the full stack: search pagination, listing extraction, anti-bot tactics, residential proxy setup, CSV export, and the legal considerations that matter when collecting business contact data for outreach or research.

What's Available on Yellow Pages

Field Where Reliability
Business nameSearch results + detail pageHigh
PhoneSearch results (link href + visible text)High
Street addressSearch + detailHigh (~95% of listings)
Category tagsDetail pageMedium (multi-category common)
Rating + review countSearch resultsMedium (~60% of listings)
Business hoursDetail pageMedium (~70%)
Website URLDetail pageMedium (~50%)
Years in businessDetail pageLow (~30%)

Why Yellow Pages Blocks Scrapers

Yellow Pages uses Akamai Bot Manager — a commercial anti-bot product that scores every request on six signals:

  1. IP reputation. Datacenter ranges are pre-flagged. Residential ranges pass.
  2. TLS fingerprint. Python requests has a distinctive JA3 fingerprint that doesn't match real browsers — Akamai recognizes it instantly.
  3. Request rate. More than ~3 requests per second from the same IP triggers throttling.
  4. Header consistency. Missing or inconsistent User-Agent, Accept-Language, Accept-Encoding triggers bot scoring.
  5. Cookie behavior. Real browsers accept and replay cookies; bots that don't are flagged.
  6. Session continuity. Logging too many requests without typical browser patterns (image fetches, JavaScript execution) is suspicious.

Residential proxies + curl_cffi (which mimics real browser TLS fingerprints) handle 1, 2, and 3. Realistic headers and a cookie jar handle 4 and 5.

Step-by-Step Python Tutorial

1. Install dependencies

pip install requests beautifulsoup4 lxml pandas

2. Configure a residential proxy

import requests, time, random
from bs4 import BeautifulSoup
import pandas as pd

PROXY = 'http://USER:[email protected]:8080'
proxies = {'http': PROXY, 'https': PROXY}

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
}

3. Search-page scraper

def scrape_search_page(category, location, page=1):
    url = f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms={location}&page={page}'
    r = requests.get(url, proxies=proxies, headers=HEADERS, timeout=30)
    if r.status_code != 200:
        return []

    soup = BeautifulSoup(r.text, 'lxml')
    listings = []
    for card in soup.select('.result'):
        name = card.select_one('.business-name').get_text(strip=True) if card.select_one('.business-name') else None
        phone = card.select_one('.phones').get_text(strip=True) if card.select_one('.phones') else None
        addr = card.select_one('.street-address').get_text(strip=True) if card.select_one('.street-address') else None
        rating_el = card.select_one('.result-rating')
        rating = rating_el.get('class', [None])[1] if rating_el else None
        link = card.select_one('a.business-name')
        biz_url = 'https://www.yellowpages.com' + link['href'] if link else None
        listings.append({
            'name': name, 'phone': phone, 'address': addr,
            'rating': rating, 'url': biz_url,
        })
    return listings

4. Pagination loop with rate limiting

all_listings = []
for page in range(1, 21):  # 20 pages = ~600 listings
    rows = scrape_search_page('roofing', 'austin-tx', page=page)
    if not rows:
        break
    all_listings.extend(rows)
    time.sleep(random.uniform(0.5, 1.2))  # 0.5–1.2 sec jitter

# Dedup by URL
seen = set()
unique = [r for r in all_listings if r['url'] and not (r['url'] in seen or seen.add(r['url']))]

pd.DataFrame(unique).to_csv('yellow_pages_roofing_austin.csv', index=False)
print(f'Saved {len(unique)} unique listings')

Why Rotating Residential Beats Static

For Yellow Pages specifically, IP rotation is more valuable than IP stability because the site doesn't require login. Each new IP gets a fresh rate-limit budget. Static residential ($3.90/day flat) is wasted here — you'd hit the per-IP limit in 50 requests and have to wait for it to reset.

Premium rotating residential at $2.75/GB gives you a fresh IP per request automatically. At ~25 KB per Yellow Pages listing scraped, that's 40,000 listings per GB — about $0.07 per 1,000 listings.

Verify the proxy is rotating with our IP lookup tool: hit it through the proxy twice and confirm you see different exit IPs.

Anti-Bot Tactics That Actually Work

  • Use curl_cffi instead of requests for TLS fingerprint impersonation: pip install curl-cffi, then response = curl_cffi.requests.get(url, impersonate='chrome120'). Real Chrome JA3, no Akamai flag.
  • Random User-Agent rotation from a list of 20+ real browser UAs. Don't use the same UA for 1,000 requests.
  • Persist cookies across requests in a session: session = requests.Session().
  • Random delays. 500–1,200 ms between requests with jitter — not a fixed sleep.
  • Respect 429 / 503. Stop and back off for 60+ seconds. Hammering through a rate limit makes the IP burn faster. See our HTTP 429 guide.
  • Detect Akamai challenges. If response contains _abck cookie or HTML title "Access Denied", switch IP and retry. Don't follow with the same identity.

The leading US case is hiQ Labs v. LinkedIn: scraping publicly accessible data without authentication is generally not a CFAA violation. Yellow Pages business listings — name, phone, address, hours — are public-facing business information and broadly subject to the same logic.

Where things change:

  • EU GDPR. Even if a phone number is on a public website, a sole-proprietor business owner is a "natural person" under GDPR. If you're emailing or calling EU contacts, you need a lawful basis. GDPR Article 6 covers the legal grounds.
  • US TCPA. Auto-dialing scraped phone numbers without consent is a federal violation regardless of how you got the numbers.
  • Yellow Pages ToS. Forbids automated access. Legally enforceable as a breach of contract for users who agreed (very few scrapers click "I agree"), but creates IP-blocking grounds and lawsuit risk for high-volume commercial scrapers.

The safe pattern: scrape public data, comply with relevant privacy laws when contacting people, don't resell the raw scrape as your own database.