Is it legal to scrape Yellow Pages?

Scraping publicly accessible business listing data (name, phone, address, category) is generally legal in the US under the hiQ v. LinkedIn precedent. The data is public-facing business information. Yellow Pages' Terms of Service prohibit automated access, which gives them grounds to block IPs but doesn't make scraping itself illegal in most jurisdictions. Storing scraped data is fine; reselling it as your own product is where lawsuits start.

Why does Yellow Pages block my scraper after a few pages?

Yellow Pages uses Akamai Bot Manager, which scores every request on IP reputation, TLS fingerprint, and behavioral velocity. After 10–20 requests from the same datacenter IP, Akamai serves a CAPTCHA or 403. Residential proxies bypass this by appearing as ordinary home users.

What proxy type works best for Yellow Pages?

Rotating residential proxies. Yellow Pages doesn't care about consistent identity (no login required), so rotating IPs avoid per-IP rate limits. Static residential works too but costs more per scraped record. Datacenter and free proxies fail within minutes.

How fast can I scrape Yellow Pages without getting blocked?

With a rotating residential pool: ~5,000 listings per hour at one request per 600ms with random jitter. Pushing faster triggers Akamai rate limits. With a single static IP: ~50 listings per hour before the IP gets flagged for the day.

Does Yellow Pages have an API?

There's a Yellow Pages Data API (api.yellowpages.com) for partners with paid contracts. Pricing isn't public; expect $1,000+/month minimum. For most use cases, scraping is more cost-effective — a residential proxy at $1.75/GB processes thousands of listings per gigabyte.

What fields can I extract from a Yellow Pages listing?

Business name, phone, address (street + city + state + ZIP), category tags, average rating (1-5), review count, business hours, website URL, social media links if listed, and a 'years in business' metric. The detail pages also include photos, services list, and customer review text.

Can I scrape Yellow Pages without writing code?

Yes — tools like Octoparse, ParseHub, or Apify have pre-built Yellow Pages templates. They charge per-record (typically $0.01–$0.05 per listing). For one-off jobs under 1,000 records this is fine; for ongoing scraping at scale, a Python scraper through your own proxy pool is 10–50× cheaper.

How do I avoid duplicate listings when scraping multiple search pages?

Each Yellow Pages listing has a unique business URL like `/listing/business-name-12345.htm` — extract the trailing ID and use it as your dedup key. A single business can appear in multiple category searches (a roofer might show up under 'roofing', 'contractors', and 'home improvement'), so dedup is essential.

How to Scrape Yellow Pages in 2026

Alex R.

Fri May 01 2026

Quick verdict: Yellow Pages can be scraped with Python (requests + BeautifulSoup) through a rotating residential proxy at roughly 5,000 listings per hour. The site uses Akamai bot detection that flags datacenter IPs within minutes — residential is the only viable proxy type. Each listing yields name, phone, address, ratings, hours, and category. Cost is $1–$3 per 10,000 listings on a $1.75/GB residential plan.

This guide covers the full stack: search pagination, listing extraction, anti-bot tactics, residential proxy setup, CSV export, and the legal considerations that matter when collecting business contact data for outreach or research.

What's Available on Yellow Pages

Field	Where	Reliability
Business name	Search results + detail page	High
Phone	Search results (link href + visible text)	High
Street address	Search + detail	High (~95% of listings)
Category tags	Detail page	Medium (multi-category common)
Rating + review count	Search results	Medium (~60% of listings)
Business hours	Detail page	Medium (~70%)
Website URL	Detail page	Medium (~50%)
Years in business	Detail page	Low (~30%)

Why Yellow Pages Blocks Scrapers

Yellow Pages uses Akamai Bot Manager — a commercial anti-bot product that scores every request on six signals:

IP reputation. Datacenter ranges are pre-flagged. Residential ranges pass.
TLS fingerprint. Python requests has a distinctive JA3 fingerprint that doesn't match real browsers — Akamai recognizes it instantly.
Request rate. More than ~3 requests per second from the same IP triggers throttling.
Header consistency. Missing or inconsistent User-Agent, Accept-Language, Accept-Encoding triggers bot scoring.
Cookie behavior. Real browsers accept and replay cookies; bots that don't are flagged.
Session continuity. Logging too many requests without typical browser patterns (image fetches, JavaScript execution) is suspicious.

Residential proxies + curl_cffi (which mimics real browser TLS fingerprints) handle 1, 2, and 3. Realistic headers and a cookie jar handle 4 and 5.

Step-by-Step Python Tutorial

1. Install dependencies

pip install requests beautifulsoup4 lxml pandas

2. Configure a residential proxy

import requests, time, random
from bs4 import BeautifulSoup
import pandas as pd

PROXY = 'http://USER:[email protected]:8080'
proxies = {'http': PROXY, 'https': PROXY}

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
                  '(KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
    'Accept-Encoding': 'gzip, deflate, br',
}

3. Search-page scraper

def scrape_search_page(category, location, page=1):
    url = f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms={location}&page={page}'
    r = requests.get(url, proxies=proxies, headers=HEADERS, timeout=30)
    if r.status_code != 200:
        return []

    soup = BeautifulSoup(r.text, 'lxml')
    listings = []
    for card in soup.select('.result'):
        name = card.select_one('.business-name').get_text(strip=True) if card.select_one('.business-name') else None
        phone = card.select_one('.phones').get_text(strip=True) if card.select_one('.phones') else None
        addr = card.select_one('.street-address').get_text(strip=True) if card.select_one('.street-address') else None
        rating_el = card.select_one('.result-rating')
        rating = rating_el.get('class', [None])[1] if rating_el else None
        link = card.select_one('a.business-name')
        biz_url = 'https://www.yellowpages.com' + link['href'] if link else None
        listings.append({
            'name': name, 'phone': phone, 'address': addr,
            'rating': rating, 'url': biz_url,
        })
    return listings

4. Pagination loop with rate limiting

all_listings = []
for page in range(1, 21):  # 20 pages = ~600 listings
    rows = scrape_search_page('roofing', 'austin-tx', page=page)
    if not rows:
        break
    all_listings.extend(rows)
    time.sleep(random.uniform(0.5, 1.2))  # 0.5–1.2 sec jitter

# Dedup by URL
seen = set()
unique = [r for r in all_listings if r['url'] and not (r['url'] in seen or seen.add(r['url']))]

pd.DataFrame(unique).to_csv('yellow_pages_roofing_austin.csv', index=False)
print(f'Saved {len(unique)} unique listings')

Why Rotating Residential Beats Static

For Yellow Pages specifically, IP rotation is more valuable than IP stability because the site doesn't require login. Each new IP gets a fresh rate-limit budget. Static residential ($3.90/day flat) is wasted here — you'd hit the per-IP limit in 50 requests and have to wait for it to reset.

Premium rotating residential at $2.75/GB gives you a fresh IP per request automatically. At ~25 KB per Yellow Pages listing scraped, that's 40,000 listings per GB — about $0.07 per 1,000 listings.

Verify the proxy is rotating with our IP lookup tool: hit it through the proxy twice and confirm you see different exit IPs.

Anti-Bot Tactics That Actually Work

Use curl_cffi instead of requests for TLS fingerprint impersonation: pip install curl-cffi, then response = curl_cffi.requests.get(url, impersonate='chrome120'). Real Chrome JA3, no Akamai flag.
Random User-Agent rotation from a list of 20+ real browser UAs. Don't use the same UA for 1,000 requests.
Persist cookies across requests in a session: session = requests.Session().
Random delays. 500–1,200 ms between requests with jitter — not a fixed sleep.
Respect 429 / 503. Stop and back off for 60+ seconds. Hammering through a rate limit makes the IP burn faster. See our HTTP 429 guide.
Detect Akamai challenges. If response contains _abck cookie or HTML title "Access Denied", switch IP and retry. Don't follow with the same identity.

Legal Considerations

The leading US case is hiQ Labs v. LinkedIn: scraping publicly accessible data without authentication is generally not a CFAA violation. Yellow Pages business listings — name, phone, address, hours — are public-facing business information and broadly subject to the same logic.

Where things change:

EU GDPR. Even if a phone number is on a public website, a sole-proprietor business owner is a "natural person" under GDPR. If you're emailing or calling EU contacts, you need a lawful basis. GDPR Article 6 covers the legal grounds.
US TCPA. Auto-dialing scraped phone numbers without consent is a federal violation regardless of how you got the numbers.
Yellow Pages ToS. Forbids automated access. Legally enforceable as a breach of contract for users who agreed (very few scrapers click "I agree"), but creates IP-blocking grounds and lawsuit risk for high-volume commercial scrapers.

The safe pattern: scrape public data, comply with relevant privacy laws when contacting people, don't resell the raw scrape as your own database.