Quick verdict: Yellow Pages can be scraped with Python (requests + BeautifulSoup) through a rotating residential proxy at roughly 5,000 listings per hour. The site uses Akamai bot detection that flags datacenter IPs within minutes — residential is the only viable proxy type. Each listing yields name, phone, address, ratings, hours, and category. Cost is $1–$3 per 10,000 listings on a $1.75/GB residential plan.
This guide covers the full stack: search pagination, listing extraction, anti-bot tactics, residential proxy setup, CSV export, and the legal considerations that matter when collecting business contact data for outreach or research.
| Field | Where | Reliability |
|---|---|---|
| Business name | Search results + detail page | High |
| Phone | Search results (link href + visible text) | High |
| Street address | Search + detail | High (~95% of listings) |
| Category tags | Detail page | Medium (multi-category common) |
| Rating + review count | Search results | Medium (~60% of listings) |
| Business hours | Detail page | Medium (~70%) |
| Website URL | Detail page | Medium (~50%) |
| Years in business | Detail page | Low (~30%) |
Yellow Pages uses Akamai Bot Manager — a commercial anti-bot product that scores every request on six signals:
requests has a distinctive JA3 fingerprint that doesn't match real browsers — Akamai recognizes it instantly.Residential proxies + curl_cffi (which mimics real browser TLS fingerprints) handle 1, 2, and 3. Realistic headers and a cookie jar handle 4 and 5.
pip install requests beautifulsoup4 lxml pandas
import requests, time, random
from bs4 import BeautifulSoup
import pandas as pd
PROXY = 'http://USER:[email protected]:8080'
proxies = {'http': PROXY, 'https': PROXY}
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/130.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br',
}
def scrape_search_page(category, location, page=1):
url = f'https://www.yellowpages.com/search?search_terms={category}&geo_location_terms={location}&page={page}'
r = requests.get(url, proxies=proxies, headers=HEADERS, timeout=30)
if r.status_code != 200:
return []
soup = BeautifulSoup(r.text, 'lxml')
listings = []
for card in soup.select('.result'):
name = card.select_one('.business-name').get_text(strip=True) if card.select_one('.business-name') else None
phone = card.select_one('.phones').get_text(strip=True) if card.select_one('.phones') else None
addr = card.select_one('.street-address').get_text(strip=True) if card.select_one('.street-address') else None
rating_el = card.select_one('.result-rating')
rating = rating_el.get('class', [None])[1] if rating_el else None
link = card.select_one('a.business-name')
biz_url = 'https://www.yellowpages.com' + link['href'] if link else None
listings.append({
'name': name, 'phone': phone, 'address': addr,
'rating': rating, 'url': biz_url,
})
return listings
all_listings = []
for page in range(1, 21): # 20 pages = ~600 listings
rows = scrape_search_page('roofing', 'austin-tx', page=page)
if not rows:
break
all_listings.extend(rows)
time.sleep(random.uniform(0.5, 1.2)) # 0.5–1.2 sec jitter
# Dedup by URL
seen = set()
unique = [r for r in all_listings if r['url'] and not (r['url'] in seen or seen.add(r['url']))]
pd.DataFrame(unique).to_csv('yellow_pages_roofing_austin.csv', index=False)
print(f'Saved {len(unique)} unique listings')
For Yellow Pages specifically, IP rotation is more valuable than IP stability because the site doesn't require login. Each new IP gets a fresh rate-limit budget. Static residential ($3.90/day flat) is wasted here — you'd hit the per-IP limit in 50 requests and have to wait for it to reset.
Premium rotating residential at $2.75/GB gives you a fresh IP per request automatically. At ~25 KB per Yellow Pages listing scraped, that's 40,000 listings per GB — about $0.07 per 1,000 listings.
Verify the proxy is rotating with our IP lookup tool: hit it through the proxy twice and confirm you see different exit IPs.
pip install curl-cffi, then response = curl_cffi.requests.get(url, impersonate='chrome120'). Real Chrome JA3, no Akamai flag.session = requests.Session()._abck cookie or HTML title "Access Denied", switch IP and retry. Don't follow with the same identity.The leading US case is hiQ Labs v. LinkedIn: scraping publicly accessible data without authentication is generally not a CFAA violation. Yellow Pages business listings — name, phone, address, hours — are public-facing business information and broadly subject to the same logic.
Where things change:
The safe pattern: scrape public data, comply with relevant privacy laws when contacting people, don't resell the raw scrape as your own database.