Email scraping with Python is one of the simplest "first scraping projects" you can build, but doing it properly — at scale, without getting blocked, and within the law — requires more than a regex and a for loop. This guide walks through the full pipeline: pulling email addresses from any HTML page, handling JavaScript-rendered content, deduplicating, exporting to CSV, rotating IPs through residential proxies, and staying on the right side of GDPR, CCPA, and CAN-SPAM.
By the end you'll have a complete, copy-pasteable Python email scraper with the production-grade additions (proxies, retries, error handling) that most tutorials skip.
Short answer: scraping publicly listed business emails is generally allowed; using them to send unsolicited bulk email is generally not. The key laws to know:
Build your scraper assuming the addresses you collect are personal data, document why you're collecting them, store them securely, honor deletion requests, and never send unsolicited bulk email. When in doubt, talk to a lawyer — particularly if you're operating across jurisdictions.
The minimal stack:
Install everything in one go:
pip install requests beautifulsoup4 pandas selenium playwright
The first piece is the regex that recognizes email addresses inside arbitrary text. The classic pattern that handles 99% of real-world emails:
import re
EMAIL_RE = re.compile(
r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
)
text = "Contact us at [email protected] or [email protected] for help."
emails = EMAIL_RE.findall(text)
print(emails)
# ['[email protected]', '[email protected]']
This pattern is intentionally permissive — it catches real emails but won't catch every edge case in the official RFC 5322 spec (which is dozens of times longer). If you want stricter validation, layer the email-validator package on top to filter out malformed entries after extraction.
Now wire the regex into a real fetch + parse pipeline:
import re
import requests
from bs4 import BeautifulSoup
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/124.0.0.0 Safari/537.36"
}
def scrape_emails(url):
r = requests.get(url, headers=HEADERS, timeout=15)
r.raise_for_status()
soup = BeautifulSoup(r.text, "html.parser")
# Strip script/style tags so we don't pick up junk
for tag in soup(["script", "style"]):
tag.decompose()
# Pull from visible text
visible = soup.get_text(separator=" ")
emails = set(EMAIL_RE.findall(visible))
# Also catch mailto: hrefs (very common on contact pages)
for a in soup.select("a[href^=mailto]"):
href = a.get("href", "")
addr = href.replace("mailto:", "").split("?")[0]
if addr:
emails.add(addr)
return emails
if __name__ == "__main__":
for email in scrape_emails("https://example.com/contact"):
print(email)
Key details:
python-requests/X.X string is the fastest way to get blocked.<script> and <style> tags first. Otherwise you'll match analytics IDs, CSS class names, and other junk that contains @ symbols.mailto: hrefs. Many contact pages obfuscate the visible email but still expose it in the link.set(), not a list. Pages frequently repeat the same address in multiple places.One contact page is rarely enough. Most companies scatter emails across /contact, /about, /team, and footer links. Crawl a list of URLs, dedupe globally, and persist after each fetch so a crash doesn't lose your progress:
import csv
import time
URLS = [
"https://example.com/contact",
"https://example.com/about",
"https://example.com/team",
]
def crawl(urls, out_path="emails.csv"):
seen = set()
with open(out_path, "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["email", "source_url"])
for url in urls:
try:
emails = scrape_emails(url)
except Exception as e:
print(f"[WARN] {url}: {e}")
continue
for addr in emails:
if addr not in seen:
seen.add(addr)
writer.writerow([addr, url])
time.sleep(2) # polite delay
print(f"saved {len(seen)} unique emails to {out_path}")
crawl(URLS)
Two practical notes: (1) the time.sleep(2) isn't optional — without a delay you'll hammer the target server and trigger rate limits within seconds; (2) writing to disk after each URL means if the script crashes at URL 47 of 50, you keep the first 47 results.
Plenty of modern sites render contact info client-side with React or Vue, which means requests won't see anything useful — you need a real browser. Selenium (or Playwright) is the answer:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import re, time
EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
def scrape_emails_selenium(url):
opts = Options()
opts.add_argument("--headless=new")
opts.add_argument("--disable-gpu")
opts.add_argument("--no-sandbox")
opts.add_argument("--user-agent=Mozilla/5.0 ...")
driver = webdriver.Chrome(options=opts)
try:
driver.get(url)
time.sleep(5) # give the page time to render
body = driver.find_element("tag name", "body").text
emails = set(EMAIL_RE.findall(body))
# Also grab mailto links
for a in driver.find_elements("css selector", "a[href^=mailto]"):
href = a.get_attribute("href") or ""
addr = href.replace("mailto:", "").split("?")[0]
if addr:
emails.add(addr)
return emails
finally:
driver.quit()
For most production work, prefer Playwright over Selenium — it's faster, has cleaner async APIs, and ships with auto-wait built in. Either way, headless browsers are 10–100× slower than requests, so use them only when the email truly isn't in the initial HTML.
Email scraping at any scale lights up rate limiters. After 50–100 requests from the same IP, expect to see HTTP 429 (Too Many Requests), CAPTCHAs, or full IP bans. The fix is to rotate through residential IPs so each request looks like it's coming from a different real consumer.
SpyderProxy's residential proxies expose a single rotating endpoint — you don't need to manage IP lists yourself. Drop one config block into your scraper:
PROXY_USER = "your-spyder-username"
PROXY_PASS = "your-spyder-password"
PROXY_HOST = "gate.spyderproxy.com"
PROXY_PORT = 7777
def proxies_dict():
auth = f"{PROXY_USER}:{PROXY_PASS}"
return {
"http": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
"https": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
}
# Then in your fetch:
r = requests.get(url, headers=HEADERS, proxies=proxies_dict(), timeout=15)
Each request through that endpoint goes out from a different IP in the 120M+ SpyderProxy residential pool. For most email-scraping workloads, the Budget Residential tier at $1.75/GB is more than enough — emails are tiny payloads, you'll burn maybe 10–50 KB per page including images and JS. A few dollars buys you tens of thousands of pages.
If you need to keep the same IP across multiple page loads (e.g., logging in once, then scraping internal pages), use SpyderProxy's sticky sessions by adding a session token to the username:
PROXY_USER = "your-spyder-username-session-AB12CD34" # any string keeps the IP sticky
# IP stays the same up to 24 hours on Premium
The bare-bones script above works. To run it across thousands of domains without hand-holding, add:
requests.get in a retry loop that handles 429, 5xx, and connection errors. The tenacity library makes this two lines.concurrent.futures.ThreadPoolExecutor with 5–20 workers — much faster than serial fetches, gentler than asyncio for beginners.robots.txt and don't hit one domain more than ~1 req/sec without permission.email-validator package catches typos and obviously invalid addresses (foo@bar, @example.com) before they pollute your dataset.noreply@, [email protected] (when example.com is the placeholder), [email protected] (form placeholders).from email_validator import validate_email, EmailNotValidError
GENERIC_LOCAL_PARTS = {"noreply", "no-reply", "do-not-reply", "test", "example"}
PLACEHOLDER_DOMAINS = {"example.com", "domain.com", "yourdomain.com"}
def is_useful(addr):
try:
v = validate_email(addr, check_deliverability=False)
except EmailNotValidError:
return False
local, domain = v.normalized.split("@")
if local.lower() in GENERIC_LOCAL_PARTS:
return False
if domain.lower() in PLACEHOLDER_DOMAINS:
return False
return True
clean = {e for e in raw_emails if is_useful(e)}
Once you have a clean CSV, the legal/ethical baseline is: don't use these for unsolicited bulk email. Realistic uses that stay within the rules:
script tags before regex. You'll match Google Analytics IDs and similar junk.A useful email scraper is ~80 lines of Python: regex + requests + BeautifulSoup for the basics, Selenium/Playwright for JS-heavy targets, a residential proxy endpoint for rotation, and validation/filtering on the back end. The technical work is straightforward — the discipline is in following the law, respecting site owners, and using the data responsibly.
If you're going to scrape at any volume, route through SpyderProxy Residential Proxies from $1.75/GB. Email payloads are tiny, the rotation handles rate limits and IP bans automatically, and you'll spend pennies for tens of thousands of pages instead of fighting blocks for hours.
Scraping publicly listed business emails is generally allowed in most jurisdictions. The legal exposure starts when you (1) ignore site Terms of Service that explicitly prohibit scraping, (2) collect personal-data emails without a lawful basis under GDPR/CCPA, or (3) use the addresses for unsolicited bulk email in violation of CAN-SPAM and similar laws. Always document why you're collecting addresses, store them securely, and honor deletion requests.
For static HTML, requests + BeautifulSoup4 + re is the canonical stack — minimal dependencies, fast, easy to debug. For JavaScript-rendered pages, Playwright (preferred over Selenium for new projects) handles real browser rendering. Add email-validator for post-extraction validation and pandas for clean CSV/Excel output.
The pragmatic pattern is [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}. This catches the vast majority of real-world emails without the complexity of the full RFC 5322 spec. For stricter validation, run extracted matches through the email-validator package to filter out malformed entries.
Most websites rate-limit per IP. After 50–100 requests from the same address, expect HTTP 429 errors, CAPTCHAs, or full IP bans. Rotating residential proxies make each request appear to come from a different real consumer, so rate limiters don't trigger. SpyderProxy residential proxies at $1.75/GB handle this automatically through a single rotating endpoint.
Yes — but you need a real browser, not requests. Use Playwright or Selenium to load the page, wait for the JavaScript to execute, then extract from the rendered DOM. Headless browsers are 10–100× slower than requests, so use them only when the email truly isn't in the initial HTML.
Store extracted emails in a Python set() rather than a list. Sets reject duplicates automatically. Persist the set to disk after each URL so a crash doesn't lose your progress, and key your results CSV by email address so reruns don't double-write the same row.
Indirectly, yes. If you scrape addresses and then send unsolicited bulk email to them, your sender reputation will tank within days — major mailbox providers (Gmail, Outlook, Yahoo) will mark your domain as a spam source and your deliverability will collapse. Scrape if you want, but don't use the results for cold-spam campaigns. Personalized 1:1 outreach with documented legitimate interest is a different (and legal) use case.
From a single IP without proxies: roughly 1 request per second per domain, with random jitter. With rotating residential proxies: effectively unlimited per-domain throughput as long as you respect each target's rate limits. The bottleneck shifts from "your IP is flagged" to "the target's rate limiter has a per-account or per-fingerprint cap." Add stealth headers, user-agent rotation, and reasonable delays even with proxies.
If your email-scraping pipeline is hitting block walls, the fastest fix is rotating residential IPs. SpyderProxy Residential starts at $1.75/GB with 120M+ IPs across 195+ countries, sticky sessions up to 24 hours, and HTTP + SOCKS5 support.
Get started at SpyderProxy.com — or join us on Discord and Telegram if you want help wiring proxies into your scraper.