spyderproxy

Email Scraping with Python (2026): Complete Guide with Code Examples

S

SpyderProxy Team

|
Published date

2026-04-19

Email scraping with Python is one of the simplest "first scraping projects" you can build, but doing it properly — at scale, without getting blocked, and within the law — requires more than a regex and a for loop. This guide walks through the full pipeline: pulling email addresses from any HTML page, handling JavaScript-rendered content, deduplicating, exporting to CSV, rotating IPs through residential proxies, and staying on the right side of GDPR, CCPA, and CAN-SPAM.

By the end you'll have a complete, copy-pasteable Python email scraper with the production-grade additions (proxies, retries, error handling) that most tutorials skip.

Is Email Scraping Legal?

Short answer: scraping publicly listed business emails is generally allowed; using them to send unsolicited bulk email is generally not. The key laws to know:

  • GDPR (EU) — email addresses tied to identifiable individuals are personal data. You need a lawful basis (typically legitimate interest with proper notice, or consent) to process them, even if they were publicly available.
  • CCPA / CPRA (California) — similar personal-data protections, with disclosure and opt-out obligations if you're going to use the addresses commercially.
  • CAN-SPAM (US) — governs how you can send commercial email. You can collect addresses freely; sending unsolicited commercial mail without proper headers, opt-out, and physical address is what triggers fines.
  • Site Terms of Service — many sites explicitly prohibit automated scraping in their ToS. Violating that isn't always illegal but can expose you to civil claims.

Build your scraper assuming the addresses you collect are personal data, document why you're collecting them, store them securely, honor deletion requests, and never send unsolicited bulk email. When in doubt, talk to a lawyer — particularly if you're operating across jurisdictions.

Tools You'll Need

The minimal stack:

  • requests — fetches HTML over HTTP(S).
  • BeautifulSoup4 — parses HTML, lets you target specific tags/classes.
  • re — Python's built-in regex module for matching email patterns.
  • pandas (optional) — clean CSV/Excel export.
  • Selenium or Playwright — for JavaScript-rendered pages where the email isn't in the initial HTML.

Install everything in one go:

pip install requests beautifulsoup4 pandas selenium playwright

Step 1: The Email Regex

The first piece is the regex that recognizes email addresses inside arbitrary text. The classic pattern that handles 99% of real-world emails:

import re

EMAIL_RE = re.compile(
    r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
)

text = "Contact us at [email protected] or [email protected] for help."
emails = EMAIL_RE.findall(text)
print(emails)
# ['[email protected]', '[email protected]']

This pattern is intentionally permissive — it catches real emails but won't catch every edge case in the official RFC 5322 spec (which is dozens of times longer). If you want stricter validation, layer the email-validator package on top to filter out malformed entries after extraction.

Step 2: Scrape One Page with Requests + BeautifulSoup

Now wire the regex into a real fetch + parse pipeline:

import re
import requests
from bs4 import BeautifulSoup

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/124.0.0.0 Safari/537.36"
}

def scrape_emails(url):
    r = requests.get(url, headers=HEADERS, timeout=15)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "html.parser")

    # Strip script/style tags so we don't pick up junk
    for tag in soup(["script", "style"]):
        tag.decompose()

    # Pull from visible text
    visible = soup.get_text(separator=" ")
    emails = set(EMAIL_RE.findall(visible))

    # Also catch mailto: hrefs (very common on contact pages)
    for a in soup.select("a[href^=mailto]"):
        href = a.get("href", "")
        addr = href.replace("mailto:", "").split("?")[0]
        if addr:
            emails.add(addr)

    return emails

if __name__ == "__main__":
    for email in scrape_emails("https://example.com/contact"):
        print(email)

Key details:

  • Set a real User-Agent. The default python-requests/X.X string is the fastest way to get blocked.
  • Strip <script> and <style> tags first. Otherwise you'll match analytics IDs, CSS class names, and other junk that contains @ symbols.
  • Always check mailto: hrefs. Many contact pages obfuscate the visible email but still expose it in the link.
  • Use a set(), not a list. Pages frequently repeat the same address in multiple places.

Step 3: Crawl Multiple Pages with Deduplication

One contact page is rarely enough. Most companies scatter emails across /contact, /about, /team, and footer links. Crawl a list of URLs, dedupe globally, and persist after each fetch so a crash doesn't lose your progress:

import csv
import time

URLS = [
    "https://example.com/contact",
    "https://example.com/about",
    "https://example.com/team",
]

def crawl(urls, out_path="emails.csv"):
    seen = set()
    with open(out_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.writer(f)
        writer.writerow(["email", "source_url"])
        for url in urls:
            try:
                emails = scrape_emails(url)
            except Exception as e:
                print(f"[WARN] {url}: {e}")
                continue
            for addr in emails:
                if addr not in seen:
                    seen.add(addr)
                    writer.writerow([addr, url])
            time.sleep(2)  # polite delay
    print(f"saved {len(seen)} unique emails to {out_path}")

crawl(URLS)

Two practical notes: (1) the time.sleep(2) isn't optional — without a delay you'll hammer the target server and trigger rate limits within seconds; (2) writing to disk after each URL means if the script crashes at URL 47 of 50, you keep the first 47 results.

Step 4: JavaScript-Rendered Pages with Selenium

Plenty of modern sites render contact info client-side with React or Vue, which means requests won't see anything useful — you need a real browser. Selenium (or Playwright) is the answer:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import re, time

EMAIL_RE = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

def scrape_emails_selenium(url):
    opts = Options()
    opts.add_argument("--headless=new")
    opts.add_argument("--disable-gpu")
    opts.add_argument("--no-sandbox")
    opts.add_argument("--user-agent=Mozilla/5.0 ...")
    driver = webdriver.Chrome(options=opts)
    try:
        driver.get(url)
        time.sleep(5)  # give the page time to render
        body = driver.find_element("tag name", "body").text
        emails = set(EMAIL_RE.findall(body))
        # Also grab mailto links
        for a in driver.find_elements("css selector", "a[href^=mailto]"):
            href = a.get_attribute("href") or ""
            addr = href.replace("mailto:", "").split("?")[0]
            if addr:
                emails.add(addr)
        return emails
    finally:
        driver.quit()

For most production work, prefer Playwright over Selenium — it's faster, has cleaner async APIs, and ships with auto-wait built in. Either way, headless browsers are 10–100× slower than requests, so use them only when the email truly isn't in the initial HTML.

Step 5: Stop Getting Blocked — Add Proxy Rotation

Email scraping at any scale lights up rate limiters. After 50–100 requests from the same IP, expect to see HTTP 429 (Too Many Requests), CAPTCHAs, or full IP bans. The fix is to rotate through residential IPs so each request looks like it's coming from a different real consumer.

SpyderProxy's residential proxies expose a single rotating endpoint — you don't need to manage IP lists yourself. Drop one config block into your scraper:

PROXY_USER = "your-spyder-username"
PROXY_PASS = "your-spyder-password"
PROXY_HOST = "gate.spyderproxy.com"
PROXY_PORT = 7777

def proxies_dict():
    auth = f"{PROXY_USER}:{PROXY_PASS}"
    return {
        "http":  f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
        "https": f"http://{auth}@{PROXY_HOST}:{PROXY_PORT}",
    }

# Then in your fetch:
r = requests.get(url, headers=HEADERS, proxies=proxies_dict(), timeout=15)

Each request through that endpoint goes out from a different IP in the 120M+ SpyderProxy residential pool. For most email-scraping workloads, the Budget Residential tier at $1.75/GB is more than enough — emails are tiny payloads, you'll burn maybe 10–50 KB per page including images and JS. A few dollars buys you tens of thousands of pages.

If you need to keep the same IP across multiple page loads (e.g., logging in once, then scraping internal pages), use SpyderProxy's sticky sessions by adding a session token to the username:

PROXY_USER = "your-spyder-username-session-AB12CD34"  # any string keeps the IP sticky
# IP stays the same up to 24 hours on Premium

Step 6: Production-Grade Hardening

The bare-bones script above works. To run it across thousands of domains without hand-holding, add:

  • Retries with backoff. Wrap your requests.get in a retry loop that handles 429, 5xx, and connection errors. The tenacity library makes this two lines.
  • Concurrent fetches. Use concurrent.futures.ThreadPoolExecutor with 5–20 workers — much faster than serial fetches, gentler than asyncio for beginners.
  • Polite rate limiting. Even with proxies, respect robots.txt and don't hit one domain more than ~1 req/sec without permission.
  • Validate emails after extraction. The email-validator package catches typos and obviously invalid addresses (foo@bar, @example.com) before they pollute your dataset.
  • Filter against a blocklist. Drop generic addresses you don't actually want: noreply@, [email protected] (when example.com is the placeholder), [email protected] (form placeholders).
  • Persist a "URL → status" log alongside your emails CSV so you can rerun only the failed URLs without redoing successful ones.

Quick Snippet: Validate and Filter

from email_validator import validate_email, EmailNotValidError

GENERIC_LOCAL_PARTS = {"noreply", "no-reply", "do-not-reply", "test", "example"}
PLACEHOLDER_DOMAINS = {"example.com", "domain.com", "yourdomain.com"}

def is_useful(addr):
    try:
        v = validate_email(addr, check_deliverability=False)
    except EmailNotValidError:
        return False
    local, domain = v.normalized.split("@")
    if local.lower() in GENERIC_LOCAL_PARTS:
        return False
    if domain.lower() in PLACEHOLDER_DOMAINS:
        return False
    return True

clean = {e for e in raw_emails if is_useful(e)}

What to Do With the Results

Once you have a clean CSV, the legal/ethical baseline is: don't use these for unsolicited bulk email. Realistic uses that stay within the rules:

  • Sales prospecting with documented legitimate-interest basis under GDPR, personalized 1:1 outreach, and a clear opt-out in every message.
  • Lead enrichment — appending publicly listed business emails to your existing CRM contacts.
  • Recruiting / talent sourcing — finding hiring managers' contact details for one-off direct outreach.
  • Academic or journalism research — building datasets for analysis, not for marketing campaigns.
  • Customer support / partnership outreach — finding the right department contact for legitimate inbound use cases.

Common Mistakes to Avoid

  • Forgetting to set a User-Agent. The default Python UA gets blocked instantly on any modern site.
  • Not stripping script tags before regex. You'll match Google Analytics IDs and similar junk.
  • Scraping without rate limits. A polite 1–2 second delay per request keeps you off block lists.
  • Using one IP for thousands of requests. Use rotating residential proxies — the cost is negligible compared to the time you'd spend swapping IPs manually.
  • Sending unsolicited bulk email to scraped addresses. This is the fastest way to land in legal trouble and kill your sender reputation.
  • Storing emails in plaintext CSVs forever. Personal data has retention obligations under GDPR/CCPA — set a deletion policy.

The Bottom Line

A useful email scraper is ~80 lines of Python: regex + requests + BeautifulSoup for the basics, Selenium/Playwright for JS-heavy targets, a residential proxy endpoint for rotation, and validation/filtering on the back end. The technical work is straightforward — the discipline is in following the law, respecting site owners, and using the data responsibly.

If you're going to scrape at any volume, route through SpyderProxy Residential Proxies from $1.75/GB. Email payloads are tiny, the rotation handles rate limits and IP bans automatically, and you'll spend pennies for tens of thousands of pages instead of fighting blocks for hours.

Frequently Asked Questions

Is email scraping legal?

Scraping publicly listed business emails is generally allowed in most jurisdictions. The legal exposure starts when you (1) ignore site Terms of Service that explicitly prohibit scraping, (2) collect personal-data emails without a lawful basis under GDPR/CCPA, or (3) use the addresses for unsolicited bulk email in violation of CAN-SPAM and similar laws. Always document why you're collecting addresses, store them securely, and honor deletion requests.

What's the best Python library for email scraping?

For static HTML, requests + BeautifulSoup4 + re is the canonical stack — minimal dependencies, fast, easy to debug. For JavaScript-rendered pages, Playwright (preferred over Selenium for new projects) handles real browser rendering. Add email-validator for post-extraction validation and pandas for clean CSV/Excel output.

What email regex should I use in Python?

The pragmatic pattern is [A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}. This catches the vast majority of real-world emails without the complexity of the full RFC 5322 spec. For stricter validation, run extracted matches through the email-validator package to filter out malformed entries.

Why do I need proxies for email scraping?

Most websites rate-limit per IP. After 50–100 requests from the same address, expect HTTP 429 errors, CAPTCHAs, or full IP bans. Rotating residential proxies make each request appear to come from a different real consumer, so rate limiters don't trigger. SpyderProxy residential proxies at $1.75/GB handle this automatically through a single rotating endpoint.

Can I scrape emails from JavaScript-rendered sites?

Yes — but you need a real browser, not requests. Use Playwright or Selenium to load the page, wait for the JavaScript to execute, then extract from the rendered DOM. Headless browsers are 10–100× slower than requests, so use them only when the email truly isn't in the initial HTML.

How do I avoid duplicates when scraping multiple pages?

Store extracted emails in a Python set() rather than a list. Sets reject duplicates automatically. Persist the set to disk after each URL so a crash doesn't lose your progress, and key your results CSV by email address so reruns don't double-write the same row.

Can email scraping get me blocked from sending email?

Indirectly, yes. If you scrape addresses and then send unsolicited bulk email to them, your sender reputation will tank within days — major mailbox providers (Gmail, Outlook, Yahoo) will mark your domain as a spam source and your deliverability will collapse. Scrape if you want, but don't use the results for cold-spam campaigns. Personalized 1:1 outreach with documented legitimate interest is a different (and legal) use case.

How fast can I scrape emails before getting blocked?

From a single IP without proxies: roughly 1 request per second per domain, with random jitter. With rotating residential proxies: effectively unlimited per-domain throughput as long as you respect each target's rate limits. The bottleneck shifts from "your IP is flagged" to "the target's rate limiter has a per-account or per-fingerprint cap." Add stealth headers, user-agent rotation, and reasonable delays even with proxies.

Ready to Scrape at Scale?

If your email-scraping pipeline is hitting block walls, the fastest fix is rotating residential IPs. SpyderProxy Residential starts at $1.75/GB with 120M+ IPs across 195+ countries, sticky sessions up to 24 hours, and HTTP + SOCKS5 support.

Get started at SpyderProxy.com — or join us on Discord and Telegram if you want help wiring proxies into your scraper.

Related Reading