How do anti-bot systems detect web scrapers in 2026?

Five layers, in order of cheapness for the defender: (1) IP reputation — ASN lookup catches datacenter IPs in milliseconds; (2) TLS fingerprint JA4 — Python requests has a recognizable hash; (3) HTTP headers — wrong order or default User-Agent flags you; (4) JS browser fingerprint — bare Playwright leaks navigator.webdriver; (5) behavior — uniform request timing, missing mouse movements.

What's the most important thing for stealth web scraping?

IP reputation. A datacenter IP triggers most anti-bot challenges in the first 50 milliseconds, before any other check runs. Use residential (consumer ISP) or mobile (carrier) proxies — Comcast, BT, Vodafone broadband ASNs start at low risk, AWS or Hetzner start at high risk. The other four layers matter, but a bad IP makes them moot.

How do I impersonate Chrome's TLS fingerprint in Python?

Use curl_cffi: from curl_cffi import requests; r = requests.get(url, impersonate='chrome131'). It uses BoringSSL with Chrome's cipher order, ALPN, and TLS extensions, producing a JA4 hash matching real Chrome. Alternatives: curl-impersonate (binary), tls-client (Node), azuretls-client (Go). With Playwright you get Chrome's real TLS automatically.

What headers should I send to look like a real browser?

Real Chrome sends ~10 headers in a specific order: User-Agent, Accept, Accept-Language, Accept-Encoding, Sec-Ch-Ua, Sec-Ch-Ua-Mobile, Sec-Ch-Ua-Platform, Sec-Fetch-Dest, Sec-Fetch-Mode, Sec-Fetch-Site, Sec-Fetch-User, Upgrade-Insecure-Requests. Python requests reorders headers alphabetically — use curl_cffi or httpx with explicit ordering to preserve the natural order.

How fast should I scrape to avoid detection?

Aim for 500ms-3,000ms between requests with random jitter. Uniform timing (every 1s exactly) screams bot. For sites with strong behavioral detection (sneaker, ticket, banking), throttle to 5-15 seconds between requests and add synthetic mouse moves and scrolls. Cap concurrency to ~5-10 requests/sec across your whole pool, not per IP.

Will rotating proxies alone hide my scraper?

No. Rotating residential IPs solve IP reputation but leave TLS, headers, JS fingerprint, and behavior signals exposed. Modern anti-bot systems (Cloudflare, DataDome, PerimeterX) score all five layers. A datacenter Playwright with stealth + good behavior beats a residential proxy with bare Python requests. Build the stack, don't rely on one layer.

How do I test if my scraper looks like a bot?

Visit bot.sannysoft.com — runs ~30 detection checks (webdriver flag, WebGL, plugins, fonts, etc.) and flags failures in red. Then CreepJS for deep fingerprint analysis. Then nopecha.com/demo/cloudflare against an actual Cloudflare-protected page. Fix every red flag before scraping production targets.

How to Avoid Detection When Scraping (2026)

Daniel K.

Mon May 18 2026

Quick playbook: Detection in 2026 stacks across five layers. (1) IP reputation — use residential or mobile proxies, not datacenter. (2) TLS fingerprint (JA3/JA4) — impersonate a real Chrome via curl_cffi or browser. (3) HTTP headers — full browser header set in correct order, real User-Agent, real Accept-Language. (4) JS runtime / browser fingerprint — use Patchright / undetected-playwright; never bare Playwright with defaults. (5) Behavior — human pacing, organic click paths, cookie persistence. Skip any one of these and a modern anti-bot system (Cloudflare, DataDome, PerimeterX, Akamai, Kasada) will catch you.

Who's Looking

System	Primary signal	Catches
Cloudflare WAF / Bot Fight Mode	IP rep + JA4 + JS challenge	~80% of automation
DataDome	Fingerprint + behavior	Aggressive on e-commerce, news
PerimeterX (HUMAN)	JS challenge + telemetry	Sneakers, ticketing
Akamai Bot Manager	Multi-layer scoring	Enterprise sites, airlines
Kasada	JS challenge + Web Crypto	Sneakers, ticketing
Imperva (formerly Incapsula)	Behavioral + signature	Financial, e-commerce
F5 Distributed Cloud Bot Defense	Behavior + fingerprint	Banking, retail

Layer 1: IP Reputation

Anti-bot systems classify IPs by ASN, geolocation, and historical bot activity. The first 50ms of any detection check is an ASN lookup; datacenter ASNs (AWS, Hetzner, OVH, Vultr, Linode, DigitalOcean) are pre-flagged. Most aggressive systems will challenge any request from these ASNs.

Residential proxies — consumer ISP ASNs (Comcast, BT, Vodafone broadband). Low bot score by default.
Mobile proxies — mobile carrier ASNs. Highest trust due to CGNAT.
ISP / Static Residential — residential ASN at datacenter speed.
Datacenter proxies — only on unprotected targets.

Pricing: Budget Residential $1.75/GB · Premium Residential $2.75/GB · LTE Mobile $2/IP/month.

Layer 2: TLS Fingerprint (JA3 / JA4)

Every TLS client — Chrome, Firefox, curl, Python requests, Go's net/http — sends a unique ClientHello message. The ordered list of cipher suites, extensions, and ALPN protocols becomes a fingerprint (JA3 historically; JA4 since 2023 is the new standard). Anti-bot systems hash this fingerprint and check it against a list of known clients. Python requests has a JA4 hash that screams "Python script."

Fix with TLS impersonation:

# curl_cffi — drop-in requests replacement with Chrome TLS
from curl_cffi import requests

r = requests.get("https://target.com", impersonate="chrome131")
print(r.status_code, r.text[:200])

# Other impersonations: chrome131, firefox133, edge131, safari17_5

For non-Python: curl-impersonate, azuretls-client (Go), tls-client (Node). When using a real browser (Playwright) you get Chrome's real TLS for free.

Layer 3: HTTP Headers

A real Chrome 131 sends ~10 headers in a specific order with specific values. Python requests sends 4 with default values that scream "bot." Match Chrome exactly:

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Sec-Ch-Ua": '"Chromium";v="131", "Google Chrome";v="131", "Not.A/Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
}

For subsequent requests within a session change Sec-Fetch-Site to same-origin and add the actual Referer. Header order matters — Python's requests reorders alphabetically; use curl_cffi or httpx with explicit ordering.

Layer 4: JS Runtime / Browser Fingerprint

When the site renders JavaScript, the browser env exposes ~30 signals: navigator.webdriver, plugin list, language, screen resolution, WebGL renderer, AudioContext fingerprint, canvas fingerprint, font list, timezone offset, hardware concurrency, deviceMemory. Bare Puppeteer/Playwright leaks "I'm a headless Chrome" in seconds:

navigator.webdriver === true
No plugin array
WebGL renderer is "SwiftShader"
Permissions API returns "denied" for notifications
Default 1024x768 viewport (real users are 1920x1080+)

Fix with stealth libraries:

# Patchright (the maintained successor to playwright-stealth)
from patchright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={"server": "http://USER:[email protected]:8000"},
    )
    page = browser.new_page(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
    )
    page.goto("https://target.com")
    html = page.content()
    browser.close()

Also consider: undetected-chromedriver (Selenium), nodriver (Python, the modern UC successor), Camoufox (Firefox-based, harder to fingerprint).

Layer 5: Behavior

Modern systems profile your behavior over a session: mouse movements, scroll patterns, key timings, click sequences, page-load reactions. Specifically:

Pacing. 500ms–3,000ms between requests. Random jitter is essential; uniform timing = bot.
Mouse / scroll. Real users move the mouse before clicking, scroll a bit, sometimes pause. Headless skips all of this.
Organic click paths. Real users don't deep-link to product pages with no referrer; they navigate home → category → product. Match this.
Cookie persistence. Real browsers have cookies, history, localStorage. Use Playwright's storage_state to persist across runs.
Time-of-day. Real shoppers don't buy mechanical keyboards at 4 AM; spread your crawl into business hours of the IP's geography.

# Human-like pacing
import random, time
time.sleep(random.uniform(1.5, 4.0))

# Synthetic mouse + scroll with Playwright
page.mouse.move(random.randint(100, 1800), random.randint(100, 900))
page.mouse.wheel(0, random.randint(200, 600))
time.sleep(random.uniform(0.5, 1.5))

Detection-Evasion Checklist

Layer	Minimum	Recommended
IP	Residential proxy	Mobile or residential with sticky session
TLS	curl_cffi impersonate	Real Playwright TLS
Headers	Full Chrome header set, correct order	Same + correct Referer + cookies
JS env	Patchright / nodriver	Same + storage_state + locale match
Behavior	Random sleeps	Mouse moves + scrolls + organic paths

Testing Your Stealth

bot.sannysoft.com — comprehensive detection test.
CreepJS — fingerprint deep-dive.
Cloudflare-protected dummy: https://nopecha.com/demo/cloudflare
Run your scraper against your own honeypot site that logs all 30 fingerprint signals; iterate until none flag.

When You Get Blocked Anyway

Identify which layer caught you. Cloudflare returns specific error codes (1006, 1015, 1020); DataDome returns 403 with a specific HTML page; Akamai returns 400 with a reference ID.
Test the layer in isolation. Same proxy + curl_cffi without browser? Same browser without proxy? Same browser+proxy with delays?
Wait out the IP ban. Usually 15 min–24 hours. Don't retry from the same IP in a tight loop.
Switch IP type. Datacenter blocked → residential; residential challenged → mobile.
Add CAPTCHA solver if persistent. CAPTCHA bypass guide.