spyderproxy

How to Avoid Detection When Scraping (2026)

D

Daniel K.

|
Published date

Mon May 18 2026

Quick playbook: Detection in 2026 stacks across five layers. (1) IP reputation — use residential or mobile proxies, not datacenter. (2) TLS fingerprint (JA3/JA4) — impersonate a real Chrome via curl_cffi or browser. (3) HTTP headers — full browser header set in correct order, real User-Agent, real Accept-Language. (4) JS runtime / browser fingerprint — use Patchright / undetected-playwright; never bare Playwright with defaults. (5) Behavior — human pacing, organic click paths, cookie persistence. Skip any one of these and a modern anti-bot system (Cloudflare, DataDome, PerimeterX, Akamai, Kasada) will catch you.

Who's Looking

SystemPrimary signalCatches
Cloudflare WAF / Bot Fight ModeIP rep + JA4 + JS challenge~80% of automation
DataDomeFingerprint + behaviorAggressive on e-commerce, news
PerimeterX (HUMAN)JS challenge + telemetrySneakers, ticketing
Akamai Bot ManagerMulti-layer scoringEnterprise sites, airlines
KasadaJS challenge + Web CryptoSneakers, ticketing
Imperva (formerly Incapsula)Behavioral + signatureFinancial, e-commerce
F5 Distributed Cloud Bot DefenseBehavior + fingerprintBanking, retail

Layer 1: IP Reputation

Anti-bot systems classify IPs by ASN, geolocation, and historical bot activity. The first 50ms of any detection check is an ASN lookup; datacenter ASNs (AWS, Hetzner, OVH, Vultr, Linode, DigitalOcean) are pre-flagged. Most aggressive systems will challenge any request from these ASNs.

  • Residential proxies — consumer ISP ASNs (Comcast, BT, Vodafone broadband). Low bot score by default.
  • Mobile proxies — mobile carrier ASNs. Highest trust due to CGNAT.
  • ISP / Static Residential — residential ASN at datacenter speed.
  • Datacenter proxies — only on unprotected targets.

Pricing: Budget Residential $1.75/GB · Premium Residential $2.75/GB · LTE Mobile $2/IP/month.

Layer 2: TLS Fingerprint (JA3 / JA4)

Every TLS client — Chrome, Firefox, curl, Python requests, Go's net/http — sends a unique ClientHello message. The ordered list of cipher suites, extensions, and ALPN protocols becomes a fingerprint (JA3 historically; JA4 since 2023 is the new standard). Anti-bot systems hash this fingerprint and check it against a list of known clients. Python requests has a JA4 hash that screams "Python script."

Fix with TLS impersonation:

# curl_cffi — drop-in requests replacement with Chrome TLS
from curl_cffi import requests

r = requests.get("https://target.com", impersonate="chrome131")
print(r.status_code, r.text[:200])

# Other impersonations: chrome131, firefox133, edge131, safari17_5

For non-Python: curl-impersonate, azuretls-client (Go), tls-client (Node). When using a real browser (Playwright) you get Chrome's real TLS for free.

Layer 3: HTTP Headers

A real Chrome 131 sends ~10 headers in a specific order with specific values. Python requests sends 4 with default values that scream "bot." Match Chrome exactly:

HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,"
              "image/avif,image/webp,image/apng,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br, zstd",
    "Sec-Ch-Ua": '"Chromium";v="131", "Google Chrome";v="131", "Not.A/Brand";v="24"',
    "Sec-Ch-Ua-Mobile": "?0",
    "Sec-Ch-Ua-Platform": '"Windows"',
    "Sec-Fetch-Dest": "document",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-User": "?1",
    "Upgrade-Insecure-Requests": "1",
}

For subsequent requests within a session change Sec-Fetch-Site to same-origin and add the actual Referer. Header order matters — Python's requests reorders alphabetically; use curl_cffi or httpx with explicit ordering.

Layer 4: JS Runtime / Browser Fingerprint

When the site renders JavaScript, the browser env exposes ~30 signals: navigator.webdriver, plugin list, language, screen resolution, WebGL renderer, AudioContext fingerprint, canvas fingerprint, font list, timezone offset, hardware concurrency, deviceMemory. Bare Puppeteer/Playwright leaks "I'm a headless Chrome" in seconds:

  • navigator.webdriver === true
  • No plugin array
  • WebGL renderer is "SwiftShader"
  • Permissions API returns "denied" for notifications
  • Default 1024x768 viewport (real users are 1920x1080+)

Fix with stealth libraries:

# Patchright (the maintained successor to playwright-stealth)
from patchright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(
        headless=True,
        proxy={"server": "http://USER:[email protected]:8000"},
    )
    page = browser.new_page(
        viewport={"width": 1920, "height": 1080},
        user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
        locale="en-US",
        timezone_id="America/New_York",
    )
    page.goto("https://target.com")
    html = page.content()
    browser.close()

Also consider: undetected-chromedriver (Selenium), nodriver (Python, the modern UC successor), Camoufox (Firefox-based, harder to fingerprint).

Layer 5: Behavior

Modern systems profile your behavior over a session: mouse movements, scroll patterns, key timings, click sequences, page-load reactions. Specifically:

  • Pacing. 500ms–3,000ms between requests. Random jitter is essential; uniform timing = bot.
  • Mouse / scroll. Real users move the mouse before clicking, scroll a bit, sometimes pause. Headless skips all of this.
  • Organic click paths. Real users don't deep-link to product pages with no referrer; they navigate home → category → product. Match this.
  • Cookie persistence. Real browsers have cookies, history, localStorage. Use Playwright's storage_state to persist across runs.
  • Time-of-day. Real shoppers don't buy mechanical keyboards at 4 AM; spread your crawl into business hours of the IP's geography.
# Human-like pacing
import random, time
time.sleep(random.uniform(1.5, 4.0))

# Synthetic mouse + scroll with Playwright
page.mouse.move(random.randint(100, 1800), random.randint(100, 900))
page.mouse.wheel(0, random.randint(200, 600))
time.sleep(random.uniform(0.5, 1.5))

Detection-Evasion Checklist

LayerMinimumRecommended
IPResidential proxyMobile or residential with sticky session
TLScurl_cffi impersonateReal Playwright TLS
HeadersFull Chrome header set, correct orderSame + correct Referer + cookies
JS envPatchright / nodriverSame + storage_state + locale match
BehaviorRandom sleepsMouse moves + scrolls + organic paths

Testing Your Stealth

  • bot.sannysoft.com — comprehensive detection test.
  • CreepJS — fingerprint deep-dive.
  • Cloudflare-protected dummy: https://nopecha.com/demo/cloudflare
  • Run your scraper against your own honeypot site that logs all 30 fingerprint signals; iterate until none flag.

When You Get Blocked Anyway

  1. Identify which layer caught you. Cloudflare returns specific error codes (1006, 1015, 1020); DataDome returns 403 with a specific HTML page; Akamai returns 400 with a reference ID.
  2. Test the layer in isolation. Same proxy + curl_cffi without browser? Same browser without proxy? Same browser+proxy with delays?
  3. Wait out the IP ban. Usually 15 min–24 hours. Don't retry from the same IP in a tight loop.
  4. Switch IP type. Datacenter blocked → residential; residential challenged → mobile.
  5. Add CAPTCHA solver if persistent. CAPTCHA bypass guide.

Related: How to bypass CAPTCHAs · FlareSolverr guide · Cloudscraper tutorial · Browser fingerprinting.