Which Python library should I use for a headless browser?

Playwright is the default in 2026 — cleanest API, best maintained, cross-browser (Chromium, Firefox, WebKit), Microsoft-backed. Selenium for legacy compatibility or existing test suites. Pyppeteer (Puppeteer port to Python) is largely abandoned — skip it. For greenfield: Playwright.

Why is my Playwright scraper missing content?

Most common cause: scraping before JavaScript finishes loading. Add page.wait_for_selector('your.target.element') before extracting. Second cause: site uses lazy loading (content appears on scroll) — use page.evaluate('window.scrollTo(0, document.body.scrollHeight)') then wait again.

How do I use a proxy with Playwright?

Pass proxy at browser launch: p.chromium.launch(proxy={'server': 'http://gw.spyderproxy.com:8000', 'username': 'USER', 'password': 'PASS'}). All requests from that browser route through. For per-request rotation, launch a fresh browser per task with a different sticky-session token.

Does Playwright run in headless mode by default?

Yes — p.chromium.launch() defaults to headless=True. For debugging, set headless=False to see what the scraper is doing. On servers with no display, always keep headless=True. CI environments work fine with the default.

How much memory does Playwright use?

~250 MB per active Chromium instance, similar for Firefox, slightly less for WebKit. Multiple pages from one browser share memory more efficiently — 5 pages in one browser uses ~400 MB vs 5 separate browsers using ~1.25 GB. Restart browsers every 100-200 pages to avoid memory leaks.

Can Playwright bypass Cloudflare or DataDome?

Not by itself. Playwright runs a real browser, which passes basic checks, but sites detect 'navigator.webdriver = true' and other headless tells. Install playwright-stealth to patch most detection signals. For Cloudflare Turnstile and DataDome, also need fresh residential IPs and natural-feeling delays.

Should I use sync or async API in Playwright?

Async (playwright.async_api) for high concurrency or integration with async frameworks like FastAPI. Sync (playwright.sync_api) for scripts, simpler code, or one-off scrapers. Don't mix in the same file — pick one and stick with it per project.

How do I scrape a site that requires login with Playwright?

Three patterns: (1) automate the login each session — page.fill() the form, click submit, wait for redirect. (2) save the storage state after login: context.storage_state(path='auth.json'), then reuse with browser.new_context(storage_state='auth.json'). (3) manually create a session cookie and inject via context.add_cookies().

Headless Browser in Python: Playwright Tutorial (2026)

Alex R.

Sun May 10 2026

Quick verdict: For headless browsers in Python in 2026, default to Playwright — cleanest API, best maintained, supports Chromium + Firefox + WebKit. Selenium remains the legacy choice for large existing codebases. Pyppeteer (Python port of Node's Puppeteer) is largely abandoned — skip it. This tutorial: install Playwright, first scrape, with proxies, stealth mode, async pattern, and the gotchas you will hit.

Install Playwright

pip install playwright
python -m playwright install chromium

The second command downloads the actual browser binary (~150 MB for Chromium). For Firefox or WebKit: python -m playwright install firefox or python -m playwright install webkit.

On Linux servers, also install system dependencies:

python -m playwright install-deps chromium

First Scrape (Sync API)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    print(page.content()[:500])
    browser.close()

That is a working headless scrape in 7 lines. page.content() returns the full HTML after JavaScript execution completes. page.title() returns the document title.

Async API (Higher Performance)

import asyncio
from playwright.async_api import async_playwright

async def scrape(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        title = await page.title()
        html = await page.content()
        await browser.close()
        return title, html

async def main():
    urls = ["https://example.com", "https://playwright.dev"]
    results = await asyncio.gather(*(scrape(u) for u in urls))
    for t, h in results:
        print(t)

asyncio.run(main())

Spawns a fresh browser per URL. For higher concurrency with one browser and multiple pages, see below.

Waiting for Content

The most common bug: scraping before JavaScript finishes loading content. Three wait strategies:

# 1. Wait for a specific selector
page.goto("https://target.com")
page.wait_for_selector("div.products-loaded")
products = page.query_selector_all("div.product")

# 2. Wait for network idle (no requests for 500ms)
page.goto("https://target.com", wait_until="networkidle")

# 3. Wait for a fixed time (last resort)
import time; time.sleep(3)

Prefer (1) over (2) over (3). Selector waits are most reliable; network idle works on JS-heavy sites; fixed sleeps are fragile.

Extracting Data

page.goto("https://target.com/products")
page.wait_for_selector("li.product")

# Multiple elements
cards = page.query_selector_all("li.product")
products = []
for c in cards:
    title = c.query_selector("h3").inner_text()
    price = c.query_selector("span.price").inner_text()
    url = c.query_selector("a").get_attribute("href")
    products.append({"title": title, "price": price, "url": url})

# Single element
h1 = page.query_selector("h1.page-title").inner_text()

# Multiple text contents (faster)
all_titles = page.locator("h3.title").all_text_contents()

For CSS selectors, see CSS selector cheat sheet. Playwright also supports text= filter: page.locator("button", has_text="Submit").

Proxy Configuration

browser = p.chromium.launch(
    headless=True,
    proxy={
        "server": "http://gw.spyderproxy.com:8000",
        "username": "YOUR_USER",
        "password": "YOUR_PASS",
    },
)

Proxy is set at browser launch — all requests from that browser route through it. For rotating IPs, launch a fresh browser per scrape or use sticky-session syntax:

import random

def fresh_browser(p):
    sid = random.randint(0, 100000)
    return p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://gw.spyderproxy.com:8000",
            "username": f"YOUR_USER-session-{sid}",
            "password": "YOUR_PASS",
        },
    )

Each session-{sid} gives a fresh sticky-session IP that stays consistent for up to 8 hours. Premium Residential ($2.75/GB) or LTE Mobile ($2/IP) are recommended for sites that would justify needing a headless browser at all.

Stealth Mode (Avoid Detection)

Default Playwright leaves several headless tells: navigator.webdriver = true, missing browser-features like permissions.query, etc. Sites that fingerprint will flag you. Install the community stealth plugin:

pip install playwright-stealth

from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()
    stealth_sync(page)            # patches detection signals
    page.goto("https://target.com")
    print(page.title())
    browser.close()

Stealth handles ~80% of common detection checks. For Cloudflare Turnstile and DataDome, you also need fresh residential IPs and slow-down delays. See FlareSolverr guide for the heaviest cases.

Screenshots and PDFs

# Full page screenshot
page.goto("https://target.com")
page.screenshot(path="screen.png", full_page=True)

# Just a specific element
el = page.query_selector("div.main-content")
el.screenshot(path="content.png")

# PDF (Chromium only)
page.pdf(path="page.pdf", format="A4")

Filling Forms

page.goto("https://example.com/login")
page.fill("input[name=\"username\"]", "alice")
page.fill("input[name=\"password\"]", "secret123")
page.click("button[type=\"submit\"]")
page.wait_for_url("**/dashboard")

The ** in wait_for_url is a glob; it matches any URL ending in /dashboard. Use for asynchronous form submissions where the redirect URL varies.

Concurrent Pages, One Browser

import asyncio
from playwright.async_api import async_playwright

async def scrape_url(browser, url):
    page = await browser.new_page()
    await page.goto(url)
    title = await page.title()
    await page.close()
    return title

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        urls = [f"https://example.com/p/{i}" for i in range(20)]
        sem = asyncio.Semaphore(5)        # max 5 concurrent
        async def bounded(u):
            async with sem:
                return await scrape_url(browser, u)
        results = await asyncio.gather(*(bounded(u) for u in urls))
        await browser.close()
    print(results)

asyncio.run(main())

One browser, multiple pages, capped concurrency. More memory-efficient than launching N browsers, but pages share cookies/storage — use browser.new_context() per scrape if you need full isolation.

Common Pitfalls

Forgetting to close the browser — leaks 200+ MB per orphaned process. Always wrap in with or try/finally.
Default timeouts too short — 30s default. For slow sites: page.goto(url, timeout=60000).
Headless detection — navigator.webdriver is true. Install playwright-stealth.
Mixing sync and async APIs — do not. Pick one per project.
Server has no display — pass headless=True (the default in 1.x). Avoid headless=False on CI.
Memory accumulation — restart browser every 100-200 pages to avoid leaks.

When to Use Selenium Instead

Existing Selenium test suite you do not want to rewrite
Need browser support Playwright lacks (very old Safari, Internet Explorer)
Specific Selenium features (Grid, Selenium IDE recording)
Your team already knows Selenium and the API differences are not worth retraining

For greenfield Python projects in 2026, Playwright wins. Selenium's only remaining advantage is the install base.