spyderproxy

Headless Browser in Python: Playwright Tutorial (2026)

A

Alex R.

|
Published date

Sun May 10 2026

Quick verdict: For headless browsers in Python in 2026, default to Playwright — cleanest API, best maintained, supports Chromium + Firefox + WebKit. Selenium remains the legacy choice for large existing codebases. Pyppeteer (Python port of Node's Puppeteer) is largely abandoned — skip it. This tutorial: install Playwright, first scrape, with proxies, stealth mode, async pattern, and the gotchas you will hit.

Install Playwright

pip install playwright
python -m playwright install chromium

The second command downloads the actual browser binary (~150 MB for Chromium). For Firefox or WebKit: python -m playwright install firefox or python -m playwright install webkit.

On Linux servers, also install system dependencies:

python -m playwright install-deps chromium

First Scrape (Sync API)

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto("https://example.com")
    print(page.title())
    print(page.content()[:500])
    browser.close()

That is a working headless scrape in 7 lines. page.content() returns the full HTML after JavaScript execution completes. page.title() returns the document title.

Async API (Higher Performance)

import asyncio
from playwright.async_api import async_playwright

async def scrape(url):
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()
        await page.goto(url)
        title = await page.title()
        html = await page.content()
        await browser.close()
        return title, html

async def main():
    urls = ["https://example.com", "https://playwright.dev"]
    results = await asyncio.gather(*(scrape(u) for u in urls))
    for t, h in results:
        print(t)

asyncio.run(main())

Spawns a fresh browser per URL. For higher concurrency with one browser and multiple pages, see below.

Waiting for Content

The most common bug: scraping before JavaScript finishes loading content. Three wait strategies:

# 1. Wait for a specific selector
page.goto("https://target.com")
page.wait_for_selector("div.products-loaded")
products = page.query_selector_all("div.product")

# 2. Wait for network idle (no requests for 500ms)
page.goto("https://target.com", wait_until="networkidle")

# 3. Wait for a fixed time (last resort)
import time; time.sleep(3)

Prefer (1) over (2) over (3). Selector waits are most reliable; network idle works on JS-heavy sites; fixed sleeps are fragile.

Extracting Data

page.goto("https://target.com/products")
page.wait_for_selector("li.product")

# Multiple elements
cards = page.query_selector_all("li.product")
products = []
for c in cards:
    title = c.query_selector("h3").inner_text()
    price = c.query_selector("span.price").inner_text()
    url = c.query_selector("a").get_attribute("href")
    products.append({"title": title, "price": price, "url": url})

# Single element
h1 = page.query_selector("h1.page-title").inner_text()

# Multiple text contents (faster)
all_titles = page.locator("h3.title").all_text_contents()

For CSS selectors, see CSS selector cheat sheet. Playwright also supports text= filter: page.locator("button", has_text="Submit").

Proxy Configuration

browser = p.chromium.launch(
    headless=True,
    proxy={
        "server": "http://gw.spyderproxy.com:8000",
        "username": "YOUR_USER",
        "password": "YOUR_PASS",
    },
)

Proxy is set at browser launch — all requests from that browser route through it. For rotating IPs, launch a fresh browser per scrape or use sticky-session syntax:

import random

def fresh_browser(p):
    sid = random.randint(0, 100000)
    return p.chromium.launch(
        headless=True,
        proxy={
            "server": "http://gw.spyderproxy.com:8000",
            "username": f"YOUR_USER-session-{sid}",
            "password": "YOUR_PASS",
        },
    )

Each session-{sid} gives a fresh sticky-session IP that stays consistent for up to 8 hours. Premium Residential ($2.75/GB) or LTE Mobile ($2/IP) are recommended for sites that would justify needing a headless browser at all.

Stealth Mode (Avoid Detection)

Default Playwright leaves several headless tells: navigator.webdriver = true, missing browser-features like permissions.query, etc. Sites that fingerprint will flag you. Install the community stealth plugin:

pip install playwright-stealth
from playwright.sync_api import sync_playwright
from playwright_stealth import stealth_sync

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    context = browser.new_context()
    page = context.new_page()
    stealth_sync(page)            # patches detection signals
    page.goto("https://target.com")
    print(page.title())
    browser.close()

Stealth handles ~80% of common detection checks. For Cloudflare Turnstile and DataDome, you also need fresh residential IPs and slow-down delays. See FlareSolverr guide for the heaviest cases.

Screenshots and PDFs

# Full page screenshot
page.goto("https://target.com")
page.screenshot(path="screen.png", full_page=True)

# Just a specific element
el = page.query_selector("div.main-content")
el.screenshot(path="content.png")

# PDF (Chromium only)
page.pdf(path="page.pdf", format="A4")

Filling Forms

page.goto("https://example.com/login")
page.fill("input[name=\"username\"]", "alice")
page.fill("input[name=\"password\"]", "secret123")
page.click("button[type=\"submit\"]")
page.wait_for_url("**/dashboard")

The ** in wait_for_url is a glob; it matches any URL ending in /dashboard. Use for asynchronous form submissions where the redirect URL varies.

Concurrent Pages, One Browser

import asyncio
from playwright.async_api import async_playwright

async def scrape_url(browser, url):
    page = await browser.new_page()
    await page.goto(url)
    title = await page.title()
    await page.close()
    return title

async def main():
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        urls = [f"https://example.com/p/{i}" for i in range(20)]
        sem = asyncio.Semaphore(5)        # max 5 concurrent
        async def bounded(u):
            async with sem:
                return await scrape_url(browser, u)
        results = await asyncio.gather(*(bounded(u) for u in urls))
        await browser.close()
    print(results)

asyncio.run(main())

One browser, multiple pages, capped concurrency. More memory-efficient than launching N browsers, but pages share cookies/storage — use browser.new_context() per scrape if you need full isolation.

Common Pitfalls

  • Forgetting to close the browser — leaks 200+ MB per orphaned process. Always wrap in with or try/finally.
  • Default timeouts too short — 30s default. For slow sites: page.goto(url, timeout=60000).
  • Headless detectionnavigator.webdriver is true. Install playwright-stealth.
  • Mixing sync and async APIs — do not. Pick one per project.
  • Server has no display — pass headless=True (the default in 1.x). Avoid headless=False on CI.
  • Memory accumulation — restart browser every 100-200 pages to avoid leaks.

When to Use Selenium Instead

  • Existing Selenium test suite you do not want to rewrite
  • Need browser support Playwright lacks (very old Safari, Internet Explorer)
  • Specific Selenium features (Grid, Selenium IDE recording)
  • Your team already knows Selenium and the API differences are not worth retraining

For greenfield Python projects in 2026, Playwright wins. Selenium's only remaining advantage is the install base.

Related: What is a headless browser, Puppeteer vs Playwright vs Selenium, Cheerio vs Puppeteer.