spyderproxy

Python asyncio Tutorial: Async Web Scraping (2026)

A

Alex R.

|
Published date

Sun May 10 2026

Quick verdict: Python asyncio is built for I/O-bound concurrency — the perfect tool for web scraping, where most time is spent waiting for the network. Replacing 100 sequential requests.get calls with asyncio.gather + aiohttp typically goes from ~30 seconds to ~1 second. The core primitives: async def to define coroutines, await to wait for one, asyncio.gather() to run many concurrently, asyncio.Semaphore to cap concurrency for rate limiting.

Why asyncio for Scraping

A web request spends 99% of its time waiting:

  • DNS lookup: ~20ms
  • TCP handshake: ~30ms
  • TLS handshake: ~50ms
  • Server processing: ~100-500ms
  • Body transfer: ~100ms
  • Total: ~300-700ms of which your CPU does ~2ms of work

Synchronous code wastes the wait time. asyncio uses it to start other requests. For I/O-bound work, expect 50-100x speedup over sequential. (For CPU-bound work, asyncio does nothing — use multiprocessing.)

The Basics: async/await

import asyncio

async def hello(name):
    print(f"hi {name}, waiting...")
    await asyncio.sleep(1)        # simulates I/O
    print(f"done with {name}")
    return f"result-{name}"

async def main():
    # Run three coroutines concurrently
    results = await asyncio.gather(
        hello("alice"),
        hello("bob"),
        hello("carol"),
    )
    print(results)

asyncio.run(main())

Output:

hi alice, waiting...
hi bob, waiting...
hi carol, waiting...
# (1 second passes — all three run concurrently)
done with alice
done with bob
done with carol
['result-alice', 'result-bob', 'result-carol']

Total wall time: ~1 second (not 3). The three coroutines wait concurrently because await asyncio.sleep(1) yields control to the event loop, which picks up the next ready coroutine.

aiohttp: Async HTTP Client

pip install aiohttp
import asyncio, aiohttp

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def main():
    urls = [f"https://example.com/page/{i}" for i in range(100)]
    async with aiohttp.ClientSession() as session:
        # All 100 requests in flight at once
        results = await asyncio.gather(*(fetch(session, u) for u in urls))
    print(f"Got {len(results)} pages")

asyncio.run(main())

100 requests in ~1 second on a decent connection. Compare to sequential requests: ~30 seconds. aiohttp.ClientSession() must be used inside an async context — reusing the session pools connections, which is critical for performance.

httpx: Sync + Async in One Library

httpx is a drop-in replacement for requests with async support:

import asyncio, httpx

async def fetch(client, url):
    r = await client.get(url)
    return r.text

async def main():
    urls = [f"https://example.com/page/{i}" for i in range(100)]
    async with httpx.AsyncClient() as client:
        results = await asyncio.gather(*(fetch(client, u) for u in urls))
    return results

asyncio.run(main())

API choice: aiohttp is older and more battle-tested. httpx is newer with a cleaner API and matching sync/async. Both work; pick the one that matches the rest of your stack.

Rate Limiting With Semaphore

Launching 1,000 requests at once will get your IP banned. Cap concurrency with asyncio.Semaphore:

import asyncio, aiohttp

semaphore = asyncio.Semaphore(10)   # max 10 concurrent requests

async def fetch(session, url):
    async with semaphore:           # acquire slot
        async with session.get(url) as response:
            await asyncio.sleep(0.5)  # polite delay
            return await response.text()

async def main():
    urls = [f"https://example.com/page/{i}" for i in range(1000)]
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*(fetch(session, u) for u in urls))
    return results

asyncio.run(main())

The semaphore lets only 10 coroutines past at once. Others wait at async with semaphore: until a slot frees up. 1,000 URLs at 10 concurrency + 0.5s delay completes in ~50 seconds vs ~500 seconds sequential.

Error Handling: Some Will Fail

Use return_exceptions=True with gather so one failure does not kill the batch:

results = await asyncio.gather(
    *(fetch(session, u) for u in urls),
    return_exceptions=True,         # exceptions become results, not raises
)

for url, r in zip(urls, results):
    if isinstance(r, Exception):
        print(f"FAIL {url}: {r}")
    else:
        print(f"OK   {url}: {len(r)} bytes")

Or wrap each fetch in try/except for inline retry logic:

async def fetch_with_retry(session, url, max_retries=3):
    for attempt in range(max_retries):
        try:
            async with session.get(url, timeout=10) as r:
                if r.status == 200:
                    return await r.text()
                if r.status in (429, 503):
                    await asyncio.sleep(2 ** attempt)
                    continue
                r.raise_for_status()
        except (aiohttp.ClientError, asyncio.TimeoutError):
            if attempt == max_retries - 1:
                raise
            await asyncio.sleep(2 ** attempt)

Rotating Proxies in asyncio

aiohttp accepts a proxy per request:

import asyncio, aiohttp, random

def fresh_proxy():
    """SpyderProxy sticky session: fresh IP per session_id."""
    sid = random.randint(0, 100000)
    return f"http://USER-session-{sid}:[email protected]:8000"

async def fetch(session, url):
    async with session.get(url, proxy=fresh_proxy(), timeout=15) as r:
        return await r.text()

async def main():
    urls = [f"https://target.com/item/{i}" for i in range(100)]
    sem = asyncio.Semaphore(10)
    async def bounded(u):
        async with sem:
            return await fetch(session, u)
    async with aiohttp.ClientSession() as session:
        results = await asyncio.gather(*(bounded(u) for u in urls), return_exceptions=True)
    return results

asyncio.run(main())

Each request gets a fresh sticky-session IP from Premium Residential. 100 URLs through 100 different IPs, all in flight at once (capped at 10 concurrent by the semaphore).

Common Patterns

Producer-consumer (queue-based)

import asyncio, aiohttp

async def producer(queue, urls):
    for url in urls:
        await queue.put(url)
    for _ in range(NUM_WORKERS):
        await queue.put(None)   # sentinel to stop workers

async def worker(queue, session, results):
    while True:
        url = await queue.get()
        if url is None:
            break
        async with session.get(url) as r:
            results.append(await r.text())

NUM_WORKERS = 20
async def main(urls):
    queue = asyncio.Queue(maxsize=100)
    results = []
    async with aiohttp.ClientSession() as session:
        await asyncio.gather(
            producer(queue, urls),
            *(worker(queue, session, results) for _ in range(NUM_WORKERS)),
        )
    return results

Better for unknown-size streams or when each request might enqueue more URLs (crawling).

Common Gotchas

  • Calling sync code blocks the event loop. Never use requests.get, time.sleep, or any blocking I/O inside an async function. Use aiohttp/httpx and asyncio.sleep.
  • "RuntimeWarning: coroutine was never awaited" — you called an async function without await. Fix: await coro() or asyncio.gather(coro()).
  • One slow URL hangs the batch. Always set timeouts. aiohttp.ClientTimeout(total=30) per session, or timeout= per request.
  • SSL errors with aiohttp. Pass connector=aiohttp.TCPConnector(ssl=False) to skip cert verification (only for testing — see why ignoring SSL is dangerous).
  • Cannot mix asyncio with multiprocessing easily. If you need both CPU and I/O concurrency, use asyncio.run_in_executor(None, cpu_bound_fn) to offload CPU work to a thread/process pool.

When NOT to Use asyncio

  • Pure CPU-bound work — use multiprocessing instead.
  • Few requests at low concurrency — the async complexity is not worth it for <20 URLs.
  • You need a sync interface elsewhere — using async-only libraries leaks asyncio.run() calls everywhere.
  • Selenium / Playwright is needed — Playwright has its own async API; do not try to wrap Selenium's sync API in asyncio.

Related: Mastering httpx, Python requests retry, Concurrency vs parallelism.