Quick verdict: Python asyncio is built for I/O-bound concurrency — the perfect tool for web scraping, where most time is spent waiting for the network. Replacing 100 sequential requests.get calls with asyncio.gather + aiohttp typically goes from ~30 seconds to ~1 second. The core primitives: async def to define coroutines, await to wait for one, asyncio.gather() to run many concurrently, asyncio.Semaphore to cap concurrency for rate limiting.
A web request spends 99% of its time waiting:
Synchronous code wastes the wait time. asyncio uses it to start other requests. For I/O-bound work, expect 50-100x speedup over sequential. (For CPU-bound work, asyncio does nothing — use multiprocessing.)
import asyncio
async def hello(name):
print(f"hi {name}, waiting...")
await asyncio.sleep(1) # simulates I/O
print(f"done with {name}")
return f"result-{name}"
async def main():
# Run three coroutines concurrently
results = await asyncio.gather(
hello("alice"),
hello("bob"),
hello("carol"),
)
print(results)
asyncio.run(main())Output:
hi alice, waiting...
hi bob, waiting...
hi carol, waiting...
# (1 second passes — all three run concurrently)
done with alice
done with bob
done with carol
['result-alice', 'result-bob', 'result-carol']Total wall time: ~1 second (not 3). The three coroutines wait concurrently because await asyncio.sleep(1) yields control to the event loop, which picks up the next ready coroutine.
pip install aiohttpimport asyncio, aiohttp
async def fetch(session, url):
async with session.get(url) as response:
return await response.text()
async def main():
urls = [f"https://example.com/page/{i}" for i in range(100)]
async with aiohttp.ClientSession() as session:
# All 100 requests in flight at once
results = await asyncio.gather(*(fetch(session, u) for u in urls))
print(f"Got {len(results)} pages")
asyncio.run(main())100 requests in ~1 second on a decent connection. Compare to sequential requests: ~30 seconds. aiohttp.ClientSession() must be used inside an async context — reusing the session pools connections, which is critical for performance.
httpx is a drop-in replacement for requests with async support:
import asyncio, httpx
async def fetch(client, url):
r = await client.get(url)
return r.text
async def main():
urls = [f"https://example.com/page/{i}" for i in range(100)]
async with httpx.AsyncClient() as client:
results = await asyncio.gather(*(fetch(client, u) for u in urls))
return results
asyncio.run(main())API choice: aiohttp is older and more battle-tested. httpx is newer with a cleaner API and matching sync/async. Both work; pick the one that matches the rest of your stack.
Launching 1,000 requests at once will get your IP banned. Cap concurrency with asyncio.Semaphore:
import asyncio, aiohttp
semaphore = asyncio.Semaphore(10) # max 10 concurrent requests
async def fetch(session, url):
async with semaphore: # acquire slot
async with session.get(url) as response:
await asyncio.sleep(0.5) # polite delay
return await response.text()
async def main():
urls = [f"https://example.com/page/{i}" for i in range(1000)]
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(*(fetch(session, u) for u in urls))
return results
asyncio.run(main())The semaphore lets only 10 coroutines past at once. Others wait at async with semaphore: until a slot frees up. 1,000 URLs at 10 concurrency + 0.5s delay completes in ~50 seconds vs ~500 seconds sequential.
Use return_exceptions=True with gather so one failure does not kill the batch:
results = await asyncio.gather(
*(fetch(session, u) for u in urls),
return_exceptions=True, # exceptions become results, not raises
)
for url, r in zip(urls, results):
if isinstance(r, Exception):
print(f"FAIL {url}: {r}")
else:
print(f"OK {url}: {len(r)} bytes")Or wrap each fetch in try/except for inline retry logic:
async def fetch_with_retry(session, url, max_retries=3):
for attempt in range(max_retries):
try:
async with session.get(url, timeout=10) as r:
if r.status == 200:
return await r.text()
if r.status in (429, 503):
await asyncio.sleep(2 ** attempt)
continue
r.raise_for_status()
except (aiohttp.ClientError, asyncio.TimeoutError):
if attempt == max_retries - 1:
raise
await asyncio.sleep(2 ** attempt)aiohttp accepts a proxy per request:
import asyncio, aiohttp, random
def fresh_proxy():
"""SpyderProxy sticky session: fresh IP per session_id."""
sid = random.randint(0, 100000)
return f"http://USER-session-{sid}:[email protected]:8000"
async def fetch(session, url):
async with session.get(url, proxy=fresh_proxy(), timeout=15) as r:
return await r.text()
async def main():
urls = [f"https://target.com/item/{i}" for i in range(100)]
sem = asyncio.Semaphore(10)
async def bounded(u):
async with sem:
return await fetch(session, u)
async with aiohttp.ClientSession() as session:
results = await asyncio.gather(*(bounded(u) for u in urls), return_exceptions=True)
return results
asyncio.run(main())Each request gets a fresh sticky-session IP from Premium Residential. 100 URLs through 100 different IPs, all in flight at once (capped at 10 concurrent by the semaphore).
import asyncio, aiohttp
async def producer(queue, urls):
for url in urls:
await queue.put(url)
for _ in range(NUM_WORKERS):
await queue.put(None) # sentinel to stop workers
async def worker(queue, session, results):
while True:
url = await queue.get()
if url is None:
break
async with session.get(url) as r:
results.append(await r.text())
NUM_WORKERS = 20
async def main(urls):
queue = asyncio.Queue(maxsize=100)
results = []
async with aiohttp.ClientSession() as session:
await asyncio.gather(
producer(queue, urls),
*(worker(queue, session, results) for _ in range(NUM_WORKERS)),
)
return resultsBetter for unknown-size streams or when each request might enqueue more URLs (crawling).
requests.get, time.sleep, or any blocking I/O inside an async function. Use aiohttp/httpx and asyncio.sleep.await. Fix: await coro() or asyncio.gather(coro()).aiohttp.ClientTimeout(total=30) per session, or timeout= per request.connector=aiohttp.TCPConnector(ssl=False) to skip cert verification (only for testing — see why ignoring SSL is dangerous).asyncio.run_in_executor(None, cpu_bound_fn) to offload CPU work to a thread/process pool.asyncio.run() calls everywhere.Related: Mastering httpx, Python requests retry, Concurrency vs parallelism.