Building a web scraper in Python comes down to four steps: fetch a page with the requests library, parse its HTML with BeautifulSoup, extract the fields you want, and save them. This tutorial walks through a complete, working scraper from scratch — then adds the two things that separate a toy script from one that survives real websites: a proxy so you do not get blocked, and proper headers so you do not look like a bot.
This is the beginner-friendly requests-plus-BeautifulSoup path. When you outgrow it and need concurrency and structure, step up to our Scrapy tutorial. For LLM-based extraction, see web scraping with Claude.
pip install requests beautifulsoup4
requests fetches pages over HTTP; beautifulsoup4 parses the returned HTML into something you can query. These two cover the vast majority of static-site scraping.
Never fetch with bare defaults — requests sends a dead-giveaway user agent. Send realistic headers and route through a proxy from the start:
import requests
URL = "https://books.toscrape.com/"
HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
}
PROXY = "http://USER:[email protected]:7777"
resp = requests.get(URL, headers=HEADERS,
proxies={"http": PROXY, "https": PROXY}, timeout=30)
resp.raise_for_status()
html = resp.text
Routing through a rotating residential proxy means the site sees an ordinary household IP, and pick a current user agent so the request blends in.
Load the HTML and select elements with CSS selectors. Suppose each product sits in an <article class="product_pod"> with a title and a price inside:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for card in soup.select("article.product_pod"):
title = card.select_one("h3 a")["title"]
price = card.select_one("p.price_color").get_text(strip=True)
print(title, price)
select() returns all matches; select_one() returns the first. Use get_text(strip=True) for clean text, and bracket access for attributes. CSS selectors are the easiest way in — see our CSS selector cheat sheet.
Gather each row into a list of dictionaries so it is ready to save:
rows = []
for card in soup.select("article.product_pod"):
rows.append({
"title": card.select_one("h3 a")["title"],
"price": card.select_one("p.price_color").get_text(strip=True),
"in_stock": "In stock" in card.select_one("p.instock").get_text(),
})
Most listings span many pages. Find the "next" link and loop until it disappears:
from urllib.parse import urljoin
def scrape_all(start_url):
rows, url = [], start_url
while url:
resp = requests.get(url, headers=HEADERS,
proxies={"http": PROXY, "https": PROXY}, timeout=30)
soup = BeautifulSoup(resp.text, "html.parser")
for card in soup.select("article.product_pod"):
rows.append({
"title": card.select_one("h3 a")["title"],
"price": card.select_one("p.price_color").get_text(strip=True),
})
nxt = soup.select_one("li.next a")
url = urljoin(url, nxt["href"]) if nxt else None
return rows
Because the proxy endpoint rotates IPs automatically, each page request can come from a different residential address — exactly what keeps a multi-page run from being blocked. See rotating proxies in Python.
import csv
rows = scrape_all(URL)
with open("books.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print("saved", len(rows), "rows")
requests and BeautifulSoup only see the HTML the server sends — they do not run JavaScript. If a site renders its content client-side (you fetch the page and the data is not in the HTML), you need a headless browser like Playwright or Selenium to render it first, then parse. For heavily protected sites, read how to avoid detection while scraping and how to bypass Cloudflare.
Scraping publicly available data is broadly permissible in many jurisdictions, but is bounded by site terms and privacy laws. Collect responsibly and seek legal advice for high-stakes use.
Install requests and beautifulsoup4, fetch the page with requests (sending realistic headers and a proxy), parse the HTML with BeautifulSoup, select the elements you want with CSS selectors, collect the fields into dictionaries, follow pagination, and save the results to CSV or JSON. That four-step fetch-parse-extract-save loop is the core of any Python scraper.
For static sites, requests (to fetch pages) and beautifulsoup4 (to parse HTML) are enough. For large structured crawls, use the Scrapy framework. For JavaScript-rendered sites, add a headless browser like Playwright or Selenium to render the page before parsing.
Usually two reasons: all your requests come from one IP, and your request looks like a bot. Fix both by routing through rotating residential proxies and sending a current user agent with realistic headers, plus throttling your request rate. IP diversity is the biggest single factor.
For anything beyond a few requests, yes. Sites rate-limit and block repeated requests from one address. Rotating residential proxies spread your requests across many real IPs so the activity looks like ordinary users, which is what keeps a scraper running at scale.
requests and BeautifulSoup only see server-sent HTML and cannot run JavaScript. If the data is loaded client-side, use a headless browser such as Playwright or Selenium to render the page, then extract from the rendered HTML — ideally still routed through proxies.
Collect each record as a dictionary, then write the list to a file. Python's built-in csv module exports to CSV with DictWriter, and the json module writes JSON. Both take a list of dictionaries directly, so no extra tooling is needed.
A Python web scraper is just four steps — fetch, parse, extract, save — wrapped in a pagination loop. requests and BeautifulSoup get you a working scraper in a few dozen lines. What turns it into something that survives real websites is sending realistic headers and routing through rotating residential proxies so you are not blocked on the second page.
To keep your scraper running without bans, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — drop the endpoint into your requests call and scale up.