How to Build a Web Scraper in Python (2026): Full Tutorial

Building a web scraper in Python comes down to four steps: fetch a page with the requests library, parse its HTML with BeautifulSoup, extract the fields you want, and save them. This tutorial walks through a complete, working scraper from scratch — then adds the two things that separate a toy script from one that survives real websites: a proxy so you do not get blocked, and proper headers so you do not look like a bot.

This is the beginner-friendly requests-plus-BeautifulSoup path. When you outgrow it and need concurrency and structure, step up to our Scrapy tutorial. For LLM-based extraction, see web scraping with Claude.

1. Install the Libraries

pip install requests beautifulsoup4

requests fetches pages over HTTP; beautifulsoup4 parses the returned HTML into something you can query. These two cover the vast majority of static-site scraping.

2. Fetch a Page (With Headers and a Proxy)

Never fetch with bare defaults — requests sends a dead-giveaway user agent. Send realistic headers and route through a proxy from the start:

import requests

URL = "https://books.toscrape.com/"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept-Language": "en-US,en;q=0.9",
}
PROXY = "http://USER:[email protected]:7777"

resp = requests.get(URL, headers=HEADERS,
                    proxies={"http": PROXY, "https": PROXY}, timeout=30)
resp.raise_for_status()
html = resp.text

Routing through a rotating residential proxy means the site sees an ordinary household IP, and pick a current user agent so the request blends in.

3. Parse the HTML With BeautifulSoup

Load the HTML and select elements with CSS selectors. Suppose each product sits in an <article class="product_pod"> with a title and a price inside:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html, "html.parser")

for card in soup.select("article.product_pod"):
    title = card.select_one("h3 a")["title"]
    price = card.select_one("p.price_color").get_text(strip=True)
    print(title, price)

select() returns all matches; select_one() returns the first. Use get_text(strip=True) for clean text, and bracket access for attributes. CSS selectors are the easiest way in — see our CSS selector cheat sheet.

4. Collect Structured Records

Gather each row into a list of dictionaries so it is ready to save:

rows = []
for card in soup.select("article.product_pod"):
    rows.append({
        "title": card.select_one("h3 a")["title"],
        "price": card.select_one("p.price_color").get_text(strip=True),
        "in_stock": "In stock" in card.select_one("p.instock").get_text(),
    })

5. Handle Pagination

Most listings span many pages. Find the "next" link and loop until it disappears:

from urllib.parse import urljoin

def scrape_all(start_url):
    rows, url = [], start_url
    while url:
        resp = requests.get(url, headers=HEADERS,
                            proxies={"http": PROXY, "https": PROXY}, timeout=30)
        soup = BeautifulSoup(resp.text, "html.parser")
        for card in soup.select("article.product_pod"):
            rows.append({
                "title": card.select_one("h3 a")["title"],
                "price": card.select_one("p.price_color").get_text(strip=True),
            })
        nxt = soup.select_one("li.next a")
        url = urljoin(url, nxt["href"]) if nxt else None
    return rows

Because the proxy endpoint rotates IPs automatically, each page request can come from a different residential address — exactly what keeps a multi-page run from being blocked. See rotating proxies in Python.

6. Save the Data

import csv

rows = scrape_all(URL)
with open("books.csv", "w", newline="", encoding="utf-8") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)
print("saved", len(rows), "rows")

What About JavaScript-Heavy Sites?

requests and BeautifulSoup only see the HTML the server sends — they do not run JavaScript. If a site renders its content client-side (you fetch the page and the data is not in the HTML), you need a headless browser like Playwright or Selenium to render it first, then parse. For heavily protected sites, read how to avoid detection while scraping and how to bypass Cloudflare.

Best Practices and Legality

Send realistic headers and a current user agent. Defaults flag you instantly.
Route through rotating residential proxies. The single biggest factor in not getting blocked.
Throttle. Add delays between requests; do not hammer a server.
Respect robots.txt and terms. Read how to read a robots.txt file; scrape public data and avoid login-gated content you are not authorized for.
Handle errors. Retry transient failures and check status codes.

Scraping publicly available data is broadly permissible in many jurisdictions, but is bounded by site terms and privacy laws. Collect responsibly and seek legal advice for high-stakes use.

Frequently Asked Questions

How do I build a web scraper in Python?

Install requests and beautifulsoup4, fetch the page with requests (sending realistic headers and a proxy), parse the HTML with BeautifulSoup, select the elements you want with CSS selectors, collect the fields into dictionaries, follow pagination, and save the results to CSV or JSON. That four-step fetch-parse-extract-save loop is the core of any Python scraper.

What libraries do I need to scrape with Python?

For static sites, requests (to fetch pages) and beautifulsoup4 (to parse HTML) are enough. For large structured crawls, use the Scrapy framework. For JavaScript-rendered sites, add a headless browser like Playwright or Selenium to render the page before parsing.

Why does my Python scraper get blocked?

Usually two reasons: all your requests come from one IP, and your request looks like a bot. Fix both by routing through rotating residential proxies and sending a current user agent with realistic headers, plus throttling your request rate. IP diversity is the biggest single factor.

Do I need proxies to scrape with Python?

For anything beyond a few requests, yes. Sites rate-limit and block repeated requests from one address. Rotating residential proxies spread your requests across many real IPs so the activity looks like ordinary users, which is what keeps a scraper running at scale.

How do I scrape a site that uses JavaScript?

requests and BeautifulSoup only see server-sent HTML and cannot run JavaScript. If the data is loaded client-side, use a headless browser such as Playwright or Selenium to render the page, then extract from the rendered HTML — ideally still routed through proxies.

How do I save scraped data?

Collect each record as a dictionary, then write the list to a file. Python's built-in csv module exports to CSV with DictWriter, and the json module writes JSON. Both take a list of dictionaries directly, so no extra tooling is needed.

Conclusion

A Python web scraper is just four steps — fetch, parse, extract, save — wrapped in a pagination loop. requests and BeautifulSoup get you a working scraper in a few dozen lines. What turns it into something that survives real websites is sending realistic headers and routing through rotating residential proxies so you are not blocked on the second page.

To keep your scraper running without bans, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — drop the endpoint into your requests call and scale up.