spyderproxy

Scrapy Web Scraping Tutorial (2026): Build a Spider + Proxies

D

Daniel K.

|
Published date

Sun May 24 2026

|10 min read

Scrapy is the most powerful open-source web scraping framework in Python — an async engine that handles requests, parsing, pagination, retries, and data export in one structured project, rather than the hand-rolled scripts you build with requests and BeautifulSoup. This tutorial takes you from install to a working spider, then adds the one thing every real Scrapy project needs to run at scale: rotating residential proxies so you do not get blocked.

If you want the simpler requests-plus-BeautifulSoup approach first, see how to build a web scraper in Python; Scrapy is the step up when you need concurrency, structure, and scale.

1. Install Scrapy

pip install scrapy

Scrapy runs on Python 3.9+ and bundles its own networking and parsing, so you do not need requests or an HTML parser separately.

2. Create a Project and Spider

Scrapy organizes work into a project with one or more spiders. Create the project:

scrapy startproject bookscraper
cd bookscraper

Then create a spider file at bookscraper/spiders/books.py:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    start_urls = ["https://books.toscrape.com/"]

    def parse(self, response):
        for book in response.css("article.product_pod"):
            yield {
                "title": book.css("h3 a::attr(title)").get(),
                "price": book.css("p.price_color::text").get(),
                "in_stock": "In stock" in book.css("p.instock.availability::text").get(default=""),
            }
        # follow pagination
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

This spider visits the start URL, extracts each book's fields, and follows the "next" link until there are no more pages.

3. Selectors: CSS and XPath

Scrapy supports both CSS and XPath selectors on the response object:

  • CSS: response.css("h2.title::text").get() — concise and familiar from the browser. See our CSS selector cheat sheet.
  • XPath: response.xpath("//h2[@class='title']/text()").get() — more powerful for complex traversals. See the XPath cheat sheet.

Use .get() for the first match and .getall() for a list. Both selector types can be mixed freely in the same spider.

4. Run It and Export Data

Run the spider and write results straight to a file — Scrapy has built-in exporters:

scrapy crawl books -o books.json
# or CSV
scrapy crawl books -o books.csv

No extra code needed: the dicts you yield become JSON objects or CSV rows automatically.

5. Add Rotating Proxies (the part that matters at scale)

Run a real crawl and the target will rate-limit or block your IP fast. The fix is to route requests through rotating residential proxies. The simplest reliable method is to set the proxy in start_requests so every request carries it:

import scrapy

class BooksSpider(scrapy.Spider):
    name = "books"
    PROXY = "http://USER:[email protected]:7777"

    def start_requests(self):
        for url in ["https://books.toscrape.com/"]:
            yield scrapy.Request(url, callback=self.parse,
                                 meta={"proxy": self.PROXY})

    def parse(self, response):
        # ...extract as above, and pass the proxy on followed requests
        next_page = response.css("li.next a::attr(href)").get()
        if next_page:
            yield response.follow(next_page, callback=self.parse,
                                  meta={"proxy": self.PROXY})

Because SpyderProxy rotates the exit IP at the endpoint, every request can leave from a different residential address automatically — no proxy-list management in your spider. For the concept, see rotating proxies in Python.

6. Be a Good (and Unblocked) Citizen

In settings.py, a few values keep you both polite and harder to block:

# settings.py
DOWNLOAD_DELAY = 1.0            # pause between requests
CONCURRENT_REQUESTS = 8         # cap parallelism
AUTOTHROTTLE_ENABLED = True     # adapt speed to the server
ROBOTSTXT_OBEY = True           # respect robots.txt
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
RETRY_ENABLED = True

Set a current user agent, respect robots.txt, and throttle. Combined with residential proxies, this is what lets a Scrapy project run for hours instead of getting blocked in minutes. For tougher targets, read how to avoid detection while scraping.

When to Use Scrapy

Scrapy shines for large, structured crawls: thousands of pages, concurrency, pipelines that clean and store data, and built-in retries. For a quick one-page pull, requests plus BeautifulSoup is lighter. For JavaScript-heavy sites, pair Scrapy with a headless browser (via scrapy-playwright) since Scrapy alone does not execute JavaScript.

Frequently Asked Questions

What is Scrapy used for?

Scrapy is a Python framework for building web crawlers and scrapers at scale. It handles requests, parsing with CSS and XPath selectors, following links, retries, and exporting data to JSON or CSV — all in one structured, asynchronous project. It is the go-to tool for large, recurring scraping jobs.

How do I use proxies with Scrapy?

The simplest reliable method is to set meta={"proxy": "http://USER:PASS@host:port"} on each Scrapy Request, including followed links. With a rotating residential endpoint, the exit IP changes automatically per request, so you do not manage a proxy list in code. You can also use a downloader middleware for more control.

Is Scrapy better than BeautifulSoup?

They solve different scopes. BeautifulSoup is a parser you combine with requests for small, simple scrapes. Scrapy is a full framework with concurrency, pipelines, retries, and exporters for large structured crawls. Use BeautifulSoup for quick jobs and Scrapy when you need scale and structure.

Does Scrapy run JavaScript?

Not on its own — Scrapy fetches raw HTML and does not execute JavaScript. For sites that render content client-side, integrate a headless browser through scrapy-playwright so the page is rendered before Scrapy parses it.

How do I avoid getting blocked with Scrapy?

Route requests through rotating residential proxies, set a current user agent, enable AUTOTHROTTLE and a DOWNLOAD_DELAY, respect robots.txt, and keep concurrency reasonable. The single biggest factor is IP diversity from residential proxies; the settings make your crawler behave politely on top of that.

Can Scrapy export to JSON and CSV?

Yes, natively. Run scrapy crawl spider -o output.json or output.csv and the items you yield are written automatically in that format. No extra export code is required.

Conclusion

Scrapy turns scraping into a real engineering project: install it, create a project, write a spider with CSS or XPath selectors, follow pagination, and export to JSON or CSV in a few lines. The piece that makes it production-ready is rotating residential proxies plus polite settings — without IP diversity, even a perfect spider gets blocked at scale.

To keep your Scrapy crawls running, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — drop the endpoint into your spider's request meta and go.

Keep Your Scrapy Crawls Running

Drop a rotating endpoint into your spider and stop getting blocked. SpyderProxy residential proxies from $1.75/GB — 10M+ IPs, 195+ countries, automatic rotation, city-level targeting.