Scrapy is the most powerful open-source web scraping framework in Python — an async engine that handles requests, parsing, pagination, retries, and data export in one structured project, rather than the hand-rolled scripts you build with requests and BeautifulSoup. This tutorial takes you from install to a working spider, then adds the one thing every real Scrapy project needs to run at scale: rotating residential proxies so you do not get blocked.
If you want the simpler requests-plus-BeautifulSoup approach first, see how to build a web scraper in Python; Scrapy is the step up when you need concurrency, structure, and scale.
pip install scrapy
Scrapy runs on Python 3.9+ and bundles its own networking and parsing, so you do not need requests or an HTML parser separately.
Scrapy organizes work into a project with one or more spiders. Create the project:
scrapy startproject bookscraper
cd bookscraper
Then create a spider file at bookscraper/spiders/books.py:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
start_urls = ["https://books.toscrape.com/"]
def parse(self, response):
for book in response.css("article.product_pod"):
yield {
"title": book.css("h3 a::attr(title)").get(),
"price": book.css("p.price_color::text").get(),
"in_stock": "In stock" in book.css("p.instock.availability::text").get(default=""),
}
# follow pagination
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse)
This spider visits the start URL, extracts each book's fields, and follows the "next" link until there are no more pages.
Scrapy supports both CSS and XPath selectors on the response object:
Use .get() for the first match and .getall() for a list. Both selector types can be mixed freely in the same spider.
Run the spider and write results straight to a file — Scrapy has built-in exporters:
scrapy crawl books -o books.json
# or CSV
scrapy crawl books -o books.csv
No extra code needed: the dicts you yield become JSON objects or CSV rows automatically.
Run a real crawl and the target will rate-limit or block your IP fast. The fix is to route requests through rotating residential proxies. The simplest reliable method is to set the proxy in start_requests so every request carries it:
import scrapy
class BooksSpider(scrapy.Spider):
name = "books"
PROXY = "http://USER:[email protected]:7777"
def start_requests(self):
for url in ["https://books.toscrape.com/"]:
yield scrapy.Request(url, callback=self.parse,
meta={"proxy": self.PROXY})
def parse(self, response):
# ...extract as above, and pass the proxy on followed requests
next_page = response.css("li.next a::attr(href)").get()
if next_page:
yield response.follow(next_page, callback=self.parse,
meta={"proxy": self.PROXY})
Because SpyderProxy rotates the exit IP at the endpoint, every request can leave from a different residential address automatically — no proxy-list management in your spider. For the concept, see rotating proxies in Python.
In settings.py, a few values keep you both polite and harder to block:
# settings.py
DOWNLOAD_DELAY = 1.0 # pause between requests
CONCURRENT_REQUESTS = 8 # cap parallelism
AUTOTHROTTLE_ENABLED = True # adapt speed to the server
ROBOTSTXT_OBEY = True # respect robots.txt
USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36"
RETRY_ENABLED = True
Set a current user agent, respect robots.txt, and throttle. Combined with residential proxies, this is what lets a Scrapy project run for hours instead of getting blocked in minutes. For tougher targets, read how to avoid detection while scraping.
Scrapy shines for large, structured crawls: thousands of pages, concurrency, pipelines that clean and store data, and built-in retries. For a quick one-page pull, requests plus BeautifulSoup is lighter. For JavaScript-heavy sites, pair Scrapy with a headless browser (via scrapy-playwright) since Scrapy alone does not execute JavaScript.
Scrapy is a Python framework for building web crawlers and scrapers at scale. It handles requests, parsing with CSS and XPath selectors, following links, retries, and exporting data to JSON or CSV — all in one structured, asynchronous project. It is the go-to tool for large, recurring scraping jobs.
The simplest reliable method is to set meta={"proxy": "http://USER:PASS@host:port"} on each Scrapy Request, including followed links. With a rotating residential endpoint, the exit IP changes automatically per request, so you do not manage a proxy list in code. You can also use a downloader middleware for more control.
They solve different scopes. BeautifulSoup is a parser you combine with requests for small, simple scrapes. Scrapy is a full framework with concurrency, pipelines, retries, and exporters for large structured crawls. Use BeautifulSoup for quick jobs and Scrapy when you need scale and structure.
Not on its own — Scrapy fetches raw HTML and does not execute JavaScript. For sites that render content client-side, integrate a headless browser through scrapy-playwright so the page is rendered before Scrapy parses it.
Route requests through rotating residential proxies, set a current user agent, enable AUTOTHROTTLE and a DOWNLOAD_DELAY, respect robots.txt, and keep concurrency reasonable. The single biggest factor is IP diversity from residential proxies; the settings make your crawler behave politely on top of that.
Yes, natively. Run scrapy crawl spider -o output.json or output.csv and the items you yield are written automatically in that format. No extra export code is required.
Scrapy turns scraping into a real engineering project: install it, create a project, write a spider with CSS or XPath selectors, follow pagination, and export to JSON or CSV in a few lines. The piece that makes it production-ready is rotating residential proxies plus polite settings — without IP diversity, even a perfect spider gets blocked at scale.
To keep your Scrapy crawls running, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, and city-level targeting — drop the endpoint into your spider's request meta and go.