Web scraping with Claude means using Anthropic's Claude models to do the part of scraping that traditionally breaks the most: turning messy, ever-changing HTML into clean, structured data. Claude does not fetch pages for you, rotate IPs, or solve CAPTCHAs — it replaces the brittle CSS-selector and XPath parsing layer with a model that reads a page the way a person would and returns exactly the JSON you asked for. You still bring the proxies and the HTTP client; Claude brings the extraction.
This guide covers the 2026 architecture for scraping with Claude, why proxies remain non-negotiable, four working patterns (structured extraction, cost-optimized batch extraction, vision for obfuscated pages, and agentic scraping), and the best practices that keep token bills sane. If you came from the ChatGPT side of this question, the same principles apply — see our guide to using ChatGPT for web scraping for the comparison.
There is a common misconception that you can paste a URL into Claude and get data back. That is not how it works, and understanding why is the whole game. A large language model is not a web crawler — it has no IP address of its own to send requests from, no browser to render JavaScript, and no mechanism to retry against an anti-bot wall. What Claude is exceptionally good at is reading: give it the HTML (or a screenshot) of a page and a description of the fields you want, and it returns structured data with near-human accuracy, no selectors required.
So the modern pipeline splits into three jobs:
The payoff is durability. Traditional scrapers break the moment a site changes a class name or reorders its DOM. A Claude-based extractor keeps working through redesigns because it understands meaning, not markup. That resilience is exactly what AI scraping brings to the table.
This is the single most important point in the guide. Claude solves parsing; it does nothing for access. The moment you scrape at any real volume, the target site sees a burst of requests from one IP and does what every site does: rate-limits you, serves a CAPTCHA, or returns an HTTP 403. No model fixes that — only IP diversity does.
That is why a Claude scraper still needs:
A clean mental model: proxies get you the HTML, Claude makes sense of it. Skip the proxy layer and you will have a brilliant parser with nothing to parse, because every request returns a block page. If anti-bot systems are your bottleneck, read how to avoid detection while scraping and how to bypass Cloudflare first.
The cleanest way to get reliable, schema-conformant output from Claude is tool use (function calling). Instead of hoping the model returns valid JSON in prose, you define a tool whose input schema is your data shape and force Claude to call it. The result is always valid against your schema.
Install the SDK and an HTTP client:
pip install anthropic requests
Then fetch through a proxy and hand the HTML to Claude:
import os
import requests
from anthropic import Anthropic
client = Anthropic() # reads ANTHROPIC_API_KEY from the environment
# Fetch the page through a SpyderProxy residential endpoint
PROXY = "http://USER:[email protected]:7777"
html = requests.get(
"https://example.com/product/123",
proxies={"http": PROXY, "https": PROXY},
timeout=30,
).text
# Define the output shape as a tool. Claude is forced to fill it in.
tools = [{
"name": "save_product",
"description": "Save the structured product data extracted from the page.",
"input_schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"},
"currency": {"type": "string", "description": "ISO 4217 code, e.g. USD"},
"in_stock": {"type": "boolean"},
"rating": {"type": "number", "description": "0-5, null if absent"},
},
"required": ["name", "price", "currency", "in_stock"],
},
}]
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
tools=tools,
tool_choice={"type": "tool", "name": "save_product"},
messages=[{
"role": "user",
"content": "Extract the product fields from this HTML:\n\n" + html[:60000],
}],
)
product = next(block.input for block in resp.content if block.type == "tool_use")
print(product)
# {'name': 'Acme Wireless Mouse', 'price': 24.99, 'currency': 'USD', 'in_stock': True, 'rating': 4.6}
Three things make this robust. First, tool_choice forces the tool, so you never parse free text. Second, the schema doubles as documentation — field descriptions guide the model. Third, because Claude reasons over the whole page, it survives the class-name changes that would silently break a BeautifulSoup selector.
For high-volume, well-structured pages (product listings, directories, search results), Claude Haiku is fast and inexpensive and usually all you need. Reserve Claude Sonnet for pages with ambiguous layout, nested tables, or extraction that requires light reasoning. A practical rule: start on Haiku, and only escalate the pages that fail validation.
Token cost is the thing that surprises teams new to LLM extraction. Two built-in features bring it down dramatically.
Prompt caching. If you extract the same shape from thousands of pages, your instructions and schema are identical every call — only the HTML changes. Mark the static portion with a cache breakpoint and Anthropic stores it server-side, charging a fraction of the price on cache hits:
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
system=[{
"type": "text",
"text": LONG_EXTRACTION_INSTRUCTIONS, # reused on every page
"cache_control": {"type": "ephemeral"},
}],
messages=[{"role": "user", "content": html[:60000]}],
)
On a large reused system prompt, cache reads cost roughly a tenth of normal input tokens — a meaningful saving across a crawl of any size.
Batch API. When extraction does not need to be real time (overnight crawls, backfills), submit requests as a batch. It is processed asynchronously and is about half the price of synchronous calls:
batch = client.messages.batches.create(requests=[
{
"custom_id": "page-123",
"params": {
"model": "claude-haiku-4-5",
"max_tokens": 1024,
"messages": [{"role": "user", "content": html_123[:60000]}],
},
},
# ... thousands more
])
Caching plus batching together can cut a large extraction job's cost by more than half versus naive per-page synchronous calls. Trim the HTML before sending it, too — strip <script>, <style>, and <svg> blocks so you are not paying to tokenize code the model will ignore.
Some sites deliberately make their HTML useless — prices rendered as images, data drawn to a <canvas>, or class names randomized on every load. When the markup fights you, skip it: screenshot the rendered page with a headless browser and send the image to Claude, which reads it visually.
import base64
from anthropic import Anthropic
client = Anthropic()
with open("page.png", "rb") as f:
img_b64 = base64.standard_b64encode(f.read()).decode()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64", "media_type": "image/png", "data": img_b64,
}},
{"type": "text", "text": "Return the visible price and product title as JSON."},
],
}],
)
print(resp.content[0].text)
Vision extraction costs more per page than text, so use it as a targeted fallback for the pages that defeat HTML parsing, not as your default path.
The patterns above treat Claude as a parser. You can also let it drive. By giving Claude tools — a "fetch this URL through a proxy" function, a "click this element" function — it can decide which links to follow and when to paginate, handling sites whose structure it discovers as it goes. The Model Context Protocol (MCP) standardizes how you expose those tools, so the same proxy-backed fetch tool works across your agents.
Agentic scraping is powerful for exploratory or multi-step jobs (log in, navigate, extract across pages) but it is slower and more expensive than a fixed pipeline because the model is in the loop on every step. For predictable, high-volume scraping of known page types, the Method 1 pipeline wins on cost and speed every time. Match the approach to the job. For a broader look at the tooling landscape, see our roundup of the best AI web scraping tools.
Here is the shape of a production extractor — proxy fetch, optional retry, Claude extraction, validation:
import requests
from anthropic import Anthropic
client = Anthropic()
PROXY = "http://USER:[email protected]:7777"
def fetch(url, retries=3):
for attempt in range(retries):
r = requests.get(url, proxies={"http": PROXY, "https": PROXY}, timeout=30)
if r.status_code == 200:
return r.text
raise RuntimeError("blocked after retries: " + url)
def extract(html):
resp = client.messages.create(
model="claude-haiku-4-5",
max_tokens=1024,
tools=TOOLS,
tool_choice={"type": "tool", "name": "save_product"},
messages=[{"role": "user", "content": html[:60000]}],
)
return next(b.input for b in resp.content if b.type == "tool_use")
for url in urls:
try:
data = extract(fetch(url))
if data["price"] > 0: # cheap validation gate
save(data)
except Exception as e:
log_failure(url, e)
Because rotation happens at the proxy endpoint, every call to fetch() can exit from a different residential IP automatically — no IP management in your code. That separation is what lets the same script scale from 100 to 1,000,000 pages.
tool_choice for structured jobs. Never regex JSON out of prose.Using Claude to parse pages does not change the legal picture — the same rules that govern any web scraping apply. Scraping publicly available data is broadly permissible in many jurisdictions, but the details matter: respect a site's Terms of Service, do not collect personal data in ways that violate GDPR or similar laws, honor robots.txt where it applies, and never scrape content behind a login you are not authorized to access. The model is just the parser; responsibility for what you collect and how you use it rests with you. When in doubt, consult a lawyer for your specific use case.
No. Claude is a language model, not a crawler — it has no IP address, browser, or fetch capability of its own. You fetch the page with your own code (through a proxy), then send the HTML or a screenshot to Claude for extraction. Claude replaces the parsing layer, not the request layer.
Yes, and this is the most common misunderstanding. Claude handles parsing, not access. At any real volume the target site will rate-limit or block a single IP. Residential or mobile proxies provide the IP diversity that keeps requests flowing; without them you get block pages with nothing for Claude to read.
Both follow the same architecture: your code fetches through proxies, the model extracts. Claude's tool use is well suited to forcing schema-conformant JSON, and its long context window lets you pass large pages in one call. The right choice usually comes down to your existing stack and pricing. See our ChatGPT scraping guide for that side.
Cost scales with tokens, so it depends on page size and volume. Using Claude Haiku, trimming HTML before sending, caching your static instructions, and batching non-urgent jobs can together cut the bill by well over half compared to naive per-page calls on a larger model. For most structured pages, Haiku plus caching keeps per-page cost very low.
Start with Claude Haiku — it is fast and cheap and handles well-structured pages fine. Escalate to Claude Sonnet only for pages with ambiguous layout or extraction that needs reasoning. A good pattern is Haiku by default, with failed-validation pages retried on Sonnet.
Yes, via vision. Screenshot the rendered page with a headless browser and send the image to Claude, which reads the price or text visually. It costs more than text extraction, so use it as a fallback for pages that defeat HTML parsing rather than as the default.
Yes. SpyderProxy provides the fetch-layer IPs; Claude provides extraction. Point your HTTP client at a SpyderProxy residential endpoint, request the page, and pass the HTML to Claude. Rotation and geo-targeting happen at the proxy endpoint, so your extraction code stays unchanged as you scale.
Scraping with Claude is not about replacing your scraper — it is about replacing the part of it that always breaks. Let proxies own access and rotation, let a headless browser handle JavaScript when needed, and let Claude turn whatever HTML comes back into clean, schema-conformant data that survives site redesigns. Add tool use for reliable JSON, prompt caching and the Batch API for cost, and vision as a fallback, and you have a pipeline that scales without a wall of fragile selectors.
The one piece Claude can never provide is the IP. To feed it pages instead of block screens, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, sticky sessions, and city-level targeting — the access layer your Claude extractor runs on.