spyderproxy

Web Scraping With Claude: A 2026 Guide (Code + Proxies)

D

Daniel K.

|
Published date

Fri May 29 2026

|11 min read

Web scraping with Claude means using Anthropic's Claude models to do the part of scraping that traditionally breaks the most: turning messy, ever-changing HTML into clean, structured data. Claude does not fetch pages for you, rotate IPs, or solve CAPTCHAs — it replaces the brittle CSS-selector and XPath parsing layer with a model that reads a page the way a person would and returns exactly the JSON you asked for. You still bring the proxies and the HTTP client; Claude brings the extraction.

This guide covers the 2026 architecture for scraping with Claude, why proxies remain non-negotiable, four working patterns (structured extraction, cost-optimized batch extraction, vision for obfuscated pages, and agentic scraping), and the best practices that keep token bills sane. If you came from the ChatGPT side of this question, the same principles apply — see our guide to using ChatGPT for web scraping for the comparison.

What "Web Scraping With Claude" Actually Means

There is a common misconception that you can paste a URL into Claude and get data back. That is not how it works, and understanding why is the whole game. A large language model is not a web crawler — it has no IP address of its own to send requests from, no browser to render JavaScript, and no mechanism to retry against an anti-bot wall. What Claude is exceptionally good at is reading: give it the HTML (or a screenshot) of a page and a description of the fields you want, and it returns structured data with near-human accuracy, no selectors required.

So the modern pipeline splits into three jobs:

  1. Fetch — your code requests the page through a proxy so you are not blocked or rate-limited. This is where residential proxies live.
  2. Render (only if needed) — a headless browser executes JavaScript for single-page apps that ship an empty HTML shell.
  3. Extract — Claude turns the resulting HTML or screenshot into the exact schema you defined.

The payoff is durability. Traditional scrapers break the moment a site changes a class name or reorders its DOM. A Claude-based extractor keeps working through redesigns because it understands meaning, not markup. That resilience is exactly what AI scraping brings to the table.

Why You Still Need Proxies (Claude Will Not Save You Here)

This is the single most important point in the guide. Claude solves parsing; it does nothing for access. The moment you scrape at any real volume, the target site sees a burst of requests from one IP and does what every site does: rate-limits you, serves a CAPTCHA, or returns an HTTP 403. No model fixes that — only IP diversity does.

That is why a Claude scraper still needs:

  • Residential or mobile IPs so each request looks like an ordinary household or phone connection rather than a datacenter bot.
  • Rotation so thousands of requests are spread across thousands of IPs. See rotating proxies in Python for the pattern.
  • Geo-targeting when prices, search results, or availability differ by country.

A clean mental model: proxies get you the HTML, Claude makes sense of it. Skip the proxy layer and you will have a brilliant parser with nothing to parse, because every request returns a block page. If anti-bot systems are your bottleneck, read how to avoid detection while scraping and how to bypass Cloudflare first.

Method 1: HTML to Structured JSON With Tool Use

The cleanest way to get reliable, schema-conformant output from Claude is tool use (function calling). Instead of hoping the model returns valid JSON in prose, you define a tool whose input schema is your data shape and force Claude to call it. The result is always valid against your schema.

Install the SDK and an HTTP client:

pip install anthropic requests

Then fetch through a proxy and hand the HTML to Claude:

import os
import requests
from anthropic import Anthropic

client = Anthropic()  # reads ANTHROPIC_API_KEY from the environment

# Fetch the page through a SpyderProxy residential endpoint
PROXY = "http://USER:[email protected]:7777"
html = requests.get(
    "https://example.com/product/123",
    proxies={"http": PROXY, "https": PROXY},
    timeout=30,
).text

# Define the output shape as a tool. Claude is forced to fill it in.
tools = [{
    "name": "save_product",
    "description": "Save the structured product data extracted from the page.",
    "input_schema": {
        "type": "object",
        "properties": {
            "name": {"type": "string"},
            "price": {"type": "number"},
            "currency": {"type": "string", "description": "ISO 4217 code, e.g. USD"},
            "in_stock": {"type": "boolean"},
            "rating": {"type": "number", "description": "0-5, null if absent"},
        },
        "required": ["name", "price", "currency", "in_stock"],
    },
}]

resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    tools=tools,
    tool_choice={"type": "tool", "name": "save_product"},
    messages=[{
        "role": "user",
        "content": "Extract the product fields from this HTML:\n\n" + html[:60000],
    }],
)

product = next(block.input for block in resp.content if block.type == "tool_use")
print(product)
# {'name': 'Acme Wireless Mouse', 'price': 24.99, 'currency': 'USD', 'in_stock': True, 'rating': 4.6}

Three things make this robust. First, tool_choice forces the tool, so you never parse free text. Second, the schema doubles as documentation — field descriptions guide the model. Third, because Claude reasons over the whole page, it survives the class-name changes that would silently break a BeautifulSoup selector.

Choosing a model

For high-volume, well-structured pages (product listings, directories, search results), Claude Haiku is fast and inexpensive and usually all you need. Reserve Claude Sonnet for pages with ambiguous layout, nested tables, or extraction that requires light reasoning. A practical rule: start on Haiku, and only escalate the pages that fail validation.

Method 2: Cut the Bill With Prompt Caching and the Batch API

Token cost is the thing that surprises teams new to LLM extraction. Two built-in features bring it down dramatically.

Prompt caching. If you extract the same shape from thousands of pages, your instructions and schema are identical every call — only the HTML changes. Mark the static portion with a cache breakpoint and Anthropic stores it server-side, charging a fraction of the price on cache hits:

resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=1024,
    system=[{
        "type": "text",
        "text": LONG_EXTRACTION_INSTRUCTIONS,   # reused on every page
        "cache_control": {"type": "ephemeral"},
    }],
    messages=[{"role": "user", "content": html[:60000]}],
)

On a large reused system prompt, cache reads cost roughly a tenth of normal input tokens — a meaningful saving across a crawl of any size.

Batch API. When extraction does not need to be real time (overnight crawls, backfills), submit requests as a batch. It is processed asynchronously and is about half the price of synchronous calls:

batch = client.messages.batches.create(requests=[
    {
        "custom_id": "page-123",
        "params": {
            "model": "claude-haiku-4-5",
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": html_123[:60000]}],
        },
    },
    # ... thousands more
])

Caching plus batching together can cut a large extraction job's cost by more than half versus naive per-page synchronous calls. Trim the HTML before sending it, too — strip <script>, <style>, and <svg> blocks so you are not paying to tokenize code the model will ignore.

Method 3: Use Vision for Obfuscated or Canvas-Rendered Pages

Some sites deliberately make their HTML useless — prices rendered as images, data drawn to a <canvas>, or class names randomized on every load. When the markup fights you, skip it: screenshot the rendered page with a headless browser and send the image to Claude, which reads it visually.

import base64
from anthropic import Anthropic

client = Anthropic()
with open("page.png", "rb") as f:
    img_b64 = base64.standard_b64encode(f.read()).decode()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": [
            {"type": "image", "source": {
                "type": "base64", "media_type": "image/png", "data": img_b64,
            }},
            {"type": "text", "text": "Return the visible price and product title as JSON."},
        ],
    }],
)
print(resp.content[0].text)

Vision extraction costs more per page than text, so use it as a targeted fallback for the pages that defeat HTML parsing, not as your default path.

Method 4: Agentic Scraping With Tool Use and MCP

The patterns above treat Claude as a parser. You can also let it drive. By giving Claude tools — a "fetch this URL through a proxy" function, a "click this element" function — it can decide which links to follow and when to paginate, handling sites whose structure it discovers as it goes. The Model Context Protocol (MCP) standardizes how you expose those tools, so the same proxy-backed fetch tool works across your agents.

Agentic scraping is powerful for exploratory or multi-step jobs (log in, navigate, extract across pages) but it is slower and more expensive than a fixed pipeline because the model is in the loop on every step. For predictable, high-volume scraping of known page types, the Method 1 pipeline wins on cost and speed every time. Match the approach to the job. For a broader look at the tooling landscape, see our roundup of the best AI web scraping tools.

Putting It Together: A Proxy-Backed Claude Pipeline

Here is the shape of a production extractor — proxy fetch, optional retry, Claude extraction, validation:

import requests
from anthropic import Anthropic

client = Anthropic()
PROXY = "http://USER:[email protected]:7777"

def fetch(url, retries=3):
    for attempt in range(retries):
        r = requests.get(url, proxies={"http": PROXY, "https": PROXY}, timeout=30)
        if r.status_code == 200:
            return r.text
    raise RuntimeError("blocked after retries: " + url)

def extract(html):
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        tools=TOOLS,
        tool_choice={"type": "tool", "name": "save_product"},
        messages=[{"role": "user", "content": html[:60000]}],
    )
    return next(b.input for b in resp.content if b.type == "tool_use")

for url in urls:
    try:
        data = extract(fetch(url))
        if data["price"] > 0:          # cheap validation gate
            save(data)
    except Exception as e:
        log_failure(url, e)

Because rotation happens at the proxy endpoint, every call to fetch() can exit from a different residential IP automatically — no IP management in your code. That separation is what lets the same script scale from 100 to 1,000,000 pages.

Best Practices for Scraping With Claude

  • Trim before you send. Strip scripts, styles, SVG, and base64 blobs. Smaller input means lower cost and better accuracy. Often you can isolate the main content container and send only that.
  • Force the schema. Always use tool use with tool_choice for structured jobs. Never regex JSON out of prose.
  • Validate cheaply, escalate selectively. Gate on simple rules (price > 0, required fields present). Re-run only the failures on a stronger model.
  • Cache the static parts. Schema and instructions are identical every call — cache them.
  • Batch what is not urgent. Overnight and backfill jobs belong on the Batch API.
  • Keep proxies and parsing separate. Let the proxy layer own access and rotation; let Claude own meaning. Debugging is far easier when the two concerns do not bleed together.

Is Scraping With Claude Legal?

Using Claude to parse pages does not change the legal picture — the same rules that govern any web scraping apply. Scraping publicly available data is broadly permissible in many jurisdictions, but the details matter: respect a site's Terms of Service, do not collect personal data in ways that violate GDPR or similar laws, honor robots.txt where it applies, and never scrape content behind a login you are not authorized to access. The model is just the parser; responsibility for what you collect and how you use it rests with you. When in doubt, consult a lawyer for your specific use case.

Frequently Asked Questions

Can Claude scrape a website directly from a URL?

No. Claude is a language model, not a crawler — it has no IP address, browser, or fetch capability of its own. You fetch the page with your own code (through a proxy), then send the HTML or a screenshot to Claude for extraction. Claude replaces the parsing layer, not the request layer.

Do I still need proxies if I use Claude?

Yes, and this is the most common misunderstanding. Claude handles parsing, not access. At any real volume the target site will rate-limit or block a single IP. Residential or mobile proxies provide the IP diversity that keeps requests flowing; without them you get block pages with nothing for Claude to read.

Is Claude or ChatGPT better for web scraping?

Both follow the same architecture: your code fetches through proxies, the model extracts. Claude's tool use is well suited to forcing schema-conformant JSON, and its long context window lets you pass large pages in one call. The right choice usually comes down to your existing stack and pricing. See our ChatGPT scraping guide for that side.

How much does it cost to scrape with Claude?

Cost scales with tokens, so it depends on page size and volume. Using Claude Haiku, trimming HTML before sending, caching your static instructions, and batching non-urgent jobs can together cut the bill by well over half compared to naive per-page calls on a larger model. For most structured pages, Haiku plus caching keeps per-page cost very low.

Which Claude model should I use for scraping?

Start with Claude Haiku — it is fast and cheap and handles well-structured pages fine. Escalate to Claude Sonnet only for pages with ambiguous layout or extraction that needs reasoning. A good pattern is Haiku by default, with failed-validation pages retried on Sonnet.

Can Claude read pages that render data as images or canvas?

Yes, via vision. Screenshot the rendered page with a headless browser and send the image to Claude, which reads the price or text visually. It costs more than text extraction, so use it as a fallback for pages that defeat HTML parsing rather than as the default.

Does SpyderProxy work with the Claude API?

Yes. SpyderProxy provides the fetch-layer IPs; Claude provides extraction. Point your HTTP client at a SpyderProxy residential endpoint, request the page, and pass the HTML to Claude. Rotation and geo-targeting happen at the proxy endpoint, so your extraction code stays unchanged as you scale.

Conclusion

Scraping with Claude is not about replacing your scraper — it is about replacing the part of it that always breaks. Let proxies own access and rotation, let a headless browser handle JavaScript when needed, and let Claude turn whatever HTML comes back into clean, schema-conformant data that survives site redesigns. Add tool use for reliable JSON, prompt caching and the Batch API for cost, and vision as a fallback, and you have a pipeline that scales without a wall of fragile selectors.

The one piece Claude can never provide is the IP. To feed it pages instead of block screens, SpyderProxy residential proxies start at $1.75/GB with 10M+ IPs across 195+ countries, automatic rotation, sticky sessions, and city-level targeting — the access layer your Claude extractor runs on.

The Fetch Layer Your Claude Scraper Runs On

Claude parses the page; SpyderProxy gets you the page. Residential proxies from $1.75/GB — 10M+ IPs, 195+ countries, automatic rotation, sticky sessions, and city-level targeting.