spyderproxy

How to Use ChatGPT for Web Scraping (2026)

A

Alex R.

|
Published date

Mon May 18 2026

Three ways to use ChatGPT for web scraping in 2026: (1) as a code generator — paste HTML, ask for selectors or a full scraper script, copy-paste the output; (2) as a runtime parser — ChatGPT API takes the fetched HTML and a schema, returns JSON; (3) as an agentic driver — ChatGPT (or function-calling) drives a headless browser through clicks, scrolls, forms. The "vibe coding" approach works for one-off jobs. For anything you'll re-run, productionize with the API + Pydantic / Zod schema validation.

Method 1: Code Generation (Prototype Fast)

The simplest workflow. Open ChatGPT, paste a snippet of the page's HTML, ask for a scraper. Example prompt:

Here's the HTML for a product card on example.com. Write a Python script using requests + BeautifulSoup that scrapes /products/all (paginated), extracts every product's name, price, and SKU, and saves to products.json. Use a residential proxy at http://USER:[email protected]:8000 with a 1-second delay between requests. Include retry logic for 429 / 503.

[paste HTML]

You'll get back ~50 lines of working code. Run it. It probably works on page 1; pagination breaks at page 3 because ChatGPT guessed the URL pattern wrong. Iterate: paste the actual page-2 URL, ask it to fix the pagination. Total time to a working scraper: 10–15 minutes vs 1–2 hours from scratch.

Prompt Patterns That Work

  • Be specific about selectors. "Use the class .product-card for items" beats "extract products."
  • Specify edge cases. "Some products are out-of-stock and lack a price; output null for those."
  • Demand error handling. "Retry on 429 with exponential backoff."
  • Ask for typed output. "Define a Pydantic model and return validated instances."
  • Provide the proxy config. Otherwise you get hard-coded URLs and have to fix them.

Prompt Patterns That Don't

  • "Write me a web scraper" — too vague, you'll get boilerplate.
  • "Scrape this URL" without HTML — the model hallucinates selectors. Either paste HTML or use the API method below.
  • "Fix this code" without the error — paste the traceback.

Method 2: Runtime Parser (API + Schema)

For production, don't paste-and-run; use the API directly as your parser. Workflow: your code fetches the page (you control the proxy and retries), passes the HTML or Markdown to ChatGPT, gets back validated JSON.

from openai import OpenAI
from pydantic import BaseModel
import requests

client = OpenAI()

class Product(BaseModel):
    name: str
    price: float
    sku: str | None = None
    in_stock: bool

def scrape(url: str, proxy: str) -> Product:
    # 1. Fetch via your proxy stack
    html = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=30).text

    # 2. Ask ChatGPT for structured output (uses OpenAI's structured output mode)
    resp = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Extract product details from the HTML."},
            {"role": "user", "content": html[:30_000]},
        ],
        response_format=Product,  # Pydantic-backed structured output
    )

    return resp.choices[0].message.parsed

product = scrape("https://example.com/p/abc",
                 "http://USER:[email protected]:8000")
print(product)

Structured Output mode (released August 2024) guarantees valid JSON matching the schema. No regex parsing, no "the model returned text instead of JSON" failures. Cost ~$0.0005–$0.001 per page with gpt-4o-mini.

Same Thing With Claude

from anthropic import Anthropic
import json

client = Anthropic()

resp = client.messages.create(
    model="claude-haiku-4-5-20251001",
    max_tokens=2000,
    system="Extract product details. Return JSON: {name, price, sku, in_stock}.",
    messages=[{"role": "user", "content": html[:30_000]}],
)
data = json.loads(resp.content[0].text)
product = Product(**data)

Claude doesn't have a one-line structured-output mode like OpenAI, but tool-use achieves the same effect. Claude tends to win on long-document extraction; GPT wins on speed-of-response.

Method 3: Agentic Browser (Multi-Step)

For tasks that need real browser interaction — log in, click a tab, fill a search box — pair ChatGPT with a headless browser. The model decides what to do; the framework executes it.

# Browser-use is the cleanest 2026 wrapper
import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI

agent = Agent(
    task=("Go to example.com/login. Log in with user/pass from env. "
          "Navigate to dashboard, find the 'export CSV' button, click it. "
          "Return the resulting CSV URL."),
    llm=ChatOpenAI(model="gpt-4o"),
)
result = asyncio.run(agent.run())
print(result)

This is what AI web scraping tools like Browser-use, Stagehand, and AgentQL package. ChatGPT is the brain; the browser is the body. Cost is high (30–90s per task, frontier model tokens) so reserve for tasks worth $0.05–$0.50 each.

Prompt Engineering for Scraping

The "Paste-HTML-Get-Selectors" Prompt

Here is HTML from example.com/products/laptop. Give me the most stable CSS selectors for: product title, price, sku, in-stock status. Prefer attributes over class names (classes change; attributes like data-test-id are usually intentional). Avoid :nth-child. Output as a Python dict.

The "Reverse-Engineer the API" Prompt

I'm scraping example.com. Here's a curl from Chrome DevTools showing the XHR request the page makes to load products. Decode this, identify any tokens/cookies/headers required, and write Python that calls the same endpoint without using a browser. [paste curl]

Often the page is calling an internal JSON API. Scraping the API directly skips all the rendering overhead and CAPTCHAs.

The "Why Is It Failing" Prompt

This scraper returns empty results despite 200 OK responses. Here's the code, here's a sample of the HTML I got back, here's the page in a real browser. What's different?

Usually answer: JS-rendered content. ChatGPT will tell you to switch to Playwright.

Where ChatGPT-Driven Scraping Breaks

  • Selectors hallucinate. Without real HTML in the prompt, the model invents class names. Always paste real HTML.
  • It doesn't test the code. Generated code looks right and works in 60–80% of cases; the rest you have to debug.
  • It can't see live JS state. If the data isn't in the source HTML, the model can't guess what JS will do.
  • Context window limits. A real e-commerce product page is 200KB+ of HTML; you have to truncate.
  • Cost adds up. Method 2 (runtime parser) costs $0.0005–$0.001/page. For 1M pages that's $500–$1,000 just in tokens, vs $50 in compute for a traditional scraper.

When Each Method Wins

Use casePick
One-off scrape, 100 pagesMethod 1 (code generation)
Production scraper for 200 different sitesMethod 2 (runtime parser)
Production scraper for 1 known site, 1M pagesMethod 1 (generate once, run forever)
Logged-in dashboard, multi-stepMethod 3 (agentic)
Layout changes weeklyMethod 2 or 3 (resilient)
Layout never changes, max cost-efficiencyMethod 1 (traditional code, no per-request tokens)

Still Need Proxies

ChatGPT helps you parse; it doesn't help you reach the page. Anti-bot systems block your IP before any extraction starts. For any scraping you intend to run more than a few times:

Related: What is AI scraping? · AI web scraping tools · AI data collection · Avoid scraper detection.