Three ways to use ChatGPT for web scraping in 2026: (1) as a code generator — paste HTML, ask for selectors or a full scraper script, copy-paste the output; (2) as a runtime parser — ChatGPT API takes the fetched HTML and a schema, returns JSON; (3) as an agentic driver — ChatGPT (or function-calling) drives a headless browser through clicks, scrolls, forms. The "vibe coding" approach works for one-off jobs. For anything you'll re-run, productionize with the API + Pydantic / Zod schema validation.
The simplest workflow. Open ChatGPT, paste a snippet of the page's HTML, ask for a scraper. Example prompt:
Here's the HTML for a product card on example.com. Write a Python script using requests + BeautifulSoup that scrapes /products/all (paginated), extracts every product's name, price, and SKU, and saves to products.json. Use a residential proxy at
http://USER:[email protected]:8000with a 1-second delay between requests. Include retry logic for 429 / 503.[paste HTML]
You'll get back ~50 lines of working code. Run it. It probably works on page 1; pagination breaks at page 3 because ChatGPT guessed the URL pattern wrong. Iterate: paste the actual page-2 URL, ask it to fix the pagination. Total time to a working scraper: 10–15 minutes vs 1–2 hours from scratch.
.product-card for items" beats "extract products."null for those."For production, don't paste-and-run; use the API directly as your parser. Workflow: your code fetches the page (you control the proxy and retries), passes the HTML or Markdown to ChatGPT, gets back validated JSON.
from openai import OpenAI
from pydantic import BaseModel
import requests
client = OpenAI()
class Product(BaseModel):
name: str
price: float
sku: str | None = None
in_stock: bool
def scrape(url: str, proxy: str) -> Product:
# 1. Fetch via your proxy stack
html = requests.get(url, proxies={"http": proxy, "https": proxy}, timeout=30).text
# 2. Ask ChatGPT for structured output (uses OpenAI's structured output mode)
resp = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "system", "content": "Extract product details from the HTML."},
{"role": "user", "content": html[:30_000]},
],
response_format=Product, # Pydantic-backed structured output
)
return resp.choices[0].message.parsed
product = scrape("https://example.com/p/abc",
"http://USER:[email protected]:8000")
print(product)
Structured Output mode (released August 2024) guarantees valid JSON matching the schema. No regex parsing, no "the model returned text instead of JSON" failures. Cost ~$0.0005–$0.001 per page with gpt-4o-mini.
from anthropic import Anthropic
import json
client = Anthropic()
resp = client.messages.create(
model="claude-haiku-4-5-20251001",
max_tokens=2000,
system="Extract product details. Return JSON: {name, price, sku, in_stock}.",
messages=[{"role": "user", "content": html[:30_000]}],
)
data = json.loads(resp.content[0].text)
product = Product(**data)
Claude doesn't have a one-line structured-output mode like OpenAI, but tool-use achieves the same effect. Claude tends to win on long-document extraction; GPT wins on speed-of-response.
For tasks that need real browser interaction — log in, click a tab, fill a search box — pair ChatGPT with a headless browser. The model decides what to do; the framework executes it.
# Browser-use is the cleanest 2026 wrapper
import asyncio
from browser_use import Agent
from langchain_openai import ChatOpenAI
agent = Agent(
task=("Go to example.com/login. Log in with user/pass from env. "
"Navigate to dashboard, find the 'export CSV' button, click it. "
"Return the resulting CSV URL."),
llm=ChatOpenAI(model="gpt-4o"),
)
result = asyncio.run(agent.run())
print(result)
This is what AI web scraping tools like Browser-use, Stagehand, and AgentQL package. ChatGPT is the brain; the browser is the body. Cost is high (30–90s per task, frontier model tokens) so reserve for tasks worth $0.05–$0.50 each.
Here is HTML from example.com/products/laptop. Give me the most stable CSS selectors for: product title, price, sku, in-stock status. Prefer attributes over class names (classes change; attributes like
data-test-idare usually intentional). Avoid:nth-child. Output as a Python dict.
I'm scraping example.com. Here's a curl from Chrome DevTools showing the XHR request the page makes to load products. Decode this, identify any tokens/cookies/headers required, and write Python that calls the same endpoint without using a browser. [paste curl]
Often the page is calling an internal JSON API. Scraping the API directly skips all the rendering overhead and CAPTCHAs.
This scraper returns empty results despite 200 OK responses. Here's the code, here's a sample of the HTML I got back, here's the page in a real browser. What's different?
Usually answer: JS-rendered content. ChatGPT will tell you to switch to Playwright.
| Use case | Pick |
|---|---|
| One-off scrape, 100 pages | Method 1 (code generation) |
| Production scraper for 200 different sites | Method 2 (runtime parser) |
| Production scraper for 1 known site, 1M pages | Method 1 (generate once, run forever) |
| Logged-in dashboard, multi-step | Method 3 (agentic) |
| Layout changes weekly | Method 2 or 3 (resilient) |
| Layout never changes, max cost-efficiency | Method 1 (traditional code, no per-request tokens) |
ChatGPT helps you parse; it doesn't help you reach the page. Anti-bot systems block your IP before any extraction starts. For any scraping you intend to run more than a few times:
Related: What is AI scraping? · AI web scraping tools · AI data collection · Avoid scraper detection.