Quick definition: AI scraping uses a large language model — or an LLM-driven agent that controls a browser — to read a web page and return structured data without you writing selectors, XPath, or per-site code. You describe what you want (a JSON schema or plain-English prompt), point it at a URL, and the model figures out where the title, price, or contact info lives. In 2026 the leading approaches are Firecrawl (managed), ScrapeGraphAI (open-source), Browser-use (agentic), and Stagehand (programmatic AI control). It's magical when sites change layout often or you only need 10–1,000 pages; it's the wrong tool for the 100M-page crawl.
The whole loop runs per page. There's no per-site code, no selectors to maintain, no breakage when a site redesigns its CSS classes. You pay for it in tokens (and latency).
| Traditional (BeautifulSoup, Scrapy) | AI Scraping | |
|---|---|---|
| Setup time | ~30 min per site | ~30 sec per site |
| Maintenance | Breaks on layout change | Adapts automatically |
| Cost per page | ~$0.0001 (compute only) | ~$0.001–$0.01 (tokens) |
| Throughput | 500–1,000 pages/sec/box | 5–50 pages/sec (rate-limited) |
| Best for | Known sites, large volume | New sites, varied layouts, low–mid volume |
| JS rendering | Need to add Playwright | Usually built-in |
| Schema mapping | You write per-field | LLM infers |
The pricing gap is the key tradeoff. A traditional scraper costs you fractions of a cent in cloud compute. An AI scraper using GPT-4o-mini costs about $0.001–$0.003 per page; using Claude 3.5 Sonnet or GPT-5 costs $0.005–$0.02. For 1M pages that's $1,000–$20,000 in tokens, vs $50–$200 in compute for a traditional scraper.
| Tool | Type | Strength |
|---|---|---|
| Firecrawl | Managed API | Best dev-experience; one POST returns clean markdown + structured JSON |
| ScrapeGraphAI | Open-source Python | Graph-based pipelines; works with any OpenAI-compatible LLM (incl. local Llama) |
| Browser-use | Agentic open-source | LLM drives a real Chromium for forms, clicks, multi-step flows |
| Stagehand | Programmatic AI control (Browserbase) | Mix deterministic Playwright + AI fallback for selectors |
| AgentQL | Query language + cloud | Natural-language locators ("the price element") |
| Reworkd Tarsier / Banana | OSS extraction | Vision-language model reads page like a human |
| Crawl4AI | OSS crawl + extract | Async crawler that emits LLM-ready markdown chunks |
| LangChain WebBaseLoader | Pipeline component | Drop-in for RAG ingestion |
from firecrawl import FirecrawlApp
from pydantic import BaseModel
class Product(BaseModel):
name: str
price: float
in_stock: bool
sku: str | None = None
app = FirecrawlApp(api_key="fc-...")
result = app.scrape_url(
"https://example.com/products/abc",
params={
"formats": ["extract"],
"extract": {"schema": Product.model_json_schema()},
},
)
product = Product.model_validate(result["extract"])
print(product)
One call. The schema is your contract; the LLM fills it. Firecrawl handles fetching, JS rendering, Markdown conversion, and the LLM prompt.
from scrapegraphai.graphs import SmartScraperGraph
graph_config = {
"llm": {
"model": "gpt-4o-mini",
"api_key": "sk-...",
"temperature": 0,
},
"headless": True,
# Route the headless browser through a residential proxy
"loader_kwargs": {
"proxy": {"http": "http://USER:[email protected]:8000"},
},
}
scraper = SmartScraperGraph(
prompt="Extract product name, price (as float), and availability",
source="https://example.com/products/abc",
config=graph_config,
)
print(scraper.run())
ScrapeGraphAI runs entirely on your hardware. Pair it with a self-hosted Llama-4 / Qwen3 endpoint via Ollama for $0 marginal token cost.
For multi-step flows — log in, navigate menus, fill forms, then extract — you need an agent that can act, not just read.
from browser_use import Agent
from langchain_openai import ChatOpenAI
import asyncio
agent = Agent(
task=("Go to https://example.com/jobs, search for 'engineer' "
"in San Francisco, return the top 5 results as JSON."),
llm=ChatOpenAI(model="gpt-4o"),
)
result = asyncio.run(agent.run())
print(result)
The agent looks at the page screenshot + DOM, picks a tool (click, type, scroll), executes, observes the new state, and continues until the task is complete. Slow (30–90 seconds per task) but handles things you'd need 200 lines of selectors for.
Token cost depends on three factors: input tokens (the cleaned page content), output tokens (your JSON), and the model you pick. Rough 2026 numbers for a typical e-commerce product page (~3k input tokens, ~200 output tokens):
| Model | Cost per page | 1M pages |
|---|---|---|
| GPT-4o-mini | ~$0.0005 | ~$500 |
| Claude Haiku 4.5 | ~$0.0008 | ~$800 |
| Gemini 2.5 Flash | ~$0.0004 | ~$400 |
| GPT-5 | ~$0.012 | ~$12,000 |
| Claude Opus 4.7 | ~$0.018 | ~$18,000 |
| Self-hosted Llama-4 70B | ~$0.00005 (electricity) | ~$50 |
For most production scraping, mini-class models are accurate enough. Reserve frontier models for tricky pages or use a two-tier setup: try mini first, fall back to a bigger model when the schema validation fails.
AI scraping eliminates the parser, not the network. The headless Chrome inside Firecrawl / ScrapeGraphAI / Browser-use still presents an IP to the target. Run 1,000 AI scrapes from one datacenter IP and you'll hit the same 429s and Cloudflare challenges as a traditional scraper.
llms.txt + ai.txt. Honor opt-outs.Related: AI data collection process · How AI agents use proxies · Proxies for LLM training · Best AI web scraping tools.