What is AI data collection?

AI data collection is the systematic gathering of text, images, code, and structured records used to train, fine-tune, or ground machine-learning models. In 2026 it covers four parallel pipelines: large-scale pretraining corpora (Common Crawl + custom web crawl), retrieval-augmented generation (live web + APIs), fine-tuning sets (human-labeled + synthetic), and evaluation suites.

Is web scraping for AI training legal in 2026?

It depends on jurisdiction and purpose. The EU AI Act (effective for general-purpose models from August 2025) requires you to respect copyright opt-outs and publish a training-data summary. The U.S. NYT v. OpenAI case is reshaping fair-use limits. Public personal data is still GDPR-regulated. Honor robots.txt, llms.txt, ai.txt, and license premium content.

Where does Common Crawl fit in AI data collection?

Common Crawl is a 250B+ page free web archive published in monthly snapshots. Most frontier LLMs use it as the foundation of their pretraining corpus, filtered through quality pipelines like FineWeb-Edu or RefinedWeb. Labs typically keep less than 10% of the raw bytes after dedup, language filtering, and quality scoring.

Do I need proxies for AI data collection?

Not strictly, but yes for any production volume. Crawling from a single IP rate-limits the source, gets your IP block banned, and concentrates request load unfairly. A rotating residential pool spreads requests across thousands of real consumer IPs, behaves more like organic traffic, and lets you respect a per-IP rate limit while still hitting the throughput you need.

How much data does it take to train an LLM in 2026?

Frontier pretraining now uses 10-30 trillion tokens. That's ~50TB of cleaned text after filtering, sourced from a Common Crawl + custom crawl input that started 5-10x larger before dedup. Smaller specialized models can be fine-tuned on as few as 1,000 high-quality examples. Quality matters far more than raw volume past a certain point.

What is the EU AI Act training-data summary requirement?

Article 53 of the EU AI Act requires general-purpose AI providers to publish a 'sufficiently detailed' summary of the content used for training. It must cover main data sources, types of data, and steps taken to respect copyright. The AI Office publishes a template. The obligation applies from 2 August 2025 for new models and 2 August 2027 for models placed on the market before that date.

How do I avoid eval contamination?

Maintain a manifest of every benchmark you may evaluate on. Before training, run a 13-gram overlap check between your training corpus and every benchmark item. Remove matches. Re-run the check after every dataset change. Contamination quietly inflates benchmark numbers and is the most common reproducibility failure.

What's the difference between data for pretraining and data for RAG?

Pretraining data is bulk, lower-quality acceptable, used once during model training and bakes weights. RAG data is per-query, must be fresh, must be high-quality and accurate, and is retrieved at inference time. Different volumes (trillions vs millions of docs), different freshness requirements (years vs minutes), and different legal exposure.

AI Data Collection: Process, Tools & Ethics (2026)

Daniel K.

Sat May 16 2026

Quick definition: AI data collection is the systematic gathering of text, images, and structured records used to train, fine-tune, or ground machine-learning models. In 2026, frontier LLM teams run three parallel pipelines — large-scale web crawl for pretraining, live retrieval (RAG) for grounding, and curated human-labeled sets for instruction tuning and evaluation. The legal frame tightened sharply after the EU AI Act's general-purpose provisions took effect in August 2025, the NYT v. OpenAI ruling, and the rapid adoption of llms.txt and ai.txt opt-out standards.

What Counts as AI Data Collection?

Anything you pull into a model is data collection — but the use-cases sort cleanly into four buckets, and the rules for each differ.

Pipeline	Typical volume	Main source	Key constraint
Pretraining corpus	10–30 trillion tokens	Common Crawl + custom web crawl	Copyright, opt-out signals, dedup
RAG / grounding	Live, per-query	Web search + first-party APIs	Freshness, citation, robots.txt
Fine-tune / SFT	10k–10M examples	Human labels, synthetic data	Quality, contamination with eval
Evaluation	~1k–500k items	Curated, often expert-written	Test-set leakage is fatal

Where the Data Actually Comes From (2026)

Common Crawl — The 250B+ page free archive remains the foundation. Most labs filter it through their own quality pipeline (FineWeb-Edu, RefinedWeb, RedPajama-V2) before using a fraction.
Custom web crawls — Every frontier lab now runs its own crawler (GPTBot, ClaudeBot, Google-Extended, PerplexityBot). Custom crawls let labs hit recent content and target specific verticals Common Crawl under-samples.
Partner data deals — News (AP, Reuters, FT, NYT-licensed competitors), Stack Exchange dump, Reddit licensing deal (Google/OpenAI). After the NYT lawsuit and EU AI Act transparency rules, deals are the safer path for premium text.
Synthetic data — Used heavily for code, math, and reasoning chains. Self-instruct, Constitutional AI, and reject-and-retry pipelines now generate ~30–50% of fine-tuning corpora at top labs.
Human labeling — Scale AI, Surge, Invisible, plus in-house ops teams. RLHF and rubric-based eval still need real humans for the highest-quality slices.
First-party APIs — YouTube transcripts (when licensed), GitHub public repos, Wikipedia/Wikidata, government open data.

The 2026 Data Pipeline

For a team building an AI product (not Frontier Lab scale), the practical pipeline looks like this:

Define what you need. Domain-specific RAG corpus? Fine-tune set? Eval suite? Volume and quality requirements differ by 100x.
Source check. Robots.txt, llms.txt, ai.txt, ToS, copyright notice. EU AI Act Art. 53 requires you to publish a sufficiently-detailed training-data summary.
Crawl / fetch. Use a rotating proxy pool to avoid hot-spotting any single IP, identify your bot with a real User-Agent and contact URL, respect crawl-delay.
Extract. Trafilatura, Boilerplate Removal, Mozilla Readability, or HTML to Markdown for clean text. Discard nav, ads, cookie banners.
Deduplicate. MinHash-LSH or SimHash at document level + line level. Pretraining corpora typically lose 60–80% of raw tokens to dedup.
Filter for quality. Language-ID, perplexity, classifier-based filters (FastText, DataComp-LM), PII detection, NSFW + violence classifiers.
Decontaminate. Remove anything overlapping with your eval sets. Test-set leakage inflates benchmarks and is the #1 reproducibility failure mode.
Document. Datasheet for Datasets (Gebru et al.) + EU AI Act training-data summary. Version everything.

Legal & Ethical Boundaries (2026)

The compliance picture changed materially between 2023 and 2026. Treat this as the current floor, not exhaustive legal advice:

EU AI Act — General-purpose model rules (Art. 51–55) applied from 2 August 2025. Providers must publish a training-data summary, respect EU copyright opt-outs (including TDM exception Art. 4 of the 2019 CDSM Directive), and keep technical documentation. Systemic-risk models (10^25+ FLOP) carry additional obligations.
GDPR — Personal data scraped from public web is still personal data. You need a lawful basis (typically legitimate interest with a documented balancing test), and data subjects retain Art. 17 erasure rights against the trained model where technically feasible.
U.S. case law — NYT v. OpenAI (2024) is the headline copyright fight. Outcomes will reshape what training counts as fair use. Until settled, premium publishers should be licensed, not scraped.
Opt-out signals to respect — robots.txt, llms.txt, ai.txt, IETF Crawl-Delay, meta noai/noimageai, and the newer Cloudflare AI bot block. Honoring these is both ethically right and a soft legal shield.
Personal data — Do not collect email + name + phone tuples as training data. Even on "public" pages, that crosses GDPR / CCPA lines.

Where Proxies Fit

You don't need proxies to collect AI data ethically — you need them to collect it at volume without breaking the source. A single IP hammering a site is rude, hot-spots your rack on rate-limit blocklists, and gets you a permanent ban for the IP block your cloud assigned. A rotating residential pool spreads load across thousands of real consumer IPs, mimics natural traffic, and respects the practical "no more than 1 request per IP per second" hygiene rule.

Budget Residential — $1.75/GB, 10M+ IPs, 195+ countries. Best fit for high-volume crawl where freshness matters more than premium pool quality.
Premium Residential — $2.75/GB, 130M+ IPs, sticky sessions up to 24h. For sites that fingerprint heavily or rotate captchas.
Static Datacenter — $1.50/proxy/month, unlimited bandwidth. Cheapest per byte; only works on lightly-protected targets.
LTE Mobile — $2/IP, real 4G/5G handsets. Lowest detection rate; reserve for the hardest targets and small volumes.

Quality Over Quantity

Through 2023 the consensus was "scale wins." By 2026 every well-known result — FineWeb-Edu, the DataComp-LM filtering competition, Phi-4's textbook-quality pretraining — points the other way. Modest, well-filtered corpora outperform raw web dumps 5–10x in compute efficiency. Practical filters:

Language ID — fasttext-176, drop documents under 0.95 confidence.
Length — remove documents under ~100 words or over ~50k (likely junk or PDFs that OCR'd badly).
Perplexity gating — score every doc with a small reference LM, keep the middle 80%. Low perplexity = boilerplate, high = garbage.
Quality classifier — train a binary classifier on Wikipedia / Stack Exchange (positive) vs random Common Crawl (negative). FineWeb-Edu's educational-value classifier is now the standard reference.
PII redaction — Microsoft Presidio, scrubadub, or a custom regex + NER pass.
Decontamination — 13-gram overlap against every public benchmark you may evaluate on.

Tools & Infrastructure 2026

Layer	Tools
Crawl orchestration	Crawlee, Scrapy, Apify, custom Go/Rust workers
JS rendering	Playwright, Browserbase, Steel.dev, Browser-use
Extraction	Trafilatura, Mozilla Readability, jusText, html2text, AI scraping (Firecrawl, ScrapeGraphAI)
Storage	Parquet on S3 / R2, Hugging Face Datasets, DuckDB for ad-hoc
Dedup	datasketch (MinHash), text-dedup, NeMo Curator
Filtering	NeMo Curator, DataTrove, Dolma, DataComp-LM
Labeling	Argilla, Label Studio, Scale Studio, Surge AI
Synthetic	distilabel, Magpie, self-instruct pipelines on Llama-4 / Claude
Tracking	Weights & Biases, MLflow, DVC, lakeFS for data versioning

Best Practices Checklist

Publish a contact email in your User-Agent. Hostile crawls don't.
Cache aggressively — don't re-fetch the same URL twice in a quarter.
Set per-host rate limits, not just global. 1 RPS per host is a reasonable default.
Pipe through a rotating proxy pool — for the source's sake, not just yours.
Snapshot the training-data summary alongside each model release. It's an EU AI Act obligation; treat it as a feature, not paperwork.
Run decontamination against every eval you plan to publish numbers on.
Version your data with the same rigor as your code.
Keep a kill-switch: a public form to honor erasure requests within 30 days.