Quick definition: AI data collection is the systematic gathering of text, images, and structured records used to train, fine-tune, or ground machine-learning models. In 2026, frontier LLM teams run three parallel pipelines — large-scale web crawl for pretraining, live retrieval (RAG) for grounding, and curated human-labeled sets for instruction tuning and evaluation. The legal frame tightened sharply after the EU AI Act's general-purpose provisions took effect in August 2025, the NYT v. OpenAI ruling, and the rapid adoption of llms.txt and ai.txt opt-out standards.
What Counts as AI Data Collection?
Anything you pull into a model is data collection — but the use-cases sort cleanly into four buckets, and the rules for each differ.
| Pipeline | Typical volume | Main source | Key constraint |
| Pretraining corpus | 10–30 trillion tokens | Common Crawl + custom web crawl | Copyright, opt-out signals, dedup |
| RAG / grounding | Live, per-query | Web search + first-party APIs | Freshness, citation, robots.txt |
| Fine-tune / SFT | 10k–10M examples | Human labels, synthetic data | Quality, contamination with eval |
| Evaluation | ~1k–500k items | Curated, often expert-written | Test-set leakage is fatal |
Where the Data Actually Comes From (2026)
- Common Crawl — The 250B+ page free archive remains the foundation. Most labs filter it through their own quality pipeline (FineWeb-Edu, RefinedWeb, RedPajama-V2) before using a fraction.
- Custom web crawls — Every frontier lab now runs its own crawler (GPTBot, ClaudeBot, Google-Extended, PerplexityBot). Custom crawls let labs hit recent content and target specific verticals Common Crawl under-samples.
- Partner data deals — News (AP, Reuters, FT, NYT-licensed competitors), Stack Exchange dump, Reddit licensing deal (Google/OpenAI). After the NYT lawsuit and EU AI Act transparency rules, deals are the safer path for premium text.
- Synthetic data — Used heavily for code, math, and reasoning chains. Self-instruct, Constitutional AI, and reject-and-retry pipelines now generate ~30–50% of fine-tuning corpora at top labs.
- Human labeling — Scale AI, Surge, Invisible, plus in-house ops teams. RLHF and rubric-based eval still need real humans for the highest-quality slices.
- First-party APIs — YouTube transcripts (when licensed), GitHub public repos, Wikipedia/Wikidata, government open data.
The 2026 Data Pipeline
For a team building an AI product (not Frontier Lab scale), the practical pipeline looks like this:
- Define what you need. Domain-specific RAG corpus? Fine-tune set? Eval suite? Volume and quality requirements differ by 100x.
- Source check. Robots.txt,
llms.txt, ai.txt, ToS, copyright notice. EU AI Act Art. 53 requires you to publish a sufficiently-detailed training-data summary.
- Crawl / fetch. Use a rotating proxy pool to avoid hot-spotting any single IP, identify your bot with a real
User-Agent and contact URL, respect crawl-delay.
- Extract. Trafilatura, Boilerplate Removal, Mozilla Readability, or HTML to Markdown for clean text. Discard nav, ads, cookie banners.
- Deduplicate. MinHash-LSH or SimHash at document level + line level. Pretraining corpora typically lose 60–80% of raw tokens to dedup.
- Filter for quality. Language-ID, perplexity, classifier-based filters (FastText, DataComp-LM), PII detection, NSFW + violence classifiers.
- Decontaminate. Remove anything overlapping with your eval sets. Test-set leakage inflates benchmarks and is the #1 reproducibility failure mode.
- Document. Datasheet for Datasets (Gebru et al.) + EU AI Act training-data summary. Version everything.
Legal & Ethical Boundaries (2026)
The compliance picture changed materially between 2023 and 2026. Treat this as the current floor, not exhaustive legal advice:
- EU AI Act — General-purpose model rules (Art. 51–55) applied from 2 August 2025. Providers must publish a training-data summary, respect EU copyright opt-outs (including TDM exception Art. 4 of the 2019 CDSM Directive), and keep technical documentation. Systemic-risk models (10^25+ FLOP) carry additional obligations.
- GDPR — Personal data scraped from public web is still personal data. You need a lawful basis (typically legitimate interest with a documented balancing test), and data subjects retain Art. 17 erasure rights against the trained model where technically feasible.
- U.S. case law — NYT v. OpenAI (2024) is the headline copyright fight. Outcomes will reshape what training counts as fair use. Until settled, premium publishers should be licensed, not scraped.
- Opt-out signals to respect —
robots.txt, llms.txt, ai.txt, IETF Crawl-Delay, meta noai/noimageai, and the newer Cloudflare AI bot block. Honoring these is both ethically right and a soft legal shield.
- Personal data — Do not collect email + name + phone tuples as training data. Even on "public" pages, that crosses GDPR / CCPA lines.
Where Proxies Fit
You don't need proxies to collect AI data ethically — you need them to collect it at volume without breaking the source. A single IP hammering a site is rude, hot-spots your rack on rate-limit blocklists, and gets you a permanent ban for the IP block your cloud assigned. A rotating residential pool spreads load across thousands of real consumer IPs, mimics natural traffic, and respects the practical "no more than 1 request per IP per second" hygiene rule.
- Budget Residential — $1.75/GB, 10M+ IPs, 195+ countries. Best fit for high-volume crawl where freshness matters more than premium pool quality.
- Premium Residential — $2.75/GB, 130M+ IPs, sticky sessions up to 24h. For sites that fingerprint heavily or rotate captchas.
- Static Datacenter — $1.50/proxy/month, unlimited bandwidth. Cheapest per byte; only works on lightly-protected targets.
- LTE Mobile — $2/IP, real 4G/5G handsets. Lowest detection rate; reserve for the hardest targets and small volumes.
Quality Over Quantity
Through 2023 the consensus was "scale wins." By 2026 every well-known result — FineWeb-Edu, the DataComp-LM filtering competition, Phi-4's textbook-quality pretraining — points the other way. Modest, well-filtered corpora outperform raw web dumps 5–10x in compute efficiency. Practical filters:
- Language ID — fasttext-176, drop documents under 0.95 confidence.
- Length — remove documents under ~100 words or over ~50k (likely junk or PDFs that OCR'd badly).
- Perplexity gating — score every doc with a small reference LM, keep the middle 80%. Low perplexity = boilerplate, high = garbage.
- Quality classifier — train a binary classifier on Wikipedia / Stack Exchange (positive) vs random Common Crawl (negative). FineWeb-Edu's educational-value classifier is now the standard reference.
- PII redaction — Microsoft Presidio, scrubadub, or a custom regex + NER pass.
- Decontamination — 13-gram overlap against every public benchmark you may evaluate on.
| Layer | Tools |
| Crawl orchestration | Crawlee, Scrapy, Apify, custom Go/Rust workers |
| JS rendering | Playwright, Browserbase, Steel.dev, Browser-use |
| Extraction | Trafilatura, Mozilla Readability, jusText, html2text, AI scraping (Firecrawl, ScrapeGraphAI) |
| Storage | Parquet on S3 / R2, Hugging Face Datasets, DuckDB for ad-hoc |
| Dedup | datasketch (MinHash), text-dedup, NeMo Curator |
| Filtering | NeMo Curator, DataTrove, Dolma, DataComp-LM |
| Labeling | Argilla, Label Studio, Scale Studio, Surge AI |
| Synthetic | distilabel, Magpie, self-instruct pipelines on Llama-4 / Claude |
| Tracking | Weights & Biases, MLflow, DVC, lakeFS for data versioning |
Best Practices Checklist
- Publish a contact email in your
User-Agent. Hostile crawls don't.
- Cache aggressively — don't re-fetch the same URL twice in a quarter.
- Set per-host rate limits, not just global. 1 RPS per host is a reasonable default.
- Pipe through a rotating proxy pool — for the source's sake, not just yours.
- Snapshot the training-data summary alongside each model release. It's an EU AI Act obligation; treat it as a feature, not paperwork.
- Run decontamination against every eval you plan to publish numbers on.
- Version your data with the same rigor as your code.
- Keep a kill-switch: a public form to honor erasure requests within 30 days.
Related: What is AI scraping? · Proxies for LLM training · Best LLM training datasets · How AI agents use proxies.