spyderproxy

Best LLM Training Datasets for 2026

D

Daniel K.

|
Published date

Mon May 18 2026

Quick guide: If you're pretraining or fine-tuning an LLM in 2026 you're probably using some mix of Common Crawl (raw web, the foundation) filtered through a quality classifier like FineWeb-Edu or Dolma, then mixed with code (The Stack v2, StarCoderData), math (OpenWebMath, FineMath), and Wikipedia + Stack Exchange + Books. For instruction tuning, mix human-labeled (OpenAssistant, UltraFeedback) with synthetic (Magpie, distilabel outputs). All-rights-reserved data like NYT or Stack Overflow snapshot need licenses post-2024.

Pretraining Datasets

Common Crawl

The free monthly archive of the web; ~250B+ pages, ~400 TB compressed. Every frontier lab starts here. Released under the Common Crawl License (permissive for research and commercial use). Practical workflow: download a snapshot, run language ID + dedup + quality classifier; you'll keep 2–10% of bytes.

  • Where: commoncrawl.org
  • Tokens after filtering: ~5–15 trillion English depending on cutoff
  • License: CC0-like / Open

FineWeb (Hugging Face)

FineWeb is Common Crawl filtered through HuggingFace's 2024–25 pipeline: language ID, fastText quality classifier, NSFW filter, dedup. Result: 15 trillion English tokens that outperform raw Common Crawl by ~10% on benchmarks per token spent.

  • FineWeb-Edu (1.3T tokens) — the educational-quality subset, the new default reference corpus for academic models.
  • License: Open Data Commons Attribution License (ODC-By) — commercial use OK with attribution.

Dolma (AI2)

Allen Institute's open pretraining corpus. 3T tokens v1.7. Mix of Common Crawl, The Stack code, peS2o academic papers, Project Gutenberg books, Wikipedia, Reddit. The full data pipeline is open.

  • Best for: reproducibility — full provenance per document.
  • License: AI2 ImpACT License (permissive but with use-policy guardrails).

RedPajama-V2

Together AI's open replication of the LLaMA pretraining mix. 30T tokens with quality and language metadata at the document level, letting you re-filter on the fly. Used as the base for many open models including the OLMo series.

  • License: Apache-2.0

The Pile (EleutherAI)

The 2020 standard open pretraining corpus. 825 GB of curated text. Smaller than the others (~300B tokens) but still useful for smaller models. Withdrawn-then-restored in 2023 after copyright concerns; the current version has Books3 removed.

Code Datasets

DatasetSizeLicenseNotes
The Stack v2 (Hugging Face)~900B tokens, 600+ langsLicense-filtered (only OSI-approved)The new standard. Filters by license per repo.
StarCoderData~250B tokensPermissive onlySame lineage as The Stack; used to train StarCoder2.
CodeParrot~50B tokens PythonMIT/Apache filteredSmaller, focused on Python.
BigCode CommitPackFT~2B tokensLicense-filteredCommit + message pairs for code-instruction tuning.

Math & Reasoning

  • OpenWebMath (Hugging Face) — ~15B tokens of math from the web, latex-preserving.
  • FineMath (HF) — 2025 release, 54B tokens, the educational subset of OpenWebMath plus new sources.
  • MATH + GSM8K — eval datasets, not training (use carefully to avoid contamination).
  • NuminaMath — 860k problems, used in OpenAI o-series fine-tuning.

Multilingual

  • HPLT 2.0 — 4.5T tokens across 75 languages, filtered.
  • FineWeb-2 (2026) — multilingual FineWeb, 1,000+ languages, quality-filtered.
  • Madlad-400 — 3T tokens, 400 languages, Google.
  • OSCAR 23.01 — multilingual Common Crawl extract, freely available.

Instruction Tuning Datasets

  • OpenAssistant Conversations — 161k human-written conversations, multilingual.
  • UltraFeedback — 64k prompts with 4 responses each, GPT-4-ranked for DPO/PPO.
  • UltraChat 200k — large-scale synthetic chat data.
  • Magpie — Llama-generated instructions, 1M+ examples, the basis for many 2025 open models.
  • Tulu 3 SFT Mix (AI2) — 940k examples curated for state-of-the-art post-training.
  • NVIDIA HelpSteer 3 — 38k preference pairs for instruction quality.

Vision-Language (Image-Caption) Datasets

  • LAION-5B — 5.85B image-text pairs. Heavily used in Stable Diffusion lineage. Note: subsets like LAION-2B-en are CC-by-style; the full corpus has filtering issues, especially around CSAM concerns that led to a 2024 takedown and re-release of a filtered version.
  • DataComp-CommonPool — 12.8B pairs, designed for ablation studies on filtering.
  • COYO-700M — alternative to LAION, Kakao Brain.
  • OBELICS — interleaved image-text web documents, 141M docs.

Speech & Audio

  • Common Voice 17 (Mozilla) — 30,000+ hours across 100+ languages, CC0.
  • LibriSpeech — 1,000 hours English audiobooks.
  • YODAS — 500k+ hours YouTube audio with transcripts (use-with-caution licensing).
  • Voxpopuli — 400k hours European parliament audio, public-domain transcripts.

Curated High-Quality Text

DatasetTokensLicense
Wikipedia (all languages)~25BCC-BY-SA
Stack Exchange dump~10BCC-BY-SA
Project Gutenberg~6BPublic domain
arXiv full-text~30BMixed (per paper)
peS2o (S2ORC academic)~40BOpen access
FineFineWeb (subset of FineWeb-Edu)~370BODC-By
  • EU AI Act, Article 53. General-purpose model providers must publish a "sufficiently detailed" training-data summary and respect EU copyright opt-outs (incl. TDM exception under the 2019 CDSM Directive Art. 4). Effective from 2 August 2025 for new models.
  • U.S. case law. NYT v. OpenAI, Authors Guild v. OpenAI, Getty Images v. Stability AI remain unresolved. Until settled, premium publishers should be licensed, not scraped.
  • Opt-out signals. robots.txt, llms.txt, ai.txt, Cloudflare AI bot block. Respecting these is both ethical and a legal shield.
  • Personal data. Even "public" personal data (names + emails) is GDPR-regulated. Strip PII before pretraining.

How to Pick

  1. Are you commercial? Use ODC-By / Apache / MIT licensed sources. Skip CC-NC and "research only" corpora.
  2. Compute-bound? FineWeb-Edu over raw Common Crawl. 5x compute efficiency per token.
  3. Multilingual? FineWeb-2, HPLT 2.0, Madlad-400.
  4. Code-heavy model? The Stack v2 + StarCoderData.
  5. Instruction tuning? Tulu 3 SFT Mix + UltraFeedback for preference data.
  6. Vision-language? DataComp-CommonPool over LAION-5B in 2026.

Where Proxies Fit in Custom Crawling

If you're augmenting public datasets with a domain-specific crawl — medical literature, e-commerce, news — you'll need to scrape ethically and at volume. Spreading requests across a rotating residential pool is the only way to crawl meaningfully without hot-spotting any single IP. Budget Residential at $1.75/GB is the standard pick for high-volume AI data collection.

Related: AI data collection process · Proxies for LLM training · What is AI scraping? · AI web scraping tools.