What's the best dataset for pretraining an LLM in 2026?

FineWeb-Edu (1.3T tokens) is the new default for educational-quality web text — Common Crawl filtered through HuggingFace's 2024-25 pipeline. For larger pretraining, mix FineWeb (15T tokens), The Stack v2 for code (900B tokens), OpenWebMath/FineMath for math, plus curated Wikipedia, Stack Exchange, and Project Gutenberg. All commercial-use friendly.

Is Common Crawl legal to use for LLM training?

Yes — Common Crawl is released under a permissive license for both research and commercial use. The legal complications arise from the upstream content the archive contains: copyrighted articles, books, and personal data. EU AI Act requires you to respect opt-out signals (robots.txt, llms.txt) and publish a training-data summary. U.S. case law is still unsettled post NYT v. OpenAI.

What's the difference between FineWeb and FineWeb-Edu?

FineWeb is the full filtered Common Crawl corpus at 15T tokens. FineWeb-Edu is a 1.3T subset filtered with an additional 'educational value' classifier — academic-style text, tutorials, encyclopedic content. FineWeb-Edu trains models that score higher on benchmarks per token, so it's used when compute is the bottleneck.

Which datasets are safest for commercial LLMs?

Use ODC-By, Apache-2.0, MIT, CC-BY, or public-domain sources. FineWeb (ODC-By), RedPajama-V2 (Apache-2.0), The Stack v2 (only OSI-approved licenses included), Wikipedia (CC-BY-SA), Project Gutenberg (public domain) are all commercial-friendly. Avoid CC-NC, 'research only' corpora, and anything with copyright concerns (LAION-5B pre-2024 versions, Books3, NYT/news scrapes).

How much data do I need to train a 70B-parameter LLM?

Frontier scale is roughly 15-30 tokens per parameter for compute-optimal Chinchilla scaling, so ~1-2T tokens for a 70B model. Modern recipes go further — Llama 3.3 70B used ~15T tokens. For specialized models you can get good results with much less; fine-tuning a 7B model needs 1k-1M examples, not trillions.

Are there good multilingual training datasets?

FineWeb-2 (2026) covers 1,000+ languages with quality filtering. HPLT 2.0 has 4.5T tokens across 75 languages. Madlad-400 spans 400 languages. OSCAR 23.01 is a multilingual Common Crawl extract. All four are freely available and commercial-friendly.

What's the best dataset for instruction tuning?

Tulu 3 SFT Mix (940k examples, AI2) is the current open standard for state-of-the-art instruction tuning. Pair with UltraFeedback (64k preference pairs) for DPO/RLHF. For specialized tasks add OpenAssistant Conversations (multilingual) and Magpie (Llama-generated, 1M+ examples). Always run decontamination against your eval suites first.

Do I need proxies if I'm using public datasets?

No for the published datasets themselves — those are downloads. Yes for any custom crawl you do to augment them. Domain-specific data (medical literature, e-commerce product catalogs, news) requires direct scraping, and rotating residential proxies prevent your crawl from hot-spotting any single IP. $1.75/GB for budget residential makes the bandwidth affordable at any scale.

Best LLM Training Datasets for 2026

Daniel K.

Mon May 18 2026

Quick guide: If you're pretraining or fine-tuning an LLM in 2026 you're probably using some mix of Common Crawl (raw web, the foundation) filtered through a quality classifier like FineWeb-Edu or Dolma, then mixed with code (The Stack v2, StarCoderData), math (OpenWebMath, FineMath), and Wikipedia + Stack Exchange + Books. For instruction tuning, mix human-labeled (OpenAssistant, UltraFeedback) with synthetic (Magpie, distilabel outputs). All-rights-reserved data like NYT or Stack Overflow snapshot need licenses post-2024.

Pretraining Datasets

Common Crawl

The free monthly archive of the web; ~250B+ pages, ~400 TB compressed. Every frontier lab starts here. Released under the Common Crawl License (permissive for research and commercial use). Practical workflow: download a snapshot, run language ID + dedup + quality classifier; you'll keep 2–10% of bytes.

Where: commoncrawl.org
Tokens after filtering: ~5–15 trillion English depending on cutoff
License: CC0-like / Open

FineWeb (Hugging Face)

FineWeb is Common Crawl filtered through HuggingFace's 2024–25 pipeline: language ID, fastText quality classifier, NSFW filter, dedup. Result: 15 trillion English tokens that outperform raw Common Crawl by ~10% on benchmarks per token spent.

FineWeb-Edu (1.3T tokens) — the educational-quality subset, the new default reference corpus for academic models.
License: Open Data Commons Attribution License (ODC-By) — commercial use OK with attribution.

Dolma (AI2)

Allen Institute's open pretraining corpus. 3T tokens v1.7. Mix of Common Crawl, The Stack code, peS2o academic papers, Project Gutenberg books, Wikipedia, Reddit. The full data pipeline is open.

Best for: reproducibility — full provenance per document.
License: AI2 ImpACT License (permissive but with use-policy guardrails).

RedPajama-V2

Together AI's open replication of the LLaMA pretraining mix. 30T tokens with quality and language metadata at the document level, letting you re-filter on the fly. Used as the base for many open models including the OLMo series.

License: Apache-2.0

The Pile (EleutherAI)

The 2020 standard open pretraining corpus. 825 GB of curated text. Smaller than the others (~300B tokens) but still useful for smaller models. Withdrawn-then-restored in 2023 after copyright concerns; the current version has Books3 removed.

Code Datasets

Dataset	Size	License	Notes
The Stack v2 (Hugging Face)	~900B tokens, 600+ langs	License-filtered (only OSI-approved)	The new standard. Filters by license per repo.
StarCoderData	~250B tokens	Permissive only	Same lineage as The Stack; used to train StarCoder2.
CodeParrot	~50B tokens Python	MIT/Apache filtered	Smaller, focused on Python.
BigCode CommitPackFT	~2B tokens	License-filtered	Commit + message pairs for code-instruction tuning.

Math & Reasoning

OpenWebMath (Hugging Face) — ~15B tokens of math from the web, latex-preserving.
FineMath (HF) — 2025 release, 54B tokens, the educational subset of OpenWebMath plus new sources.
MATH + GSM8K — eval datasets, not training (use carefully to avoid contamination).
NuminaMath — 860k problems, used in OpenAI o-series fine-tuning.

Multilingual

HPLT 2.0 — 4.5T tokens across 75 languages, filtered.
FineWeb-2 (2026) — multilingual FineWeb, 1,000+ languages, quality-filtered.
Madlad-400 — 3T tokens, 400 languages, Google.
OSCAR 23.01 — multilingual Common Crawl extract, freely available.

Instruction Tuning Datasets

OpenAssistant Conversations — 161k human-written conversations, multilingual.
UltraFeedback — 64k prompts with 4 responses each, GPT-4-ranked for DPO/PPO.
UltraChat 200k — large-scale synthetic chat data.
Magpie — Llama-generated instructions, 1M+ examples, the basis for many 2025 open models.
Tulu 3 SFT Mix (AI2) — 940k examples curated for state-of-the-art post-training.
NVIDIA HelpSteer 3 — 38k preference pairs for instruction quality.

Vision-Language (Image-Caption) Datasets

LAION-5B — 5.85B image-text pairs. Heavily used in Stable Diffusion lineage. Note: subsets like LAION-2B-en are CC-by-style; the full corpus has filtering issues, especially around CSAM concerns that led to a 2024 takedown and re-release of a filtered version.
DataComp-CommonPool — 12.8B pairs, designed for ablation studies on filtering.
COYO-700M — alternative to LAION, Kakao Brain.
OBELICS — interleaved image-text web documents, 141M docs.

Speech & Audio

Common Voice 17 (Mozilla) — 30,000+ hours across 100+ languages, CC0.
LibriSpeech — 1,000 hours English audiobooks.
YODAS — 500k+ hours YouTube audio with transcripts (use-with-caution licensing).
Voxpopuli — 400k hours European parliament audio, public-domain transcripts.

Curated High-Quality Text

Dataset	Tokens	License
Wikipedia (all languages)	~25B	CC-BY-SA
Stack Exchange dump	~10B	CC-BY-SA
Project Gutenberg	~6B	Public domain
arXiv full-text	~30B	Mixed (per paper)
peS2o (S2ORC academic)	~40B	Open access
FineFineWeb (subset of FineWeb-Edu)	~370B	ODC-By

Legal Frame (2026)

EU AI Act, Article 53. General-purpose model providers must publish a "sufficiently detailed" training-data summary and respect EU copyright opt-outs (incl. TDM exception under the 2019 CDSM Directive Art. 4). Effective from 2 August 2025 for new models.
U.S. case law. NYT v. OpenAI, Authors Guild v. OpenAI, Getty Images v. Stability AI remain unresolved. Until settled, premium publishers should be licensed, not scraped.
Opt-out signals. robots.txt, llms.txt, ai.txt, Cloudflare AI bot block. Respecting these is both ethical and a legal shield.
Personal data. Even "public" personal data (names + emails) is GDPR-regulated. Strip PII before pretraining.

How to Pick

Are you commercial? Use ODC-By / Apache / MIT licensed sources. Skip CC-NC and "research only" corpora.
Compute-bound? FineWeb-Edu over raw Common Crawl. 5x compute efficiency per token.
Multilingual? FineWeb-2, HPLT 2.0, Madlad-400.
Code-heavy model? The Stack v2 + StarCoderData.
Instruction tuning? Tulu 3 SFT Mix + UltraFeedback for preference data.
Vision-language? DataComp-CommonPool over LAION-5B in 2026.

Where Proxies Fit in Custom Crawling

If you're augmenting public datasets with a domain-specific crawl — medical literature, e-commerce, news — you'll need to scrape ethically and at volume. Spreading requests across a rotating residential pool is the only way to crawl meaningfully without hot-spotting any single IP. Budget Residential at $1.75/GB is the standard pick for high-volume AI data collection.