Browse Data

73–96 of 227

1.065 billion ERA5 reanalysis weather records covering 85+ years (1940-2025) of global weather at 0.25° resolution, geographically partitioned parquet for efficient regional queries.

1,064,793,600 rows·PARQUET·0 downloads

text

Freefixed price

Falcon RefinedWeb

Massive English web corpus (~600M docs) from TII, built via stringent filtering and deduplication of CommonCrawl. Powers the Falcon LLM family. ODC-By 1.0 licensed, parquet format.

968,000,015 rows·PARQUET·0 downloads

text

Freefixed price

GSM8K — Grade School Math Word Problems (OpenAI)

8.5K linguistically diverse grade-school math word problems with step-by-step solutions. Standard benchmark for multi-step arithmetic reasoning in LLMs.

17,584 rows·PARQUET·0 downloads

text

Freefixed price

OpenWebMath: 14.7B Tokens of Mathematical Web Text

6.3M high-quality mathematical documents (14.7B tokens) filtered from 200B+ Common Crawl HTML files. Parquet format, ideal for LLM math pretraining and fine-tuning.

6,315,233 rows·PARQUET·0 downloads

text

Freefixed price

Nemotron-Math-v2: Mathematical Reasoning Trajectories (NVIDIA)

NVIDIA's 347K math problems with 7M model-generated reasoning trajectories for distilling mathematical reasoning. Long-context, tool-use, multi-mode supervision. CC-BY-4.0/CC-BY-SA-4.0.

7,085,839 rows·PARQUET·0 downloads

text

Freefixed price

OpenOrca — Augmented FLAN Instruction Dataset

~4M GPT-4/GPT-3.5 augmented FLAN instruction-response pairs aligned with the Orca paper distribution. Widely used for instruction tuning and fine-tuning open LLMs.

2,942,029 rows·PARQUET·0 downloads

text

Freefixed price

SmolLM Corpus — Curated Pretraining Data for Small Language Models

High-quality educational and synthetic pretraining corpus from HuggingFaceTB. Includes Cosmopedia v2 (39M+ synthetic textbooks/stories), Python-Edu, and FineWeb-Edu deduplicated subsets. Parquet, 100M-1B rows, ODC-By licensed.

236,980,453 rows·PARQUET·0 downloads

text

Freefixed price

Nemotron Safety Guard Dataset v3 (Multilingual LLM Safety, 12 Languages)

NVIDIA's 514K-sample multilingual safety dataset for training LLM safety guard models across 12 languages, generated via the CultureGuard pipeline. CC-BY-4.0.

514,617 rows·PARQUET·0 downloads

text

Freefixed price

StackMathQA: 2M Math Q&A from Stack Exchange

Curated collection of 2 million mathematical questions and answers sourced from Stack Exchange sites (Math, MathOverflow, Stats, etc.). Designed for math reasoning fine-tuning, pretraining, and RAG.

6,195,432 rows·PARQUET·0 downloads

scientific

Freefixed price

Sci-Base: AI-Ready Scientific Foundation Dataset

Large multi-domain scientific text corpus (chem, bio, climate, medical, materials, earth, physics) from OpenDataLab's Sciverse foundation, 1M-10M rows in Parquet, CC-BY-4.0.

3,631,260 rows·PARQUET·0 downloads

text

Freefixed price

WildChat-4.8M: Real Human-ChatGPT Conversations (Non-Toxic Subset)

3.2M real user↔ChatGPT conversations collected in the wild by AI2. Non-toxic subset, parquet format, ODC-By licensed. Widely used for instruction tuning, RAG, and chat-model evaluation.

3,199,860 rows·PARQUET·0 downloads

text

Freefixed price

TriviaQA — Reading Comprehension QA with Evidence Documents

650K+ question-answer-evidence triples for open-domain and reading comprehension QA. Standard benchmark for retrieval-augmented QA, RAG evals, and extractive/abstractive QA model training.

847,579 rows·PARQUET·0 downloads

text

Freefixed price

OpenCodeInstruct: 5M Instruction Tuning Samples for Code LLMs (NVIDIA)

NVIDIA's 5M-sample open-access instruction tuning dataset for supervised fine-tuning of code LLMs. Synthetic, diverse coding instructions and responses in English, CC-BY-4.0 licensed.

1,400,000 rows·PARQUET·0 downloads

images

Freefixed price

TreeOfLife-200M: Biodiversity Image Dataset (214M images, 952K taxa)

214M biology/species images across 952,257 taxa from GBIF, EOL, BIOSCAN-5M, and FathomNet. CC0 licensed, parquet format, ready for CV/CLIP training and zero-shot taxonomic classification.

213,937,319 images·PARQUET·0 downloads

text

Freefixed price

MathVista — Visual Mathematical Reasoning Benchmark

Multimodal math reasoning benchmark with ~6K image+text QA examples across geometry, charts, figures, and scientific diagrams. Standard eval for vision-language models.

6,141 rows·PARQUET·0 downloads

text

Freefixed price

Nemotron-PrismMath: 1M Synthetic Math Reasoning Problems

NVIDIA's 1M diverse math problem-solution pairs generated via Prismatic Synthesis. State-of-the-art math reasoning dataset for LLM fine-tuning and evaluation. CC-BY-4.0 licensed for commercial use.

1,002,595 rows·PARQUET·0 downloads

images

Freefixed price

Nemotron VLM Dataset v2 (NVIDIA)

NVIDIA's large-scale vision-language training dataset (~9M samples) for VQA, image-text-to-text, video-text-to-text, and document understanding. CC-BY-4.0.

4,518,886 images·PARQUET·0 downloads

text

Freefixed price

Eurus-2-RL-Data: Math & Coding RL Training Dataset with Outcome Verifiers

High-quality RL training dataset combining math problems (NuminaMath-CoT) and coding problems (APPS, CodeContests, TACO, Codeforces) with outcome verifiers — LaTeX answers for math and test cases for code. ~450K examples, parquet, MIT-licensed.

482,585 rows·PARQUET·0 downloads

text

Freefixed price

FineFineWeb: Fine-Grained Domain Web Corpus

Large-scale (billions of tokens) English web corpus from m-a-p, organized by fine-grained domains (aerospace, agronomy, artistic, etc.) for domain-aware LLM pretraining, classification, and RAG.

1,646,140 rows·PARQUET·0 downloads

text

Freefixed price

MMLU-Pro: Robust Multi-Task LLM Benchmark (12K Questions)

12K challenging multi-discipline multiple-choice questions for benchmarking LLM reasoning. MIT-licensed, parquet format, widely used for model evaluation and leaderboards.

12,102 rows·PARQUET·0 downloads

text

Freefixed price

Aya Collection (Language-Split) — 513M Multilingual Instruction Instances

Cohere Labs' Aya Collection re-uploaded with per-language splits. 513M multilingual instruction-tuning instances across 115+ languages in parquet. Apache-2.0.

513,757,344 rows·PARQUET·0 downloads

text

Freefixed price

FineInstructions Nemotron — 1B+ Synthetic Instruction-Answer Pairs

Approximately 1 billion synthetic instruction-answer pairs (~300B tokens) generated via the FineInstructions pipeline over the Nemotron-CC high-quality CommonCrawl pre-training corpus. Parquet format with per-shard judge scoring files.

1,228,476,202 rows·PARQUET·0 downloads

images

Freefixed price

FineVision — 24M-Sample Vision-Language Training Corpus

Massive open VLM training set: 17.3M images, 24.3M samples, 88.9M turns, 9.5B answer tokens. Parquet, multi-subset, ready for fine-tuning state-of-the-art vision-language models.

24,209,105 images·PARQUET·0 downloads

text

Freefixed price

DeepSearchQA — Google DeepMind 900-Prompt Agent Factuality Benchmark

900-prompt benchmark from Google DeepMind for evaluating agents on difficult multi-step information-seeking tasks across 17 fields. Apache-2.0, CSV format.

900 rows·PARQUET·0 downloads

Browse Data

Weather Geo ERA5 — 1.065B Global Weather Records (1940-2025)

Falcon RefinedWeb

GSM8K — Grade School Math Word Problems (OpenAI)

OpenWebMath: 14.7B Tokens of Mathematical Web Text

Nemotron-Math-v2: Mathematical Reasoning Trajectories (NVIDIA)

OpenOrca — Augmented FLAN Instruction Dataset

SmolLM Corpus — Curated Pretraining Data for Small Language Models

Nemotron Safety Guard Dataset v3 (Multilingual LLM Safety, 12 Languages)

StackMathQA: 2M Math Q&A from Stack Exchange

Sci-Base: AI-Ready Scientific Foundation Dataset

WildChat-4.8M: Real Human-ChatGPT Conversations (Non-Toxic Subset)

TriviaQA — Reading Comprehension QA with Evidence Documents

OpenCodeInstruct: 5M Instruction Tuning Samples for Code LLMs (NVIDIA)

TreeOfLife-200M: Biodiversity Image Dataset (214M images, 952K taxa)

MathVista — Visual Mathematical Reasoning Benchmark

Nemotron-PrismMath: 1M Synthetic Math Reasoning Problems

Nemotron VLM Dataset v2 (NVIDIA)

Eurus-2-RL-Data: Math & Coding RL Training Dataset with Outcome Verifiers

FineFineWeb: Fine-Grained Domain Web Corpus

MMLU-Pro: Robust Multi-Task LLM Benchmark (12K Questions)

Aya Collection (Language-Split) — 513M Multilingual Instruction Instances

FineInstructions Nemotron — 1B+ Synthetic Instruction-Answer Pairs

FineVision — 24M-Sample Vision-Language Training Corpus

DeepSearchQA — Google DeepMind 900-Prompt Agent Factuality Benchmark