To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
73–96 of 227Weather Geo ERA5 — 1.065B Global Weather Records (1940-2025)
1.065 billion ERA5 reanalysis weather records covering 85+ years (1940-2025) of global weather at 0.25° resolution, geographically partitioned parquet for efficient regional queries.
Falcon RefinedWeb
Massive English web corpus (~600M docs) from TII, built via stringent filtering and deduplication of CommonCrawl. Powers the Falcon LLM family. ODC-By 1.0 licensed, parquet format.
GSM8K — Grade School Math Word Problems (OpenAI)
8.5K linguistically diverse grade-school math word problems with step-by-step solutions. Standard benchmark for multi-step arithmetic reasoning in LLMs.
OpenWebMath: 14.7B Tokens of Mathematical Web Text
6.3M high-quality mathematical documents (14.7B tokens) filtered from 200B+ Common Crawl HTML files. Parquet format, ideal for LLM math pretraining and fine-tuning.
Nemotron-Math-v2: Mathematical Reasoning Trajectories (NVIDIA)
NVIDIA's 347K math problems with 7M model-generated reasoning trajectories for distilling mathematical reasoning. Long-context, tool-use, multi-mode supervision. CC-BY-4.0/CC-BY-SA-4.0.
OpenOrca — Augmented FLAN Instruction Dataset
~4M GPT-4/GPT-3.5 augmented FLAN instruction-response pairs aligned with the Orca paper distribution. Widely used for instruction tuning and fine-tuning open LLMs.
SmolLM Corpus — Curated Pretraining Data for Small Language Models
High-quality educational and synthetic pretraining corpus from HuggingFaceTB. Includes Cosmopedia v2 (39M+ synthetic textbooks/stories), Python-Edu, and FineWeb-Edu deduplicated subsets. Parquet, 100M-1B rows, ODC-By licensed.
Nemotron Safety Guard Dataset v3 (Multilingual LLM Safety, 12 Languages)
NVIDIA's 514K-sample multilingual safety dataset for training LLM safety guard models across 12 languages, generated via the CultureGuard pipeline. CC-BY-4.0.
StackMathQA: 2M Math Q&A from Stack Exchange
Curated collection of 2 million mathematical questions and answers sourced from Stack Exchange sites (Math, MathOverflow, Stats, etc.). Designed for math reasoning fine-tuning, pretraining, and RAG.
Sci-Base: AI-Ready Scientific Foundation Dataset
Large multi-domain scientific text corpus (chem, bio, climate, medical, materials, earth, physics) from OpenDataLab's Sciverse foundation, 1M-10M rows in Parquet, CC-BY-4.0.
WildChat-4.8M: Real Human-ChatGPT Conversations (Non-Toxic Subset)
3.2M real user↔ChatGPT conversations collected in the wild by AI2. Non-toxic subset, parquet format, ODC-By licensed. Widely used for instruction tuning, RAG, and chat-model evaluation.
TriviaQA — Reading Comprehension QA with Evidence Documents
650K+ question-answer-evidence triples for open-domain and reading comprehension QA. Standard benchmark for retrieval-augmented QA, RAG evals, and extractive/abstractive QA model training.
OpenCodeInstruct: 5M Instruction Tuning Samples for Code LLMs (NVIDIA)
NVIDIA's 5M-sample open-access instruction tuning dataset for supervised fine-tuning of code LLMs. Synthetic, diverse coding instructions and responses in English, CC-BY-4.0 licensed.
TreeOfLife-200M: Biodiversity Image Dataset (214M images, 952K taxa)
214M biology/species images across 952,257 taxa from GBIF, EOL, BIOSCAN-5M, and FathomNet. CC0 licensed, parquet format, ready for CV/CLIP training and zero-shot taxonomic classification.
MathVista — Visual Mathematical Reasoning Benchmark
Multimodal math reasoning benchmark with ~6K image+text QA examples across geometry, charts, figures, and scientific diagrams. Standard eval for vision-language models.
Nemotron-PrismMath: 1M Synthetic Math Reasoning Problems
NVIDIA's 1M diverse math problem-solution pairs generated via Prismatic Synthesis. State-of-the-art math reasoning dataset for LLM fine-tuning and evaluation. CC-BY-4.0 licensed for commercial use.
Nemotron VLM Dataset v2 (NVIDIA)
NVIDIA's large-scale vision-language training dataset (~9M samples) for VQA, image-text-to-text, video-text-to-text, and document understanding. CC-BY-4.0.
Eurus-2-RL-Data: Math & Coding RL Training Dataset with Outcome Verifiers
High-quality RL training dataset combining math problems (NuminaMath-CoT) and coding problems (APPS, CodeContests, TACO, Codeforces) with outcome verifiers — LaTeX answers for math and test cases for code. ~450K examples, parquet, MIT-licensed.
FineFineWeb: Fine-Grained Domain Web Corpus
Large-scale (billions of tokens) English web corpus from m-a-p, organized by fine-grained domains (aerospace, agronomy, artistic, etc.) for domain-aware LLM pretraining, classification, and RAG.
MMLU-Pro: Robust Multi-Task LLM Benchmark (12K Questions)
12K challenging multi-discipline multiple-choice questions for benchmarking LLM reasoning. MIT-licensed, parquet format, widely used for model evaluation and leaderboards.
Aya Collection (Language-Split) — 513M Multilingual Instruction Instances
Cohere Labs' Aya Collection re-uploaded with per-language splits. 513M multilingual instruction-tuning instances across 115+ languages in parquet. Apache-2.0.
FineInstructions Nemotron — 1B+ Synthetic Instruction-Answer Pairs
Approximately 1 billion synthetic instruction-answer pairs (~300B tokens) generated via the FineInstructions pipeline over the Nemotron-CC high-quality CommonCrawl pre-training corpus. Parquet format with per-shard judge scoring files.
FineVision — 24M-Sample Vision-Language Training Corpus
Massive open VLM training set: 17.3M images, 24.3M samples, 88.9M turns, 9.5B answer tokens. Parquet, multi-subset, ready for fine-tuning state-of-the-art vision-language models.
DeepSearchQA — Google DeepMind 900-Prompt Agent Factuality Benchmark
900-prompt benchmark from Google DeepMind for evaluating agents on difficult multi-step information-seeking tasks across 17 fields. Apache-2.0, CSV format.