To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
121–144 of 220BigCodeBench — Code Generation Benchmark (Complete & Instruct)
1,140-task code generation benchmark from BigCode with both docstring-based completion and NL-instruction variants, 99% test coverage, Apache-2.0 licensed.
WildChat-1M: 1 Million Real Human-ChatGPT Conversations
1M real-world conversations between human users and ChatGPT (GPT-3.5/4), with demographics, timestamps, languages, and toxicity labels. Parquet format, ODC-BY licensed. Widely used for instruction tuning, eval, and alignment research.
HelpSteer3 — NVIDIA Human Preference & Feedback Dataset for RLHF/Reward Modeling
NVIDIA's open-source multilingual preference dataset for training reward models and aligning LLMs. CC-BY-4.0, 100K+ samples across 15 languages, used to train SOTA reward models on RM-Bench (85.5%) and JudgeBench (78.6%).
Alpaca Cleaned — Instruction Fine-Tuning Dataset
Cleaned version of Stanford's Alpaca instruction-following dataset (~52K examples). Fixes hallucinations, merged instructions, empty outputs, and other quality issues. CC-BY-4.0, ready for LLM fine-tuning.
OpenR1-Math-220k: Math Reasoning Traces from DeepSeek R1
220k math problems with verified DeepSeek R1 reasoning traces, sourced from NuminaMath 1.5. Apache-2.0 licensed dataset for training and evaluating mathematical reasoning models.
ScienceQA: Multimodal Science Question Answering with Chain-of-Thought
21K multimodal multiple-choice science questions with images, lectures, and chain-of-thought explanations spanning natural science, social science, and language science. Widely used for VLM evaluation and CoT fine-tuning.
Nemotron Content Safety Dataset V2 (Aegis 2.0)
33,416 annotated human-LLM interactions for content safety classification across 12+ harm categories. Used for training and evaluating LLM safety guardrails like NeMo Guard.
Hermes Agent Reasoning Traces (Kimi-K2.5 & GLM-5.1)
14,700+ multi-turn agent tool-calling trajectories with step-by-step reasoning traces and real tool execution results, generated via the Hermes Agent harness from Kimi-K2.5 and GLM-5.1 models. Apache-2.0.
UltraChat 200k (Zephyr SFT Dataset)
Filtered 200k-dialogue subset of UltraChat used to train Zephyr-7B-β. High-quality multi-turn ChatGPT-generated conversations for supervised fine-tuning of chat models. MIT licensed, parquet format.
NuminaMath-TIR: Tool-Integrated Reasoning Math Problems
~70k math problems with tool-integrated reasoning (TIR) traces generated via GPT-4, derived from NuminaMath-CoT. Apache-2.0 licensed, ideal for training math-reasoning agents with code execution.
Full HH-RLHF (Prompt/Chosen/Rejected Format)
Anthropic's Helpful & Harmless RLHF dataset reformatted into prompt/chosen/rejected triples for preference modeling and DPO/RLHF training.
Wikimedia Wikipedia (All Languages, Cleaned)
Cleaned full-text Wikipedia articles across 300+ language subsets, built from official Wikimedia dumps. Parquet format, one row per article. Foundational corpus for LLM pretraining, RAG, and multilingual NLP.
SWE-bench: Real-World GitHub Issue Resolution Benchmark
2,294 Issue-PR pairs from 12 popular Python repos for evaluating LLM/agent ability to resolve real GitHub issues. Verified via post-PR unit tests. Canonical agent coding benchmark.
C4: Colossal Clean Crawled Corpus (en + multilingual mC4)
Cleaned Common Crawl web text corpus from AllenAI/Google. 305GB English + 9.7TB multilingual (108 languages). Foundational pretraining dataset behind T5 and many open LLMs. ODC-BY licensed.
FinePDFs — 3T Tokens from 475M PDFs in 1,733 Languages
Largest public PDF-sourced corpus: ~3 trillion tokens, 475M documents, 1,733 languages. Parquet format, ODC-By licensed. Built by HuggingFaceFW for pretraining and multilingual research.
FineWeb-Edu: 1.3T Tokens of Educational Web Content
1.3 trillion tokens of high-quality educational web pages filtered from FineWeb using a Llama3-70B-trained classifier. Parquet format, ODC-By licensed, ideal for LLM pretraining and RAG.
Common Corpus — 2.27T Tokens of Permissibly-Licensed Text
Largest open and permissibly-licensed text dataset: 2.27 trillion tokens across books, newspapers, scientific articles, legal docs, code, and more in 13+ languages. Built by PleIAs for LLM pretraining.
MMLU — Massive Multitask Language Understanding Benchmark
57-subject multiple-choice benchmark (humanities, STEM, social sciences, professional) for evaluating LLM knowledge and reasoning. ~116K questions across dev/val/test splits. MIT licensed.
FineWeb2: Multilingual Web Pretraining Corpus (1000+ Languages)
Second iteration of HuggingFace's FineWeb dataset, covering 1000+ languages with high-quality filtered web text for LLM pretraining. ODC-By 1.0 licensed, validated through hundreds of ablation experiments.
Wikipedia 2024-06 Embeddings (BGE-M3, Multilingual)
Wikipedia paragraph embeddings from the June 2024 dump across 11 languages, generated with the multilingual BGE-M3 model. Ideal for multilingual RAG, semantic search, and retrieval evals.
SWE-bench Pro
Enterprise-level benchmark dataset from Scale AI for evaluating AI agents on long-horizon software engineering tasks. Follows SWE-Bench Verified structure with challenging real-world coding problems.
SQuAD 2.0 - Stanford Question Answering Dataset
Reading comprehension benchmark with 150K+ questions on Wikipedia articles, including 50K unanswerable questions. Standard for extractive QA model training and evaluation.
MegaMath: 300B+ Token Open Math Pretraining Corpus
Largest open math-focused pretraining dataset (300B+ tokens) from LLM360, curated from Common Crawl, code, and synthetic sources for training math-capable LLMs.
Wikipedia 2023-11 Multilingual Embeddings (Cohere Embed V3, 300+ Languages)
~250M Wikipedia paragraph embeddings across 300+ languages, generated with Cohere Embed V3. Ideal for multilingual semantic search and RAG.