To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
97–120 of 227ScaleEdit-12M: Large-Scale Instruction-Based Image Editing Dataset
12.4M verified instruction–image pairs across 23 task families for instruction-based image editing. Largest open-source dataset of its kind, built via the ScaleEditor multi-agent framework. MIT licensed.
Llama-Nemotron VLM Dataset v1 (NVIDIA)
NVIDIA's 1M+ sample multimodal dataset for vision-language model training, covering VQA, OCR, captioning, and image-to-text tasks. CC-BY-4.0 licensed.
OpenWebText — Open Replication of GPT-2's WebText Corpus
Open-source replication of OpenAI's WebText, the corpus used to train GPT-2. ~8M English web documents (~13.5GB) in parquet format. CC0 licensed, widely used for LM pretraining and research.
HelpSteer: NVIDIA Helpfulness Preference Dataset
NVIDIA's open helpfulness dataset with multi-attribute human ratings (helpfulness, correctness, coherence, complexity, verbosity) for prompts and responses. Used to train SteerLM-aligned LLMs. CC-BY-4.0.
OpenThoughts-114k: Synthetic Reasoning Dataset
114k high-quality synthetic reasoning examples across math, science, code, and puzzles. Used to fine-tune OpenThinker-7B/32B models. Apache 2.0 licensed, parquet format.
BLIP3o Pretrain Long-Caption Dataset (27M Images)
27 million images paired with ~120-token long captions generated by Qwen2.5-VL-7B-Instruct. WebDataset format, Apache 2.0. Ideal for vision-language pretraining and multimodal model fine-tuning.
ClimbMix: 400B-Token Pre-training Corpus (NVIDIA)
A 400-billion-token English pre-training corpus from NVIDIA, filtered and topic-mixed for efficient LLM pre-training with superior performance per token.
MATH: Competition Mathematics Problems with Step-by-Step Solutions
12,500 competition math problems (AMC 10/12, AIME, etc.) with full worked solutions. Standard benchmark for math reasoning, chain-of-thought training, and LLM evaluation. MIT licensed, parquet format.
AgentTrove: 1.7M Agentic Interaction Traces
Largest open-source collection of agentic interaction traces (1.7M rows from 219 source datasets) covering code repair, shell scripting, math, competitive programming, and computer-use tasks. Apache 2.0.
Sangraha — 251B Token Indic Language Pretraining Corpus (22 Languages)
Largest cleaned Indic-language pretraining dataset: 251B tokens across 22 Indian languages, curated from web sources, multilingual corpora, and large-scale translations. CC-BY-4.0, parquet format.
OpenAssistant Conversations (OASST1)
161,443 human-generated assistant conversation messages across 35 languages with 461,292 quality ratings — a foundational dataset for alignment, RLHF, and instruction-tuning research.
BEIR: MS MARCO Passage Retrieval Benchmark
MS MARCO subset of the BEIR heterogeneous IR benchmark — 8.8M passages and ~500K queries in standard retrieval layout. Parquet format, CC-BY-SA-4.0, English.
DataComp-1B: Image-Text Pair Metadata (1.4B Samples)
Metadata (URLs, captions, CLIP features) for ~1.4B image-text pairs from DataComp-1B, the curated subset of CommonPool used to train state-of-the-art CLIP models. CC-BY-4.0, parquet format.
UltraFeedback — Large-Scale Fine-Grained Preference Dataset
64k prompts × 4 LLM responses (256k samples) with fine-grained GPT-4 preference annotations across instruction-following, truthfulness, honesty, and helpfulness. Canonical dataset for reward model and RLHF training.
Dolci Instruct SFT Mixture (Olmo 3 Training Data)
2.15M-sample multilingual instruction SFT mixture used to train AI2's Olmo 3 7B Instruct. Parquet format, ODC-BY licensed, covers 70+ languages and includes OpenThoughts 3 and other curated prompt sets.
MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)
ArXiv subset of MINT-1T: multimodal interleaved text-and-image documents extracted from ArXiv papers, designed for multimodal pretraining at scale. CC-BY-4.0.
MMFineReason 1.8M — Multimodal Reasoning Traces (Qwen3-VL-235B-Thinking)
1.8M multimodal reasoning samples with 5.1B solution tokens distilled from Qwen3-VL-235B-A22B-Thinking. Image+text inputs with detailed chain-of-thought annotations for math, science, and STEM visual reasoning. Apache-2.0.
WikiText (WikiText-2 & WikiText-103) Language Modeling Corpus
Canonical language modeling benchmark: 100M+ tokens from verified Good/Featured Wikipedia articles. Includes WikiText-2 and WikiText-103 in raw and tokenized variants. Parquet format, CC-BY-SA 3.0.
KILT: Knowledge-Intensive Language Tasks Benchmark
Facebook AI's KILT benchmark — 11 datasets across fact-checking, entity linking, slot filling, open-domain QA, and dialog generation, all grounded in a unified Wikipedia snapshot. MIT licensed, parquet format, 1M-10M examples.
LLaVA-OneVision-1.5 Instruction Data (22M Multimodal SFT)
22M curated instruction-tuning examples (image+text) used to train LLaVA-OneVision-1.5 LMMs. Apache-2.0, covers VQA, captioning, OCR, reasoning, and more.
FineWeb-Edu Translated (36 Languages, 960B+ Tokens)
Machine-translated FineWeb-Edu corpus covering 36.7M aligned documents across 36 European languages — 960B+ tokens for multilingual LLM pretraining and translation training.
People's Speech — 30,000+ Hours of Transcribed English Speech (MLCommons)
One of the world's largest open English ASR corpora: 30,000+ hours of transcribed speech under CC-BY/CC-BY-SA. Built by MLCommons for training and evaluating speech-to-text systems.
MS MARCO — Question Answering & Passage Ranking
Microsoft's MS MARCO dataset: 1M+ real Bing questions with human-generated answers and passage relevance judgments. Foundational benchmark for retrieval, RAG, and QA systems.
UltraChat — Large-Scale Multi-Round Dialogue Dataset
Open-source large-scale multi-round dialogue dataset (1.5M+ conversations) generated via dual ChatGPT Turbo API role-play. Widely used for instruction tuning and chat model fine-tuning. MIT licensed.