To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
49–72 of 220TaskTrove — 750K+ Agentic Tasks for RL & SFT Training
Open-source collection of 750,000+ unique agentic tasks aggregated from 100+ sources including SWE-Smith, R2EGym, and SWE-Re-Bench. Apache-2.0 licensed, parquet format, designed for agent training and evaluation.
Bhasha SFT — 13M+ Multilingual Instruction-Response Pairs (Hindi, Bengali, Gujarati, English)
13M+ instruction-response pairs across Hindi, Bengali, Gujarati, and English for supervised fine-tuning of multilingual LLMs. Mix of human-annotated and synthetic data from open-source SFT collections, curated by Soket AI Labs.
SciCode Domain Code: 1.1B+ Lines of Scientific Code Across 178 Domains
Large-scale domain-specific code dataset (~115 GB, 1.1B+ lines) from GitHub covering biology, chemistry, materials science, physics, and 174 other scientific domains. Apache 2.0 licensed.
HPDv3 — Human Preference Dataset v3 (1.08M text-image pairs)
Wide-spectrum human preference dataset for text-to-image generation: 1.08M text-image pairs and 1.17M pairwise human preference annotations. MIT-licensed, used to train HPSv3 (ICCV 2025) reward models.
PubTabNet — 568K Table Images with HTML Annotations
568K+ scientific table images paired with HTML structure annotations, extracted from PubMed Central Open Access articles. Standard benchmark for image-based table recognition and document AI.
Yambda-5B: Large-Scale Music Recommendation Dataset (4.79B Interactions)
Industrial-scale music recommendation dataset from Yandex with 4.79B user-item interactions across 1M users and 9.39M tracks, including organic and recommendation interactions plus audio embeddings.
LongCoT: Long-Horizon Reasoning Benchmark
Benchmark for evaluating sustained long chain-of-thought reasoning across logic, computer science, chemistry, chess, and mathematics. Parquet format, MIT licensed.
EO-Data-1.5M: Interleaved Vision-Text-Action Dataset for Embodied AI
1.5M-sample interleaved vision-language-action dataset for embodied AI and robot learning, emphasizing temporal dynamics and causal dependencies across modalities. Apache-2.0 licensed, parquet format.
ChemBench — Chemistry & Materials LLM Evaluation Benchmark
Manually curated benchmark for evaluating chemistry and materials science capabilities of LLMs. Expert-generated QA and multiple-choice items. MIT licensed, evaluation-only.
FLUX-Reason-6M: 6M Reasoning-Focused Text-to-Image Dataset
6 million high-quality images synthesized by FLUX.1-dev with 20M bilingual (English/Chinese) descriptions, engineered to instill complex reasoning capabilities in text-to-image generative models.
Text-to-Image 2M: Curated Text-Image Pair Training Dataset
~2M curated high-quality text-image pairs in webdataset format, designed for fine-tuning text-to-image generative models. MIT licensed.
UGMathBench: Undergraduate Math Reasoning Benchmark
5,062 undergraduate-level math problems across 16 subjects and 111 topics, with 10 answer types and 3 randomized versions each. Designed for evaluating LLM mathematical reasoning.
Common Crawl Creative Commons Fine (C5f) — Multilingual High-Quality Web Corpus
Filtered Creative Commons web corpus from Common Crawl, intersected with FineWeb/FineWeb-2 for quality. Multilingual (EN, DE, FR, NL, ES, IT, AF, FY), 10M-100M rows, Parquet format. Ideal for LLM pretraining and fine-tuning.
TreeOfLife-10M: 10M Biological Organism Images with Taxonomic Labels
Largest ML-ready dataset of biological organism images (10M+ images, 454K taxa) paired with taxonomic labels. Aggregates iNat21, BIOSCAN-1M, and Encyclopedia of Life. Used to train BioCLIP and similar vision-language models.
MMFineReason-SFT-123K (Qwen3-VL-235B Thinking Traces)
123K hardest-7% multimodal reasoning samples with Qwen3-VL-235B chain-of-thought traces. Curated SFT subset of MMFineReason-1.8M where smaller models consistently fail. Apache-2.0, parquet, image+text.
MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances)
Parallel multilingual NLU benchmark from Amazon Science with 1M+ utterances across 51 languages, annotated with 60 intents and 55 slot types. Built by localizing SLURP voice assistant interactions.
Salad-Data: LLM Safety & Jailbreak Evaluation Benchmark
21K+ safety questions for LLM red-teaming and jailbreak evaluation, aggregated from HH-RLHF, AdvBench, ToxicChat, GPTFuzzer, and GPT-3.5 self-instructed prompts. Apache-2.0 licensed.
US Stocks & ETFs 1-Minute Bars (2016–Present)
Minute-level OHLCV bars for US stocks and ETFs from 2016 onward, scraped from the Alpaca Markets Historical Bars API. Parquet format, 100M+ rows, MIT licensed.
CodeParrot Clean — Deduplicated Python Code from GitHub
Cleaned, deduplicated corpus of Python files scraped from GitHub. Filtered for line length, alphanumeric fraction, and auto-generated content. Widely used for code LM pretraining and fine-tuning.
MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)
1,000 hours of multi-dialect Arabic speech (16 kHz) crawled from 700+ YouTube channels with transcripts. Parquet format, CC-BY-4.0. For Arabic ASR, TTS, and speech LLM training/eval.
WorldCuisines VQA — Multilingual Multicultural Visual QA on Global Cuisines
Massive-scale multilingual, multicultural VQA benchmark covering global cuisines across 30 languages. NAACL 2025 Best Theme Paper. Images + text questions/answers for vision-language evaluation and fine-tuning.
LLM-jp-4 Thinking — Japanese Reasoning SFT Dataset
~3.2M supervised fine-tuning examples used to train the llm-jp-4 "thinking" models. Prompts drawn from diverse sources are paired with generated step-by-step reasoning traces and final responses for chain-of-thought training.
MAPS: Multilingual Agentic AI Benchmark (Performance & Security)
805-task multilingual benchmark for evaluating agentic AI across 11 languages, combining performance tasks (GAIA, SWE-bench, MATH) with Agent Security Benchmark tasks. CC-BY-4.0.
Weather Geo ERA5 — 1.065B Global Weather Records (1940-2025)
1.065 billion ERA5 reanalysis weather records covering 85+ years (1940-2025) of global weather at 0.25° resolution, geographically partitioned parquet for efficient regional queries.