To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
1–24 of 227Iris Flower Classification (Fisher, 1936)
Ronald Fisher's iconic Iris dataset — 150 samples across three species (setosa, versicolor, virginica) with sepal and petal length and width measurements. The canonical introductory classification benchmark.
Pen-Based Digit Recognition — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Pen Digits dataset, with 16 pen-trajectory features per sample for handwritten-digit classification.
Multilingual Amazon Product Reviews — 2.5M Labeled
2,520,000 Amazon product reviews in English, Japanese, German, French, Chinese, and Spanish (collected 2015–2019), each labeled with a star rating. Built for multilingual sentiment analysis and text classification.
Glass Type Identification — 138K Synthetic (BNG)
A 137,781-row synthetic expansion (generated via a Bayesian Network) of the UCI Glass Identification dataset, classifying glass type from its refractive index and oxide composition.
Cleveland Heart Disease (UCI)
The classic Cleveland heart-disease dataset — 303 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a diagnosis label. One of the most widely used UCI classification benchmarks.
Hepatitis Prognosis — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the UCI Hepatitis dataset, predicting patient survival from clinical and laboratory attributes.
German Credit Risk — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the German Credit dataset, profiling loan applicants by financial and personal attributes with a good/bad credit-risk label.
Space Shuttle Autolanding Control (NASA / UCI)
A small NASA decision dataset specifying the conditions — stability, error, sign, wind, magnitude, and visibility — under which a Space Shuttle should land automatically versus manually. A classic UCI benchmark.
Page Blocks Classification — 295K Synthetic (BNG)
A 295,245-row synthetic expansion (generated via a Bayesian Network) of the classic Page Blocks dataset, classifying blocks in scanned document layouts by geometric and pixel-density features.
Heart Disease (Statlog) — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Statlog Heart dataset, predicting heart-disease presence from 13 clinical attributes.
Automobile Price Prediction (UCI)
159 automobiles described by 15 engineering and design attributes — wheelbase, dimensions, engine size, horsepower, fuel economy, and more — with price as the prediction target. A classic UCI regression benchmark.
Heart Disease — Cholesterol Prediction (UCI)
A 303-patient variant of the UCI heart-disease data with serum cholesterol as the prediction target alongside 13 clinical attributes. Used as a numeric (regression) prediction benchmark.
E. coli Promoter Gene Sequences (UCI)
106 E. coli DNA sequences labeled as promoter or non-promoter, each described by 57 sequential nucleotide positions. A classic UCI molecular-biology classification benchmark for sequence analysis.
SEA Concept-Drift Stream — 1M Synthetic Samples
A 1,000,000-row synthetic data stream from the classic SEA generator, with three numeric attributes and a binary class label. A standard benchmark for evaluating concept-drift detection and streaming classifiers.
Statlog Heart Disease (UCI)
The Statlog Heart dataset — 270 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a heart-disease presence label. A classic UCI classification benchmark.
OpenPII — 1.4M Multilingual PII Masking Examples
1,428,143 synthetic text examples with fine-grained PII annotations spanning 23 European languages. Each example pairs source and masked text with privacy masks and token-level labels, designed for training privacy-redaction and anonymization models.
OPUS-100 — English-Centric Parallel Corpus (100 Languages)
A large English-centric multilingual translation corpus covering 100 languages, where every training pair includes English on the source or target side. A standard resource for machine-translation research and multilingual model training.
Conceptual Captions — 5.3M Image–Caption Pairs
Millions of web images paired with natural-language captions, harvested and filtered from alt-text descriptions. A large-scale resource for training and evaluating image-captioning and vision-language models.
Nemotron Cascade-2 — SFT Training Data (Math, Code & Chat)
~1.94M supervised fine-tuning examples used to train NVIDIA's Nemotron-Cascade-2 models, spanning math, code, and conversational prompts paired with model-generated responses and source/domain labels.
WideSearch — Agentic Broad Information-Seeking Benchmark
A 200-instance benchmark for evaluating LLM-driven agents on broad, wide-coverage information-seeking tasks. Each instance provides a query, structured evaluation criteria, and language metadata for measuring agent search performance.
Emotion — English Tweets Labeled by Emotion
English Twitter messages labeled with six basic emotions — anger, fear, joy, love, sadness, and surprise. A widely used benchmark for text-based emotion classification and sentiment research.
OpenMath GSM8K (Masked) — Grade-School Math Solutions
A masked version of the GSM8K grade-school math word problems (7,473 items), pairing each question with masked and full reference solutions plus the expected answer. Designed to support consistent synthetic generation of additional solutions.
MiniPile — 1M-Document Deduplicated Text Corpus
A compact 6 GB subset of the deduplicated Pile corpus (~1,010,500 documents), curated for data-efficient language-model pretraining and experimentation without the storage cost of the full Pile.
Vessel Detection — Labeled Satellite Image Patches
1,920 validated satellite image patches labeled for maritime vessel detection, exported from a boat-detection review workflow. Each patch ships with Hugging Face image-folder-compatible metadata for object-detection and classification training.