To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search

Full API docs: https://api.databazaar.io/llms.txt

Agent discovery: https://databazaar.io/.well-known/agent.json

Browse Data

1–24 of 227
Filters
Newest
scientific
Freefixed price

Iris Flower Classification (Fisher, 1936)

Ronald Fisher's iconic Iris dataset — 150 samples across three species (setosa, versicolor, virginica) with sepal and petal length and width measurements. The canonical introductory classification benchmark.

150 rows·PARQUET·0 downloads
scientific
Freefixed price

Pen-Based Digit Recognition — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Pen Digits dataset, with 16 pen-trajectory features per sample for handwritten-digit classification.

1,000,000 rows·PARQUET·1 downloads
retail
Freefixed price

Multilingual Amazon Product Reviews — 2.5M Labeled

2,520,000 Amazon product reviews in English, Japanese, German, French, Chinese, and Spanish (collected 2015–2019), each labeled with a star rating. Built for multilingual sentiment analysis and text classification.

2,520,000 rows·PARQUET·1 downloads
scientific
Freefixed price

Glass Type Identification — 138K Synthetic (BNG)

A 137,781-row synthetic expansion (generated via a Bayesian Network) of the UCI Glass Identification dataset, classifying glass type from its refractive index and oxide composition.

137,781 rows·PARQUET·1 downloads
scientific
Freefixed price

Cleveland Heart Disease (UCI)

The classic Cleveland heart-disease dataset — 303 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a diagnosis label. One of the most widely used UCI classification benchmarks.

303 rows·PARQUET·0 downloads
scientific
Freefixed price

Hepatitis Prognosis — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the UCI Hepatitis dataset, predicting patient survival from clinical and laboratory attributes.

1,000,000 rows·PARQUET·1 downloads
financial
Freefixed price

German Credit Risk — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the German Credit dataset, profiling loan applicants by financial and personal attributes with a good/bad credit-risk label.

1,000,000 rows·PARQUET·1 downloads
scientific
Freefixed price

Space Shuttle Autolanding Control (NASA / UCI)

A small NASA decision dataset specifying the conditions — stability, error, sign, wind, magnitude, and visibility — under which a Space Shuttle should land automatically versus manually. A classic UCI benchmark.

15 rows·PARQUET·0 downloads
scientific
Freefixed price

Page Blocks Classification — 295K Synthetic (BNG)

A 295,245-row synthetic expansion (generated via a Bayesian Network) of the classic Page Blocks dataset, classifying blocks in scanned document layouts by geometric and pixel-density features.

295,245 rows·PARQUET·1 downloads
scientific
Freefixed price

Heart Disease (Statlog) — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Statlog Heart dataset, predicting heart-disease presence from 13 clinical attributes.

1,000,000 rows·PARQUET·0 downloads
pricing
Freefixed price

Automobile Price Prediction (UCI)

159 automobiles described by 15 engineering and design attributes — wheelbase, dimensions, engine size, horsepower, fuel economy, and more — with price as the prediction target. A classic UCI regression benchmark.

159 rows·PARQUET·1 downloads
scientific
Freefixed price

Heart Disease — Cholesterol Prediction (UCI)

A 303-patient variant of the UCI heart-disease data with serum cholesterol as the prediction target alongside 13 clinical attributes. Used as a numeric (regression) prediction benchmark.

303 rows·PARQUET·0 downloads
scientific
Freefixed price

E. coli Promoter Gene Sequences (UCI)

106 E. coli DNA sequences labeled as promoter or non-promoter, each described by 57 sequential nucleotide positions. A classic UCI molecular-biology classification benchmark for sequence analysis.

106 rows·PARQUET·0 downloads
scientific
Freefixed price

SEA Concept-Drift Stream — 1M Synthetic Samples

A 1,000,000-row synthetic data stream from the classic SEA generator, with three numeric attributes and a binary class label. A standard benchmark for evaluating concept-drift detection and streaming classifiers.

1,000,000 rows·PARQUET·1 downloads
scientific
Freefixed price

Statlog Heart Disease (UCI)

The Statlog Heart dataset — 270 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a heart-disease presence label. A classic UCI classification benchmark.

270 rows·PARQUET·0 downloads
text
Freefixed price

OpenPII — 1.4M Multilingual PII Masking Examples

1,428,143 synthetic text examples with fine-grained PII annotations spanning 23 European languages. Each example pairs source and masked text with privacy masks and token-level labels, designed for training privacy-redaction and anonymization models.

1,428,143 rows·PARQUET·0 downloads
text
Freefixed price

OPUS-100 — English-Centric Parallel Corpus (100 Languages)

A large English-centric multilingual translation corpus covering 100 languages, where every training pair includes English on the source or target side. A standard resource for machine-translation research and multilingual model training.

55,057,504 rows·PARQUET·0 downloads
images
Freefixed price

Conceptual Captions — 5.3M Image–Caption Pairs

Millions of web images paired with natural-language captions, harvested and filtered from alt-text descriptions. A large-scale resource for training and evaluating image-captioning and vision-language models.

5,341,263 images·PARQUET·0 downloads
text
Freefixed price

Nemotron Cascade-2 — SFT Training Data (Math, Code & Chat)

~1.94M supervised fine-tuning examples used to train NVIDIA's Nemotron-Cascade-2 models, spanning math, code, and conversational prompts paired with model-generated responses and source/domain labels.

1,940,375 rows·PARQUET·0 downloads
text
Freefixed price

WideSearch — Agentic Broad Information-Seeking Benchmark

A 200-instance benchmark for evaluating LLM-driven agents on broad, wide-coverage information-seeking tasks. Each instance provides a query, structured evaluation criteria, and language metadata for measuring agent search performance.

200 rows·PARQUET·0 downloads
text
Freefixed price

Emotion — English Tweets Labeled by Emotion

English Twitter messages labeled with six basic emotions — anger, fear, joy, love, sadness, and surprise. A widely used benchmark for text-based emotion classification and sentiment research.

436,809 rows·PARQUET·0 downloads
text
Freefixed price

OpenMath GSM8K (Masked) — Grade-School Math Solutions

A masked version of the GSM8K grade-school math word problems (7,473 items), pairing each question with masked and full reference solutions plus the expected answer. Designed to support consistent synthetic generation of additional solutions.

7,473 rows·PARQUET·0 downloads
text
Freefixed price

MiniPile — 1M-Document Deduplicated Text Corpus

A compact 6 GB subset of the deduplicated Pile corpus (~1,010,500 documents), curated for data-efficient language-model pretraining and experimentation without the storage cost of the full Pile.

1,010,500 rows·PARQUET·0 downloads
images
Freefixed price

Vessel Detection — Labeled Satellite Image Patches

1,920 validated satellite image patches labeled for maritime vessel detection, exported from a boat-detection review workflow. Each patch ships with Hugging Face image-folder-compatible metadata for object-detection and classification training.

1,920 images·PARQUET·0 downloads