Browse Data

Pen-Based Digit Recognition — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Pen Digits dataset, with 16 pen-trajectory features per sample for handwritten-digit classification.

retail

2,520,000 rows·PARQUET·1 downloads

Multilingual Amazon Product Reviews — 2.5M Labeled

2,520,000 Amazon product reviews in English, Japanese, German, French, Chinese, and Spanish (collected 2015–2019), each labeled with a star rating. Built for multilingual sentiment analysis and text classification.

137,781 rows·PARQUET·1 downloads

Glass Type Identification — 138K Synthetic (BNG)

A 137,781-row synthetic expansion (generated via a Bayesian Network) of the UCI Glass Identification dataset, classifying glass type from its refractive index and oxide composition.

303 rows·PARQUET·0 downloads

Cleveland Heart Disease (UCI)

The classic Cleveland heart-disease dataset — 303 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a diagnosis label. One of the most widely used UCI classification benchmarks.

Hepatitis Prognosis — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the UCI Hepatitis dataset, predicting patient survival from clinical and laboratory attributes.

financial

German Credit Risk — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the German Credit dataset, profiling loan applicants by financial and personal attributes with a good/bad credit-risk label.

15 rows·PARQUET·0 downloads

Space Shuttle Autolanding Control (NASA / UCI)

A small NASA decision dataset specifying the conditions — stability, error, sign, wind, magnitude, and visibility — under which a Space Shuttle should land automatically versus manually. A classic UCI benchmark.

295,245 rows·PARQUET·1 downloads

Page Blocks Classification — 295K Synthetic (BNG)

A 295,245-row synthetic expansion (generated via a Bayesian Network) of the classic Page Blocks dataset, classifying blocks in scanned document layouts by geometric and pixel-density features.

1,000,000 rows·PARQUET·0 downloads

Heart Disease (Statlog) — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Statlog Heart dataset, predicting heart-disease presence from 13 clinical attributes.

pricing

159 rows·PARQUET·1 downloads

Automobile Price Prediction (UCI)

159 automobiles described by 15 engineering and design attributes — wheelbase, dimensions, engine size, horsepower, fuel economy, and more — with price as the prediction target. A classic UCI regression benchmark.

303 rows·PARQUET·0 downloads

Heart Disease — Cholesterol Prediction (UCI)

A 303-patient variant of the UCI heart-disease data with serum cholesterol as the prediction target alongside 13 clinical attributes. Used as a numeric (regression) prediction benchmark.

106 rows·PARQUET·0 downloads

E. coli Promoter Gene Sequences (UCI)

106 E. coli DNA sequences labeled as promoter or non-promoter, each described by 57 sequential nucleotide positions. A classic UCI molecular-biology classification benchmark for sequence analysis.

SEA Concept-Drift Stream — 1M Synthetic Samples

A 1,000,000-row synthetic data stream from the classic SEA generator, with three numeric attributes and a binary class label. A standard benchmark for evaluating concept-drift detection and streaming classifiers.

270 rows·PARQUET·0 downloads

Statlog Heart Disease (UCI)

The Statlog Heart dataset — 270 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a heart-disease presence label. A classic UCI classification benchmark.

1,428,143 rows·PARQUET·0 downloads

OpenPII — 1.4M Multilingual PII Masking Examples

1,428,143 synthetic text examples with fine-grained PII annotations spanning 23 European languages. Each example pairs source and masked text with privacy masks and token-level labels, designed for training privacy-redaction and anonymization models.

55,057,504 rows·PARQUET·0 downloads

OPUS-100 — English-Centric Parallel Corpus (100 Languages)

A large English-centric multilingual translation corpus covering 100 languages, where every training pair includes English on the source or target side. A standard resource for machine-translation research and multilingual model training.

images

5,341,263 images·PARQUET·0 downloads

Conceptual Captions — 5.3M Image–Caption Pairs

Millions of web images paired with natural-language captions, harvested and filtered from alt-text descriptions. A large-scale resource for training and evaluating image-captioning and vision-language models.

1,940,375 rows·PARQUET·0 downloads

Nemotron Cascade-2 — SFT Training Data (Math, Code & Chat)

~1.94M supervised fine-tuning examples used to train NVIDIA's Nemotron-Cascade-2 models, spanning math, code, and conversational prompts paired with model-generated responses and source/domain labels.

200 rows·PARQUET·0 downloads

WideSearch — Agentic Broad Information-Seeking Benchmark

A 200-instance benchmark for evaluating LLM-driven agents on broad, wide-coverage information-seeking tasks. Each instance provides a query, structured evaluation criteria, and language metadata for measuring agent search performance.

436,809 rows·PARQUET·0 downloads

Emotion — English Tweets Labeled by Emotion

English Twitter messages labeled with six basic emotions — anger, fear, joy, love, sadness, and surprise. A widely used benchmark for text-based emotion classification and sentiment research.

7,473 rows·PARQUET·0 downloads

OpenMath GSM8K (Masked) — Grade-School Math Solutions

A masked version of the GSM8K grade-school math word problems (7,473 items), pairing each question with masked and full reference solutions plus the expected answer. Designed to support consistent synthetic generation of additional solutions.

1,010,500 rows·PARQUET·0 downloads

MiniPile — 1M-Document Deduplicated Text Corpus

A compact 6 GB subset of the deduplicated Pile corpus (~1,010,500 documents), curated for data-efficient language-model pretraining and experimentation without the storage cost of the full Pile.

images