Browse Data

1–24 of 40

Ronald Fisher's iconic Iris dataset — 150 samples across three species (setosa, versicolor, virginica) with sepal and petal length and width measurements. The canonical introductory classification benchmark.

150 rows·PARQUET·0 downloads

scientific

Freefixed price

Pen-Based Digit Recognition — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Pen Digits dataset, with 16 pen-trajectory features per sample for handwritten-digit classification.

1,000,000 rows·PARQUET·1 downloads

scientific

Freefixed price

Glass Type Identification — 138K Synthetic (BNG)

A 137,781-row synthetic expansion (generated via a Bayesian Network) of the UCI Glass Identification dataset, classifying glass type from its refractive index and oxide composition.

137,781 rows·PARQUET·1 downloads

scientific

Freefixed price

Cleveland Heart Disease (UCI)

The classic Cleveland heart-disease dataset — 303 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a diagnosis label. One of the most widely used UCI classification benchmarks.

303 rows·PARQUET·0 downloads

scientific

Freefixed price

Hepatitis Prognosis — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the UCI Hepatitis dataset, predicting patient survival from clinical and laboratory attributes.

1,000,000 rows·PARQUET·1 downloads

scientific

Freefixed price

Space Shuttle Autolanding Control (NASA / UCI)

A small NASA decision dataset specifying the conditions — stability, error, sign, wind, magnitude, and visibility — under which a Space Shuttle should land automatically versus manually. A classic UCI benchmark.

15 rows·PARQUET·0 downloads

scientific

Freefixed price

Page Blocks Classification — 295K Synthetic (BNG)

A 295,245-row synthetic expansion (generated via a Bayesian Network) of the classic Page Blocks dataset, classifying blocks in scanned document layouts by geometric and pixel-density features.

295,245 rows·PARQUET·1 downloads

scientific

Freefixed price

Heart Disease (Statlog) — 1M Synthetic (BNG)

A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Statlog Heart dataset, predicting heart-disease presence from 13 clinical attributes.

1,000,000 rows·PARQUET·0 downloads

scientific

Freefixed price

Heart Disease — Cholesterol Prediction (UCI)

A 303-patient variant of the UCI heart-disease data with serum cholesterol as the prediction target alongside 13 clinical attributes. Used as a numeric (regression) prediction benchmark.

303 rows·PARQUET·0 downloads

scientific

Freefixed price

E. coli Promoter Gene Sequences (UCI)

106 E. coli DNA sequences labeled as promoter or non-promoter, each described by 57 sequential nucleotide positions. A classic UCI molecular-biology classification benchmark for sequence analysis.

106 rows·PARQUET·0 downloads

scientific

Freefixed price

SEA Concept-Drift Stream — 1M Synthetic Samples

A 1,000,000-row synthetic data stream from the classic SEA generator, with three numeric attributes and a binary class label. A standard benchmark for evaluating concept-drift detection and streaming classifiers.

1,000,000 rows·PARQUET·1 downloads

scientific

Freefixed price

Statlog Heart Disease (UCI)

The Statlog Heart dataset — 270 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a heart-disease presence label. A classic UCI classification benchmark.

270 rows·PARQUET·0 downloads

scientific

Freefixed price

Magneton: Substructure-Aware Protein Representation Learning Dataset

530,601 SwissProt proteins with DSSP secondary structure and InterPro 103.0 substructure annotations, sharded JSONL format. For training/evaluating protein representation learning models.

530,601 rows·PARQUET·0 downloads

scientific

Freefixed price

Open Schematics: Electronic Circuit Designs Dataset

10K-100K electronic schematics from hardware projects with visual representations, component metadata, and KiCad source files. For training AI on circuit design, component recognition, and hardware engineering tasks.

84,470 rows·PARQUET·0 downloads

scientific

Freefixed price

Physiotherapy Evidence QA (Bilingual TR/EN)

143,711 bilingual (Turkish/English) expert-curated Q&A pairs covering evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology. CSV format, CC-BY-4.0.

143,711 rows·PARQUET·0 downloads

scientific

Freefixed price

ChemBench — Chemistry & Materials LLM Evaluation Benchmark

Manually curated benchmark for evaluating chemistry and materials science capabilities of LLMs. Expert-generated QA and multiple-choice items. MIT licensed, evaluation-only.

2,785 rows·PARQUET·1 downloads

scientific

Freefixed price

Sci-Base: AI-Ready Scientific Foundation Dataset

Large multi-domain scientific text corpus (chem, bio, climate, medical, materials, earth, physics) from OpenDataLab's Sciverse foundation, 1M-10M rows in Parquet, CC-BY-4.0.

3,631,260 rows·PARQUET·0 downloads

scientific

Freefixed price

ScienceQA: Multimodal Science Question Answering with Chain-of-Thought

21K multimodal multiple-choice science questions with images, lectures, and chain-of-thought explanations spanning natural science, social science, and language science. Widely used for VLM evaluation and CoT fine-tuning.

21,208 rows·PARQUET·0 downloads

scientific

Freefixed price

NOAA Global Temperature Anomalies (1850-2025)

Monthly global land and ocean average temperature anomalies from 1850 to 2025, sourced directly from NOAA National Centers for Environmental Information (NCEI). Base period: 1901-2000 average. 176 annual data points showing the long-term global warming trend measured in degrees Celsius. Format: CSV, 2 columns (Year, Anomaly), 176 rows + header. Source: NOAA Climate at a Glance (https://www.ncei.noaa.gov). License: Public domain (US government data). Captured: 2026-04-13. Ideal for: climate trend analysis, time series modeling, regression tutorials, data visualization demos, and agent workflows that need authoritative temperature history.

179 rows·CSV·7 downloads

3.8/5

scientific

Freefixed price

USGS Global Earthquakes — Past 24 Hours (M2.5+)

A fresh snapshot of all earthquakes magnitude 2.5 and greater recorded worldwide in the past 24 hours, sourced directly from the USGS Earthquake Hazards Program real-time feed. Contains 45 events with full seismic parameters: event time (UTC), latitude, longitude, depth (km), magnitude, magnitude type, number of stations, azimuthal gap, minimum distance, RMS error, network, event ID, place description, event type, horizontal/depth/magnitude errors, review status, and location/magnitude sources. Format: CSV (22 columns, 45 rows + header). Source: https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv License: USGS data is public domain (U.S. Government work, not subject to copyright). Captured: 2026-04-13. Ideal for real-time geoscience demos, seismic monitoring prototypes, map visualization tutorials, anomaly detection notebooks, and agent workflows that need recent hazard data.

45 rows·CSV·3 downloads

3.6/5

scientific

Freefixed price

NASA Exoplanet & Planetary Candidate Catalog — 20,933 Objects from NASA, Kepler & TESS (1992–2025)

Harmonized catalog of 20,933 exoplanetary objects combining three authoritative NASA sources: the NASA Exoplanet Archive (6,153 confirmed exoplanets), the Kepler Cumulative KOI Table (6,867 unique Kepler Objects of Interest), and the TESS Objects of Interest catalog (7,913 TOIs). Each record includes 28 normalized fields covering planetary properties (orbital period, radius, mass, equilibrium temperature, eccentricity, insolation flux), host star characteristics (effective temperature, radius, mass, metallicity, surface gravity, spectral type, luminosity), discovery metadata (method, year, facility), sky coordinates (RA/Dec), distance, and system multiplicity. Objects span the full disposition spectrum from confirmed planets through candidates to false positives, enabling classification model training, demographic analysis, and habitability studies. Data sourced from the NASA Exoplanet Science Institute (IPAC/Caltech), Kepler mission pipeline, and TESS Follow-up Observing Program. Deduplicated across catalogs to avoid double-counting confirmed Kepler planets. Suitable for exoplanet population statistics, machine learning classification of planetary candidates, stellar characterization, and habitability zone analysis.

20,933 rows·CSV·6 downloads

4.4/5

scientific

Freefixed price

Energy Production, Consumption & CO₂ Emissions by Country (1965–2023)

Cross-national country-level dataset covering energy production, consumption, and greenhouse gas emissions for 229 countries and territories from 1965 to 2023. Contains 13,100+ rows with 60 curated indicators spanning electricity generation by source, primary energy consumption, fossil fuel and renewable energy breakdowns, CO2 emissions by fuel type, cumulative emissions, methane and nitrous oxide emissions, and estimated temperature contributions. **Sources:** - Our World in Data — Energy Dataset (electricity generation, consumption, production by fuel type, energy mix shares) - Our World in Data — CO2 and Greenhouse Gas Emissions Dataset (annual CO2 by source, GHG totals, per-capita metrics, cumulative emissions, temperature change attribution) - Underlying sources include: BP Statistical Review of World Energy, Ember Global Electricity Review, Energy Institute Statistical Review, IPCC, Global Carbon Project, Climate Watch/CAIT, UNFCCC **Schema (60 columns):** - `country` — Country or territory name - `iso_code` — ISO 3166-1 alpha-3 country code - `year` — Year of observation (1965–2023) - `population` — Total population - `gdp` — GDP in international-$ (PPP, 2017 prices) - `electricity_generation` — Total electricity generation (TWh) - `electricity_demand` — Electricity demand (TWh) - `primary_energy_consumption` — Primary energy consumption (TWh) - `energy_per_capita` — Energy consumption per capita (kWh) - `energy_per_gdp` — Energy intensity (kWh per $) - `fossil_fuel_consumption` / `renewables_consumption` / `nuclear_consumption` — Consumption by type (TWh) - `coal_consumption` / `oil_consumption` / `gas_consumption` — Fossil fuel breakdown (TWh) - `hydro_consumption` / `solar_consumption` / `wind_consumption` / `biofuel_consumption` — Renewable breakdown (TWh) - `fossil_share_energy` / `renewables_share_energy` / `nuclear_share_energy` — Energy mix shares (%) - `coal_share_energy` / `oil_share_energy` / `gas_share_energy` — Fossil fuel shares (%) - `low_carbon_share_energy` — Low-carbon energy share (%) - `carbon_intensity_elec` — Carbon intensity of electricity (gCO2/kWh) - `co2` — Annual CO2 emissions (million tonnes) - `co2_per_capita` — CO2 per capita (tonnes) - `co2_per_gdp` / `co2_per_unit_energy` — CO2 efficiency metrics - `coal_co2` / `oil_co2` / `gas_co2` / `cement_co2` / `flaring_co2` — CO2 by source (Mt) - `total_ghg` — Total greenhouse gas emissions (MtCO2e) - `methane` / `nitrous_oxide` — Non-CO2 GHG emissions (MtCO2e) - `cumulative_co2` / `share_global_co2` / `share_global_cumulative_co2` — Global share metrics - `temperature_change_from_co2` / `temperature_change_from_ghg` — Estimated warming contribution (°C) - `data_sources` — Which source datasets contributed to each row **Coverage:** 229 countries, 1965–2023, 13,100+ observations. Data density is highest for 1990–2023 with near-complete coverage; earlier decades have sparser coverage for smaller nations. **Use cases:** Climate policy analysis, energy transition tracking, cross-country emissions benchmarking, renewable energy adoption studies, carbon intensity trends, ESG research, AI agent environmental analysis, academic research on global decarbonization pathways.

13,115 rows·CSV·4 downloads

3.0/5

scientific

Freefixed price

Public Health & Disease Burden — 222 Countries (1960–2023)

Full-coverage dataset covering 20 key public health indicators across 222 countries and territories from 1960 to 2023. Sourced from the World Bank Open Data platform, this dataset combines mortality metrics (life expectancy, infant mortality, maternal mortality, NCD mortality), healthcare infrastructure (physicians, hospital beds, health expenditure), disease burden (tuberculosis, HIV incidence), preventive care (immunization rates, skilled birth attendance), environmental health (clean water access, sanitation), and behavioral risk factors (smoking prevalence, obesity rates). Wide format with one row per country-year. 14,183 rows across 25 columns. Ideal for epidemiological analysis, global health comparisons, development economics research, and public health policy evaluation.

14,183 rows·CSV·2 downloads

2.3/5

scientific

Freefixed price

Seismic Events Database — 94,767 Earthquakes (2020–2025)

Wide-coverage catalog of 94,767 seismic events (magnitude 4.0+) recorded globally from 2020 to 2025, sourced from the USGS Earthquake Hazards Program. Each record includes precise geolocation (latitude/longitude), depth, magnitude with type classification, timestamp, region identification, tsunami flags, felt reports, community and instrumental intensity measures, alert levels, and review status. Key features: - **94,767 unique events** spanning 6 years (2020–2025) - **Global coverage**: 180+ countries and regions, with major representation from Indonesia, Japan, Philippines, Chile, and the Pacific Ring of Fire - **Enriched classifications**: depth class (shallow/intermediate/deep), magnitude class (light/moderate/strong/major/great) - **Multi-source validation**: Events cross-referenced across USGS contributing networks - **28 columns** including technical parameters (RMS error, azimuthal gap, station count) for advanced seismological analysis Ideal for: seismic risk modeling, geospatial analysis, climate/disaster research, machine learning (earthquake prediction), insurance risk assessment, and educational use. Data sourced from the USGS Earthquake Hazards Program (earthquake.usgs.gov), a globally authoritative public seismological data source.

94,767 rows·CSV·3 downloads

4.4/5

Browse Data

Iris Flower Classification (Fisher, 1936)

Pen-Based Digit Recognition — 1M Synthetic (BNG)

Glass Type Identification — 138K Synthetic (BNG)

Cleveland Heart Disease (UCI)

Hepatitis Prognosis — 1M Synthetic (BNG)

Space Shuttle Autolanding Control (NASA / UCI)

Page Blocks Classification — 295K Synthetic (BNG)

Heart Disease (Statlog) — 1M Synthetic (BNG)

Heart Disease — Cholesterol Prediction (UCI)

E. coli Promoter Gene Sequences (UCI)

SEA Concept-Drift Stream — 1M Synthetic Samples

Statlog Heart Disease (UCI)

Magneton: Substructure-Aware Protein Representation Learning Dataset

Open Schematics: Electronic Circuit Designs Dataset

Physiotherapy Evidence QA (Bilingual TR/EN)

ChemBench — Chemistry & Materials LLM Evaluation Benchmark

Sci-Base: AI-Ready Scientific Foundation Dataset

ScienceQA: Multimodal Science Question Answering with Chain-of-Thought

NOAA Global Temperature Anomalies (1850-2025)

USGS Global Earthquakes — Past 24 Hours (M2.5+)

NASA Exoplanet & Planetary Candidate Catalog — 20,933 Objects from NASA, Kepler & TESS (1992–2025)

Energy Production, Consumption & CO₂ Emissions by Country (1965–2023)

Public Health & Disease Burden — 222 Countries (1960–2023)

Seismic Events Database — 94,767 Earthquakes (2020–2025)