To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
1–24 of 41Iris Flower Classification (Fisher, 1936)
Ronald Fisher's iconic Iris dataset — 150 samples across three species (setosa, versicolor, virginica) with sepal and petal length and width measurements. The canonical introductory classification benchmark.
Pen-Based Digit Recognition — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Pen Digits dataset, with 16 pen-trajectory features per sample for handwritten-digit classification.
Glass Type Identification — 138K Synthetic (BNG)
A 137,781-row synthetic expansion (generated via a Bayesian Network) of the UCI Glass Identification dataset, classifying glass type from its refractive index and oxide composition.
Cleveland Heart Disease (UCI)
The classic Cleveland heart-disease dataset — 303 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a diagnosis label. One of the most widely used UCI classification benchmarks.
Hepatitis Prognosis — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the UCI Hepatitis dataset, predicting patient survival from clinical and laboratory attributes.
Space Shuttle Autolanding Control (NASA / UCI)
A small NASA decision dataset specifying the conditions — stability, error, sign, wind, magnitude, and visibility — under which a Space Shuttle should land automatically versus manually. A classic UCI benchmark.
Page Blocks Classification — 295K Synthetic (BNG)
A 295,245-row synthetic expansion (generated via a Bayesian Network) of the classic Page Blocks dataset, classifying blocks in scanned document layouts by geometric and pixel-density features.
Heart Disease (Statlog) — 1M Synthetic (BNG)
A 1,000,000-row synthetic expansion (generated via a Bayesian Network) of the Statlog Heart dataset, predicting heart-disease presence from 13 clinical attributes.
Heart Disease — Cholesterol Prediction (UCI)
A 303-patient variant of the UCI heart-disease data with serum cholesterol as the prediction target alongside 13 clinical attributes. Used as a numeric (regression) prediction benchmark.
E. coli Promoter Gene Sequences (UCI)
106 E. coli DNA sequences labeled as promoter or non-promoter, each described by 57 sequential nucleotide positions. A classic UCI molecular-biology classification benchmark for sequence analysis.
SEA Concept-Drift Stream — 1M Synthetic Samples
A 1,000,000-row synthetic data stream from the classic SEA generator, with three numeric attributes and a binary class label. A standard benchmark for evaluating concept-drift detection and streaming classifiers.
Statlog Heart Disease (UCI)
The Statlog Heart dataset — 270 patients described by 13 clinical attributes (age, sex, chest-pain type, blood pressure, cholesterol, ECG results, and more) with a heart-disease presence label. A classic UCI classification benchmark.
Robotics and Humanoid FMEA Public Source Dataset
Public-source dataset for the Robotics Companies FMEAs bounty. Includes 18 robot, humanoid, HRC, personal-care robot, collaborative robot, autonomous robot, failure mode and effects analysis, and FMEA-methodology sources with URLs, source type, year, robot platform/domain, directness label, access status, relevance score, evidence locator, evidence note, and notes. This is AI-assisted public research with manual review labels. It is a compiled source index and review dataset: it does not redistribute papers, claim access to private proprietary company FMEAs, or include fabricated safety documents. Original source rights remain with their publishers; buyers get the compiled metadata, links, labels, and review notes.
Magneton: Substructure-Aware Protein Representation Learning Dataset
530,601 SwissProt proteins with DSSP secondary structure and InterPro 103.0 substructure annotations, sharded JSONL format. For training/evaluating protein representation learning models.
Open Schematics: Electronic Circuit Designs Dataset
10K-100K electronic schematics from hardware projects with visual representations, component metadata, and KiCad source files. For training AI on circuit design, component recognition, and hardware engineering tasks.
Physiotherapy Evidence QA (Bilingual TR/EN)
143,711 bilingual (Turkish/English) expert-curated Q&A pairs covering evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology. CSV format, CC-BY-4.0.
ChemBench — Chemistry & Materials LLM Evaluation Benchmark
Manually curated benchmark for evaluating chemistry and materials science capabilities of LLMs. Expert-generated QA and multiple-choice items. MIT licensed, evaluation-only.
Sci-Base: AI-Ready Scientific Foundation Dataset
Large multi-domain scientific text corpus (chem, bio, climate, medical, materials, earth, physics) from OpenDataLab's Sciverse foundation, 1M-10M rows in Parquet, CC-BY-4.0.
ScienceQA: Multimodal Science Question Answering with Chain-of-Thought
21K multimodal multiple-choice science questions with images, lectures, and chain-of-thought explanations spanning natural science, social science, and language science. Widely used for VLM evaluation and CoT fine-tuning.
NOAA Global Temperature Anomalies (1850-2025)
Monthly global land and ocean average temperature anomalies from 1850 to 2025, sourced directly from NOAA National Centers for Environmental Information (NCEI). Base period: 1901-2000 average. 176 annual data points showing the long-term global warming trend measured in degrees Celsius. Format: CSV, 2 columns (Year, Anomaly), 176 rows + header. Source: NOAA Climate at a Glance (https://www.ncei.noaa.gov). License: Public domain (US government data). Captured: 2026-04-13. Ideal for: climate trend analysis, time series modeling, regression tutorials, data visualization demos, and agent workflows that need authoritative temperature history.
USGS Global Earthquakes — Past 24 Hours (M2.5+)
A fresh snapshot of all earthquakes magnitude 2.5 and greater recorded worldwide in the past 24 hours, sourced directly from the USGS Earthquake Hazards Program real-time feed. Contains 45 events with full seismic parameters: event time (UTC), latitude, longitude, depth (km), magnitude, magnitude type, number of stations, azimuthal gap, minimum distance, RMS error, network, event ID, place description, event type, horizontal/depth/magnitude errors, review status, and location/magnitude sources. Format: CSV (22 columns, 45 rows + header). Source: https://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/2.5_day.csv License: USGS data is public domain (U.S. Government work, not subject to copyright). Captured: 2026-04-13. Ideal for real-time geoscience demos, seismic monitoring prototypes, map visualization tutorials, anomaly detection notebooks, and agent workflows that need recent hazard data.
NASA Exoplanet & Planetary Candidate Catalog — 20,933 Objects from NASA, Kepler & TESS (1992–2025)
Harmonized catalog of 20,933 exoplanetary objects combining three authoritative NASA sources: the NASA Exoplanet Archive (6,153 confirmed exoplanets), the Kepler Cumulative KOI Table (6,867 unique Kepler Objects of Interest), and the TESS Objects of Interest catalog (7,913 TOIs). Each record includes 28 normalized fields covering planetary properties (orbital period, radius, mass, equilibrium temperature, eccentricity, insolation flux), host star characteristics (effective temperature, radius, mass, metallicity, surface gravity, spectral type, luminosity), discovery metadata (method, year, facility), sky coordinates (RA/Dec), distance, and system multiplicity. Objects span the full disposition spectrum from confirmed planets through candidates to false positives, enabling classification model training, demographic analysis, and habitability studies. Data sourced from the NASA Exoplanet Science Institute (IPAC/Caltech), Kepler mission pipeline, and TESS Follow-up Observing Program. Deduplicated across catalogs to avoid double-counting confirmed Kepler planets. Suitable for exoplanet population statistics, machine learning classification of planetary candidates, stellar characterization, and habitability zone analysis.
Energy Production, Consumption & CO₂ Emissions by Country (1965–2023)
Cross-national country-level dataset covering energy production, consumption, and greenhouse gas emissions for 229 countries and territories from 1965 to 2023. Contains 13,100+ rows with 60 curated indicators spanning electricity generation by source, primary energy consumption, fossil fuel and renewable energy breakdowns, CO2 emissions by fuel type, cumulative emissions, methane and nitrous oxide emissions, and estimated temperature contributions. **Sources:** - Our World in Data — Energy Dataset (electricity generation, consumption, production by fuel type, energy mix shares) - Our World in Data — CO2 and Greenhouse Gas Emissions Dataset (annual CO2 by source, GHG totals, per-capita metrics, cumulative emissions, temperature change attribution) - Underlying sources include: BP Statistical Review of World Energy, Ember Global Electricity Review, Energy Institute Statistical Review, IPCC, Global Carbon Project, Climate Watch/CAIT, UNFCCC **Schema (60 columns):** - `country` — Country or territory name - `iso_code` — ISO 3166-1 alpha-3 country code - `year` — Year of observation (1965–2023) - `population` — Total population - `gdp` — GDP in international-$ (PPP, 2017 prices) - `electricity_generation` — Total electricity generation (TWh) - `electricity_demand` — Electricity demand (TWh) - `primary_energy_consumption` — Primary energy consumption (TWh) - `energy_per_capita` — Energy consumption per capita (kWh) - `energy_per_gdp` — Energy intensity (kWh per $) - `fossil_fuel_consumption` / `renewables_consumption` / `nuclear_consumption` — Consumption by type (TWh) - `coal_consumption` / `oil_consumption` / `gas_consumption` — Fossil fuel breakdown (TWh) - `hydro_consumption` / `solar_consumption` / `wind_consumption` / `biofuel_consumption` — Renewable breakdown (TWh) - `fossil_share_energy` / `renewables_share_energy` / `nuclear_share_energy` — Energy mix shares (%) - `coal_share_energy` / `oil_share_energy` / `gas_share_energy` — Fossil fuel shares (%) - `low_carbon_share_energy` — Low-carbon energy share (%) - `carbon_intensity_elec` — Carbon intensity of electricity (gCO2/kWh) - `co2` — Annual CO2 emissions (million tonnes) - `co2_per_capita` — CO2 per capita (tonnes) - `co2_per_gdp` / `co2_per_unit_energy` — CO2 efficiency metrics - `coal_co2` / `oil_co2` / `gas_co2` / `cement_co2` / `flaring_co2` — CO2 by source (Mt) - `total_ghg` — Total greenhouse gas emissions (MtCO2e) - `methane` / `nitrous_oxide` — Non-CO2 GHG emissions (MtCO2e) - `cumulative_co2` / `share_global_co2` / `share_global_cumulative_co2` — Global share metrics - `temperature_change_from_co2` / `temperature_change_from_ghg` — Estimated warming contribution (°C) - `data_sources` — Which source datasets contributed to each row **Coverage:** 229 countries, 1965–2023, 13,100+ observations. Data density is highest for 1990–2023 with near-complete coverage; earlier decades have sparser coverage for smaller nations. **Use cases:** Climate policy analysis, energy transition tracking, cross-country emissions benchmarking, renewable energy adoption studies, carbon intensity trends, ESG research, AI agent environmental analysis, academic research on global decarbonization pathways.
Public Health & Disease Burden — 222 Countries (1960–2023)
Full-coverage dataset covering 20 key public health indicators across 222 countries and territories from 1960 to 2023. Sourced from the World Bank Open Data platform, this dataset combines mortality metrics (life expectancy, infant mortality, maternal mortality, NCD mortality), healthcare infrastructure (physicians, hospital beds, health expenditure), disease burden (tuberculosis, HIV incidence), preventive care (immunization rates, skilled birth attendance), environmental health (clean water access, sanitation), and behavioral risk factors (smoking prevalence, obesity rates). Wide format with one row per country-year. 14,183 rows across 25 columns. Ideal for epidemiological analysis, global health comparisons, development economics research, and public health policy evaluation.