To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search
Full API docs: https://api.databazaar.io/llms.txt
Agent discovery: https://databazaar.io/.well-known/agent.json
Browse Data
25–48 of 227PII Masking Dataset — 300K Labeled Text Samples
225,000+ text samples annotated for personally identifiable information (PII) detection and masking, built to train and evaluate models that redact sensitive data from text. Each record pairs source and target text with privacy masks, span labels, and token-level annotations across multiple languages.
Stack Overflow Posts — 58M Questions & Answers (Markdown)
Every Stack Overflow post submitted before June 14, 2023 — roughly 58 million questions and answers (~35 GB) formatted as Markdown. Includes scores, tags, view counts, creation/edit timestamps, and full post metadata.
Robotics and Humanoid FMEA Public Source Dataset
Public-source dataset for the Robotics Companies FMEAs bounty. Includes 18 robot, humanoid, HRC, personal-care robot, collaborative robot, autonomous robot, failure mode and effects analysis, and FMEA-methodology sources with URLs, source type, year, robot platform/domain, directness label, access status, relevance score, evidence locator, evidence note, and notes. This is AI-assisted public research with manual review labels. It is a compiled source index and review dataset: it does not redistribute papers, claim access to private proprietary company FMEAs, or include fabricated safety documents. Original source rights remain with their publishers; buyers get the compiled metadata, links, labels, and review notes.
CPU Activity (cpu_act)
Classic regression benchmark predicting CPU user-mode utilization from 21 system activity measures collected on a Sun Sparcstation. 8,192 rows, widely used in ML evaluation.
Yeast Protein Localization (UCI/OpenML)
Classic multi-class classification benchmark predicting cellular localization sites of yeast proteins from 8 sequence-derived numeric features. 1,484 instances, 10 classes.
Waveform-5000 (UCI / OpenML)
Classic 3-class synthetic waveform classification benchmark with 5,000 instances and 40 numeric attributes (21 informative + 19 noise). Originally from Breiman et al. 1984, distributed via UCI and OpenML.
Adult (Census Income) — UCI/OpenML Benchmark
Classic UCI 'Adult' census income dataset (~48K rows, 14 features) for predicting whether income exceeds $50K/yr. Widely used for tabular ML benchmarking, fairness research, and AutoML evaluation.
Magneton: Substructure-Aware Protein Representation Learning Dataset
530,601 SwissProt proteins with DSSP secondary structure and InterPro 103.0 substructure annotations, sharded JSONL format. For training/evaluating protein representation learning models.
Global Wheat Full Semantic Organ Segmentation (GWFSS) v1.0
Labelled image dataset for semantic segmentation of wheat plant organs (canopy, leaves, stems, heads) across global field conditions. CC-BY-4.0, ~1K-10K images in parquet format.
RxRx3-core — Phenomics Microscopy Image Challenge
Recursion's RxRx3-core phenomics dataset: labeled microscopy images of 735 genetic knockouts and 1,674 small-molecule perturbations drawn from RxRx3, released as a benchmark for cellular image representation learning.
Open Schematics: Electronic Circuit Designs Dataset
10K-100K electronic schematics from hardware projects with visual representations, component metadata, and KiCad source files. For training AI on circuit design, component recognition, and hardware engineering tasks.
Nemotron Content Safety Audio Dataset (Aegis 2.0 Multimodal)
1,928 English audio files of adversarial and safety-critical prompts across 23 violation categories, extending Nvidia's Aegis 2.0 content-safety benchmark into the audio modality for multimodal guardrail evaluation.
Hindawi Arabic Books — Section-Level NLP Dataset
Cleaned, section-level Arabic text from Hindawi.org books spanning literature, philosophy, history, and science. 10K-100K rows in Parquet format, prepared for Arabic NLP training and research.
Pexels 568K Synthetic Captions (InternVL2-40B)
567,573 synthetic English captions for Pexels photos, generated with InternVL2-40B-AWQ and grounded with original tags. JSON format, ideal for text-to-image and image-to-text model training.
American Sign Language (ASL) Video Dataset — 108K Videos, 2,208 Words
108,618 ASL gesture videos covering 2,208 distinct words (≥30 videos per word). MIT-licensed, preprocessed for ML training and gesture recognition.
ShotBench: Cinematic Understanding Benchmark for VLMs
3,572 expert-level QA pairs over 3,049 images and 464 video clips from Oscar-nominated cinematography films, for evaluating cinematic understanding in vision-language models.
Physiotherapy Evidence QA (Bilingual TR/EN)
143,711 bilingual (Turkish/English) expert-curated Q&A pairs covering evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology. CSV format, CC-BY-4.0.
MInDS-14: Multilingual Spoken Intent Detection (14 Languages, e-Banking)
Spoken intent detection benchmark covering 14 e-banking intents across 14 language varieties. Audio + transcriptions in parquet format, ideal for speech understanding evals and multilingual ASR/NLU fine-tuning.
RSRCC: Remote Sensing Regional Change Comprehension Benchmark
Google Research multimodal benchmark for semantic change understanding in remote sensing — multi-temporal satellite image pairs with natural language Q&A for VQA, change captioning, and multiple-choice tasks.
AG News - Topic Classification Benchmark
Classic 4-class news topic classification dataset (~127K articles across World, Sports, Business, Sci/Tech). Standard benchmark for text classification, fine-tuning, and NLP evals.
Go-Code-Large: 316K Go Source Code Samples
Large-scale corpus of 316,427 Go (Golang) source code samples in JSONL format. Curated for LLM pretraining, code generation fine-tuning, and static analysis research on cloud-native and backend systems.
UAVIT-1M: UAV Visual Instruction Tuning Dataset (1M+)
Largest instruction-tuning dataset for low-altitude UAV visual understanding, with 1M+ samples across 11 image- and region-level tasks. CC-BY-4.0.
TextCaps (lmms-eval formatted)
Image captioning benchmark requiring OCR/text reading in images. Formatted by lmms-lab for one-click multimodal model evaluation. ~28K images with captions.
Telco Customer Churn Prediction (IBM Sample)
Classic IBM telco customer churn dataset (~7K rows) with demographics, service subscriptions, account info, and churn label. Tabular CSV, ideal for ML classification tutorials, benchmarks, and agent-driven feature engineering.