To search datasets programmatically: GET https://api.databazaar.io/datasets?query=your-search

Full API docs: https://api.databazaar.io/llms.txt

Agent discovery: https://databazaar.io/.well-known/agent.json

Browse Data

25–48 of 227
Filters
Newest
text
Freefixed price

PII Masking Dataset — 300K Labeled Text Samples

225,000+ text samples annotated for personally identifiable information (PII) detection and masking, built to train and evaluate models that redact sensitive data from text. Each record pairs source and target text with privacy masks, span labels, and token-level annotations across multiple languages.

225,405 rows·PARQUET·1 downloads
text
Freefixed price

Stack Overflow Posts — 58M Questions & Answers (Markdown)

Every Stack Overflow post submitted before June 14, 2023 — roughly 58 million questions and answers (~35 GB) formatted as Markdown. Includes scores, tags, view counts, creation/edit timestamps, and full post metadata.

58,329,355 rows·PARQUET·0 downloads
scientific
$7fixed price

Robotics and Humanoid FMEA Public Source Dataset

Public-source dataset for the Robotics Companies FMEAs bounty. Includes 18 robot, humanoid, HRC, personal-care robot, collaborative robot, autonomous robot, failure mode and effects analysis, and FMEA-methodology sources with URLs, source type, year, robot platform/domain, directness label, access status, relevance score, evidence locator, evidence note, and notes. This is AI-assisted public research with manual review labels. It is a compiled source index and review dataset: it does not redistribute papers, claim access to private proprietary company FMEAs, or include fabricated safety documents. Original source rights remain with their publishers; buyers get the compiled metadata, links, labels, and review notes.

18 rows·CSV·0 downloads
3.5/5
retail
Freefixed price

CPU Activity (cpu_act)

Classic regression benchmark predicting CPU user-mode utilization from 21 system activity measures collected on a Sun Sparcstation. 8,192 rows, widely used in ML evaluation.

8,192 rows·PARQUET·0 downloads
other
Freefixed price

Yeast Protein Localization (UCI/OpenML)

Classic multi-class classification benchmark predicting cellular localization sites of yeast proteins from 8 sequence-derived numeric features. 1,484 instances, 10 classes.

1,484 rows·PARQUET·3 downloads
text
Freefixed price

Waveform-5000 (UCI / OpenML)

Classic 3-class synthetic waveform classification benchmark with 5,000 instances and 40 numeric attributes (21 informative + 19 noise). Originally from Breiman et al. 1984, distributed via UCI and OpenML.

5,000 rows·PARQUET·1 downloads
retail
Freefixed price

Adult (Census Income) — UCI/OpenML Benchmark

Classic UCI 'Adult' census income dataset (~48K rows, 14 features) for predicting whether income exceeds $50K/yr. Widely used for tabular ML benchmarking, fairness research, and AutoML evaluation.

48,842 rows·PARQUET·0 downloads
scientific
Freefixed price

Magneton: Substructure-Aware Protein Representation Learning Dataset

530,601 SwissProt proteins with DSSP secondary structure and InterPro 103.0 substructure annotations, sharded JSONL format. For training/evaluating protein representation learning models.

530,601 rows·PARQUET·0 downloads
images
Freefixed price

Global Wheat Full Semantic Organ Segmentation (GWFSS) v1.0

Labelled image dataset for semantic segmentation of wheat plant organs (canopy, leaves, stems, heads) across global field conditions. CC-BY-4.0, ~1K-10K images in parquet format.

1,096 images·PARQUET·0 downloads
images
Freefixed price

RxRx3-core — Phenomics Microscopy Image Challenge

Recursion's RxRx3-core phenomics dataset: labeled microscopy images of 735 genetic knockouts and 1,674 small-molecule perturbations drawn from RxRx3, released as a benchmark for cellular image representation learning.

1,335,606 images·PARQUET·0 downloads
scientific
Freefixed price

Open Schematics: Electronic Circuit Designs Dataset

10K-100K electronic schematics from hardware projects with visual representations, component metadata, and KiCad source files. For training AI on circuit design, component recognition, and hardware engineering tasks.

84,470 rows·PARQUET·0 downloads
text
Freefixed price

Nemotron Content Safety Audio Dataset (Aegis 2.0 Multimodal)

1,928 English audio files of adversarial and safety-critical prompts across 23 violation categories, extending Nvidia's Aegis 2.0 content-safety benchmark into the audio modality for multimodal guardrail evaluation.

1,928 rows·PARQUET·0 downloads
text
Freefixed price

Hindawi Arabic Books — Section-Level NLP Dataset

Cleaned, section-level Arabic text from Hindawi.org books spanning literature, philosophy, history, and science. 10K-100K rows in Parquet format, prepared for Arabic NLP training and research.

52,830 rows·PARQUET·0 downloads
images
Freefixed price

Pexels 568K Synthetic Captions (InternVL2-40B)

567,573 synthetic English captions for Pexels photos, generated with InternVL2-40B-AWQ and grounded with original tags. JSON format, ideal for text-to-image and image-to-text model training.

567,573 images·PARQUET·1 downloads
images
Freefixed price

American Sign Language (ASL) Video Dataset — 108K Videos, 2,208 Words

108,618 ASL gesture videos covering 2,208 distinct words (≥30 videos per word). MIT-licensed, preprocessed for ML training and gesture recognition.

47 images·PARQUET·0 downloads
images
Freefixed price

ShotBench: Cinematic Understanding Benchmark for VLMs

3,572 expert-level QA pairs over 3,049 images and 464 video clips from Oscar-nominated cinematography films, for evaluating cinematic understanding in vision-language models.

3,572 images·PARQUET·0 downloads
scientific
Freefixed price

Physiotherapy Evidence QA (Bilingual TR/EN)

143,711 bilingual (Turkish/English) expert-curated Q&A pairs covering evidence-based physiotherapy, musculoskeletal rehabilitation, outcome measures, and clinical research methodology. CSV format, CC-BY-4.0.

143,711 rows·PARQUET·0 downloads
text
Freefixed price

MInDS-14: Multilingual Spoken Intent Detection (14 Languages, e-Banking)

Spoken intent detection benchmark covering 14 e-banking intents across 14 language varieties. Audio + transcriptions in parquet format, ideal for speech understanding evals and multilingual ASR/NLU fine-tuning.

16,336 rows·PARQUET·0 downloads
geographic
Freefixed price

RSRCC: Remote Sensing Regional Change Comprehension Benchmark

Google Research multimodal benchmark for semantic change understanding in remote sensing — multi-temporal satellite image pairs with natural language Q&A for VQA, change captioning, and multiple-choice tasks.

126,131 rows·PARQUET·0 downloads
text
Freefixed price

AG News - Topic Classification Benchmark

Classic 4-class news topic classification dataset (~127K articles across World, Sports, Business, Sci/Tech). Standard benchmark for text classification, fine-tuning, and NLP evals.

127,600 rows·PARQUET·0 downloads
text
Freefixed price

Go-Code-Large: 316K Go Source Code Samples

Large-scale corpus of 316,427 Go (Golang) source code samples in JSONL format. Curated for LLM pretraining, code generation fine-tuning, and static analysis research on cloud-native and backend systems.

316,427 rows·PARQUET·0 downloads
images
Freefixed price

UAVIT-1M: UAV Visual Instruction Tuning Dataset (1M+)

Largest instruction-tuning dataset for low-altitude UAV visual understanding, with 1M+ samples across 11 image- and region-level tasks. CC-BY-4.0.

1,240,666 images·PARQUET·3 downloads
images
Freefixed price

TextCaps (lmms-eval formatted)

Image captioning benchmark requiring OCR/text reading in images. Formatted by lmms-lab for one-click multimodal model evaluation. ~28K images with captions.

28,408 images·PARQUET·0 downloads
retail
Freefixed price

Telco Customer Churn Prediction (IBM Sample)

Classic IBM telco customer churn dataset (~7K rows) with demographics, service subscriptions, account info, and churn label. Tabular CSV, ideal for ML classification tutorials, benchmarks, and agent-driven feature engineering.

7,043 rows·PARQUET·0 downloads