Browse Data

17,191 rows·PARQUET·0 downloads

TaskTrove — 750K+ Agentic Tasks for RL & SFT Training

Open-source collection of 750,000+ unique agentic tasks aggregated from 100+ sources including SWE-Smith, R2EGym, and SWE-Re-Bench. Apache-2.0 licensed, parquet format, designed for agent training and evaluation.

18,139,035 rows·PARQUET·0 downloads

Bhasha SFT — 13M+ Multilingual Instruction-Response Pairs (Hindi, Bengali, Gujarati, English)

13M+ instruction-response pairs across Hindi, Bengali, Gujarati, and English for supervised fine-tuning of multilingual LLMs. Mix of human-annotated and synthetic data from open-source SFT collections, curated by Soket AI Labs.

155,855 rows·PARQUET·0 downloads

SciCode Domain Code: 1.1B+ Lines of Scientific Code Across 178 Domains

Large-scale domain-specific code dataset (~115 GB, 1.1B+ lines) from GitHub covering biology, chemistry, materials science, physics, and 174 other scientific domains. Apache 2.0 licensed.

1,154,324 rows·PARQUET·0 downloads

HPDv3 — Human Preference Dataset v3 (1.08M text-image pairs)

Wide-spectrum human preference dataset for text-to-image generation: 1.08M text-image pairs and 1.17M pairwise human preference annotations. MIT-licensed, used to train HPSv3 (ICCV 2025) reward models.

223,100 images·PARQUET·0 downloads

PubTabNet — 568K Table Images with HTML Annotations

568K+ scientific table images paired with HTML structure annotations, extracted from PubMed Central Open Access articles. Standard benchmark for image-based table recognition and document AI.

5,004 rows·PARQUET·0 downloads

LongCoT: Long-Horizon Reasoning Benchmark

Benchmark for evaluating sustained long chain-of-thought reasoning across logic, computer science, chemistry, chess, and mathematics. Parquet format, MIT licensed.

1,422,808 images·PARQUET·0 downloads

EO-Data-1.5M: Interleaved Vision-Text-Action Dataset for Embodied AI

1.5M-sample interleaved vision-language-action dataset for embodied AI and robot learning, emphasizing temporal dynamics and causal dependencies across modalities. Apache-2.0 licensed, parquet format.

scientific

2,785 rows·PARQUET·1 downloads

ChemBench — Chemistry & Materials LLM Evaluation Benchmark

Manually curated benchmark for evaluating chemistry and materials science capabilities of LLMs. Expert-generated QA and multiple-choice items. MIT licensed, evaluation-only.

5,890,279 images·PARQUET·0 downloads

FLUX-Reason-6M: 6M Reasoning-Focused Text-to-Image Dataset

6 million high-quality images synthesized by FLUX.1-dev with 20M bilingual (English/Chinese) descriptions, engineered to instill complex reasoning capabilities in text-to-image generative models.

10,000 images·PARQUET·0 downloads

Text-to-Image 2M: Curated Text-Image Pair Training Dataset

~2M curated high-quality text-image pairs in webdataset format, designed for fine-tuning text-to-image generative models. MIT licensed.

5,061 rows·PARQUET·0 downloads

UGMathBench: Undergraduate Math Reasoning Benchmark

5,062 undergraduate-level math problems across 16 subjects and 111 topics, with 10 answer types and 3 randomized versions each. Designed for evaluating LLM mathematical reasoning.

75,055,472 rows·PARQUET·0 downloads

Common Crawl Creative Commons Fine (C5f) — Multilingual High-Quality Web Corpus

Filtered Creative Commons web corpus from Common Crawl, intersected with FineWeb/FineWeb-2 for quality. Multilingual (EN, DE, FR, NL, ES, IT, AF, FY), 10M-100M rows, Parquet format. Ideal for LLM pretraining and fine-tuning.

15,100 images·PARQUET·0 downloads

TreeOfLife-10M: 10M Biological Organism Images with Taxonomic Labels

Largest ML-ready dataset of biological organism images (10M+ images, 454K taxa) paired with taxonomic labels. Aggregates iNat21, BIOSCAN-1M, and Encyclopedia of Life. Used to train BioCLIP and similar vision-language models.

122,603 images·PARQUET·0 downloads

MMFineReason-SFT-123K (Qwen3-VL-235B Thinking Traces)

123K hardest-7% multimodal reasoning samples with Qwen3-VL-235B chain-of-thought traces. Curated SFT subset of MMFineReason-1.8M where smaller models consistently fail. Apache-2.0, parquet, image+text.

2,560,755 rows·PARQUET·0 downloads

MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances)

Parallel multilingual NLU benchmark from Amazon Science with 1M+ utterances across 51 languages, annotated with 60 intents and 55 slot types. Built by localizing SLURP voice assistant interactions.

30,358 rows·PARQUET·0 downloads

Salad-Data: LLM Safety & Jailbreak Evaluation Benchmark

21K+ safety questions for LLM red-teaming and jailbreak evaluation, aggregated from HH-RLHF, AdvBench, ToxicChat, GPTFuzzer, and GPT-3.5 self-instructed prompts. Apache-2.0 licensed.

financial

584,347,549 rows·PARQUET·0 downloads

US Stocks & ETFs 1-Minute Bars (2016–Present)

Minute-level OHLCV bars for US stocks and ETFs from 2016 onward, scraped from the Alpaca Markets Historical Bars API. Parquet format, 100M+ rows, MIT licensed.

477,028 rows·PARQUET·0 downloads

CodeParrot Clean — Deduplicated Python Code from GitHub

Cleaned, deduplicated corpus of Python files scraped from GitHub. Filtered for line length, alphanumeric fraction, and auto-generated content. Widely used for code LM pretraining and fine-tuning.

913,400 rows·PARQUET·0 downloads

MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)

1,000 hours of multi-dialect Arabic speech (16 kHz) crawled from 700+ YouTube channels with transcripts. Parquet format, CC-BY-4.0. For Arabic ASR, TTS, and speech LLM training/eval.

1,152,000 images·PARQUET·0 downloads

WorldCuisines VQA — Multilingual Multicultural Visual QA on Global Cuisines

Massive-scale multilingual, multicultural VQA benchmark covering global cuisines across 30 languages. NAACL 2025 Best Theme Paper. Images + text questions/answers for vision-language evaluation and fine-tuning.

3,199,986 rows·PARQUET·0 downloads

LLM-jp-4 Thinking — Japanese Reasoning SFT Dataset

~3.2M supervised fine-tuning examples used to train the llm-jp-4 "thinking" models. Prompts drawn from diverse sources are paired with generated step-by-step reasoning traces and final responses for chain-of-thought training.