Browse Data

2,863,834 images·PARQUET·0 downloads

Llama-Nemotron VLM Dataset v1 (NVIDIA)

NVIDIA's 1M+ sample multimodal dataset for vision-language model training, covering VQA, OCR, captioning, and image-to-text tasks. CC-BY-4.0 licensed.

8,013,769 rows·PARQUET·0 downloads

OpenWebText — Open Replication of GPT-2's WebText Corpus

Open-source replication of OpenAI's WebText, the corpus used to train GPT-2. ~8M English web documents (~13.5GB) in parquet format. CC0 licensed, widely used for LM pretraining and research.

37,120 rows·PARQUET·0 downloads

HelpSteer: NVIDIA Helpfulness Preference Dataset

NVIDIA's open helpfulness dataset with multi-attribute human ratings (helpfulness, correctness, coherence, complexity, verbosity) for prompts and responses. Used to train SteerLM-aligned LLMs. CC-BY-4.0.

227,914 rows·PARQUET·1 downloads

OpenThoughts-114k: Synthetic Reasoning Dataset

114k high-quality synthetic reasoning examples across math, science, code, and puzzles. Used to fine-tune OpenThinker-7B/32B models. Apache 2.0 licensed, parquet format.

101,900 images·PARQUET·0 downloads

BLIP3o Pretrain Long-Caption Dataset (27M Images)

27 million images paired with ~120-token long captions generated by Qwen2.5-VL-7B-Instruct. WebDataset format, Apache 2.0. Ideal for vision-language pretraining and multimodal model fine-tuning.

1,794,054 rows·PARQUET·0 downloads

ClimbMix: 400B-Token Pre-training Corpus (NVIDIA)

A 400-billion-token English pre-training corpus from NVIDIA, filtered and topic-mixed for efficient LLM pre-training with superior performance per token.

12,500 rows·PARQUET·0 downloads

MATH: Competition Mathematics Problems with Step-by-Step Solutions

12,500 competition math problems (AMC 10/12, AIME, etc.) with full worked solutions. Standard benchmark for math reasoning, chain-of-thought training, and LLM evaluation. MIT licensed, parquet format.

1,696,847 rows·PARQUET·0 downloads

AgentTrove: 1.7M Agentic Interaction Traces

Largest open-source collection of agentic interaction traces (1.7M rows from 219 source datasets) covering code repair, shell scripting, math, competitive programming, and computer-use tasks. Apache 2.0.

177,355,164 rows·PARQUET·0 downloads

Sangraha — 251B Token Indic Language Pretraining Corpus (22 Languages)

Largest cleaned Indic-language pretraining dataset: 251B tokens across 22 Indian languages, curated from web sources, multilingual corpora, and large-scale translations. CC-BY-4.0, parquet format.

88,838 rows·PARQUET·0 downloads

OpenAssistant Conversations (OASST1)

161,443 human-generated assistant conversation messages across 35 languages with 461,292 quality ratings — a foundational dataset for alignment, RLHF, and instruction-tuning research.

9,351,785 rows·PARQUET·0 downloads

BEIR: MS MARCO Passage Retrieval Benchmark

MS MARCO subset of the BEIR heterogeneous IR benchmark — 8.8M passages and ~500K queries in standard retrieval layout. Parquet format, CC-BY-SA-4.0, English.

1,387,173,656 images·PARQUET·0 downloads

DataComp-1B: Image-Text Pair Metadata (1.4B Samples)

Metadata (URLs, captions, CLIP features) for ~1.4B image-text pairs from DataComp-1B, the curated subset of CommonPool used to train state-of-the-art CLIP models. CC-BY-4.0, parquet format.

63,967 rows·PARQUET·0 downloads

UltraFeedback — Large-Scale Fine-Grained Preference Dataset

64k prompts × 4 LLM responses (256k samples) with fine-grained GPT-4 preference annotations across instruction-following, truthfulness, honesty, and helpfulness. Canonical dataset for reward model and RLHF training.

2,152,112 rows·PARQUET·0 downloads

Dolci Instruct SFT Mixture (Olmo 3 Training Data)

2.15M-sample multilingual instruction SFT mixture used to train AI2's Olmo 3 7B Instruct. Parquet format, ODC-BY licensed, covers 70+ languages and includes OpenThoughts 3 and other curated prompt sets.

7,300 images·PARQUET·0 downloads

MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)

ArXiv subset of MINT-1T: multimodal interleaved text-and-image documents extracted from ArXiv papers, designed for multimodal pretraining at scale. CC-BY-4.0.

1,810,926 images·PARQUET·0 downloads

MMFineReason 1.8M — Multimodal Reasoning Traces (Qwen3-VL-235B-Thinking)

1.8M multimodal reasoning samples with 5.1B solution tokens distilled from Qwen3-VL-235B-A22B-Thinking. Image+text inputs with detailed chain-of-thought annotations for math, science, and STEM visual reasoning. Apache-2.0.

3,708,608 rows·PARQUET·0 downloads

WikiText (WikiText-2 & WikiText-103) Language Modeling Corpus

Canonical language modeling benchmark: 100M+ tokens from verified Good/Featured Wikipedia articles. Includes WikiText-2 and WikiText-103 in raw and tokenized variants. Parquet format, CC-BY-SA 3.0.

3,231,786 rows·PARQUET·0 downloads

KILT: Knowledge-Intensive Language Tasks Benchmark

Facebook AI's KILT benchmark — 11 datasets across fact-checking, entity linking, slot filling, open-domain QA, and dialog generation, all grounded in a unified Wikipedia snapshot. MIT licensed, parquet format, 1M-10M examples.

21,239,576 images·PARQUET·0 downloads

LLaVA-OneVision-1.5 Instruction Data (22M Multimodal SFT)

22M curated instruction-tuning examples (image+text) used to train LLaVA-OneVision-1.5 LMMs. Apache-2.0, covers VQA, captioning, OCR, reasoning, and more.

1,999,563,091 rows·PARQUET·1 downloads

FineWeb-Edu Translated (36 Languages, 960B+ Tokens)

Machine-translated FineWeb-Edu corpus covering 36.7M aligned documents across 36 European languages — 960B+ tokens for multilingual LLM pretraining and translation training.

8,051,212 rows·PARQUET·0 downloads

People's Speech — 30,000+ Hours of Transcribed English Speech (MLCommons)

One of the world's largest open English ASR corpora: 30,000+ hours of transcribed speech under CC-BY/CC-BY-SA. Built by MLCommons for training and evaluating speech-to-text systems.

1,112,939 rows·PARQUET·0 downloads

MS MARCO — Question Answering & Passage Ranking

Microsoft's MS MARCO dataset: 1M+ real Bing questions with human-generated answers and passage relevance judgments. Foundational benchmark for retrieval, RAG, and QA systems.