Browse Data

121–144 of 220

1,140-task code generation benchmark from BigCode with both docstring-based completion and NL-instruction variants, 99% test coverage, Apache-2.0 licensed.

5,700 rows·PARQUET·0 downloads

text

Freefixed price

WildChat-1M: 1 Million Real Human-ChatGPT Conversations

1M real-world conversations between human users and ChatGPT (GPT-3.5/4), with demographics, timestamps, languages, and toxicity labels. Parquet format, ODC-BY licensed. Widely used for instruction tuning, eval, and alignment research.

837,989 rows·PARQUET·0 downloads

text

Freefixed price

HelpSteer3 — NVIDIA Human Preference & Feedback Dataset for RLHF/Reward Modeling

NVIDIA's open-source multilingual preference dataset for training reward models and aligning LLMs. CC-BY-4.0, 100K+ samples across 15 languages, used to train SOTA reward models on RM-Bench (85.5%) and JudgeBench (78.6%).

132,937 rows·PARQUET·0 downloads

text

Freefixed price

Alpaca Cleaned — Instruction Fine-Tuning Dataset

Cleaned version of Stanford's Alpaca instruction-following dataset (~52K examples). Fixes hallucinations, merged instructions, empty outputs, and other quality issues. CC-BY-4.0, ready for LLM fine-tuning.

51,760 rows·PARQUET·0 downloads

text

Freefixed price

OpenR1-Math-220k: Math Reasoning Traces from DeepSeek R1

220k math problems with verified DeepSeek R1 reasoning traces, sourced from NuminaMath 1.5. Apache-2.0 licensed dataset for training and evaluating mathematical reasoning models.

450,258 rows·PARQUET·0 downloads

scientific

Freefixed price

ScienceQA: Multimodal Science Question Answering with Chain-of-Thought

21K multimodal multiple-choice science questions with images, lectures, and chain-of-thought explanations spanning natural science, social science, and language science. Widely used for VLM evaluation and CoT fine-tuning.

21,208 rows·PARQUET·0 downloads

text

Freefixed price

Nemotron Content Safety Dataset V2 (Aegis 2.0)

33,416 annotated human-LLM interactions for content safety classification across 12+ harm categories. Used for training and evaluating LLM safety guardrails like NeMo Guard.

33,416 rows·PARQUET·0 downloads

text

Freefixed price

Hermes Agent Reasoning Traces (Kimi-K2.5 & GLM-5.1)

14,700+ multi-turn agent tool-calling trajectories with step-by-step reasoning traces and real tool execution results, generated via the Hermes Agent harness from Kimi-K2.5 and GLM-5.1 models. Apache-2.0.

14,701 rows·PARQUET·0 downloads

text

Freefixed price

UltraChat 200k (Zephyr SFT Dataset)

Filtered 200k-dialogue subset of UltraChat used to train Zephyr-7B-β. High-quality multi-turn ChatGPT-generated conversations for supervised fine-tuning of chat models. MIT licensed, parquet format.

515,311 rows·PARQUET·0 downloads

text

Freefixed price

NuminaMath-TIR: Tool-Integrated Reasoning Math Problems

~70k math problems with tool-integrated reasoning (TIR) traces generated via GPT-4, derived from NuminaMath-CoT. Apache-2.0 licensed, ideal for training math-reasoning agents with code execution.

72,540 rows·PARQUET·0 downloads

text

Freefixed price

Full HH-RLHF (Prompt/Chosen/Rejected Format)

Anthropic's Helpful & Harmless RLHF dataset reformatted into prompt/chosen/rejected triples for preference modeling and DPO/RLHF training.

124,503 rows·PARQUET·0 downloads

text

Freefixed price

Wikimedia Wikipedia (All Languages, Cleaned)

Cleaned full-text Wikipedia articles across 300+ language subsets, built from official Wikimedia dumps. Parquet format, one row per article. Foundational corpus for LLM pretraining, RAG, and multilingual NLP.

61,614,907 rows·PARQUET·0 downloads

text

Freefixed price

SWE-bench: Real-World GitHub Issue Resolution Benchmark

2,294 Issue-PR pairs from 12 popular Python repos for evaluating LLM/agent ability to resolve real GitHub issues. Verified via post-PR unit tests. Canonical agent coding benchmark.

21,527 rows·PARQUET·1 downloads

text

Freefixed price

C4: Colossal Clean Crawled Corpus (en + multilingual mC4)

Cleaned Common Crawl web text corpus from AllenAI/Google. 305GB English + 9.7TB multilingual (108 languages). Foundational pretraining dataset behind T5 and many open LLMs. ODC-BY licensed.

114,005,516 rows·PARQUET·0 downloads

text

Freefixed price

FinePDFs — 3T Tokens from 475M PDFs in 1,733 Languages

Largest public PDF-sourced corpus: ~3 trillion tokens, 475M documents, 1,733 languages. Parquet format, ODC-By licensed. Built by HuggingFaceFW for pretraining and multilingual research.

476,178,356 rows·PARQUET·0 downloads

text

Freefixed price

FineWeb-Edu: 1.3T Tokens of Educational Web Content

1.3 trillion tokens of high-quality educational web pages filtered from FineWeb using a Llama3-70B-trained classifier. Parquet format, ODC-By licensed, ideal for LLM pretraining and RAG.

3,496,736,741 rows·PARQUET·0 downloads

text

Freefixed price

Common Corpus — 2.27T Tokens of Permissibly-Licensed Text

Largest open and permissibly-licensed text dataset: 2.27 trillion tokens across books, newspapers, scientific articles, legal docs, code, and more in 13+ languages. Built by PleIAs for LLM pretraining.

69,907 rows·PARQUET·0 downloads

text

Freefixed price

MMLU — Massive Multitask Language Understanding Benchmark

57-subject multiple-choice benchmark (humanities, STEM, social sciences, professional) for evaluating LLM knowledge and reasoning. ~116K questions across dev/val/test splits. MIT licensed.

231,400 rows·PARQUET·0 downloads

text

Freefixed price

FineWeb2: Multilingual Web Pretraining Corpus (1000+ Languages)

Second iteration of HuggingFace's FineWeb dataset, covering 1000+ languages with high-quality filtered web text for LLM pretraining. ODC-By 1.0 licensed, validated through hundreds of ablation experiments.

4,484,929,995 rows·PARQUET·0 downloads

text

Freefixed price

Wikipedia 2024-06 Embeddings (BGE-M3, Multilingual)

Wikipedia paragraph embeddings from the June 2024 dump across 11 languages, generated with the multilingual BGE-M3 model. Ideal for multilingual RAG, semantic search, and retrieval evals.

11,800,000 rows·PARQUET·0 downloads

text

Freefixed price

SWE-bench Pro

Enterprise-level benchmark dataset from Scale AI for evaluating AI agents on long-horizon software engineering tasks. Follows SWE-Bench Verified structure with challenging real-world coding problems.

731 rows·PARQUET·3 downloads

text

Freefixed price

SQuAD 2.0 - Stanford Question Answering Dataset

Reading comprehension benchmark with 150K+ questions on Wikipedia articles, including 50K unanswerable questions. Standard for extractive QA model training and evaluation.

142,192 rows·PARQUET·0 downloads

text

Freefixed price

MegaMath: 300B+ Token Open Math Pretraining Corpus

Largest open math-focused pretraining dataset (300B+ tokens) from LLM360, curated from Common Crawl, code, and synthetic sources for training math-capable LLMs.

217,499,877 rows·PARQUET·1 downloads

text

Freefixed price