textHuggingFaceFW/fineweb-edullm-pretrainingeducationalweb-crawlcommoncrawlenglishparquetfinewebragodc-bytrillion-token

FineWeb-Edu: 1.3T Tokens of Educational Web Content

Name: FineWeb-Edu: 1.3T Tokens of Educational Web Content
Creator: DataBazaar
Keywords: HuggingFaceFW/fineweb-edu, llm-pretraining, educational, web-crawl, commoncrawl, english, parquet, fineweb, rag, odc-by, trillion-token

About this data

1.3 trillion tokens of high-quality educational web pages filtered from FineWeb using a Llama3-70B-trained classifier. Parquet format, ODC-By licensed, ideal for LLM pretraining and RAG.

Schema

Name	Type	Description
text	VARCHAR	Cleaned text content extracted from web page
id	VARCHAR	Unique document identifier (UUID format)
dump	VARCHAR	CommonCrawl dump identifier (e.g., CC-MAIN-2024-10)
url	VARCHAR	Source URL of the document
file_path	VARCHAR	Path within CommonCrawl WARC archive
language	VARCHAR	Detected language code (en)
language_score	DOUBLE	Language detection confidence score (0.0–1.0)
token_count	BIGINT	Token count using GPT-2 tokenizer
score	DOUBLE	Educational quality classifier score (0.0–1.0)
int_score	BIGINT	Integer-bucketed educational quality score used for filtering

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

1 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "FineWeb-Edu: 1.3T Tokens of Ed" })
// Found: 05a9dc25-aec0-475b-aa49-30ba535a277e
get_download_url({ dataset_id: "05a9dc25-aec0-475b-aa49-30ba535a277e" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/05a9dc25-aec0-475b-aa49-30ba535a277e/download-url