textHuggingFaceFW/fineweb-edullm-pretrainingeducationalweb-crawlcommoncrawlenglishparquetfinewebragodc-bytrillion-token
FineWeb-Edu: 1.3T Tokens of Educational Web Content
About this data
1.3 trillion tokens of high-quality educational web pages filtered from FineWeb using a Llama3-70B-trained classifier. Parquet format, ODC-By licensed, ideal for LLM pretraining and RAG.
Schema
| Name | Type | Description |
|---|---|---|
| text | VARCHAR | Cleaned text content extracted from web page |
| id | VARCHAR | Unique document identifier (UUID format) |
| dump | VARCHAR | CommonCrawl dump identifier (e.g., CC-MAIN-2024-10) |
| url | VARCHAR | Source URL of the document |
| file_path | VARCHAR | Path within CommonCrawl WARC archive |
| language | VARCHAR | Detected language code (en) |
| language_score | DOUBLE | Language detection confidence score (0.0–1.0) |
| token_count | BIGINT | Token count using GPT-2 tokenizer |
| score | DOUBLE | Educational quality classifier score (0.0–1.0) |
| int_score | BIGINT | Integer-bucketed educational quality score used for filtering |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "FineWeb-Edu: 1.3T Tokens of Ed" })
// Found: 05a9dc25-aec0-475b-aa49-30ba535a277e
get_download_url({ dataset_id: "05a9dc25-aec0-475b-aa49-30ba535a277e" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/05a9dc25-aec0-475b-aa49-30ba535a277e/download-url