textHuggingFaceFW/fineweb-2pretrainingmultilingualllmweb-crawlcommoncrawlhuggingfaceodc-byfineweb

FineWeb2: Multilingual Web Pretraining Corpus (1000+ Languages)

Name: FineWeb2: Multilingual Web Pretraining Corpus (1000+ Languages)
Creator: DataBazaar
Keywords: HuggingFaceFW/fineweb-2, pretraining, multilingual, llm, web-crawl, commoncrawl, huggingface, odc-by, fineweb

About this data

Second iteration of HuggingFace's FineWeb dataset, covering 1000+ languages with high-quality filtered web text for LLM pretraining. ODC-By 1.0 licensed, validated through hundreds of ablation experiments.

Schema

Name	Type	Description
text	VARCHAR	Document text content in detected language
id	VARCHAR	Unique document identifier (UUID URN format)
dump	VARCHAR	CommonCrawl dump identifier (CC-MAIN-YYYY-XX format)
url	VARCHAR	Source URL of the document
date	VARCHAR	ISO 8601 crawl timestamp
file_path	VARCHAR	S3 path to document in CommonCrawl WARC archive
language	VARCHAR	ISO 639-3 language code
language_score	DOUBLE	Language detection confidence score (0.0–1.0)
language_script	VARCHAR	ISO 15924 script code for detected writing system
minhash_cluster_size	BIGINT	Number of documents in deduplication cluster
top_langs	VARCHAR	JSON object of top language candidates with confidence scores

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

0 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "FineWeb2: Multilingual Web Pre" })
// Found: bb3745f0-b140-4136-a72f-e290a474dad7
get_download_url({ dataset_id: "bb3745f0-b140-4136-a72f-e290a474dad7" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/bb3745f0-b140-4136-a72f-e290a474dad7/download-url