textHuggingFaceFW/finepdfsllm-pretrainingpdfsmultilingualcommon-crawlhuggingfacefwodc-bytext-corpusraglow-resource-languagesparquet

FinePDFs — 3T Tokens from 475M PDFs in 1,733 Languages

Name: FinePDFs — 3T Tokens from 475M PDFs in 1,733 Languages
Creator: DataBazaar
Keywords: HuggingFaceFW/finepdfs, llm-pretraining, pdfs, multilingual, common-crawl, huggingfacefw, odc-by, text-corpus, rag, low-resource-languages, parquet

About this data

Largest public PDF-sourced corpus: ~3 trillion tokens, 475M documents, 1,733 languages. Parquet format, ODC-By licensed. Built by HuggingFaceFW for pretraining and multilingual research.

Schema

Name	Type	Description
text	VARCHAR	Extracted PDF body text content
id	VARCHAR	Unique document identifier (UUID URN format)
dump	VARCHAR	Common Crawl snapshot identifier (e.g., CC-MAIN-2019-04)
url	VARCHAR	Source URL of the PDF document
date	VARCHAR	ISO 8601 timestamp when PDF was crawled
file_path	VARCHAR	S3 path to source WARC file in Common Crawl
offset	BIGINT	Byte offset position within the WARC file
token_count	BIGINT	Approximate token count of document text
language	VARCHAR	ISO 639-3 language code with script (e.g., iba_Latn)
page_average_lid	VARCHAR	Most common language per page (ISO 639-3 with script)
page_average_lid_score	DOUBLE	Language identification confidence score (0-1 range)
full_doc_lid	VARCHAR	Detected language for entire document (ISO 639-3 with script)
full_doc_lid_score	DOUBLE	Language identification confidence for full document (0-1)
per_page_languages	VARCHAR[]	Array of detected languages per page (ISO 639-3 with script)
is_truncated	BOOLEAN	Boolean indicating if document text was truncated during extraction
extractor	VARCHAR	PDF extraction pipeline used (e.g., rolmOCR, Docling)
page_ends	BIGINT[]	Array of token offsets marking end positions of each page

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

1 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "FinePDFs — 3T Tokens from 475M" })
// Found: 4ba1459c-4532-483d-bb8f-89220d8a625e
get_download_url({ dataset_id: "4ba1459c-4532-483d-bb8f-89220d8a625e" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4ba1459c-4532-483d-bb8f-89220d8a625e/download-url