textPleIAs/common_corpusllm-pretrainingmultilingualopen-licensetext-corpusparquetbooksscientificlegalcoderag

Common Corpus — 2.27T Tokens of Permissibly-Licensed Text

Name: Common Corpus — 2.27T Tokens of Permissibly-Licensed Text
Creator: DataBazaar
Keywords: PleIAs/common_corpus, llm-pretraining, multilingual, open-license, text-corpus, parquet, books, scientific, legal, code, rag

About this data

Largest open and permissibly-licensed text dataset: 2.27 trillion tokens across books, newspapers, scientific articles, legal docs, code, and more in 13+ languages. Built by PleIAs for LLM pretraining.

Schema

Name	Type	Description
identifier	VARCHAR	Unique hash or ID for the document/passage
collection	VARCHAR	Sub-corpus category (e.g., French Open Data, scientific, legal, code)
open_type	VARCHAR	Classification of openness (e.g., Open Government, Creative Commons)
curator	VARCHAR	Organization responsible for curation (e.g., Pleias)
license	VARCHAR	Permissive license type of the source document
date	BIGINT	Publication or source date as Unix timestamp or null if unavailable
title	VARCHAR	Document title or filename
creator	VARCHAR	Author, institution, or upstream source
language	VARCHAR	Human language name
language_type	VARCHAR	Medium of text (Written or Spoken)
word_count	BIGINT	Number of words in the document
token_count	BIGINT	Approximate token count using standard tokenization
text	VARCHAR	Full document or passage content

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

0 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Common Corpus — 2.27T Tokens o" })
// Found: 4252ca2b-9cc3-4573-9616-e393bdaffab7
get_download_url({ dataset_id: "4252ca2b-9cc3-4573-9616-e393bdaffab7" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4252ca2b-9cc3-4573-9616-e393bdaffab7/download-url