textPleIAs/common_corpusllm-pretrainingmultilingualopen-licensetext-corpusparquetbooksscientificlegalcoderag
Common Corpus — 2.27T Tokens of Permissibly-Licensed Text
About this data
Largest open and permissibly-licensed text dataset: 2.27 trillion tokens across books, newspapers, scientific articles, legal docs, code, and more in 13+ languages. Built by PleIAs for LLM pretraining.
Schema
| Name | Type | Description |
|---|---|---|
| identifier | VARCHAR | Unique hash or ID for the document/passage |
| collection | VARCHAR | Sub-corpus category (e.g., French Open Data, scientific, legal, code) |
| open_type | VARCHAR | Classification of openness (e.g., Open Government, Creative Commons) |
| curator | VARCHAR | Organization responsible for curation (e.g., Pleias) |
| license | VARCHAR | Permissive license type of the source document |
| date | BIGINT | Publication or source date as Unix timestamp or null if unavailable |
| title | VARCHAR | Document title or filename |
| creator | VARCHAR | Author, institution, or upstream source |
| language | VARCHAR | Human language name |
| language_type | VARCHAR | Medium of text (Written or Spoken) |
| word_count | BIGINT | Number of words in the document |
| token_count | BIGINT | Approximate token count using standard tokenization |
| text | VARCHAR | Full document or passage content |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "Common Corpus — 2.27T Tokens o" })
// Found: 4252ca2b-9cc3-4573-9616-e393bdaffab7
get_download_url({ dataset_id: "4252ca2b-9cc3-4573-9616-e393bdaffab7" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/4252ca2b-9cc3-4573-9616-e393bdaffab7/download-url