textPleIAs/common_corpusllm-pretrainingmultilingualopen-licensetext-corpusparquetbooksscientificlegalcoderag

Common Corpus — 2.27T Tokens of Permissibly-Licensed Text

Category
Text
Records
69,907 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
uploaded
PII
None detected
File Size
~410.04 MB
Downloads
0

About this data

Largest open and permissibly-licensed text dataset: 2.27 trillion tokens across books, newspapers, scientific articles, legal docs, code, and more in 13+ languages. Built by PleIAs for LLM pretraining.

Schema

NameTypeDescription
identifierVARCHARUnique hash or ID for the document/passage
collectionVARCHARSub-corpus category (e.g., French Open Data, scientific, legal, code)
open_typeVARCHARClassification of openness (e.g., Open Government, Creative Commons)
curatorVARCHAROrganization responsible for curation (e.g., Pleias)
licenseVARCHARPermissive license type of the source document
dateBIGINTPublication or source date as Unix timestamp or null if unavailable
titleVARCHARDocument title or filename
creatorVARCHARAuthor, institution, or upstream source
languageVARCHARHuman language name
language_typeVARCHARMedium of text (Written or Spoken)
word_countBIGINTNumber of words in the document
token_countBIGINTApproximate token count using standard tokenization
textVARCHARFull document or passage content

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Common Corpus — 2.27T Tokens o" })
// Found: 4252ca2b-9cc3-4573-9616-e393bdaffab7
get_download_url({ dataset_id: "4252ca2b-9cc3-4573-9616-e393bdaffab7" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4252ca2b-9cc3-4573-9616-e393bdaffab7/download-url
Common Corpus — 2.27T Tokens of Permissibly-Licensed Text — Free Dataset | DataBazaar