textwikimedia/wikipediawikipediamultilingualpretrainingragnlptextlanguage-modelingparquetcc-by-sa

Wikimedia Wikipedia (All Languages, Cleaned)

Name: Wikimedia Wikipedia (All Languages, Cleaned)
Creator: DataBazaar
Keywords: wikimedia/wikipedia, wikipedia, multilingual, pretraining, rag, nlp, text, language-modeling, parquet, cc-by-sa

About this data

Cleaned full-text Wikipedia articles across 300+ language subsets, built from official Wikimedia dumps. Parquet format, one row per article. Foundational corpus for LLM pretraining, RAG, and multilingual NLP.

Schema

Name	Type	Description
id	VARCHAR	Numeric Wikipedia page identifier as a string.
url	VARCHAR	Canonical HTTPS URL of the article on its language edition of Wikipedia.
title	VARCHAR	Article title in the source language.
text	VARCHAR	Cleaned plain-text article body with markup, references, and non-prose sections removed.

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

1 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Wikimedia Wikipedia (All Langu" })
// Found: 5c650553-39b6-4bea-8f0b-d127ea5c8dd0
get_download_url({ dataset_id: "5c650553-39b6-4bea-8f0b-d127ea5c8dd0" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/5c650553-39b6-4bea-8f0b-d127ea5c8dd0/download-url