textSkylion007/openwebtextlanguage-modelingpretrainingweb-textenglishgpt-2nlpcc0parquet
OpenWebText — Open Replication of GPT-2's WebText Corpus
About this data
Open-source replication of OpenAI's WebText, the corpus used to train GPT-2. ~8M English web documents (~13.5GB) in parquet format. CC0 licensed, widely used for LM pretraining and research.
Schema
| Name | Type | Description |
|---|---|---|
| text | VARCHAR | Full plain-text content of a web document scraped from Reddit-shared URLs |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "OpenWebText — Open Replication" })
// Found: d2f7287a-dfd1-460b-9330-be8efd5f85ba
get_download_url({ dataset_id: "d2f7287a-dfd1-460b-9330-be8efd5f85ba" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/d2f7287a-dfd1-460b-9330-be8efd5f85ba/download-url