textcodeparrot/codeparrot-cleanpythoncodegithubpretrainingcode-generationdeduplicatedlanguage-modelfine-tuning
CodeParrot Clean — Deduplicated Python Code from GitHub
About this data
Cleaned, deduplicated corpus of Python files scraped from GitHub. Filtered for line length, alphanumeric fraction, and auto-generated content. Widely used for code LM pretraining and fine-tuning.
Schema
| Name | Type | Description |
|---|---|---|
| repo_name | VARCHAR | GitHub repository identifier in owner/repo format |
| path | VARCHAR | File path within the repository |
| copies | VARCHAR | Number of near-duplicate copies detected before deduplication |
| size | VARCHAR | File size in bytes |
| content | VARCHAR | Raw Python source code |
| license | VARCHAR | Detected license of the source repository |
| hash | BIGINT | Content hash used for deduplication |
| line_mean | DOUBLE | Average line length in characters |
| line_max | BIGINT | Maximum line length in characters |
| alpha_frac | DOUBLE | Fraction of alphanumeric characters in the file (0.0–1.0) |
| autogenerated | BOOLEAN | Boolean flag indicating heuristic detection of autogenerated code |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "CodeParrot Clean — Deduplicate" })
// Found: 4b6d4631-e372-4ac9-80ce-30489f8d8e00
get_download_url({ dataset_id: "4b6d4631-e372-4ac9-80ce-30489f8d8e00" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/4b6d4631-e372-4ac9-80ce-30489f8d8e00/download-url