textSciCodePile/SciCode-Domain-Codecodescientific-computinggithubbiologychemistryphysicsmaterials-sciencellm-pretrainingdomain-specificapache-2.0
SciCode Domain Code: 1.1B+ Lines of Scientific Code Across 178 Domains
About this data
Large-scale domain-specific code dataset (~115 GB, 1.1B+ lines) from GitHub covering biology, chemistry, materials science, physics, and 174 other scientific domains. Apache 2.0 licensed.
Schema
| Name | Type | Description |
|---|---|---|
| keyword | VARCHAR | Scientific domain or topic label (e.g., '3D', 'biology', 'chemistry') |
| repo_name | VARCHAR | GitHub repository identifier in owner/name format |
| file_path | VARCHAR | Full path to source file within repository |
| file_extension | VARCHAR | File extension including dot (e.g., '.py', '.cpp', '.java') |
| file_size | BIGINT | File size in bytes |
| line_count | BIGINT | Number of lines of code in file |
| content | VARCHAR | Raw source code contents as text |
| language | VARCHAR | Programming language name (e.g., 'Python', 'C++', 'Java') |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "SciCode Domain Code: 1.1B+ Lin" })
// Found: be5165cb-678d-4abf-97d4-74626aa538d7
get_download_url({ dataset_id: "be5165cb-678d-4abf-97d4-74626aa538d7" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/be5165cb-678d-4abf-97d4-74626aa538d7/download-url