scientificopendatalab/Sci-Basescienceai4schemistrybiologymedicalphysicsclimatematerialspretrainingparquet
Sci-Base: AI-Ready Scientific Foundation Dataset
About this data
Large multi-domain scientific text corpus (chem, bio, climate, medical, materials, earth, physics) from OpenDataLab's Sciverse foundation, 1M-10M rows in Parquet, CC-BY-4.0.
Schema
| Name | Type | Description |
|---|---|---|
| abstract | VARCHAR | Summary of the research study's main findings and methodology. |
| author | VARCHAR | Name(s) of the publication author(s). |
| content_list | STRUCT(bbox VARCHAR, code_body VARCHAR, code_caption VARCHAR, image_caption VARCHAR, image_footnote VARCHAR, img_path VARCHAR, list_items VARCHAR, page_idx VARCHAR, sub_type VARCHAR, table_body VARCHAR, table_caption VARCHAR, table_footnote VARCHAR, "text" VARCHAR, text_format VARCHAR, text_level VARCHAR, "type" VARCHAR)[] | Array of structured document elements including text, tables, images, code, and metadata (bbox, captions, page index, type, format). |
| doi | VARCHAR | Digital Object Identifier for the scientific publication. |
| is_oa | BOOLEAN | Boolean flag indicating open-access status of the publication. |
| language | VARCHAR | Language code or name of the document text. |
| sci_category | VARCHAR | Scientific domain classification (chemistry, biology, climate, medical, materials, earth, physics). |
| sha256 | VARCHAR | SHA-256 cryptographic hash of the document content. |
| title | VARCHAR | Title of the scientific publication or article. |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "Sci-Base: AI-Ready Scientific " })
// Found: 65c011c0-8dde-40f7-95d5-0b3f88249e27
get_download_url({ dataset_id: "65c011c0-8dde-40f7-95d5-0b3f88249e27" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/65c011c0-8dde-40f7-95d5-0b3f88249e27/download-url