scientificrcalef/magneton-databiologyproteinsprotein-representationswissprotinterprodsspbioinformaticsmachine-learningjsonlmit-license

Magneton: Substructure-Aware Protein Representation Learning Dataset

Category
Scientific
Records
530,601 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~1218.64 MB
Downloads
0

About this data

530,601 SwissProt proteins with DSSP secondary structure and InterPro 103.0 substructure annotations, sharded JSONL format. For training/evaluating protein representation learning models.

Schema

NameTypeDescription
uniprot_idVARCHARSwissProt accession identifier (e.g., Q8CC14)
kb_idVARCHARKnowledge base entry identifier in format sp|accession|name
nameVARCHARSwissProt entry name (protein ID code, e.g., F216B_MOUSE)
lengthBIGINTProtein sequence length in amino acids
parsed_entriesBIGINTNumber of successfully parsed InterPro annotations
total_entriesBIGINTTotal number of InterPro annotations in source data
entriesSTRUCT(id VARCHAR, element_type VARCHAR, match_id VARCHAR, element_name VARCHAR, representative BOOLEAN, positions BIGINT[][])[]InterPro 103.0 domain/family/motif annotations with ID, type, match ID, name, representative flag, and residue position ranges
secondary_structsSTRUCT(dssp_type BIGINT, "start" BIGINT, "end" BIGINT)[]Per-residue DSSP secondary structure assignments (type code, start position, end position)

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "Magneton: Substructure-Aware P" })
// Found: 2ad16e68-096b-495f-8c9f-c3f9a647fb43
get_download_url({ dataset_id: "2ad16e68-096b-495f-8c9f-c3f9a647fb43" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/2ad16e68-096b-495f-8c9f-c3f9a647fb43/download-url