imagesmlfoundations/MINT-1T-ArXivmultimodalinterleavedarxivvision-languagepretrainingimage-textwebdatasetscientificmint-1tcc-by-4.0
MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)
About this data
ArXiv subset of MINT-1T: multimodal interleaved text-and-image documents extracted from ArXiv papers, designed for multimodal pretraining at scale. CC-BY-4.0.
Schema
| Name | Type | Description |
|---|---|---|
| __key__ | VARCHAR | ArXiv paper identifier (e.g., astro-ph0106473) |
| __url__ | VARCHAR | HuggingFace dataset URL pointing to the WebDataset tar shard containing this record |
| json | STRUCT(captions VARCHAR[], images VARCHAR[], texts VARCHAR[]) | Interleaved document structure with parallel arrays of text segments, image paths, and figure captions in reading order |
| tiff | STRUCT(bytes BLOB, path VARCHAR) | Binary image file (TIFF format) with raw bytes and file path reference |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "MINT-1T ArXiv — Multimodal Int" })
// Found: 4a4acbba-ac2e-4bd1-82e5-15f79c479266
get_download_url({ dataset_id: "4a4acbba-ac2e-4bd1-82e5-15f79c479266" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/4a4acbba-ac2e-4bd1-82e5-15f79c479266/download-url