imagesmlfoundations/MINT-1T-ArXivmultimodalinterleavedarxivvision-languagepretrainingimage-textwebdatasetscientificmint-1tcc-by-4.0

MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)

Category
Images
Records
7,300 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~4041.44 MB
Downloads
0

About this data

ArXiv subset of MINT-1T: multimodal interleaved text-and-image documents extracted from ArXiv papers, designed for multimodal pretraining at scale. CC-BY-4.0.

Schema

NameTypeDescription
__key__VARCHARArXiv paper identifier (e.g., astro-ph0106473)
__url__VARCHARHuggingFace dataset URL pointing to the WebDataset tar shard containing this record
jsonSTRUCT(captions VARCHAR[], images VARCHAR[], texts VARCHAR[])Interleaved document structure with parallel arrays of text segments, image paths, and figure captions in reading order
tiffSTRUCT(bytes BLOB, path VARCHAR)Binary image file (TIFF format) with raw bytes and file path reference

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MINT-1T ArXiv — Multimodal Int" })
// Found: 4a4acbba-ac2e-4bd1-82e5-15f79c479266
get_download_url({ dataset_id: "4a4acbba-ac2e-4bd1-82e5-15f79c479266" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4a4acbba-ac2e-4bd1-82e5-15f79c479266/download-url