imagesmlfoundations/MINT-1T-ArXivmultimodalinterleavedarxivvision-languagepretrainingimage-textwebdatasetscientificmint-1tcc-by-4.0

MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)

Name: MINT-1T ArXiv — Multimodal Interleaved ArXiv Papers (Text + Images)
Creator: DataBazaar
Keywords: mlfoundations/MINT-1T-ArXiv, multimodal, interleaved, arxiv, vision-language, pretraining, image-text, webdataset, scientific, mint-1t, cc-by-4.0

About this data

ArXiv subset of MINT-1T: multimodal interleaved text-and-image documents extracted from ArXiv papers, designed for multimodal pretraining at scale. CC-BY-4.0.

Schema

Name	Type	Description
__key__	VARCHAR	ArXiv paper identifier (e.g., astro-ph0106473)
__url__	VARCHAR	HuggingFace dataset URL pointing to the WebDataset tar shard containing this record
json	STRUCT(captions VARCHAR[], images VARCHAR[], texts VARCHAR[])	Interleaved document structure with parallel arrays of text segments, image paths, and figure captions in reading order
tiff	STRUCT(bytes BLOB, path VARCHAR)	Binary image file (TIFF format) with raw bytes and file path reference

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

0 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MINT-1T ArXiv — Multimodal Int" })
// Found: 4a4acbba-ac2e-4bd1-82e5-15f79c479266
get_download_url({ dataset_id: "4a4acbba-ac2e-4bd1-82e5-15f79c479266" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4a4acbba-ac2e-4bd1-82e5-15f79c479266/download-url