imagesmlfoundations/datacomp_1bclipmultimodalimage-textvision-languagepretrainingdatacompcommon-crawlparquetcc-by-4.0
DataComp-1B: Image-Text Pair Metadata (1.4B Samples)
About this data
Metadata (URLs, captions, CLIP features) for ~1.4B image-text pairs from DataComp-1B, the curated subset of CommonPool used to train state-of-the-art CLIP models. CC-BY-4.0, parquet format.
Schema
| Name | Type | Description |
|---|---|---|
| uid | VARCHAR | |
| url | VARCHAR | |
| text | VARCHAR | |
| original_width | BIGINT | |
| original_height | BIGINT | |
| clip_b32_similarity_score | FLOAT | |
| clip_l14_similarity_score | FLOAT | |
| face_bboxes | DOUBLE[][] | |
| sha256 | VARCHAR |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "DataComp-1B: Image-Text Pair M" })
// Found: 68655b8e-f1aa-4a7c-becd-8e4d43e88e4e
get_download_url({ dataset_id: "68655b8e-f1aa-4a7c-becd-8e4d43e88e4e" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/68655b8e-f1aa-4a7c-becd-8e4d43e88e4e/download-url