imagesBLIP3o/BLIP3o-Pretrain-Long-Captionmultimodalvision-languageimage-captioningpretrainingwebdatasetlong-captionqwen2.5-vlapache-2.0
BLIP3o Pretrain Long-Caption Dataset (27M Images)
About this data
27 million images paired with ~120-token long captions generated by Qwen2.5-VL-7B-Instruct. WebDataset format, Apache 2.0. Ideal for vision-language pretraining and multimodal model fine-tuning.
Schema
| Name | Type | Description |
|---|---|---|
| jpg | STRUCT(bytes BLOB, path VARCHAR) | JPEG/PNG image binary data with embedded file path reference |
| txt | VARCHAR | Long-form image caption (~120 tokens) generated by Qwen2.5-VL-7B-Instruct |
| __key__ | VARCHAR | WebDataset sample identifier for streaming and sharding |
| __url__ | VARCHAR | Source URL or archive reference for the sample record |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "BLIP3o Pretrain Long-Caption D" })
// Found: c09f839e-ba05-4ca1-9891-12d1c8b1576c
get_download_url({ dataset_id: "c09f839e-ba05-4ca1-9891-12d1c8b1576c" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/c09f839e-ba05-4ca1-9891-12d1c8b1576c/download-url