imagesHuggingFaceM4/FineVisionvision-languagemultimodalvlminstruction-tuningimagesvqaparquethuggingfacefine-tuningrag

FineVision — 24M-Sample Vision-Language Training Corpus

Category
Images
Records
24,209,105 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~4277560.01 MB
Downloads
0

About this data

Massive open VLM training set: 17.3M images, 24.3M samples, 88.9M turns, 9.5B answer tokens. Parquet, multi-subset, ready for fine-tuning state-of-the-art vision-language models.

Schema

NameTypeDescription
imagesSTRUCT(bytes BLOB, path VARCHAR)[]List of images with binary content and file path for each sample.
textsSTRUCT("user" VARCHAR, assistant VARCHAR)[]List of multi-turn conversational exchanges with user query and assistant response fields.
sourceVARCHARUpstream dataset identifier or task family name.
relevance_ratingsBIGINT[]List of integer relevance scores (1-5 scale) from annotators.
relevance_minBIGINTMinimum relevance rating across all annotators for the sample.
formatting_ratingsBIGINT[]List of integer formatting quality scores (1-5 scale) from annotators.
formatting_minBIGINTMinimum formatting rating across all annotators for the sample.

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "FineVision — 24M-Sample Vision" })
// Found: 4837ae04-96b5-4da6-94a3-2f9d542f39e1
get_download_url({ dataset_id: "4837ae04-96b5-4da6-94a3-2f9d542f39e1" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/4837ae04-96b5-4da6-94a3-2f9d542f39e1/download-url