textMohamedRashad/MASC-Arabicarabicspeechasrttsaudiomulti-dialectyoutubecc-by
MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)
About this data
1,000 hours of multi-dialect Arabic speech (16 kHz) crawled from 700+ YouTube channels with transcripts. Parquet format, CC-BY-4.0. For Arabic ASR, TTS, and speech LLM training/eval.
Schema
| Name | Type | Description |
|---|---|---|
| video_id | VARCHAR | YouTube video identifier from source channel |
| start | DOUBLE | Start timestamp in seconds within the source video |
| end | DOUBLE | End timestamp in seconds within the source video |
| duration | DOUBLE | Audio segment length in seconds |
| text | VARCHAR | Arabic transcript of the speech segment |
| type | VARCHAR | Segment type or quality classification code |
| file_path | VARCHAR | Local file system path to the WAV audio file |
| audio | STRUCT(bytes BLOB, path VARCHAR) | Audio waveform data structure containing encoded bytes and file path reference |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "MASC: Massive Arabic Speech Co" })
// Found: 10a0d619-3a20-4ee1-a21c-592f89d7570f
get_download_url({ dataset_id: "10a0d619-3a20-4ee1-a21c-592f89d7570f" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/10a0d619-3a20-4ee1-a21c-592f89d7570f/download-url