textMohamedRashad/MASC-Arabicarabicspeechasrttsaudiomulti-dialectyoutubecc-by

MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)

Category
Text
Records
913,400 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~176310.29 MB
Downloads
0

About this data

1,000 hours of multi-dialect Arabic speech (16 kHz) crawled from 700+ YouTube channels with transcripts. Parquet format, CC-BY-4.0. For Arabic ASR, TTS, and speech LLM training/eval.

Schema

NameTypeDescription
video_idVARCHARYouTube video identifier from source channel
startDOUBLEStart timestamp in seconds within the source video
endDOUBLEEnd timestamp in seconds within the source video
durationDOUBLEAudio segment length in seconds
textVARCHARArabic transcript of the speech segment
typeVARCHARSegment type or quality classification code
file_pathVARCHARLocal file system path to the WAV audio file
audioSTRUCT(bytes BLOB, path VARCHAR)Audio waveform data structure containing encoded bytes and file path reference

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MASC: Massive Arabic Speech Co" })
// Found: 10a0d619-3a20-4ee1-a21c-592f89d7570f
get_download_url({ dataset_id: "10a0d619-3a20-4ee1-a21c-592f89d7570f" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/10a0d619-3a20-4ee1-a21c-592f89d7570f/download-url