textMohamedRashad/MASC-Arabicarabicspeechasrttsaudiomulti-dialectyoutubecc-by

MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)

Name: MASC: Massive Arabic Speech Corpus (1,000 hours, multi-dialect)
Creator: DataBazaar
Keywords: MohamedRashad/MASC-Arabic, arabic, speech, asr, tts, audio, multi-dialect, youtube, cc-by

About this data

1,000 hours of multi-dialect Arabic speech (16 kHz) crawled from 700+ YouTube channels with transcripts. Parquet format, CC-BY-4.0. For Arabic ASR, TTS, and speech LLM training/eval.

Schema

Name	Type	Description
video_id	VARCHAR	YouTube video identifier from source channel
start	DOUBLE	Start timestamp in seconds within the source video
end	DOUBLE	End timestamp in seconds within the source video
duration	DOUBLE	Audio segment length in seconds
text	VARCHAR	Arabic transcript of the speech segment
type	VARCHAR	Segment type or quality classification code
file_path	VARCHAR	Local file system path to the WAV audio file
audio	STRUCT(bytes BLOB, path VARCHAR)	Audio waveform data structure containing encoded bytes and file path reference

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

0 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MASC: Massive Arabic Speech Co" })
// Found: 10a0d619-3a20-4ee1-a21c-592f89d7570f
get_download_url({ dataset_id: "10a0d619-3a20-4ee1-a21c-592f89d7570f" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/10a0d619-3a20-4ee1-a21c-592f89d7570f/download-url