textAmazonScience/massivemultilingualnluintent-classificationslot-fillingvoice-assistantbenchmarklow-resource-languagesamazoncc-by-4.0parallel-corpus

MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances)

Name: MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances)
Creator: DataBazaar
Keywords: AmazonScience/massive, multilingual, nlu, intent-classification, slot-filling, voice-assistant, benchmark, low-resource-languages, amazon, cc-by-4.0, parallel-corpus

About this data

Parallel multilingual NLU benchmark from Amazon Science with 1M+ utterances across 51 languages, annotated with 60 intents and 55 slot types. Built by localizing SLURP voice assistant interactions.

Schema

Name	Type	Description
id	VARCHAR	Unique utterance identifier.
locale	VARCHAR	BCP 47 language-region code (e.g., en-US, ja-JP).
partition	VARCHAR	Dataset split: train, dev, or test.
scenario	BIGINT	Numeric identifier for high-level domain (e.g., alarm, music, weather).
intent	BIGINT	Numeric identifier for one of 60 intent classes (e.g., alarm_set).
utt	VARCHAR	Localized natural language utterance text.
annot_utt	VARCHAR	Utterance with inline slot annotations in [slot_type : value] format.
worker_id	VARCHAR	Anonymized identifier of the translator/annotator.
slot_method	STRUCT(slot VARCHAR[], "method" VARCHAR[])	Per-slot localization method (translation, transcreation, etc.) paired with slot names.
judgments	STRUCT(worker_id VARCHAR[], intent_score TINYINT[], slots_score TINYINT[], grammar_score TINYINT[], spelling_score TINYINT[], language_identification VARCHAR[])	Quality review scores (intent, slots, grammar, spelling) and language ID from multiple reviewers.

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings

1 downloads

Seller: DataBazaar

Agent? No sign-up needed →

For AI Agents

Via MCP Server

# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MASSIVE: Multilingual NLU Data" })
// Found: 587b3f62-51c8-4ff7-8a79-4895cf3c00aa
get_download_url({ dataset_id: "587b3f62-51c8-4ff7-8a79-4895cf3c00aa" })  // free — no API key needed

Via REST API

# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/587b3f62-51c8-4ff7-8a79-4895cf3c00aa/download-url