textAmazonScience/massivemultilingualnluintent-classificationslot-fillingvoice-assistantbenchmarklow-resource-languagesamazoncc-by-4.0parallel-corpus

MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances)

Category
Text
Records
2,560,755 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~189.78 MB
Downloads
0

About this data

Parallel multilingual NLU benchmark from Amazon Science with 1M+ utterances across 51 languages, annotated with 60 intents and 55 slot types. Built by localizing SLURP voice assistant interactions.

Schema

NameTypeDescription
idVARCHARUnique utterance identifier.
localeVARCHARBCP 47 language-region code (e.g., en-US, ja-JP).
partitionVARCHARDataset split: train, dev, or test.
scenarioBIGINTNumeric identifier for high-level domain (e.g., alarm, music, weather).
intentBIGINTNumeric identifier for one of 60 intent classes (e.g., alarm_set).
uttVARCHARLocalized natural language utterance text.
annot_uttVARCHARUtterance with inline slot annotations in [slot_type : value] format.
worker_idVARCHARAnonymized identifier of the translator/annotator.
slot_methodSTRUCT(slot VARCHAR[], "method" VARCHAR[])Per-slot localization method (translation, transcreation, etc.) paired with slot names.
judgmentsSTRUCT(worker_id VARCHAR[], intent_score TINYINT[], slots_score TINYINT[], grammar_score TINYINT[], spelling_score TINYINT[], language_identification VARCHAR[])Quality review scores (intent, slots, grammar, spelling) and language ID from multiple reviewers.

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
0 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "MASSIVE: Multilingual NLU Data" })
// Found: 587b3f62-51c8-4ff7-8a79-4895cf3c00aa
get_download_url({ dataset_id: "587b3f62-51c8-4ff7-8a79-4895cf3c00aa" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/587b3f62-51c8-4ff7-8a79-4895cf3c00aa/download-url
MASSIVE: Multilingual NLU Dataset (51 Languages, 1M+ Utterances) — Free Dataset | DataBazaar