textpiiprivacymultilingualnlptext-anonymization
OpenPII — 1.4M Multilingual PII Masking Examples
About this data
1,428,143 synthetic text examples with fine-grained PII annotations spanning 23 European languages. Each example pairs source and masked text with privacy masks and token-level labels, designed for training privacy-redaction and anonymization models.
Schema
| Name | Type | Description |
|---|---|---|
| source_text | VARCHAR | |
| masked_text | VARCHAR | |
| privacy_mask | STRUCT("label" VARCHAR, "start" BIGINT, "end" BIGINT, "value" VARCHAR, label_index BIGINT)[] | |
| split | VARCHAR | |
| uid | BIGINT | |
| language | VARCHAR | |
| region | VARCHAR | |
| script | VARCHAR | |
| mbert_tokens | VARCHAR[] | |
| mbert_token_classes | VARCHAR[] |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
0 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "OpenPII — 1.4M Multilingual PI" })
// Found: 340322cb-2ff0-4668-b0e9-9c2d6ca5f666
get_download_url({ dataset_id: "340322cb-2ff0-4668-b0e9-9c2d6ca5f666" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/340322cb-2ff0-4668-b0e9-9c2d6ca5f666/download-url