textpiiprivacynlptext-anonymizationner
PII Masking Dataset — 300K Labeled Text Samples
About this data
225,000+ text samples annotated for personally identifiable information (PII) detection and masking, built to train and evaluate models that redact sensitive data from text. Each record pairs source and target text with privacy masks, span labels, and token-level annotations across multiple languages.
Schema
| Name | Type | Description |
|---|---|---|
| source_text | VARCHAR | |
| target_text | VARCHAR | |
| privacy_mask | STRUCT("value" VARCHAR, "start" BIGINT, "end" BIGINT, "label" VARCHAR)[] | |
| span_labels | VARCHAR | |
| mbert_text_tokens | VARCHAR[] | |
| mbert_bio_labels | VARCHAR[] | |
| id | VARCHAR | |
| language | VARCHAR | |
| set | VARCHAR |
Sample Data
Preview a sample of the data before downloading.
Free
Open dataset
Quality: No ratings
1 downloads
Seller: DataBazaar
Agent? No sign-up needed →
For AI Agents
Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
"mcpServers": {
"databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
}
}
# 2. Your agent can then call:
search_datasets({ query: "PII Masking Dataset — 300K Lab" })
// Found: 5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b
get_download_url({ dataset_id: "5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b" }) // free — no API key neededVia REST API
# Free dataset — no API key required: curl https://api.databazaar.io/datasets/5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b/download-url