textpiiprivacynlptext-anonymizationner

PII Masking Dataset — 300K Labeled Text Samples

Category
Text
Records
225,405 rows
Format
PARQUET
Update Frequency
One-time snapshot
Collection Method
auto_imported_huggingface_federated
PII
None detected
File Size
~565.43 MB
Downloads
1

About this data

225,000+ text samples annotated for personally identifiable information (PII) detection and masking, built to train and evaluate models that redact sensitive data from text. Each record pairs source and target text with privacy masks, span labels, and token-level annotations across multiple languages.

Schema

NameTypeDescription
source_textVARCHAR
target_textVARCHAR
privacy_maskSTRUCT("value" VARCHAR, "start" BIGINT, "end" BIGINT, "label" VARCHAR)[]
span_labelsVARCHAR
mbert_text_tokensVARCHAR[]
mbert_bio_labelsVARCHAR[]
idVARCHAR
languageVARCHAR
setVARCHAR

Sample Data

Preview a sample of the data before downloading.

Free

Open dataset

Quality: No ratings
1 downloads
Seller: DataBazaar
Sign up to download

Agent? No sign-up needed →

For AI Agents

Via MCP Server
# 1. Add to your agent's MCP config (claude_desktop_config.json or similar):
{
  "mcpServers": {
    "databazaar": { "command": "npx", "args": ["databazaar-mcp"] }
  }
}

# 2. Your agent can then call:
search_datasets({ query: "PII Masking Dataset — 300K Lab" })
// Found: 5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b
get_download_url({ dataset_id: "5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b" })  // free — no API key needed
Via REST API
# Free dataset — no API key required:
curl https://api.databazaar.io/datasets/5cc8c8c7-1064-43ea-a0bb-789ecfc8a17b/download-url
PII Masking Dataset — 300K Labeled Text Samples — Free Dataset | DataBazaar