Skip to content

Data Anonymization & PII Protection

Overview

Voidon includes a built-in PII (Personally Identifiable Information) anonymization system that automatically detects and redacts sensitive information from your prompts before sending them to LLM providers. This ensures GDPR compliance and protects user privacy.

The system uses a state-of-the-art NER (Named Entity Recognition) model trained specifically for PII detection, supporting 39 different types of sensitive data.

Why Anonymization?

When using third-party LLM providers (OpenAI, Anthropic, Google, etc.), your prompts are sent to their servers. This creates potential risks:

  • Data Leakage: Sensitive customer information exposed to external providers
  • GDPR Violations: Personal data processed without proper safeguards
  • Compliance Issues: Industry regulations (HIPAA, FINRA, etc.) may prohibit sharing certain data
  • Privacy Concerns: User trust and data sovereignty

Voidon's anonymization ensures privacy-first AI by removing PII before it leaves your infrastructure.

How It Works

Text Only
User Prompt
[PII Detection - NER Model]
[Redaction - Replace with Placeholders]
Anonymized Prompt → LLM Provider
Response 
Final Response to User

* Future feature

Example:

Text Only
1
2
3
4
5
Input:  "Our customer Sarah Johnson (sarah.j@company.com) reported an issue with her account.
         Her phone is +1-555-0123 and she's located at 742 Evergreen Terrace, Springfield."

Output: "Our customer [a person] ([an email address]) reported an issue with her account.
         Her phone is [a telephone number] and she's located at [a location]."

Supported PII Types

Voidon supports 39 different types of sensitive information aligned with GDPR requirements and international privacy standards:

1. APIKEY

API keys and authentication tokens Example: "sk-1234567890abcdef", "AIzaSyD9X2kF3pQrL8mN"

2. BANKACCOUNT

Bank account numbers Example: "123456789", "IT60X0542811101000000123456"

3. BIOMETRIC

Biometric identification data Example: "fingerprint hash: a3f5e9...", "retina scan data"

4. BLOODTYPE

Blood type information Example: "A+", "O-", "AB+"

5. CREDITCARDNUMBER

Credit card numbers Example: "4532 1234 5678 9010", "5425 2334 3010 9903"

6. DATE

Date references Example: "January 15, 2023", "03/22/1990", "2024-12-31"

7. DRIVINGLICENSE

Driver's license numbers Example: "D1234567", "CA DL A1234567"

8. EMAIL

Email addresses Example: "john.doe@example.com", "user123@gmail.com"

9. ETHNICITY

Ethnic origin (GDPR Special Category) Example: "Hispanic", "Asian", "Caucasian"

10. FISCALCODE

Fiscal codes (e.g., Italian Codice Fiscale) Example: "RSSMRA80A01H501U" (Italian), "123-45-6789" (US SSN format in some countries)

11. GENDER

Gender information (GDPR Special Category) Example: "Male", "Female", "Non-binary"

12. HEALTHINSURANCE

Health insurance numbers Example: "H123456789", "Blue Cross 987654321"

13. IBAN

International Bank Account Numbers Example: "GB29 NWBK 6016 1331 9268 19", "DE89 3704 0044 0532 0130 00"

14. IPADDRESS

IP addresses (IPv4 and IPv6) Example: "192.168.1.1", "2001:0db8:85a3::8a2e:0370:7334"

15. LANGUAGE

Language identifiers Example: "English", "Spanish", "Mandarin Chinese"

16. LICENSEPLATENUM

Vehicle license plate numbers Example: "ABC-1234", "CA 7ABC123", "XX-123-YY"

17. LOCATION

Physical locations and addresses Example: "123 Main Street, New York, NY", "London, UK", "GPS: 40.7128° N, 74.0060° W"

18. MACADDRESS

MAC addresses Example: "00:1B:44:11:3A:B7", "02-1F-33-45-67-89"

19. MEDICALLICENSE

Medical license numbers Example: "MD123456", "NPI 1234567890"

20. MEDICALRECORDNUMBER

Medical record numbers Example: "MRN-789456123", "Patient ID: 456789"

21. MONEY

Monetary values Example: "$1,234.56", "€500", "¥10,000"

22. NATIONALID

National ID numbers Example: "123-45-6789" (SSN), "9876543210" (Aadhaar), "A12345678" (passport style)

23. ORGANIZATION

Organization and company names Example: "Microsoft Corporation", "NHS", "United Nations"

24. PASSPORTNUMBER

Passport numbers Example: "A12345678", "US123456789"

25. PASSWORD

Passwords (always redact!) Example: "P@ssw0rd123!", "MyS3cr3tP@ss"

26. PERSON

Names of individuals Example: "John Smith", "Maria Garcia", "Dr. Sarah Johnson"

27. PERSONALDOCUMENT

Personal document references Example: "Birth Certificate #BC123456", "Marriage License ML-789"

28. POLITICALAFFILIATION

Political opinions (GDPR Special Category) Example: "Democratic Party", "Conservative", "Independent"

29. RACE

Racial origin (GDPR Special Category) Example: "Black", "White", "Asian", "Pacific Islander"

30. RELIGION

Religious beliefs (GDPR Special Category) Example: "Christianity", "Islam", "Buddhism", "Atheism"

31. SEXUAL_ORIENTATION

Sexual orientation (GDPR Special Category) Example: "Heterosexual", "Homosexual", "Bisexual"

32. SSN

US Social Security Numbers Example: "123-45-6789", "***-**-6789"

33. TAXID

Tax ID numbers Example: "12-3456789" (EIN), "123456789" (TIN)

34. TELEPHONENUM

Telephone numbers Example: "+1-555-123-4567", "(555) 987-6543", "+44 20 7946 0958"

35. TIME

Time references Example: "14:30:00", "3:45 PM", "09:15 EST"

36. UNION_MEMBERSHIP

Trade union membership (GDPR Special Category) Example: "UAW Local 600", "SEIU Member #123456"

37. URL

URLs Example: "https://example.com", "www.website.org/page"

38. USERNAME

Usernames Example: "john_doe123", "@userhandle", "player_one"

39. VEHICLEIDENTIFICATION

Vehicle Identification Numbers (VIN) Example: "1HGBH41JXMN109186" (VIN), "WBADT43452G12345"

Configuration

Via API Request

You can enable anonymization on a per-request basis using the anonymization_types and enable_anonymization parameters:

Python
import requests

response = requests.post(
    "https://api.voidon.ai/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY"
    },
    json={
        "model": "gpt-4",
        "messages": [
            {"role": "user", "content": "Generate a summary email for customer Michael Chen (m.chen@techcorp.com, +1-415-555-0199) regarding his inquiry about our enterprise plan pricing."}
        ],
        "enable_anonymization": True,
        "anonymization_types": [8, 26, 34]  # EMAIL, PERSON, TELEPHONENUM
    }
)

Parameters: - enable_anonymization (bool): Enable/disable anonymization for this request - anonymization_types (list[int]): List of PII type IDs to redact (see table above)

If not specified, uses your default settings from the dashboard.

Via Dashboard

  1. Navigate to SettingsPrivacy & Security
  2. Enable Anonymization: Toggle ON
  3. Select PII Types: Choose which categories to redact:
  4. Custom: Select individual types
  5. Save Settings

Your settings apply to all requests unless overridden per-request.

API Examples

Customer Support Scenario

Python
# Request
{
  "model": "gpt-4",
  "messages": [
    {"role": "user", "content": "Summarize this support ticket: Customer Emma Watson (emma.w@gmail.com) from London called at 14:30 reporting login issues. Her account ID is ACC-28471 and phone is +44-20-7946-0958."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [8, 17, 26, 34, 35]  # EMAIL, LOCATION, PERSON, TELEPHONENUM, TIME
}

# Sent to LLM provider:
"Summarize this support ticket: Customer [a person] ([an email address]) from [a location] called at [a time] reporting login issues. Her account ID is ACC-28471 and phone is [a telephone number]."

HR/Payroll Processing

Python
{
  "model": "claude-3-sonnet",
  "messages": [
    {"role": "user", "content": "Process payroll for employee Robert Martinez (SSN: 456-78-9012, robert.m@company.com). Payment of $4,850 to account IT60X0542811101000000123456 on March 15, 2024."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [2, 6, 8, 21, 32]  # BANKACCOUNT, DATE, EMAIL, MONEY, SSN
}

# Result:
"Process payroll for employee Robert Martinez (SSN: [a social security number], [an email address]). Payment of [a monetary value] to account [a bank account] on [a date]."

Healthcare/Medical Records

Python
{
  "model": "auto",  # Auto-routing also works
  "messages": [
    {"role": "user", "content": "Generate care plan summary for patient David Kumar (MRN: MRN-2024-5847, blood type A+) at Mayo Clinic, Boston. Insurance: Blue Cross #H987654321. Emergency contact: wife Priya at +1-617-555-8423."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [
    4,   # BLOODTYPE
    12,  # HEALTHINSURANCE
    17,  # LOCATION
    20,  # MEDICALRECORDNUMBER
    23,  # ORGANIZATION
    26,  # PERSON
    34   # TELEPHONENUM
  ]
}

# Result:
"Generate care plan summary for patient [a person] (MRN: [a medical record number], blood type [a blood type]) at [an organization], [a location]. Insurance: [a health insurance number]. Emergency contact: wife [a person] at [a telephone number]."

Disable for Specific Request

Even if anonymization is enabled by default in dashboard:

Python
1
2
3
4
5
{
  "model": "gpt-4",
  "messages": [...],
  "enable_anonymization": False  # Override dashboard setting
}

Best Practices

What to Anonymize

Always redact: - SSN, Fiscal Codes, Tax IDs - Credit card numbers, bank accounts - Passwords, API keys - Medical records, health insurance numbers - Biometric data

Consider redacting: - Names (unless essential for context) - Email addresses - Phone numbers - Physical addresses - Dates of birth

Usually safe to keep: - Generic locations (city names without addresses) - Organizations (unless sensitive) - Dates (unless paired with individuals)

Performance Tips

  1. Select only needed types: Don't enable all 39 types if you only need 5
  2. Chunking: Long documents are automatically chunked (configured via ANONYMIZER_MODEL_CTX)
  3. Caching: Identical text is cached for 5 minutes (no re-processing)

GDPR Compliance

To be GDPR-compliant, enable these minimum types: - PERSON (26) - EMAIL (8) - TELEPHONENUM (34) - LOCATION (17) - FISCALCODE (10) / SSN (32) - GENDER (11) - RELIGION (30) - SEXUAL_ORIENTATION (31) - ETHNICITY (9) - RACE (29)

Or use the "GDPR Full" preset.

Limitations

  1. Accuracy: NER model has ~95-98% F1 score. Some edge cases may be missed.
  2. Context loss: Heavy redaction may reduce LLM response quality
  3. Language support: Optimized for European Languages. Other languages may have lower accuracy.
  4. De-anonymization: Currently placeholders are NOT replaced back (future feature)

Troubleshooting

PII not detected: - Check language: Model optimized for EN/IT/ES/DE/FR - Verify PII type is enabled in settings - Use Test tool in dashboard to debug

Too much redaction: - Reduce enabled PII types