Data Anonymization & PII Protection¶

Overview¶

Voidon includes a built-in PII (Personally Identifiable Information) anonymization system that automatically detects and redacts sensitive information from your prompts before sending them to LLM providers. This ensures GDPR compliance and protects user privacy.

The system uses a state-of-the-art NER (Named Entity Recognition) model trained specifically for PII detection, supporting 39 different types of sensitive data.

Why Anonymization?¶

When using third-party LLM providers (OpenAI, Anthropic, Google, etc.), your prompts are sent to their servers. This creates potential risks:

Data Leakage: Sensitive customer information exposed to external providers
GDPR Violations: Personal data processed without proper safeguards
Compliance Issues: Industry regulations (HIPAA, FINRA, etc.) may prohibit sharing certain data
Privacy Concerns: User trust and data sovereignty

Voidon's anonymization ensures privacy-first AI by removing PII before it leaves your infrastructure.

How It Works¶

Text Only

User Prompt
    ↓
[PII Detection - NER Model]
    ↓
[Redaction - Replace with Placeholders]
    ↓
Anonymized Prompt → LLM Provider
    ↓
Response 
    ↓
Final Response to User

* Future feature

Example:

Text Only

Input:  "Our customer Sarah Johnson (sarah.j@company.com) reported an issue with her account.
         Her phone is +1-555-0123 and she's located at 742 Evergreen Terrace, Springfield."

Output: "Our customer [a person] ([an email address]) reported an issue with her account.
         Her phone is [a telephone number] and she's located at [a location]."

Supported PII Types¶

Voidon supports 39 different types of sensitive information aligned with GDPR requirements and international privacy standards:

1. APIKEY¶

API keys and authentication tokens Example: "sk-1234567890abcdef", "AIzaSyD9X2kF3pQrL8mN"

2. BANKACCOUNT¶

Bank account numbers Example: "123456789", "IT60X0542811101000000123456"

3. BIOMETRIC¶

Biometric identification data Example: "fingerprint hash: a3f5e9...", "retina scan data"

4. BLOODTYPE¶

Blood type information Example: "A+", "O-", "AB+"

5. CREDITCARDNUMBER¶

Credit card numbers Example: "4532 1234 5678 9010", "5425 2334 3010 9903"

6. DATE¶

Date references Example: "January 15, 2023", "03/22/1990", "2024-12-31"

7. DRIVINGLICENSE¶

Driver's license numbers Example: "D1234567", "CA DL A1234567"

8. EMAIL¶

Email addresses Example: "john.doe@example.com", "user123@gmail.com"

9. ETHNICITY¶

Ethnic origin (GDPR Special Category) Example: "Hispanic", "Asian", "Caucasian"

10. FISCALCODE¶

Fiscal codes (e.g., Italian Codice Fiscale) Example: "RSSMRA80A01H501U" (Italian), "123-45-6789" (US SSN format in some countries)

11. GENDER¶

Gender information (GDPR Special Category) Example: "Male", "Female", "Non-binary"

12. HEALTHINSURANCE¶

Health insurance numbers Example: "H123456789", "Blue Cross 987654321"

13. IBAN¶

International Bank Account Numbers Example: "GB29 NWBK 6016 1331 9268 19", "DE89 3704 0044 0532 0130 00"

14. IPADDRESS¶

IP addresses (IPv4 and IPv6) Example: "192.168.1.1", "2001:0db8:85a3::8a2e:0370:7334"

15. LANGUAGE¶

Language identifiers Example: "English", "Spanish", "Mandarin Chinese"

16. LICENSEPLATENUM¶

Vehicle license plate numbers Example: "ABC-1234", "CA 7ABC123", "XX-123-YY"

17. LOCATION¶

Physical locations and addresses Example: "123 Main Street, New York, NY", "London, UK", "GPS: 40.7128° N, 74.0060° W"

18. MACADDRESS¶

MAC addresses Example: "00:1B:44:11:3A:B7", "02-1F-33-45-67-89"

19. MEDICALLICENSE¶

Medical license numbers Example: "MD123456", "NPI 1234567890"

20. MEDICALRECORDNUMBER¶

Medical record numbers Example: "MRN-789456123", "Patient ID: 456789"

21. MONEY¶

Monetary values Example: "$1,234.56", "€500", "¥10,000"

22. NATIONALID¶

National ID numbers Example: "123-45-6789" (SSN), "9876543210" (Aadhaar), "A12345678" (passport style)

23. ORGANIZATION¶

Organization and company names Example: "Microsoft Corporation", "NHS", "United Nations"

24. PASSPORTNUMBER¶

Passport numbers Example: "A12345678", "US123456789"

25. PASSWORD¶

Passwords (always redact!) Example: "P@ssw0rd123!", "MyS3cr3tP@ss"

26. PERSON¶

Names of individuals Example: "John Smith", "Maria Garcia", "Dr. Sarah Johnson"

27. PERSONALDOCUMENT¶

Personal document references Example: "Birth Certificate #BC123456", "Marriage License ML-789"

28. POLITICALAFFILIATION¶

Political opinions (GDPR Special Category) Example: "Democratic Party", "Conservative", "Independent"

29. RACE¶

Racial origin (GDPR Special Category) Example: "Black", "White", "Asian", "Pacific Islander"

30. RELIGION¶

Religious beliefs (GDPR Special Category) Example: "Christianity", "Islam", "Buddhism", "Atheism"

31. SEXUAL_ORIENTATION¶

Sexual orientation (GDPR Special Category) Example: "Heterosexual", "Homosexual", "Bisexual"

32. SSN¶

US Social Security Numbers Example: "123-45-6789", "***-**-6789"

33. TAXID¶

Tax ID numbers Example: "12-3456789" (EIN), "123456789" (TIN)

34. TELEPHONENUM¶

Telephone numbers Example: "+1-555-123-4567", "(555) 987-6543", "+44 20 7946 0958"

35. TIME¶

Time references Example: "14:30:00", "3:45 PM", "09:15 EST"

36. UNION_MEMBERSHIP¶

Trade union membership (GDPR Special Category) Example: "UAW Local 600", "SEIU Member #123456"

37. URL¶

URLs Example: "https://example.com", "www.website.org/page"

38. USERNAME¶

Usernames Example: "john_doe123", "@userhandle", "player_one"

39. VEHICLEIDENTIFICATION¶

Vehicle Identification Numbers (VIN) Example: "1HGBH41JXMN109186" (VIN), "WBADT43452G12345"

Configuration¶

Via API Request¶

You can enable anonymization on a per-request basis using the anonymization_types and enable_anonymization parameters:

Python
import requests

response = requests.post(
    "https://api.voidon.ai/v1/chat/completions",
    headers={
        "Authorization": "Bearer YOUR_API_KEY"
    },
    json={
        "model": "gpt-4",
        "messages": [
            {"role": "user", "content": "Generate a summary email for customer Michael Chen (m.chen@techcorp.com, +1-415-555-0199) regarding his inquiry about our enterprise plan pricing."}
        ],
        "enable_anonymization": True,
        "anonymization_types": [8, 26, 34]  # EMAIL, PERSON, TELEPHONENUM
    }
)

Parameters: - enable_anonymization (bool): Enable/disable anonymization for this request - anonymization_types (list[int]): List of PII type IDs to redact (see table above)

If not specified, uses your default settings from the dashboard.

Via Dashboard¶

Navigate to Settings → Privacy & Security
Enable Anonymization: Toggle ON
Select PII Types: Choose which categories to redact:
Custom: Select individual types
Save Settings

Your settings apply to all requests unless overridden per-request.

API Examples¶

Customer Support Scenario¶

Python
# Request
{
  "model": "gpt-4",
  "messages": [
    {"role": "user", "content": "Summarize this support ticket: Customer Emma Watson (emma.w@gmail.com) from London called at 14:30 reporting login issues. Her account ID is ACC-28471 and phone is +44-20-7946-0958."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [8, 17, 26, 34, 35]  # EMAIL, LOCATION, PERSON, TELEPHONENUM, TIME
}

# Sent to LLM provider:
"Summarize this support ticket: Customer [a person] ([an email address]) from [a location] called at [a time] reporting login issues. Her account ID is ACC-28471 and phone is [a telephone number]."

HR/Payroll Processing¶

Python
{
  "model": "claude-3-sonnet",
  "messages": [
    {"role": "user", "content": "Process payroll for employee Robert Martinez (SSN: 456-78-9012, robert.m@company.com). Payment of $4,850 to account IT60X0542811101000000123456 on March 15, 2024."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [2, 6, 8, 21, 32]  # BANKACCOUNT, DATE, EMAIL, MONEY, SSN
}

# Result:
"Process payroll for employee Robert Martinez (SSN: [a social security number], [an email address]). Payment of [a monetary value] to account [a bank account] on [a date]."

Healthcare/Medical Records¶

Python
{
  "model": "auto",  # Auto-routing also works
  "messages": [
    {"role": "user", "content": "Generate care plan summary for patient David Kumar (MRN: MRN-2024-5847, blood type A+) at Mayo Clinic, Boston. Insurance: Blue Cross #H987654321. Emergency contact: wife Priya at +1-617-555-8423."}
  ],
  "enable_anonymization": True,
  "anonymization_types": [
    4,   # BLOODTYPE
    12,  # HEALTHINSURANCE
    17,  # LOCATION
    20,  # MEDICALRECORDNUMBER
    23,  # ORGANIZATION
    26,  # PERSON
    34   # TELEPHONENUM
  ]
}

# Result:
"Generate care plan summary for patient [a person] (MRN: [a medical record number], blood type [a blood type]) at [an organization], [a location]. Insurance: [a health insurance number]. Emergency contact: wife [a person] at [a telephone number]."

Disable for Specific Request¶

Even if anonymization is enabled by default in dashboard:

Python
{
  "model": "gpt-4",
  "messages": [...],
  "enable_anonymization": False  # Override dashboard setting
}

Best Practices¶

What to Anonymize¶

Always redact: - SSN, Fiscal Codes, Tax IDs - Credit card numbers, bank accounts - Passwords, API keys - Medical records, health insurance numbers - Biometric data

Consider redacting: - Names (unless essential for context) - Email addresses - Phone numbers - Physical addresses - Dates of birth

Usually safe to keep: - Generic locations (city names without addresses) - Organizations (unless sensitive) - Dates (unless paired with individuals)

Performance Tips¶

Select only needed types: Don't enable all 39 types if you only need 5
Chunking: Long documents are automatically chunked (configured via ANONYMIZER_MODEL_CTX)
Caching: Identical text is cached for 5 minutes (no re-processing)

To be GDPR-compliant, enable these minimum types: - PERSON (26) - EMAIL (8) - TELEPHONENUM (34) - LOCATION (17) - FISCALCODE (10) / SSN (32) - GENDER (11) - RELIGION (30) - SEXUAL_ORIENTATION (31) - ETHNICITY (9) - RACE (29)

Or use the "GDPR Full" preset.

Limitations¶

Accuracy: NER model has ~95-98% F1 score. Some edge cases may be missed.
Context loss: Heavy redaction may reduce LLM response quality
Language support: Optimized for European Languages. Other languages may have lower accuracy.
De-anonymization: Currently placeholders are NOT replaced back (future feature)

Troubleshooting¶

PII not detected: - Check language: Model optimized for EN/IT/ES/DE/FR - Verify PII type is enabled in settings - Use Test tool in dashboard to debug

Too much redaction: - Reduce enabled PII types