Data Anonymization & PII Protection¶
Overview¶
Voidon includes a built-in PII (Personally Identifiable Information) anonymization system that automatically detects and redacts sensitive information from your prompts before sending them to LLM providers. This ensures GDPR compliance and protects user privacy.
The system uses a state-of-the-art NER (Named Entity Recognition) model trained specifically for PII detection, supporting 39 different types of sensitive data.
Why Anonymization?¶
When using third-party LLM providers (OpenAI, Anthropic, Google, etc.), your prompts are sent to their servers. This creates potential risks:
- Data Leakage: Sensitive customer information exposed to external providers
- GDPR Violations: Personal data processed without proper safeguards
- Compliance Issues: Industry regulations (HIPAA, FINRA, etc.) may prohibit sharing certain data
- Privacy Concerns: User trust and data sovereignty
Voidon's anonymization ensures privacy-first AI by removing PII before it leaves your infrastructure.
How It Works¶
| Text Only | |
|---|---|
Example:
Supported PII Types¶
Voidon supports 39 different types of sensitive information aligned with GDPR requirements and international privacy standards:
1. APIKEY¶
API keys and authentication tokens Example: "sk-1234567890abcdef", "AIzaSyD9X2kF3pQrL8mN"
2. BANKACCOUNT¶
Bank account numbers Example: "123456789", "IT60X0542811101000000123456"
3. BIOMETRIC¶
Biometric identification data Example: "fingerprint hash: a3f5e9...", "retina scan data"
4. BLOODTYPE¶
Blood type information Example: "A+", "O-", "AB+"
5. CREDITCARDNUMBER¶
Credit card numbers Example: "4532 1234 5678 9010", "5425 2334 3010 9903"
6. DATE¶
Date references Example: "January 15, 2023", "03/22/1990", "2024-12-31"
7. DRIVINGLICENSE¶
Driver's license numbers Example: "D1234567", "CA DL A1234567"
8. EMAIL¶
Email addresses Example: "john.doe@example.com", "user123@gmail.com"
9. ETHNICITY¶
Ethnic origin (GDPR Special Category) Example: "Hispanic", "Asian", "Caucasian"
10. FISCALCODE¶
Fiscal codes (e.g., Italian Codice Fiscale) Example: "RSSMRA80A01H501U" (Italian), "123-45-6789" (US SSN format in some countries)
11. GENDER¶
Gender information (GDPR Special Category) Example: "Male", "Female", "Non-binary"
12. HEALTHINSURANCE¶
Health insurance numbers Example: "H123456789", "Blue Cross 987654321"
13. IBAN¶
International Bank Account Numbers Example: "GB29 NWBK 6016 1331 9268 19", "DE89 3704 0044 0532 0130 00"
14. IPADDRESS¶
IP addresses (IPv4 and IPv6) Example: "192.168.1.1", "2001:0db8:85a3::8a2e:0370:7334"
15. LANGUAGE¶
Language identifiers Example: "English", "Spanish", "Mandarin Chinese"
16. LICENSEPLATENUM¶
Vehicle license plate numbers Example: "ABC-1234", "CA 7ABC123", "XX-123-YY"
17. LOCATION¶
Physical locations and addresses Example: "123 Main Street, New York, NY", "London, UK", "GPS: 40.7128° N, 74.0060° W"
18. MACADDRESS¶
MAC addresses Example: "00:1B:44:11:3A:B7", "02-1F-33-45-67-89"
19. MEDICALLICENSE¶
Medical license numbers Example: "MD123456", "NPI 1234567890"
20. MEDICALRECORDNUMBER¶
Medical record numbers Example: "MRN-789456123", "Patient ID: 456789"
21. MONEY¶
Monetary values Example: "$1,234.56", "€500", "¥10,000"
22. NATIONALID¶
National ID numbers Example: "123-45-6789" (SSN), "9876543210" (Aadhaar), "A12345678" (passport style)
23. ORGANIZATION¶
Organization and company names Example: "Microsoft Corporation", "NHS", "United Nations"
24. PASSPORTNUMBER¶
Passport numbers Example: "A12345678", "US123456789"
25. PASSWORD¶
Passwords (always redact!) Example: "P@ssw0rd123!", "MyS3cr3tP@ss"
26. PERSON¶
Names of individuals Example: "John Smith", "Maria Garcia", "Dr. Sarah Johnson"
27. PERSONALDOCUMENT¶
Personal document references Example: "Birth Certificate #BC123456", "Marriage License ML-789"
28. POLITICALAFFILIATION¶
Political opinions (GDPR Special Category) Example: "Democratic Party", "Conservative", "Independent"
29. RACE¶
Racial origin (GDPR Special Category) Example: "Black", "White", "Asian", "Pacific Islander"
30. RELIGION¶
Religious beliefs (GDPR Special Category) Example: "Christianity", "Islam", "Buddhism", "Atheism"
31. SEXUAL_ORIENTATION¶
Sexual orientation (GDPR Special Category) Example: "Heterosexual", "Homosexual", "Bisexual"
32. SSN¶
US Social Security Numbers Example: "123-45-6789", "***-**-6789"
33. TAXID¶
Tax ID numbers Example: "12-3456789" (EIN), "123456789" (TIN)
34. TELEPHONENUM¶
Telephone numbers Example: "+1-555-123-4567", "(555) 987-6543", "+44 20 7946 0958"
35. TIME¶
Time references Example: "14:30:00", "3:45 PM", "09:15 EST"
36. UNION_MEMBERSHIP¶
Trade union membership (GDPR Special Category) Example: "UAW Local 600", "SEIU Member #123456"
37. URL¶
URLs Example: "https://example.com", "www.website.org/page"
38. USERNAME¶
Usernames Example: "john_doe123", "@userhandle", "player_one"
39. VEHICLEIDENTIFICATION¶
Vehicle Identification Numbers (VIN) Example: "1HGBH41JXMN109186" (VIN), "WBADT43452G12345"
Configuration¶
Via API Request¶
You can enable anonymization on a per-request basis using the anonymization_types and enable_anonymization parameters:
Parameters: - enable_anonymization (bool): Enable/disable anonymization for this request - anonymization_types (list[int]): List of PII type IDs to redact (see table above)
If not specified, uses your default settings from the dashboard.
Via Dashboard¶
- Navigate to Settings → Privacy & Security
- Enable Anonymization: Toggle ON
- Select PII Types: Choose which categories to redact:
- Custom: Select individual types
- Save Settings
Your settings apply to all requests unless overridden per-request.
API Examples¶
Customer Support Scenario¶
HR/Payroll Processing¶
Healthcare/Medical Records¶
Disable for Specific Request¶
Even if anonymization is enabled by default in dashboard:
| Python | |
|---|---|
Best Practices¶
What to Anonymize¶
Always redact: - SSN, Fiscal Codes, Tax IDs - Credit card numbers, bank accounts - Passwords, API keys - Medical records, health insurance numbers - Biometric data
Consider redacting: - Names (unless essential for context) - Email addresses - Phone numbers - Physical addresses - Dates of birth
Usually safe to keep: - Generic locations (city names without addresses) - Organizations (unless sensitive) - Dates (unless paired with individuals)
Performance Tips¶
- Select only needed types: Don't enable all 39 types if you only need 5
- Chunking: Long documents are automatically chunked (configured via
ANONYMIZER_MODEL_CTX) - Caching: Identical text is cached for 5 minutes (no re-processing)
GDPR Compliance¶
To be GDPR-compliant, enable these minimum types: - PERSON (26) - EMAIL (8) - TELEPHONENUM (34) - LOCATION (17) - FISCALCODE (10) / SSN (32) - GENDER (11) - RELIGION (30) - SEXUAL_ORIENTATION (31) - ETHNICITY (9) - RACE (29)
Or use the "GDPR Full" preset.
Limitations¶
- Accuracy: NER model has ~95-98% F1 score. Some edge cases may be missed.
- Context loss: Heavy redaction may reduce LLM response quality
- Language support: Optimized for European Languages. Other languages may have lower accuracy.
- De-anonymization: Currently placeholders are NOT replaced back (future feature)
Troubleshooting¶
PII not detected: - Check language: Model optimized for EN/IT/ES/DE/FR - Verify PII type is enabled in settings - Use Test tool in dashboard to debug
Too much redaction: - Reduce enabled PII types