AI technologies bring new opportunities to locate and understand personal data accurately and at scale. The recent wave of privacy legislations around the world introduced new challenges to experts in litigation support, incident response and auditing. How can modern automation help with reliable PII discovery across emails, files, and databases?
3 Reasons Keywords Fail
- High cost. Keywords and regexps are difficult to define and maintain, yet come back with a lot of errors. False positives (returned PII that’s not really PII) and False negatives (real PII that was missed) undermine user trust and require additional manual reviews – hard to say which is more expensive!
- Failure to capture so-called open set categories, which include people’s full names, home addresses or profile photos. How would you even begin to construct keywords or regexps to catch all names? There’s no way, even in theory, to capture these critical PII types with a fixed ruleset.
- Conceptual PII. New regulations bring new requirements on what constitutes personal and protected information. What if people discuss their sexual orientation or medical problems “in prose”, without any technical jargon?
This is sensitive information of the highest severity, yet no set of keywords can help. The overall document context, the relationships between “who says what”, is absolutely critical.
Contextual AI to the Rescue
|PII Ability||Rule-based systems||AI-based systems|
|can detect open-set PII types
|can detect conceptual PII
(political views, sexual preferences, health…)
|can process non-textual data
(profile photos, passport scans, audio…)
|can learn from new data||❌||✅|
|allows custom validators and checksums||✅||✅|
|requires expertise to develop||medium||high|