How to evaluate PII discovery software

Radim ŘehůřekDeep Learning, Personal Data

So, you're considering buying software for discovery of PII / PCI / PHI. Or about to start your trial of PII Tools. How to test discovery SW properly?

Don'ts: Careful what you test for

Consider the following "passport":

donald_duck


Why won't PII Tools detect the "PII" in this passport scan?

This is an actual file submitted to our support team during a trial evaluation. "Why didn't PII Tools detect the passport, is the software faulty"?

No; it just determined this image doesn't constitute personally identifiable information. Donald Duck doesn't count as a person, sorry.

This demonstrates a common problem with PII evaluations: You typically don't want to evaluate on real files, yet artificial files run the risk of being "too obviously fake". In the above case, PII Tools was clever enough to see through the level of obfuscation the user attempted.

specimen


Another example: not a file containing PII. Note the word SPECIMEN printed over this passport.

We see this pattern over and over during trials, and not just with passports. Users may scan fake data, but smart discovery SW ought to automatically reject it as false positive. Paradoxically, "dumb" software that will produce tons of false positives in real production scans could come out ahead from such naive "fake data" evaluations.

PII Tools is smart enough to call the bluff on artificial data. "Fake" and "Personally Identifiable" are mutually exclusive.

"Why didn't PII Tools detect credit card numbers in this text?"

John Smith owns the following airline credit cards:

1912126257871313
1912136294662913
1912126257870301

Well; the first two CCs don't even pass the Luhn check. The last one does, but its IIN (BIN) code is invalid, so it's still not real PII.

Same with an IP address of 127.0.0.1… And so on, and so forth.

So how to test PII properly?

Obtaining good public PII data for evaluation is difficult by definition, and synthetic "generated" data worse than useless. To make the evaluation process easier, we've published a small dataset that contains typical PII examples:

  • lyrics.txt: Text for Marvin Gaye's song "Sexual Healing", with words like "medicine" (PHI), "healing" (PHI) or "sexual" (protected PII). It will baffle naive SW that uses keywords and regexps to determine PII. But PII Tools will correctly determine there is no personally identifiable information (no PII risk).
  • passport.jpg: PII with profile photo and passport scan.
  • bank_form.pdf: Unstructured file (PDF) containing peronal and banking information.
  • ip-creditcard-email.csv: Structured file with examples of IP addresses, credit cards and emails.
  • enron-sample.mbox: Email mailbox with PII from the notorious Enron case.
  • health.zip: ZIP archive with PHI information. Scanning archives (ZIP, RAR, TAR, PST…) transparently and efficiently is critical for organizations that keep backups, because these will often hide high-risk privacy gems.
  • sexual_preferences.docx: Word document containing information about a person's sexual orientation (protected PII).

Download these files here: pii_tools_sample.zip. Feel free to use them to judge any software that claims to detect/remediate personal and sensitive information.

You are of course encouraged to try on your own files too. Just keep in mind the "Don'ts" from this article.


Questions? Want to see a live demo? Contact us.

Image

Download our AI whitepaper

Detecting Personal Names in Text