PII De-Identification vs. Masking vs. Redaction

Cole PrudenDe-Identification, Personal Data Protection, Sensitive Data

Are you struggling between processing your clients’ PII needed to complete orders while keeping that same info away from any wandering eyes? Data de-identification, masking, and redaction are just a few of the options open to you. What exactly are they and how do they work?

De-Identification

To put it plainly, de-identification is the process of removing identifying information from data. The purpose of de-identification is to modify the sensitive data in such a way that it’s of no or little value to unauthorized intruders while still being usable by software and authorized personnel.

There are plenty of ways to achieve this goal. Within the wider sphere of Data Anonymization, there are seemingly endless methods to remove PII from data sets in order to keep everyone involved anonymous. Maybe you’ve heard of some of these methods before; they include pseudonymization, encryption, tokenization, and more.

To reverse or not to reverse?

For the sake of this article, let’s highlight two of the most popular methods of data de-identification: masking and redaction. These two methods are irreversible, in that information from the original document is permanently obscured. On the other hand, encryption and tokenization are examples of reversible methods, where a party with the correct key is able to reverse the transformation and view the original document, including all of the PII.

Should you prefer reversible or irreversible de-identification? That depends on the end goal: If you aim to pass documents to third parties, the “destructive” methods are safer and simpler, because they don’t rely on maintaining keys or token databases.

Contrastingly, if you need to come back to your data to perform analytics and machine learning, or for legal and archival purposes, then the reversible methods such as tokenization or encryption are your only choice.

Masking

Data masking is a non-reversible transformation defined as the following: Masking sensitive data by partially or fully replacing characters with a symbol, such as an asterisk (*) or hash (#). See the example below:

Redaction

When it comes to redaction, many of us immediately think of the spy world, with the likes of Jason Bourne or Ethan Hunt from Mission Impossible rummaging through files of documents where all of the useful information has been blacked out. And in this case, Hollywood is actually spot on.

Redaction is the irreversible process of blacking out or removing information that is personally identifiable, sensitive, confidential, or otherwise classified – typically information coming from scanned documents, images, and PDFs. See the example image below to see what a redacted document might look like.

Source: Logikcull

Different Methods for Different Objectives

Where the typical goal of data masking is to remove any sensitive information but maintain the same data structure so it can be used in applications, redaction is meant to completely remove certain pieces of information so the remaining text can be released (perhaps to the public, journalists, unauthorized employees, etc.) without exposing anything personally identifiable or classified.

Examples of Proper Data Masking and Redaction

To further illustrate the differences between the terms de-identification, redaction, and masking, we’ve provided two comparative tables.

But first, remember that masking and redaction both fall under the de-identification “umbrella”, as two different methods of removing identifying information from data.

De-identification method → Masking:

Now, see how that same information would look as it’s redacted in this example customer complaint email.

De-identification method → Redaction:

Note */**: These two pieces of sensitive medical information make for great examples of PII that is much more difficult to detect in any given text. Understandably, addresses, emails, credit card numbers, etc., are easier for discovery tools to detect, but it takes highly trained software to also recognize someone describing their medical symptoms, etc.

Only the First Steps

Now that you’re armed with a better understanding of the de-identification methods Masking and Redaction, it’s important to remember there is so much more that goes into data protection than blacking out words and adding placeholders where names should be. If only it was that easy, right?

I mentioned some of these methods toward the start of the article, but you can also check out our other article How to Remediate PII? to learn how sensitive data discovery tools assist in this process. Furthermore, differential privacy is an exciting field concerned with how pieces of information link and group together, leading to new vectors of identification (and protection) based on statistical properties of data patterns. There really is so much to learn here.

And there’s no time to start like the present! Otherwise, what are you waiting for? All your identifiable information isn’t going to de-identify itself!

Not Sure Where to Start? Take the PII Tools Demo and Regain Control of Your Data Storages Now!