Privacy & Ethics

#BeDataSmart: Understanding the difference between de-identification and anonymization

A common mistake in handling data is to equate de-identification of data with anonymization of data. These two processes, although similar in many ways, should not be mistaken for the same thing. Indeed, whereas anonymization of data doesn’t allow for any retracing to the original respondent, de-identification does not necessarily mean that an individual cannot be identified from the data set. In other words, de-identification is a process that can be reversed while this is not possible if the data is properly anonymized. This week, ESOMAR’s Governmental Affairs and Professional Standards team is shedding some light on what exactly the difference between these two processes is.

De-identification:

Say you release a file with the answers to a questionnaire you have sent out, with each answer presented separately and each respondent de-identified through an assigned number: e.g. person 1 answered “yes” to the first question, person 2 answered “no” to the first question, etc.

By correlating the different answers to these questions for each person surveyed, even if their names and personal data has been removed from this file, and a random number has been assigned to their profile, it may be possible to retrace and identify the person through their answers. Therefore, their data has been de-identified, but not anonymized.

Anonymization

Now say that you have submitted a file with only aggregated data, and not individual answers; this would be considered to have been properly anonymized. Indeed, if done correctly, it should no longer be possible to single out an individual (“singling out”), link records relating to an individual (linkability), or infer information concerning an individual (inference) as according to the Guideline on Anonymization that the EU’s Data Protection Authorities have released.

It is important to point out that de-identification is not necessarily a malpractice, as it can also be useful for data protection. As an article released by the International Association of Privacy Professionals points out, “deidentification is useful in long-term datasets. For example, we live in a world where each user may have multiple mobile devices. In some cases, a server needs to keep track of where (which user, which device) certain pieces of data came from, for example, so that data can be automatically deleted when that particular device is no longer in use.”.

However, anyone conducting research involving personal data collection should have a clear understanding of the difference between anonymization and de-identification in order to avoid massive personal data breaches such as the identification of Netflix users based on publicly-released movie ratings. For example, in this case, researchers found that one person had strong, ostensibly private opinions about some liberal and gay-themed films, and also had ratings for some religious films. This kind of information can give a good indication about someone’s point of view that they hold on certain topics, which they may not want to share publicly.

Therefore, be careful what you promise to your respondents. Behind each bit of personal data is a person. Be Data Smart.

Leave a Comment

* By using this form you agree with the storage and handling of your data by this website.
Please note that your e-mail address will not be publicly displayed.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Related Articles