A Secret No Longer: Data Anonymization Redux Needed as Industry Learns that Anonymization Not the Same as Erasure
A major privacy fault line has just been revealed as “anonymized data” is now exposed to be not so anonymous after all. An article published by the New York Times reports that new technology can frustrate present efforts to keep data private in a July 23, 2019 article, Your Data Were ‘Anonymized’? These Scientists Can Still Identify You.
Recital 26 of the GDPR defines anonymized data as “data rendered anonymous in such a way that the data subject is no longer identifiable.” This definition underscores that anonymized data must be stripped of any identifiable information, making it impossible to derive insights on an individual - even by the company which anonymized the PI. The advantage of this is that when done correctly, PI anonymization places the processing and storage of personal data outside the scope of the GDPR.
The EU’s advisory boards, Article 29 Working Party which has since been replaced by the European Data Protection Board (EDPB), questioned whether true data anonymization is technically difficult and emphasized many organizations often fall short of true data anonymization – potentially putting them in non-compliance of the GDPR (and possible the upcoming CCPA). While this discussion was largely confined to technical and academic circles, the shortcomings of true data anonymization have been revealed in a very public manner as detailed in the New York Times article. Data scientists in England and Belgium developed software code that exposes the identity of users from limited data attributes that was previously believed to be “anonymized” data sets. What’s more, in a highly suspect move, the scientists published the code for anyone to use rather than alerting government and industry leaders to address the issue. The mechanics of the new program involve the use of an algorithm that can identify almost any data subject in databases that have been stripped of personal information.
The basis of the scientist's findings revolves around the fact that anonymized data sets can sometimes (often) include “attributes” – personal indicators or characteristics about the data subject or their household. In the article, scientists at Imperial College London and Université Catholique de Louvain, in Belgium, reported that they had devised a computer algorithm that can identify 99.98 percent of Americans from almost any available data set with as few as 15 attributes, such as gender, ZIP code, or marital status. By contrast, data sets having several hundred attributes could be once considered anonymized if certain data was stripped. The troubling disclosure of the software code online means the bar for re-identifying individuals from anonymized data just became a lot lower. Perhaps, the disclosure was meant by the lead author of the paper, Yves-Alexandre de Montjoye at Imperial College London, as an immediate call to action for the industry to build better anonymization techniques.
To be sure, data scientists have not just learned of this privacy shortcoming. On July 10, 2019, Archive360 published an article discussing the question of whether data anonymization produces the same results as data erasure. This article was triggered by a December 5, 2018 ruling by the Austrian Data Protection Agency (a member of the EU and GDPR) which concluded that the anonymization of personal information (PI) could be used to meet the law’s data erasure requirement – the right to erasure/the right to be forgotten. The ruling in the case of DSB D123.270 / 0009-DSB / 2018 (Original in German) did not reveal the specific technology and processes used to anonymize the data subject’s PI, but the fact that the DPA ruled it was sufficient was a major precedent. Still, months later, virtually no progress has been made on new anonymization techniques.
What does this mean for the GDPR and CCPA right to be forgotten?
This finding and code release means the new standard for effectively anonymizing PI just became a lot higher. Consequently, the right to be forgotten just became a lot harder to deliver. If a company wants to anonymize its PI data (mostly for data analytics projects), they will need to be absolutely sure that their database contains far fewer attributes or PI identifiers. In fact, anonymizing data sets to the point where this new algorithm cannot re-identify individual data subjects could end up being too costly to even attempt – meaning the possibility that data analytics on PI could become a thing of the past without a new approach. The possible use of Artificial Intelligence (A.I.) could solve the issue eventually, but until then, organizations will need to be extremely careful in how they handle PI erasure requests – should we delete or anonymize?
Given GDPR’ monumentally steep fines and CCPA and other State’s fine structures, the safe bet is to delete a data subject’s PI using an unrecoverable data deletion technique.
Is a computer deletion function synonymous with absolute data destruction?
The anonymization question brings this discussion to a related issue we have written about in the past; does the standard computer delete meet the intention of the GDPR and CCPA’s right to erasure?
In fact, both the GDPR and CCPA have a right to be forgotten requirement. To many privacy experts, the erasure requirement strongly implies that erased data be unrecoverable – meaning it cannot be programmatically recovered; otherwise, you could not say that the data subject had been forgotten.
A case can be made by data subjects that have requested their PI be deleted, much like the data-subject in the DPA ruling, that their data has not been erased (if in fact the data collector/processor used the standard computer delete process), it has simply been altered, making it just a tiny bit harder to restore.
Neither the GDPR or CCPA regulations address this issue directly, but will no doubt be addressed in short order. Two industry statements that help make the point:
“…the regular “delete“ function of most operating systems and databases is generally not sufficient to meet the requirements of the GDPR.” PayTechLaw
“…just deleting data or reformatting magnetic media (including hard disk drives and tapes) will not be enough to ensure that the wrong personal data does not reside somewhere in the business. If data gets deleted from any media type, it can be recovered in many cases, even when the hardware is damaged by flood or fire.” Kroll OnTrack
Until A.I. can fully address the unrecoverable anonymization requirement, there are just two methods to erase data to make it completely unrecoverable:
1. Data wipe/overwrite: Writing ones and zeros across specific files or portions of a file a predetermined number of times is considered an effective, secure deletion practice but could take an extended period of time.
2. Cryptographic Erasure: Encrypting target data/files and then deleting the encryption key and ideally, the encrypted file, is considered a secure deletion process in both regulatory as well as legal situations.
For companies that collect, process or use PI for marketing and sales activities, which is just about every business, they should ask themselves; do my current enterprise content management (ECM) systems, email/file archives, CRM systems, and marketing/sales systems meet these unrecoverable anonymization/deletion requirements?
They better soon – if not, they will quickly become a major liability when responding to privacy requests and the right to be forgotten requirement.
Bill Tolson is VP of Global Compliance and James M. McCarthy, General Counsel at Archive360, a provider of data migration and information management solutions.