Various companies collect data from our devices almost all the time. While there is always a privacy concern in the picture, they try to assure that our data is in completely safe hands. Also, if it gets shared with third-parties, all the information that could be used to identify people is redacted and de-identified.
Turns out the techniques used to anonymize data aren’t that fool-proof, according to researchers at Imperial College London who have published a paper on reverse engineering incomplete datasets.
The researchers developed a machine learning model that can reverse-engineer an incomplete dataset. Using 15 demographic attributes such as age, gender, marital status, etc. they were able to re-identify almost 99.98% Americans in an anonymized dataset.
For that purpose, the researchers used 210 different datasets covering a “large range of uniqueness.” It includes information on around 11 million Americans.
However, the goal of the study isn’t to establish the fact that the so-called “anonymous” datasets can be deanonymized. It was already done in the past at DEFCON 2018, where hackers were able to legally get hold of the browsing history of 3 million Germans, and de-anonymize them.
Researchers have made an attempt to prove how easy it has become to fool the techniques used to make the datasets. It invites a call to action for governments and companies to implement even robust techniques that can keep people’s identities secure.
They have also set up a website where you can check how easy it is to identify you in an anonymous dataset.