Up until recently, Microsoft owned the largest collection of celebrity images under its public dataset known as MS Celebs.
The purpose of the MS Celebs database was to train Microsoft’s facial recognition AI and it was also used by other companies including IBM, Nvidia, etc. Microsoft created it back in 2016, by scorching for images on the web uploaded under a creative commons license.
It contained 10 million images of around 100,000 people with varying degrees of popularity across the globe. The name includes “Celebs” to give the impression that the images mostly belong to public figures. The goal of the database is to use the images of 100,000 people and should be able to recognize people in a list of 1 million names.
An issue which hangs along with the database is the fact that it includes images of many people who never gave their consent to use them. Even though, as mentioned, the images were Creative Commons.
In fact, the database contains images for people who were not “celebrities” and only maintain an online presence, like, journalists, activists, musicians, and so on. According to MegaPixels, the list includes a former FTC technologist and The Intercept founders Glenn Greenwald, Jeremy Schahill, etc.
Why Microsoft deleted it?
However, according to Financial Times (via Gizmodo), the said database has been pulled by Microsoft over privacy concerns. It’s feared that some Chinese companies are using it to improve their surveillance systems. If you visit the website Mscelebs.org, it’ll give a “404 not found” error.
Well, that makes sense because companies nowadays could be easily scrutinized for having a careless attitude towards the privacy of the users. And with such a gigantic data set in the picture, it becomes more sensitive. Just recently, we even saw two big tech giants, Apple and Google, criticizing each other in the name of privacy.
Microsoft told Financial Times that the database was meant for academic purposes and it was maintained by a former employee and has since been removed.
Although Microsoft has quietly pulled off the database, some identifiers related to it are available on GitHub. And what gets uploaded on the internet stays forever. The database is still being shared by people on different platforms like Dropbox, GitHub, Baidu, etc. It’s still available to all the companies and researchers who have downloaded it in the past.