SMRTR AI• Jul 19, 2025• MIT Technology Review

A major AI training data set contains millions of examples of personal data

SMRTR summary

A massive AI training dataset, DataComp CommonPool, likely contains millions of images with personally identifiable information. Researchers found thousands of such images in just 0.1% of the dataset, estimating hundreds of millions more exist. The 12.8 billion-sample dataset, created through web scraping, has been downloaded over 2 million times, raising serious privacy concerns about the potential misuse of sensitive information.

SMRTR provides this summary for quick context. The original article belongs to MIT Technology Review.

Read the original article

A major AI training data set contains millions of examples of personal data

Get the next batch of curated summaries in your inbox.