SMRTR AIJul 19, 2025MIT Technology Review

A major AI training data set contains millions of examples of personal data

SMRTR summary

A massive AI training dataset, DataComp CommonPool, likely contains millions of images with personally identifiable information. Researchers found thousands of such images in just 0.1% of the dataset, estimating hundreds of millions more exist. The 12.8 billion-sample dataset, created through web scraping, has been downloaded over 2 million times, raising serious privacy concerns about the potential misuse of sensitive information.

SMRTR provides this summary for quick context. The original article belongs to MIT Technology Review.

Read the original article
SMRTR AI

Get the next batch of curated summaries in your inbox.

This archive is built from SMRTR newsletter summaries. Subscribe for hand-picked stories without the extra noise.