back
Get SIGNAL/NOISE in your inbox daily

A new study reveals that DataComp CommonPool, one of the largest open-source AI training datasets with 12.8 billion samples, contains millions of images with personally identifiable information including passports, credit cards, birth certificates, and identifiable faces. The findings highlight a fundamental privacy crisis in AI development, as researchers estimate hundreds of millions of personal documents may be embedded in datasets used to train popular image generation models like Stable Diffusion and Midjourney.

What you should know: Researchers audited just 0.1% of CommonPool’s data and found thousands of validated identity documents and over 800 job application materials linked to real people.

  • The study, published on arXiv, discovered credit cards, driver’s licenses, passports, birth certificates, résumés, and cover letters that were confirmed through LinkedIn searches.
  • Many résumés contained sensitive information including disability status, background check results, birth dates of dependents, race, and contact information for references.
  • Since CommonPool has been downloaded more than 2 million times, these privacy risks have likely been replicated across numerous downstream AI models.

The big picture: Web-scraped AI training data inherently contains private information that people never intended for machine learning use, challenging the industry’s assumption that publicly available content is fair game.

  • CommonPool draws from the same Common Crawl data source as LAION-5B, which trained major models including Stable Diffusion and Midjourney, suggesting similar privacy violations exist across multiple datasets.
  • The data was scraped between 2014 and 2022, meaning many images predate the existence of large AI models—making meaningful consent impossible.

Why privacy filters failed: Despite attempts to protect privacy, CommonPool’s automated face-blurring algorithm missed over 800 validated faces in the small sample, suggesting 102 million faces went undetected across the entire dataset.

  • The curators didn’t apply filters for known personally identifiable information strings like emails or social security numbers.
  • Even when faces are blurred, the feature is optional and can be removed, while image captions and metadata often contain additional personal details like names and locations.

What they’re saying: Experts emphasize that current web-scraping practices are fundamentally flawed and extractive.

  • “Anything you put online can [be] and probably has been scraped,” said William Agnew, a postdoctoral fellow in AI ethics at Carnegie Mellon University.
  • “It really illuminates the original sin of AI systems built off public data—it’s extractive, misleading, and dangerous to people who have been using the internet with one framework of risk,” explained Ben Winters, director of AI and privacy at the Consumer Federation of America.
  • “If you web-scrape, you’re going to have private data in there. Even if you filter, you’re still going to have private data in there, just because of the scale of this,” Agnew added.

Legal limitations: Current privacy laws offer insufficient protection against this type of data harvesting, particularly for research contexts.

  • Privacy regulations like GDPR and California’s CCPA don’t necessarily apply to academic researchers who created CommonPool and often have carve-outs for “publicly available” information.
  • Even when people successfully request data removal, the law remains unclear about whether already-trained models must be retrained or deleted.

The children’s privacy concern: Researchers found numerous examples of children’s personal information, including birth certificates, passports, and health status, often shared in contexts suggesting limited, specific purposes rather than broad AI training use.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...