back
Get SIGNAL/NOISE in your inbox daily

A leaked internal document has exposed the data sources used to fine-tune Claude, Anthropic’s AI assistant, revealing which websites were trusted or banned during the model’s training process. The spreadsheet, created by third-party contractor Surge AI and accidentally left in a public Google Drive folder, raises serious questions about data governance and transparency in AI development at a time when companies face increasing scrutiny over copyright and licensing issues.

What the leak revealed: The document contained over 120 “whitelisted” websites that contractors could use as trusted sources, alongside 50+ “blacklisted” sites they were instructed to avoid.

  • Approved sources included prestigious institutions like Harvard.edu, Bloomberg, Mayo Clinic, and the National Institutes of Health (NIH).
  • Banned sites featured major publishers and platforms including The New York Times, Reddit, The Wall Street Journal, Stanford University, and Wiley.com.
  • The restrictions likely stem from licensing or copyright concerns, particularly notable given Reddit’s recent lawsuit against Anthropic over alleged data misuse.

Why this matters: While the data was used for fine-tuning rather than pre-training, legal experts warn that courts may not distinguish between these processes when evaluating copyright violations.

  • The leak highlights growing vulnerabilities in the AI ecosystem as companies increasingly rely on third-party firms for human-supervised training.
  • With Anthropic valued at over $60 billion and Claude competing directly with ChatGPT, every misstep invites heightened scrutiny.
  • This incident follows similar data breaches at other AI vendors like Scale AI, suggesting systemic security issues across the industry.

The bigger picture: The revelation exposes how behind-the-scenes decisions by third-party vendors can influence the quality, accuracy, and ethical grounding of AI responses that millions of users rely on daily.

  • Surge AI quickly removed the document after Business Insider reported the leak, while Anthropic claimed no knowledge of the list.
  • The incident underscores the lack of transparency in AI training processes, even for top-tier models like Claude.
  • As AI becomes more embedded in everyday tools, trust increasingly depends on companies’ willingness to be transparent about their data sources and training methodologies.

What it means for users: AI chatbot responses are deeply tied to the data sources selected during training, and inconsistent standards or unclear sourcing can introduce bias and accountability issues into the AI systems people use every day.

Recent Stories

Oct 17, 2025

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...

Oct 17, 2025

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...

Oct 17, 2025

Vatican launches Latin American AI network for human development

The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...