×
Meta secretly trained its AI models on a Russian ‘shadow library,’ court docs show
Written by
Published on
Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage
Join Now

Meta’s use of pirated books database LibGen to train its AI language models has been revealed through court-ordered document unredaction, marking a significant development in an ongoing copyright lawsuit filed by authors.

The core revelation: Meta accessed and utilized Library Genesis (LibGen), a controversial pirated content database, for AI model training, despite internal concerns about the legality and optics of this approach.

  • Internal company discussions about using LibGen data were escalated to CEO Mark Zuckerberg
  • Meta employees expressed hesitation about accessing LibGen data from corporate laptops
  • The company’s AI team ultimately received approval to use the pirated materials

Legal context and implications: The case, Kadrey et al. v. Meta Platforms, represents a pivotal moment in determining how tech companies can legally utilize creative works for AI training.

  • Authors Richard Kadrey, Christopher Golden, and comedian Sarah Silverman filed the lawsuit in July 2023
  • Meta previously acknowledged using the Books3 dataset but had not disclosed its direct use of LibGen
  • The company maintains its actions fall under “fair use” doctrine and disputes the plaintiffs’ claims
  • LibGen itself faces ongoing legal challenges, including a recent $30 million judgment in 2024

Court developments: Judge Vince Chhabria’s ruling against Meta’s redaction attempts highlights growing judicial scrutiny of AI companies’ transparency.

  • The judge described Meta’s redaction approach as “preposterous” and aimed at avoiding negative publicity
  • Meta was ordered to file unredacted versions of key documents
  • The court warned Meta against making further broad redaction requests
  • Plaintiffs argue Meta not only used copyrighted material without permission but also participated in its distribution through torrenting

Questions of precedent: The unfolding legal battle between content creators and tech companies raises fundamental questions about AI training practices and intellectual property rights.

  • The case could establish important precedents for how AI companies can legally access and use training data
  • The outcome may influence future AI development practices and relationships between tech companies and content creators
  • Meta’s use of pirated materials suggests potential challenges in legally sourcing comprehensive training data for AI models
Meta Secretly Trained Its AI on a Notorious Russian 'Shadow Library,' Newly Unredacted Court Docs Reveal

Recent News

AI agents reshape digital workplaces as Moveworks invests heavily

AI agents evolve from chatbots to task-completing digital coworkers as Moveworks launches comprehensive platform for enterprise-ready agent creation, integration, and deployment.

McGovern Institute at MIT celebrates a quarter century of brain science research

MIT's McGovern Institute marks 25 years of translating brain research into practical applications, from CRISPR gene therapy to neural-controlled prosthetics.

Agentic AI transforms hiring practices in recruitment industry

AI recruitment tools accelerate candidate matching and reduce bias, but require human oversight to ensure effective hiring decisions.