Meta’s use of pirated books database LibGen to train its AI language models has been revealed through court-ordered document unredaction, marking a significant development in an ongoing copyright lawsuit filed by authors.
The core revelation: Meta accessed and utilized Library Genesis (LibGen), a controversial pirated content database, for AI model training, despite internal concerns about the legality and optics of this approach.
- Internal company discussions about using LibGen data were escalated to CEO Mark Zuckerberg
- Meta employees expressed hesitation about accessing LibGen data from corporate laptops
- The company’s AI team ultimately received approval to use the pirated materials
Legal context and implications: The case, Kadrey et al. v. Meta Platforms, represents a pivotal moment in determining how tech companies can legally utilize creative works for AI training.
- Authors Richard Kadrey, Christopher Golden, and comedian Sarah Silverman filed the lawsuit in July 2023
- Meta previously acknowledged using the Books3 dataset but had not disclosed its direct use of LibGen
- The company maintains its actions fall under “fair use” doctrine and disputes the plaintiffs’ claims
- LibGen itself faces ongoing legal challenges, including a recent $30 million judgment in 2024
Court developments: Judge Vince Chhabria’s ruling against Meta’s redaction attempts highlights growing judicial scrutiny of AI companies’ transparency.
- The judge described Meta’s redaction approach as “preposterous” and aimed at avoiding negative publicity
- Meta was ordered to file unredacted versions of key documents
- The court warned Meta against making further broad redaction requests
- Plaintiffs argue Meta not only used copyrighted material without permission but also participated in its distribution through torrenting
Questions of precedent: The unfolding legal battle between content creators and tech companies raises fundamental questions about AI training practices and intellectual property rights.
- The case could establish important precedents for how AI companies can legally access and use training data
- The outcome may influence future AI development practices and relationships between tech companies and content creators
- Meta’s use of pirated materials suggests potential challenges in legally sourcing comprehensive training data for AI models
Meta Secretly Trained Its AI on a Notorious Russian 'Shadow Library,' Newly Unredacted Court Docs Reveal