Meta‘s use of pirated content from torrent sites to train its Llama large language model has been revealed through court documents, leading to copyright litigation.
Key developments: Court documents in the “Kadrey et al. v. Meta Platforms” case have exposed internal communications suggesting Meta’s use of unauthorized content for AI training.
- Novelists Richard Kadrey and Christopher Golden filed the lawsuit in 2023, alleging Meta used their copyrighted works without permission
- Judge Vince Chhabria ordered the release of unredacted documents that were previously hidden from public view
- Internal communications show Meta employees expressing concerns about downloading torrented content on corporate laptops
- Evidence indicates CEO Mark Zuckerberg may have authorized the use of pirated materials
Sources of unauthorized content: Meta allegedly accessed pirated materials from multiple unauthorized digital libraries to train its AI systems.
- LibGen, a Russia-based digital library established in 2008, was identified as one primary source of pirated content
- The platform contains unauthorized copies of books, magazines, and academic articles
- Additional “shadow libraries” were also reportedly used in the training process
- LibGen has faced multiple copyright lawsuits since its creation, though its operators remain unknown
Meta’s legal defense: The company has presented arguments attempting to justify its use of copyrighted materials.
- Meta claims its use of public materials falls under “fair use” doctrine
- The company argues it is using text solely for statistical language modeling and generating original content
- Fair use determinations are typically made on a case-by-case basis in U.S. copyright law
Industry context: Similar allegations have emerged against other tech companies, though some have taken different approaches to AI training.
- Apple faced scrutiny over its OpenELM model using YouTube video subtitles
- Apple clarified that OpenELM was purely for research and not used in consumer products
- Apple Intelligence uses licensed data and public content collected through web crawlers
- Major publishers like The New York Times and The Atlantic have opted out of sharing content with Apple Intelligence
Looking ahead: This lawsuit could set important precedents for AI training practices and copyright law, potentially forcing tech companies to establish more transparent and legally compliant data sourcing methods for AI development.
Meta accused of training its AI using pirated content from torrents