Meta allegedly trained its AI models on pirated torrent content

Meta‘s use of pirated content from torrent sites to train its Llama large language model has been revealed through court documents, leading to copyright litigation.

Key developments: Court documents in the “Kadrey et al. v. Meta Platforms” case have exposed internal communications suggesting Meta’s use of unauthorized content for AI training.

Novelists Richard Kadrey and Christopher Golden filed the lawsuit in 2023, alleging Meta used their copyrighted works without permission
Judge Vince Chhabria ordered the release of unredacted documents that were previously hidden from public view
Internal communications show Meta employees expressing concerns about downloading torrented content on corporate laptops
Evidence indicates CEO Mark Zuckerberg may have authorized the use of pirated materials

Sources of unauthorized content: Meta allegedly accessed pirated materials from multiple unauthorized digital libraries to train its AI systems.

LibGen, a Russia-based digital library established in 2008, was identified as one primary source of pirated content
The platform contains unauthorized copies of books, magazines, and academic articles
Additional “shadow libraries” were also reportedly used in the training process
LibGen has faced multiple copyright lawsuits since its creation, though its operators remain unknown

Meta’s legal defense: The company has presented arguments attempting to justify its use of copyrighted materials.

Meta claims its use of public materials falls under “fair use” doctrine
The company argues it is using text solely for statistical language modeling and generating original content
Fair use determinations are typically made on a case-by-case basis in U.S. copyright law

Industry context: Similar allegations have emerged against other tech companies, though some have taken different approaches to AI training.

Apple faced scrutiny over its OpenELM model using YouTube video subtitles
Apple clarified that OpenELM was purely for research and not used in consumer products
Apple Intelligence uses licensed data and public content collected through web crawlers
Major publishers like The New York Times and The Atlantic have opted out of sharing content with Apple Intelligence

Looking ahead: This lawsuit could set important precedents for AI training practices and copyright law, potentially forcing tech companies to establish more transparent and legally compliant data sourcing methods for AI development.

All Signal.
No Noise.

One concise email a day. Curated by Anthony Batt & Harry DeMott.

Free. Unsubscribe anytime.

Meta allegedly trained its AI models on pirated torrent content

Recent Stories

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Vatican launches Latin American AI network for human development

All Signal.
No Noise.

Meta allegedly trained its AI models on pirated torrent content

Recent Stories

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Vatican launches Latin American AI network for human development

All Signal.No Noise.

All Signal.
No Noise.