Meta allegedly trained its AI models on pirated torrent content

Join our daily newsletter for breaking news, product launches and deals, research breakdowns, and other industry-leading AI coverage

Join Now

Meta‘s use of pirated content from torrent sites to train its Llama large language model has been revealed through court documents, leading to copyright litigation.

Key developments: Court documents in the “Kadrey et al. v. Meta Platforms” case have exposed internal communications suggesting Meta’s use of unauthorized content for AI training.

Novelists Richard Kadrey and Christopher Golden filed the lawsuit in 2023, alleging Meta used their copyrighted works without permission
Judge Vince Chhabria ordered the release of unredacted documents that were previously hidden from public view
Internal communications show Meta employees expressing concerns about downloading torrented content on corporate laptops
Evidence indicates CEO Mark Zuckerberg may have authorized the use of pirated materials

Sources of unauthorized content: Meta allegedly accessed pirated materials from multiple unauthorized digital libraries to train its AI systems.

LibGen, a Russia-based digital library established in 2008, was identified as one primary source of pirated content
The platform contains unauthorized copies of books, magazines, and academic articles
Additional “shadow libraries” were also reportedly used in the training process
LibGen has faced multiple copyright lawsuits since its creation, though its operators remain unknown

Meta’s legal defense: The company has presented arguments attempting to justify its use of copyrighted materials.

Meta claims its use of public materials falls under “fair use” doctrine
The company argues it is using text solely for statistical language modeling and generating original content
Fair use determinations are typically made on a case-by-case basis in U.S. copyright law

Industry context: Similar allegations have emerged against other tech companies, though some have taken different approaches to AI training.

Apple faced scrutiny over its OpenELM model using YouTube video subtitles
Apple clarified that OpenELM was purely for research and not used in consumer products
Apple Intelligence uses licensed data and public content collected through web crawlers
Major publishers like The New York Times and The Atlantic have opted out of sharing content with Apple Intelligence

Looking ahead: This lawsuit could set important precedents for AI training practices and copyright law, potentially forcing tech companies to establish more transparent and legally compliant data sourcing methods for AI development.

Meta accused of training its AI using pirated content from torrents

9to5Mac

Menu

Meta allegedly trained its AI models on pirated torrent content

Recent News

Rose-Hulman launches computer science major with AI and cybersecurity tracks

Microsoft’s AI prototype reverse engineers malware with 90% accuracy

Match Group beats earnings with $50M AI strategy to win back Gen Z

Join the revolution

CO/AI

Resources

Join the revolution

Menu

Welcome

Meta allegedly trained its AI models on pirated torrent content

Recent News

Rose-Hulman launches computer science major with AI and cybersecurity tracks

Microsoft’s AI prototype reverse engineers malware with 90% accuracy

Match Group beats earnings with $50M AI strategy to win back Gen Z

Join the revolution

CO/AI

Resources

Join the revolution