New research from Stanford, Cornell, and West Virginia University reveals that Meta’s Llama 3.1 70B model can reproduce 42 percent of Harry Potter and the Sorcerer’s Stone verbatim, challenging claims that AI memorization is merely a “fringe behavior.” The findings could significantly impact ongoing copyright lawsuits against AI companies, providing ammunition for both plaintiffs and defendants in disputes over training models on copyrighted content.
What you should know: The study tested five popular open-weight AI models to see how easily they could reproduce 50-token excerpts from Books3, a collection widely used to train language models.
- Llama 3.1 70B dramatically outperformed other models, memorizing portions of popular books like The Hobbit and George Orwell’s 1984 far more than obscure titles.
- In contrast, Meta’s earlier Llama 1 65B model had memorized only 4.4 percent of the same Harry Potter book, suggesting the problem worsened significantly between model generations.
- The researchers used a strict definition of memorization, requiring models to reproduce exact 50-token sequences with greater than 50 percent probability.
In plain English: Researchers tested whether AI models could spit back exact chunks of books they were trained on. They found that Meta’s newer model could reproduce nearly half of the first Harry Potter book word-for-word, while an older version could only reproduce about 4 percent—showing the problem got much worse over time.
The big picture: These results complicate the narrative that AI companies have been presenting in court—that their models merely learn word patterns without storing actual content.
- “We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” said Stanford law professor Mark Lemley, who previously worked for Meta but dropped them as a client in January. “The first thing that surprised me is how much variation there is.”
- The study found striking differences between models and books, with some works barely memorized while others showed extensive verbatim reproduction.
Why this matters: The research provides concrete evidence that could reshape legal arguments in multiple ongoing copyright cases against AI companies.
- For critics, the findings show memorization isn’t always a fringe phenomenon, particularly for popular content.
- For defendants, the highly variable results across different books could complicate class-action lawsuits by showing plaintiffs aren’t in similar legal situations.
- The study author Richard Kadrey, whose novel Sandman Slim was memorized at only 0.13 percent, is ironically the lead plaintiff in a class-action suit against Meta.
How they measured memorization: Researchers used probability calculations rather than generating thousands of outputs, making the study both cost-effective and precise.
- They calculated the likelihood that models would produce specific 50-token sequences by multiplying individual token probabilities.
- A 50-token sequence with greater than 50 percent reproduction probability requires average token probabilities of at least 98.5 percent—indicating strong memorization.
- This method allowed researchers to estimate probabilities so low they would require “more than 10 quadrillion samples” to observe through traditional generation.
In plain English: Instead of asking the AI to generate text thousands of times and counting exact matches, researchers looked at the AI’s internal confidence scores for each word. When an AI is very confident about predicting the next 50 words in sequence, it’s essentially memorized that passage. This approach let them measure memorization without the massive cost of running countless text generations.
Legal implications: The findings create potential problems for AI companies under three distinct copyright theories.
- Training on copyrighted works could be inherently infringing, though companies argue this falls under fair use similar to Google Books.
- The models themselves might constitute derivative works if they contain substantial portions of copyrighted content.
- Direct infringement occurs when models generate copyrighted material, with Llama’s extensive Harry Potter memorization providing clear evidence.
The mystery factor: Researchers couldn’t determine exactly why Llama 3.1 memorized so much Harry Potter content, though several theories emerge.
- Llama 3 was trained on 15 trillion tokens compared to Llama 1’s 1.4 trillion, potentially including multiple exposures to the same books.
- Secondary sources like fan forums, reviews, and student reports might have included extensive Harry Potter quotes.
- However, memorizing nearly half the book suggests the complete text appeared frequently in training data, not just popular excerpts.
What they’re saying: Legal experts emphasize how these findings complicate existing defense strategies.
- “It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley explained. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”
- Cornell’s James Grimmelmann noted the research reveals “really striking differences among models in terms of how much verbatim text they have memorized.”
Broader consequences: The study highlights a potential legal disadvantage for open-weight models compared to closed systems.
- Researchers could only conduct this analysis because they had access to Llama’s underlying probability values, which closed models like OpenAI’s GPT-4 don’t provide.
- Companies with closed models can implement filters to prevent infringing content from reaching users, making violations harder to prove.
- “It’s kind of perverse,” Lemley said about the possibility that copyright law might discourage open model releases. “I don’t like that outcome.”
Recent Stories
DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment
The Department of Energy has released a new roadmap targeting commercial-scale fusion power deployment by the mid-2030s, though the plan lacks specific funding commitments and relies on scientific breakthroughs that have eluded researchers for decades. The strategy emphasizes public-private partnerships and positions AI as both a research tool and motivation for developing fusion energy to meet data centers' growing electricity demands. The big picture: The DOE's roadmap aims to "deliver the public infrastructure that supports the fusion private sector scale up in the 2030s," but acknowledges it cannot commit to specific funding levels and remains subject to Congressional appropriations. Why...
Oct 17, 2025Tying it all together: Credo’s purple cables power the $4B AI data center boom
Credo, a Silicon Valley semiconductor company specializing in data center cables and chips, has seen its stock price more than double this year to $143.61, following a 245% surge in 2024. The company's signature purple cables, which cost between $300-$500 each, have become essential infrastructure for AI data centers, positioning Credo to capitalize on the trillion-dollar AI infrastructure expansion as hyperscalers like Amazon, Microsoft, and Elon Musk's xAI rapidly build out massive computing facilities. What you should know: Credo's active electrical cables (AECs) are becoming indispensable for connecting the massive GPU clusters required for AI training and inference. The company...
Oct 17, 2025Vatican launches Latin American AI network for human development
The Vatican hosted a two-day conference bringing together 50 global experts to explore how artificial intelligence can advance peace, social justice, and human development. The event launched the Latin American AI Network for Integral Human Development and established principles for ethical AI governance that prioritize human dignity over technological advancement. What you should know: The Pontifical Academy of Social Sciences, the Vatican's research body for social issues, organized the "Digital Rerum Novarum" conference on October 16-17, combining academic research with practical AI applications. Participants included leading experts from MIT, Microsoft, Columbia University, the UN, and major European institutions. The conference...