Study finds Meta's Llama 3.1 memorized 42% of Harry Potter book

New research from Stanford, Cornell, and West Virginia University reveals that Meta’s Llama 3.1 70B model can reproduce 42 percent of Harry Potter and the Sorcerer’s Stone verbatim, challenging claims that AI memorization is merely a “fringe behavior.” The findings could significantly impact ongoing copyright lawsuits against AI companies, providing ammunition for both plaintiffs and defendants in disputes over training models on copyrighted content.

What you should know: The study tested five popular open-weight AI models to see how easily they could reproduce 50-token excerpts from Books3, a collection widely used to train language models.

Llama 3.1 70B dramatically outperformed other models, memorizing portions of popular books like The Hobbit and George Orwell’s 1984 far more than obscure titles.
In contrast, Meta’s earlier Llama 1 65B model had memorized only 4.4 percent of the same Harry Potter book, suggesting the problem worsened significantly between model generations.
The researchers used a strict definition of memorization, requiring models to reproduce exact 50-token sequences with greater than 50 percent probability.

In plain English: Researchers tested whether AI models could spit back exact chunks of books they were trained on. They found that Meta’s newer model could reproduce nearly half of the first Harry Potter book word-for-word, while an older version could only reproduce about 4 percent—showing the problem got much worse over time.

The big picture: These results complicate the narrative that AI companies have been presenting in court—that their models merely learn word patterns without storing actual content.

“We’d expected to see some kind of low level of replicability on the order of 1 or 2 percent,” said Stanford law professor Mark Lemley, who previously worked for Meta but dropped them as a client in January. “The first thing that surprised me is how much variation there is.”
The study found striking differences between models and books, with some works barely memorized while others showed extensive verbatim reproduction.

Why this matters: The research provides concrete evidence that could reshape legal arguments in multiple ongoing copyright cases against AI companies.

For critics, the findings show memorization isn’t always a fringe phenomenon, particularly for popular content.
For defendants, the highly variable results across different books could complicate class-action lawsuits by showing plaintiffs aren’t in similar legal situations.
The study author Richard Kadrey, whose novel Sandman Slim was memorized at only 0.13 percent, is ironically the lead plaintiff in a class-action suit against Meta.

How they measured memorization: Researchers used probability calculations rather than generating thousands of outputs, making the study both cost-effective and precise.

They calculated the likelihood that models would produce specific 50-token sequences by multiplying individual token probabilities.
A 50-token sequence with greater than 50 percent reproduction probability requires average token probabilities of at least 98.5 percent—indicating strong memorization.
This method allowed researchers to estimate probabilities so low they would require “more than 10 quadrillion samples” to observe through traditional generation.

In plain English: Instead of asking the AI to generate text thousands of times and counting exact matches, researchers looked at the AI’s internal confidence scores for each word. When an AI is very confident about predicting the next 50 words in sequence, it’s essentially memorized that passage. This approach let them measure memorization without the massive cost of running countless text generations.

Legal implications: The findings create potential problems for AI companies under three distinct copyright theories.

Training on copyrighted works could be inherently infringing, though companies argue this falls under fair use similar to Google Books.
The models themselves might constitute derivative works if they contain substantial portions of copyrighted content.
Direct infringement occurs when models generate copyrighted material, with Llama’s extensive Harry Potter memorization providing clear evidence.

The mystery factor: Researchers couldn’t determine exactly why Llama 3.1 memorized so much Harry Potter content, though several theories emerge.

Llama 3 was trained on 15 trillion tokens compared to Llama 1’s 1.4 trillion, potentially including multiple exposures to the same books.
Secondary sources like fan forums, reviews, and student reports might have included extensive Harry Potter quotes.
However, memorizing nearly half the book suggests the complete text appeared frequently in training data, not just popular excerpts.

What they’re saying: Legal experts emphasize how these findings complicate existing defense strategies.

“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model,” Lemley explained. “That suggests to me that probably for some of those books there’s something the law would call a copy of part of the book in the model itself.”
Cornell’s James Grimmelmann noted the research reveals “really striking differences among models in terms of how much verbatim text they have memorized.”

Broader consequences: The study highlights a potential legal disadvantage for open-weight models compared to closed systems.

Researchers could only conduct this analysis because they had access to Llama’s underlying probability values, which closed models like OpenAI’s GPT-4 don’t provide.
Companies with closed models can implement filters to prevent infringing content from reaching users, making violations harder to prove.
“It’s kind of perverse,” Lemley said about the possibility that copyright law might discourage open model releases. “I don’t like that outcome.”

Study finds Meta’s Llama 3.1 memorized 42% of Harry Potter book

Recent Stories

DOE fusion roadmap targets 2030s commercial deployment as AI drives $9B investment

Tying it all together: Credo’s purple cables power the $4B AI data center boom

Vatican launches Latin American AI network for human development