Researchers suggest OpenAI trained AI models on paywalled O’Reilly books

CYBER SECURITYAIPRIVACY AND SAFETYSTARTUP

NEOCODE

4/2/20252 min read

a black and white photo of a sign that says privacy please

OpenAI has faced numerous accusations of using copyrighted material without permission to train its AI models. A new report by the AI Disclosures Project, a watchdog organization, alleges that the company has increasingly relied on non-public books that it did not license to develop its more advanced models.

AI models function as complex prediction engines, learning patterns from vast amounts of data—such as books, movies, and television shows—to generate content. When an AI model produces an essay on Greek tragedy or an illustration in the style of Studio Ghibli, it is synthesizing information from its training data rather than creating something entirely new.

As real-world data sources, particularly from the public web, become exhausted, AI labs, including OpenAI, have begun incorporating AI-generated content into their training sets. However, relying solely on synthetic data carries risks, including potential declines in model performance.

The report, published by the AI Disclosures Project—a nonprofit co-founded in 2024 by media executive Tim O’Reilly and economist Ilan Strauss—suggests that OpenAI may have trained its GPT-4o model using paywalled books from O’Reilly Media. Notably, O’Reilly Media does not have a licensing agreement with OpenAI.

“GPT-4o, OpenAI’s latest and most advanced model, demonstrates a significantly stronger recognition of paywalled O’Reilly book content compared to GPT-3.5 Turbo,” the report states. “By contrast, GPT-3.5 Turbo shows a higher recognition of publicly accessible O’Reilly book excerpts.”

The study used a technique called DE-COP, introduced in a 2024 academic paper, to detect copyrighted content within language models. Also known as a "membership inference attack," this method assesses whether a model can distinguish between original human-authored texts and AI-generated paraphrased versions. If the model consistently identifies the original text, it suggests that the content was likely part of its training data.

To test this hypothesis, the report’s authors—O’Reilly, Strauss, and AI researcher Sruly Rosenblat—examined OpenAI’s models, including GPT-4o and GPT-3.5 Turbo. Using 13,962 paragraph excerpts from 34 O’Reilly books, they estimated the probability that each excerpt had been included in the models’ training data.

The results indicate that GPT-4o exhibited a much greater familiarity with paywalled O’Reilly content than GPT-3.5 Turbo, even after accounting for factors such as improvements in model performance. According to the researchers, this suggests that GPT-4o likely had prior exposure to many non-public O’Reilly books published before its training cutoff date.

However, the authors acknowledge that their findings are not definitive. They concede that OpenAI could have obtained these book excerpts through users copying and pasting content into ChatGPT, rather than directly scraping the books. Furthermore, the study did not evaluate OpenAI’s latest models, such as GPT-4.5 and reasoning-focused models like o3-mini and o1, which may have different training data sources.

Despite the uncertainty, OpenAI’s approach to copyrighted material has been a point of contention. The company has pushed for more relaxed regulations regarding the use of copyrighted data in AI training and has sought high-quality data sources, even hiring journalists and domain experts to refine its models.

It is worth noting that OpenAI does pay for some of its training data. The company has licensing agreements with news publishers, social media platforms, and stock media providers. Additionally, OpenAI offers opt-out mechanisms that allow copyright holders to request the exclusion of their content, although these mechanisms have been criticized as imperfect.

As OpenAI continues to face multiple lawsuits over its data training practices and copyright policies in U.S. courts, the allegations from the O’Reilly report further fuel concerns about the company’s approach to intellectual property.

OpenAI has not responded to requests for comment on the matter.