Tim O’Reilly Says Open AI Stole His Company’s Paywalled Book Content

Paul Thurrott
Apr 02, 2025
3

## Tim O'Reilly Says Open AI Stole His Company's Paywalled Book Content

Echoing a legal complaint by The New York Times, publisher Tim O’Reilly says that Open AI stole content from dozens of his books.

“Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we applied the DE-COP membership inference attack method to investigate whether OpenAI’s models were trained on book content that wasn’t publicly accessible,” O’Reilly writes. “In addition, we compared their models’ knowledge of public, free to use samples, from each O’Reilly book, with paywalled (non-public) chapters from the same book.”

O’Reilly published a detailed accounting of the findings in a 33-page report. But it’s obvious enough: Open AI’s GPT-4o model “shows a strong recognition of the paywalled content,” even though the more limited public samples are more easily accessible online. In short, it’s mathematically “likely” that Open AI trained its models on copyrighted book content without first seeking a licensing agreement as required by the content owner.

“Our research provides empirical evidence that OpenAI may be training its models on non-public, copyrighted content without proper authorization,” he continues. “As AI companies increasingly rely on vast datasets to improve their models, questions about fair compensation for content creators become more urgent. If AI systems extract value from content creators’ work without fair compensation, they risk depleting the very resources upon which their systems depend, potentially creating what we describe as an ‘extractive dead end’ for the internet’s content ecosystem.”

O’Reilly also notes that Open AI–like other AI model makers–does enter into licensing arrangements with some publishers. So it’s unclear why it didn’t even try to do so with O’Reilly’s books. As with The New York Times case, one assumes that Open AI simply paid a fee to access the content, as an individual would do, and then just stole that content to train its models. This is, he explains in the report, a “systemic” issue.

“Our findings aim to provoke changes in data collection and usage practices cross AI model developers.” the longer report reads. “Meta allegedly trained their models on LibGen, a massive corpus of pirated books. Anthropic allegedly used ‘the pile’ dataset for its training, which also contains many pirated books. Given the need for high-quality paywalled data to ensure AI models are smart and kept up to date, training on such data will be a necessity for the foreseeable future. This means that structured markets for such data still have time, and a need, to arise.”

About author

Paul Thurrott

Paul Thurrott is an award-winning technology journalist and blogger with 30 years of industry experience and the author of 30 books. He is the owner of Thurrott.com and the host of three tech podcasts: Windows Weekly with Leo Laporte and Richard Campbell, Hands-On Windows, and First Ring Daily with Brad Sams. He was formerly the senior technology analyst at Windows IT Pro and the creator of the SuperSite for Windows from 1999 to 2014 and the Major Domo of Thurrott.com while at BWW Media Group from 2015 to 2023. You can reach Paul via email, Twitter or Mastodon.

View Articles

Currently on Forums
Visit the forums
Podcasts
Podcast Hub
- Windows Weekly 982: Don’t Lick the Mantra Rays
  
  Aired on May 07, 2026 by Paul Thurrott with 0 Comments
- First Ring Daily 1960: Drilling the Wall
  
  Aired on May 07, 2026 by Brad Sams with 0 Comments
- First Ring Daily 1959: Think of the Bandwidth
  
  Aired on May 06, 2026 by Brad Sams with 0 Comments
- First Ring Daily 1958: SCUBA TV
  
  Aired on May 05, 2026 by Brad Sams with 0 Comments
Join the crowd where the love of tech is real - become a Thurrott Premium Member today!

Explore Premium Benefits

Tim O’Reilly Says Open AI Stole His Company’s Paywalled Book Content

Share post

is windows trying to kill themselves

[CLOSED] Ask Paul for this Friday, May 8

A dispute over the TAB key highlights a mismatch between Microsoft and IBM organizational structures

Microsoft Edge Stores All Saved Passwords in Cleartext Process Memory at Launch

Windows Weekly 982: Don’t Lick the Mantra Rays

First Ring Daily 1960: Drilling the Wall

First Ring Daily 1959: Think of the Bandwidth

First Ring Daily 1958: SCUBA TV