Tim O’Reilly Says Open AI Stole His Company’s Paywalled Book Content

## Tim O'Reilly Says Open AI Stole His Company's Paywalled Book Content

Echoing a legal complaint by The New York Times, publisher Tim O’Reilly says that Open AI stole content from dozens of his books.

“Using a legally obtained dataset of 34 copyrighted O’Reilly Media books, we applied the DE-COP membership inference attack method to investigate whether OpenAI’s models were trained on book content that wasn’t publicly accessible,” O’Reilly writes. “In addition, we compared their models’ knowledge of public, free to use samples, from each O’Reilly book, with paywalled (non-public) chapters from the same book.”

O’Reilly published a detailed accounting of the findings in a 33-page report. But it’s obvious enough: Open AI’s GPT-4o model “shows a strong recognition of the paywalled content,” even though the more limited public samples are more easily accessible online. In short, it’s mathematically “likely” that Open AI trained its models on copyrighted book content without first seeking a licensing agreement as required by the content owner.

“Our research provides empirical evidence that OpenAI may be training its models on non-public, copyrighted content without proper authorization,” he continues. “As AI companies increasingly rely on vast datasets to improve their models, questions about fair compensation for content creators become more urgent. If AI systems extract value from content creators’ work without fair compensation, they risk depleting the very resources upon which their systems depend, potentially creating what we describe as an ‘extractive dead end’ for the internet’s content ecosystem.”

O’Reilly also notes that Open AI–like other AI model makers–does enter into licensing arrangements with some publishers. So it’s unclear why it didn’t even try to do so with O’Reilly’s books. As with The New York Times case, one assumes that Open AI simply paid a fee to access the content, as an individual would do, and then just stole that content to train its models. This is, he explains in the report, a “systemic” issue.

“Our findings aim to provoke changes in data collection and usage practices cross AI model developers.” the longer report reads. “Meta allegedly trained their models on LibGen, a massive corpus of pirated books. Anthropic allegedly used ‘the pile’ dataset for its training, which also contains many pirated books. Given the need for high-quality paywalled data to ensure AI models are smart and kept up to date, training on such data will be a necessity for the foreseeable future. This means that structured markets for such data still have time, and a need, to arise.”

Share post

Thurrott