Report: OpenAI, Microsoft, Google, and Meta Stole Content at Scale to Train AI

An eavesdropper

A stunning new report details how OpenAI/Microsoft, Google, and Meta quietly stole content wherever they could find it to help train their generative AI services.

OpenAI’s abuses—which are also Microsoft’s abuses, given its overt reliance on OpenAI—have long been a point of contention, with The New York Times (and others) suing both companies for copyright infringement. But thanks to a new report in, wait for it, The New York Times, we now know that their abuses include stealing video content too. A lot of video content.

Windows Intelligence In Your Inbox

Sign up for our new free newsletter to get three time-saving tips each Friday — and get free copies of Paul Thurrott's Windows 11 and Windows 10 Field Guides (normally $9.99) as a special welcome gift!

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

According to the periodical, OpenAI in 2021 created a speech recognition tool called Whisper that could transcribe the audio in YouTube videos to create text for training its ChatGPT AI. It did so despite the opposition of employees who argued that Whisper violated YouTube’s rules. And it did so at scale, creating over one million hours of YouTube video transcripts. The result was ChatGPT-4, which is still widely regarded as the leading large language model (LLM).

The report claims that Google also transcribed YouTube videos to train its own AI, now called Gemini, and that in doing so it knowingly violated the creator’s copyrights. (Google owns YouTube.)

The New York Times cites several sources for both accusations. And it adds that Google, nervous about ChatGPT-4, quietly changes its terms of service in 2023 so that it could train Gemini on consumer Google Docs documents, restaurant reviews in Google Maps, and other online material in its ecosystem.

Meta was similarly alarmed by ChatGPT-4 and it, too, needed to find massive quantities of data for training its own AI chatbot. Fresh off its Cambridge Analytics privacy scandal, one might think that Meta would be more careful this time. But it engaged in various dubious activities that included hiring contractors in Africa to aggregate summaries of fiction and nonfiction work, copyrighted or not. Internally, it cited OpenAI’s “market precedent” for this theft. And as with the other firms, some Meta employees pushed back against this clearly illegal behavior but were rebuffed.

What this is, ultimately, is a race: OpenAI, Microsoft, Google, Meta, and any other firms with generative AI ambitions know that the key to their success is getting as much data as possible, as quickly as possible, and doing so before regulators and lawmakers can catch up with their activities. The goal is to already have consumed the data by the time they’re held accountable, and to create synthetic data capabilities with that data so that further theft is not required.

“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” OpenAI CEO Sam Altman has said.

Tagged with

Share post

Please check our Community Guidelines before commenting

Windows Intelligence In Your Inbox

Sign up for our new free newsletter to get three time-saving tips each Friday

"*" indicates required fields

This field is for validation purposes and should be left unchanged.

Thurrott © 2024 Thurrott LLC