Report: OpenAI, Microsoft, Google, and Meta Stole Content at Scale to Train AI

Paul Thurrott
Apr 06, 2024
24

An eavesdropper

A stunning new report details how OpenAI/Microsoft, Google, and Meta quietly stole content wherever they could find it to help train their generative AI services.

OpenAI’s abuses—which are also Microsoft’s abuses, given its overt reliance on OpenAI—have long been a point of contention, with The New York Times (and others) suing both companies for copyright infringement. But thanks to a new report in, wait for it, The New York Times, we now know that their abuses include stealing video content too. A lot of video content.

Windows Intelligence In Your Inbox

Sign up for our new free newsletter to get three time-saving tips each Friday — and get free copies of Paul Thurrott's Windows 11 and Windows 10 Field Guides (normally $9.99) as a special welcome gift!

"*" indicates required fields

According to the periodical, OpenAI in 2021 created a speech recognition tool called Whisper that could transcribe the audio in YouTube videos to create text for training its ChatGPT AI. It did so despite the opposition of employees who argued that Whisper violated YouTube’s rules. And it did so at scale, creating over one million hours of YouTube video transcripts. The result was ChatGPT-4, which is still widely regarded as the leading large language model (LLM).

The report claims that Google also transcribed YouTube videos to train its own AI, now called Gemini, and that in doing so it knowingly violated the creator’s copyrights. (Google owns YouTube.)

The New York Times cites several sources for both accusations. And it adds that Google, nervous about ChatGPT-4, quietly changes its terms of service in 2023 so that it could train Gemini on consumer Google Docs documents, restaurant reviews in Google Maps, and other online material in its ecosystem.

Meta was similarly alarmed by ChatGPT-4 and it, too, needed to find massive quantities of data for training its own AI chatbot. Fresh off its Cambridge Analytics privacy scandal, one might think that Meta would be more careful this time. But it engaged in various dubious activities that included hiring contractors in Africa to aggregate summaries of fiction and nonfiction work, copyrighted or not. Internally, it cited OpenAI’s “market precedent” for this theft. And as with the other firms, some Meta employees pushed back against this clearly illegal behavior but were rebuffed.

What this is, ultimately, is a race: OpenAI, Microsoft, Google, Meta, and any other firms with generative AI ambitions know that the key to their success is getting as much data as possible, as quickly as possible, and doing so before regulators and lawmakers can catch up with their activities. The goal is to already have consumed the data by the time they’re held accountable, and to create synthetic data capabilities with that data so that further theft is not required.

“As long as you can get over the synthetic data event horizon, where the model is smart enough to make good synthetic data, everything will be fine,” OpenAI CEO Sam Altman has said.

Tagged with

Please check our Community Guidelines before commenting

About author

Paul Thurrott

Paul Thurrott is an award-winning technology journalist and blogger with over 25 years of industry experience and the author of 30 books. He is the owner of Thurrott.com and the host of three tech podcasts: Windows Weekly with Leo Laporte and Richard Campbell, Hands-On Windows, and First Ring Daily with Brad Sams. He was formerly the senior technology analyst at Windows IT Pro and the creator of the SuperSite for Windows from 1999 to 2014 and the Major Domo of Thurrott.com while at BWW Media Group from 2015 to 2023. You can reach Paul via email, Twitter or Mastodon.

View Articles

Currently on Forums
Visit the forums
- Microsoft installs Teams for Work and School without asking
  Posted by Christian Gaeng
  
  10
  comments
- Need new laptop, but Windows Signature PC’s don’t exist anymore?
  Posted by Bdsrev
  
  6
  comments
- My Surface Journey
  Posted by Greg Edwards
  
  5
  comments
- Does this APU make sense? A possible PS5 to RTX 4060TI for laptop (NPU = 60 TOPS)
  Posted by Sarah Duguay
  
  2
  comments
Podcasts
Podcast Hub
- First Ring Daily 1592: Collecting the Follows
  
  Aired on April 29, 2024 by Brad Sams with 0 Comments
- The Sams Report: The Important Microsoft Numbers
  
  Aired on April 26, 2024 by Brad Sams with 0 Comments
- First Ring Daily 1591: Big Numbers
  
  Aired on April 26, 2024 by Brad Sams with 0 Comments
- Windows Weekly 878: If You Build It, You Are Dumb
  
  Aired on April 25, 2024 by Paul Thurrott with 1 Comment
Join the crowd where the love of tech is real - become a Thurrott Premium Member today!

Explore Premium Benefits

Report: OpenAI, Microsoft, Google, and Meta Stole Content at Scale to Train AI

Windows Intelligence In Your Inbox

Tagged with

Share post

Microsoft installs Teams for Work and School without asking

Need new laptop, but Windows Signature PC’s don’t exist anymore?

My Surface Journey

Does this APU make sense? A possible PS5 to RTX 4060TI for laptop (NPU = 60 TOPS)

First Ring Daily 1592: Collecting the Follows

The Sams Report: The Important Microsoft Numbers

First Ring Daily 1591: Big Numbers

Windows Weekly 878: If You Build It, You Are Dumb

Windows Intelligence In Your Inbox

Sections

About Thurrott

Contact

Our Other Sites

Subscribe