Skip to Main Content
PCMag editors select and review products independently. If you buy through affiliate links, we may earn commissions, which help support our testing.

Google Let OpenAI Scrape YouTube Data Because Google Was Doing It Too

In a bid to secure enough data to train their AIs for years to come, OpenAI, Google, and Meta have dabbled in or seriously considered some sketchy tactics, the New York Times reports.

Updated April 6, 2024
An AI graphic being held in someone's hand (Credit: Shutterstock / LookerStudio)

OpenAI made headlines recently after its CTO couldn't say definitively whether the company had trained its Sora video generator on YouTube data, but it looks like most of the tech giants—OpenAI, Google, and Meta—have dabbled in potentially unauthorized data scraping, or at least seriously considered it.

As the New York Times reports, OpenAI transcribed than a million hours of YouTube videos using its Whisper technology in order to train its GPT-4 AI model. But Google, which owns YouTube, did the same, potentially violating its creators' copyrights, so it didn't go after OpenAI.

In an interview with Bloomberg this week, YouTube CEO Neal Mohan said the company's terms of service "does not allow for things like transcripts or video bits to be downloaded, and that is a clear violation of our terms of service." But when pressed on whether YouTube data was scraped by OpenAI, Mohan was evasive. "I have seen reports that it may or may not have been used. I have no information myself," he said.

The Times' report focuses on the need for more and more data to train advanced AI models, and the sometimes sketchy things the tech giants have considered to get it. As OpenAI CEO Sam Altman has noted, data "will run out" eventually, putting the usefulness of these billion-dollar companies' products in question.

Meta, for example, discussed acquiring Simon & Schuster so its AI could ingest the publishers' books. It also pondered just scraping whatever it needed and hoping people didn't sue, the Times says, citing recordings of internal meetings. Execs also looked to a 2015 ruling that said Google did not violate copyright laws by digitizing books for Google Books.

Google, meanwhile, changed its terms of service (on a holiday weekend) to let it use public Google Docs, restaurant reviews on Google Maps, and other internet data to train its AI. The Docs data was used as part of "an experimental program," Google tells the Times.

The Times itself has already pushed back on this data scraping. It sued OpenAI and its partner Microsoft for using Times content to train its AI models, a case that's currently making its way through the courts.

Get Our Best Stories!

Sign up for What's New Now to get our top stories delivered to your inbox every morning.

This newsletter may contain advertising, deals, or affiliate links. Subscribing to a newsletter indicates your consent to our Terms of Use and Privacy Policy. You may unsubscribe from the newsletters at any time.


Thanks for signing up!

Your subscription has been confirmed. Keep an eye on your inbox!

Sign up for other newsletters

TRENDING

About Emily Price

Weekend Reporter

Emily is a freelance writer based in Durham, NC. Her work has appeared in The Wall Street Journal, The New York Times, Lifehacker, Popular Mechanics, Macworld, Engadget, Computerworld, and more. You can also snag a copy of her book Productivity Hacks: 500+ Easy Ways to Accomplish More at Work--That Actually Work! online through Simon & Schuster or wherever books are sold.

Read Emily's full bio

Read the latest from Emily Price