A common trope when discussing AI is to call out the vast quantities of data that models are trained on. This has led some to believe that the performance of AI models comes solely from processing more information than any human.
As I’ve been working more with large models in a variety of domains, and become increasingly impressed with their capabilities, I decided to spend some time verifying these claims. How many more books did ChatGPT actually consume compared to a human? How many images did Stable Diffusion see? Before running the numbers, I would have estimated that the average large model sees five to six orders of magnitude more data than a human, mostly due to how articles have described the size of these datasets. It turns out that the gap between how much data a human and a model needs is much smaller than I had anticipated.
I’ll tackle 4 areas of interest: images, text, speech, and energy. In each, I’ll attempt a ballpark estimate of human and model data consumption, and compare both. Wonder where humans still have the edge and for how long? Let’s find out.
Let’s estimate how many interesting images a human sees throughout their lifetime. On average, people are awake for 16 hours per day and see around 60 images per second. However, most of these images are nearly identical, so let’s instead use an average quantity of distinct images perceived of 6 per second. 6 images per second for 16 hours a day translates to about 1.4 million images per day, or roughly 2 billion images per year. So, by the age of 10, the average person has seen 19 billion images, and 38 billion by the age of 20.
What about computers?
The dataset used to train Stable Diffusion, called LAION, includes approximately 413 million pairs of images and their corresponding captions. This is equivalent to the number of images that a human sees during their first 3 months of life. And yet, compared to most toddlers, stable diffusion is a pretty amazing painter.
The above is a bit of an unfair comparison because each LAION image comes with a caption. While humans aren’t provided live captions, they have other unfair advantages such as their mastery of language which allows them to generate those captions themselves, or 3D vision and depth perception, which makes making sense of the world (and drawing hands) significantly easier. For the purpose of our analysis, we’ll ignore these issues.
Verdict: ML wins. ML models require fewer images than humans
Assuming that babies hear four hours of language a day until they reach pre-K (age 3), and eight hours of conversations or lectures during kindergarten, and approximately twelve hours (including classes and personal time) in elementary school, an average human being will hear approximately 1,460 hours of speech by their first anniversary, 4,000 hours by the time they reach pre-K, 10,000 by kindergarten, 25,000 after completing elementary school, and 60,000 by the end of high school.
The machine learning model, Whisper, was trained on an impressive 600,000 hours of audio, which is ten times more than an average human has heard by the end of high school. At the same time, Whisper performs human level speech transcription in 10 different languages. Assuming equal difficulty for all languages, Whisper is 10 times better than the average human.
Verdict: Tie. ML models require 10 times more data to perform 10 times better.
On average, Americans read four books per year. Over an active reading range of 10 to 80 years old, that comes out to 280 books. We can generously double this estimate to account for education and round up to 600 books in a lifetime.
Nowadays people don’t just read books. They read blogs (thanks by the way!), social media posts, online articles, as well as texts and email.
People spend an average of 150 minutes daily on social media, a quarter of which is spent reading content. At a reading speed of 250 words per minute, that’s an additional 3.4 million words a year, the equivalent of reading an extra 43 books per year (an average book is 80,000 words). Assuming regular social media use from age 10 to 70, we get an incremental additional 2,623 books per person. What about text and email?
The average email is 100 words. An average person sends and receives 60 non-spam business emails a day. All in all that gives us an additional 2.2 million words (or 28 books) a year, totaling 1,708 extra books from 10 to 70.
10 years ago, Americans sent and received an average of 40 texts a day. Based on my experience, I’m going to double this estimate to get to current numbers. 80 texts a day, at a short average length of 7 words, gives us an incremental two hundred thousand words a year, or 2.5 books. Assuming we text from age 10 to 80 gives us an additional 175 books. That’s a lot of texts.
Summing all those up, we have 600 + 2,623 + 1,708 + 175 = 5,106 books.
In this analysis, I will use GPT-3 to estimate ChatGPT’s data consumption, as more public data exists about it.
GPT-3 was trained on hundreds of billions of words, corresponding to about 500 billion BPE tokens. Each BPE token encodes between 3 and 4 characters, meaning that GPT-3 was exposed to a staggering 1,750 billion characters. In English, the average word length is 4.8 characters, so ignoring other languages and code, GPT-3 has read 365 billion words or approximately 4.6 million books, which is as much text as 1,000 Americans read in their lifetime.
Does GPT-3 exhibit reasoning capabilities that are 1,000 times better than the average american? I wouldn’t say so. What I will say is that it has a breadth of knowledge that is likely 1,000 times superior to the average human, being able to summon facts from essentially any domain of science or the humanities. At the same time, it also fails on riddles that a 5 year old can solve, so we can’t really consider it as superior to humans for now.
Verdict: Humans win. ML models require much more text than humans.
I will note that while these large models do use orders of magnitude more data, I would have expected the difference to be even larger (on the order of a million times larger). The relatively smaller gap surprised me and motivated me to write this blog post.
One last comparison for fun.
Americans consume an average of 3,782 kilocalories per day, which translates to approximately 1,600 kWh per year or 128,000 kWh over their lifetime.
Training GPT-3 required 190,000 kWh. So GPT-3 consumes as much energy to train as a human would if they were to live 118 years. And this doesn’t even take into account the inference costs of running the model while it’s being trained.
Verdict: Humans win. In terms of energy, we still have an edge for now.
Conclusion: converging curves
To sum up, machine learning models like Stable Diffusion, Whisper, and GPT-3 have been trained on vast amounts of data, but the difference between the amount of data they have seen and that of humans is not as significant as you may have believed. In fact, Stable Diffusion has been trained on less data than any artist has seen! Although scaling to larger datasets has been beneficial so far, the development of better training techniques (such as recent fine-tuning approaches) will continue to improve our efficiency at learning from data. I expect the gap between the amount of data machine learning models and humans need for training to shrink as progress in model training continues. It wouldn’t surprise me if language models start performing as well as humans while training on less data soon.