LLMs and the Scaling Hypothesis - Why There's Likely More to Come

Introduction

I'm going to give a slightly updated version of the talk I gave at the first Mindstone meetup. I think at the time, Josh said to me, just say something spicy. I think it was something along those lines. It was instructions when I was preparing the talk.

And it was to a much smaller audience. I think you've done an amazing job over the year growing the event, it seems to me. to take it on life of its own, but I'll, yeah, let me pull up my slides and what I'm going to chat about, uh, the title that you guys were given is what Josh remembers of the talk, which I, I really like. Um, and I'm going to basically try and be true to that.

So let me share my screen. Um, and, and I'll go and Josh, I'm writing. I've got about 10 minutes, 15 minutes. Uh, you, you can overrun a little bit. Okay. Great. I'll try and earn the side of being shorter. And then if people have questions, we can dig in. Cool.

The Essence of LLMs and Scaling Hypothesis

So what I'm going to be talking about, the talk's title is kind of LLMs and the scaling hypothesis. Or another way of putting it is sort of what's the, you know, how might we think about the future of what's coming next, given what happened.

Audience Engagement and Reflection

And when I did this talk last time in London, I could see the audience and I did a poll at this point. And I'm just going to ask you guys to do the poll yourselves at home or reflect on it. which is that just take a moment and consider when ChatGPT came out, when you first got your hands on it and you saw it, were you surprised by what it was capable of?

And maybe if you weren't surprised by ChatGPT itself, then were you surprised by the capabilities that you've seen in some of the demos today? I personally was. I was extremely surprised the first time I got access to large language models, not ChatGPT.

Actually, I'd seen similar things a little bit before, but models of that scale were just light years ahead of what I thought was possible at the time and have continued to improve quite quickly.

And so what I want to talk about is a little bit of brief history of kind of the field from a very personal perspective, and also what we might have got wrong and why we were surprised by ChatGPD, and then if we're going to make predictions for the future that we might get right. you know, the people who did get this right, what did they believe that was different to what the majority of us believed? And if we kind of were to project forward, what might that suggest for the future?

Personal Journey and Background

I'm going to start with a little bit of background about myself personally. And the reason for this is that the talk I'm going to give is a little bit of a personal history in the sense of I want to talk about how the field evolved, but sharing from what it was like to be a researcher and a PhD student and an entrepreneur during that time.

So I started my PhD in 2017, actually 2016, I think the site is actually wrong. So the year before the Transformers paper came out and over the course of my PhD, I kind of got a front row seat to some of the really big research breakthroughs that precipitated the current moment. And I also got to work at some really large famous research labs and some startups.

Now I run Human Loop, which I'll tell you guys a little bit about in a moment. And I want to share and tell you guys a little bit about the history of large language models and NLP, but also share a little bit about what it was like going through it and how to try and predict what might come next. So that's going to be the core of it. Maybe very briefly also what I do at HumanLoop and why that's relevant to this.

Qualifications to Discuss LLMs

Why am I in a position to talk about large language models? Oh, yeah. So at HumanLoop, we help developers to build applications on top of large language models like GPT-4 and other models, Anthropic, open source, et cetera. And we're giving companies the tools to find the right prompts, to evaluate performance, and to iterate towards something that's good. But the reason why this is, I think, relevant to this talk is that in that process, we get to see hundreds or thousands of companies go from ideas to actually getting things robust into production. We see what works and what doesn't work and have a really good sense of what the state of the art really is in terms of capabilities.

Practical Use Cases of LLMs

And reflecting, when I was preparing for the talk the first time, one of the things I just wanted to do was reflect on what have LLMs ever done for us? At the beginning, this was May of that year. People were still skeptical that ChatGPT was anything more than hype. And so I was trying to convince people that there were practical use cases.

I think that's less needed now. But still, in just a few minutes of thinking about it, you very quickly come up with many possible applications with this technology. Better code search, marketing assistance. We saw an example of that earlier, like sales generation, automated note taking, et cetera, et cetera. This is all a list that I came up with, I think, in five minutes, all with real companies actually building examples of these things.

I've just seen something in chat. Cool. I'm just going to ignore that. If people chuck questions into chat as we go, I'll try and respond to them.

The Evolution of LLMs and Predicting the Future

So it's very clear that LLMs have burst onto the scene very quickly and become extremely useful. And what I want to talk about is where did they come from? How did we get here? And how might we have predicted that a technology that took so many of us surprise would have done this at the time that it did? What did the people who got it right think differently from the rest of us?

So to start with, I'm going to go back to basics. I suspect many people here will already know the answers to these questions or know about this, but I think it's worth thinking it through and also reflecting on the recent history because the models are so good today that it's easy to forget what the state of the art was even a few years ago.

So firstly, just what is a language model? So a language model is a statistical model of sequences of tokens or text. And what it tries to do is take in a sentence and then spit out a number, which is the probability of getting that sentence.

And if you have a well-trained language model, then you can use it both to score sentences for how probable they are, or you can use it to generate and sample text, which is what we're doing with language models like ChatGPT. And so very simply, you're given a history of a small number of words, where are we going in this case, or the cat sat on the, and then the model is asked to output a probability distribution over the next word in that sentence.

And I kind of think it's worth dwelling also like why use this task I was speaking to someone recently who thought that the reason that this task was chosen maybe was to try and get models that were plausible at generating language, that we wanted to pass the Turing test, and that was what motivated this learning objective. I actually don't think that's true. I think the reason why language modeling is an interesting task to choose and the reason why this has been a source of interest in machine learning research for a long time

is that to get really good at this task, you have to do unsupervised representation learning. So a lot of the recent progress in deep learning over the last few years has involved supervised learning, which are methods where you have to annotate a huge data set. And that means that every time you want to create a new use case, you have to go and start from scratch with data set construction, then train a model.

The Holy Grail for a long time has been, is there a way that we might be able to train a machine learning model to learn reasonable representations of the world or of text or something else without any labels? And next word prediction or language modeling was a candidate task for being able to do this that has turned out to work really well.

And if you think about it, if you ever trained one of these language models, and if you haven't done it firsthand, I'd recommend just grabbing a notebook from somewhere and just following it along. It's kind of interesting because when you first start training them, especially simple ones, the first things the model learn are character frequencies, and then word frequencies, and then grammar. And so at each stage, you're going from something that's very basic, but...

for the model to keep getting better, once it's learned the frequencies of words and basic grammar, et cetera, it has to start learning more complicated concepts, right? The difference between a sentence in which you get gender agreement right and wrong might only be one letter difference, right? Reza walked into the room and he sat down or Reza walked into the room and she sat down are really similar sentences. So every incremental bit of improvement that you need to do to get better at this task requires richer and richer, uh,

like more complicated modeling, basically. And the hope is that in order to get really good at this task, you have to learn world knowledge and reasoning and a model of the world. The jury is still out on whether or not that actually happens.

I'm personally of the opinion that it probably is happening in these models. But that's why we use this task. And that's why language modeling is interesting. But the models that we're interacting with today are remarkably good compared to what was there even a few years ago.

A Brief History of Language Modeling

And so I wanted to give like a very potted history from the top of my head of kind of how this field has evolved. So firstly, like language modeling is a really old task. Like the first language models that people were building were Ngram models from the 1950s. And n-gram models are really simple.

They're just models that are based on counting word frequencies. So you just count, you know, given if the first word is the, then how often does cat come up as the next word? And that count becomes your probability or that frequency becomes your probability. But it's not a new task.

And then in the 1980s, we started to get neural network based language models. And then starting in 2017 is considered one of the big breakthroughs, which is when the transformer model came out. I actually personally think that it's 2018 with ULM fit, but I'll dwell on that in a second.

But during my PhD, my research group was working on language modeling. And I remember a very particular sort of meeting between my supervisor and my group. One of the team had brought in their latest version and the model had kind of just we just started to be able to get this deep learning network to train. And it was beginning to put out things that looked like words.

Right. Like forget having a conversation with it. Forget getting it to use tools. You know, not trying to get the model to do anything of the complexity that we've seen today.

Like the model was just beginning to produce words. And I remember the PhD student who was working on it being really pleased that he'd finally got this complicated network to at least be training. Or Andrej Kaparthi wrote a blog post around that time on the unreasonable effectiveness of neural networks. And he was talking about how incredible it was that you could train an RNN language model that was capable of opening and closing brackets.

This was such a big breakthrough because you had a model that could understand long-range dependencies. We're talking 2017, 2018, 2016 for models where we're getting impressed that they can open and close brackets. It's worth thinking how much progress there's been in a short space of time.

But the two big breakthroughs in my mind, the first one was in 2018, which was the first time that someone really managed to show that you could pre-train a language model on a very large corpus of text. You could go grab the internet or another big corpus of text, just do this next word prediction task, and then use that model for lots of different downstream tasks. And that was the first big breakthrough.

But then the next one, the one that I think has precipitated the huge wave that we're seeing at the moment was in 2020 when GPT-3 came out. And the reason GPT-3 was such a big deal is it's the first time that what's called in-context learning started to work. So even the models of 2018 and 2019, if you wanted to adapt them to a specific task, if you wanted to train a model that could send sales emails versus a model that could classify sentiment versus a model that might write code, then you would have to, for each of those cases, you would take the pre-trained model, you would go and grab a domain-specific data set for that use case. You would annotate that data set, and then you would fine tune a model on it.

You would do further training on that domain. And that was the only way to do customization for different tasks. And that's 2019. So that's three or four years ago.

The Breakthrough of In-Context Learning

And GPT-3 was really the first time that we saw models where you could provide examples of what you wanted, in the input field, in the context window of the model and have a reasonable expectation that it could learn something from that context. And I remember seeing this for the first time and having my mind blown because it felt like such a faster and more easy way to adapt these models and felt like this was going to change completely how we built NLP and how we built AI systems.

And that turned out to be true, but early language models were even GPT-3 wasn't yet that useful. So since then, there's been this enormous explosion, right? Like once we got in context learning working, then, you know, this slide is actually from May. So since then, there's been like even more models than this.

And Oscar was talking about the huge explosion and open source that has happened. There are now hundreds of different language models of different sizes. People are training them. And the question is like, OK, what changed to make these models get so much better?

Like, why does in-context learning start to work in GPT-3? Why do we continue to see so much progress?

Exploring the Scaling Hypothesis

There's a lot of possible explanations. But there's one explanation that I want to talk about specifically in this talk, which is the idea of what's called the scaling hypothesis. And the scaling hypothesis is an empirical observation. It's not a theory. We don't actually have good reason for it.

But it's the empirical observation that as we take these language models, or as we take deep learning networks in general, and we make them bigger in terms of the total compute that they use, so we increase the data sizes that they're trained on, and we increase the number of parameters in the models, then we see very predictable improvements in performance. And so the graphs that I'm showing here are taken, I think, from the original GPT-3 paper. And all they're showing is that as we increase the compute that the model gets across, in this case, 10 orders of magnitude, we see very reliable, very predictable decreases in the model's loss or in its performance. And so what that was suggesting from the GPT-3 paper was like, well, if we just keep making these models bigger, will they continue to just keep getting better?

And it was a pretty heterodox thing to believe at the time. I think the first people to really buy into this properly were the folks at OpenAI post-GPT-2. But to say that at the time, if we just make these models bigger, they'll continue to improve was something that was not respected or considered serious amongst most people in the community.

I personally did not think that it would happen post GPT-2. I think after GPT-3, I started to believe it, but most of academia continued and continues to this moment to be skeptical of the scaling hypothesis. But taken to its kind of like logical extreme, the scaling hypothesis is the idea that like, as we make the models bigger, they just continue to improve. And also that there are many blessings of scale, that a lot of the problems that you have when you train smaller models start to go away as the models get bigger.

So like a very famous problem in training neural nets is something called catastrophic forgetting, where you train the model on one task, and then when you train it on another task, it forgets the first one. That seems to just diminish as you make the models bigger. Initialization of neural networks is something that's extremely challenging, but seems to get better as the models scale. And you can kind of keep going.

Like there are many so-called blessings of scale where things that seem like they're problems for training small neural networks just start to disappear as the models get bigger. And the people who took the scaling hypothesis seriously are people like Shane Legg, other founders of OpenAI, other founders of Anthropic, Rich Sutton. These are the people who have actually been very consistently correct in forecasting the capabilities of machine learning models. And the majority of the community, myself included and many others, were skeptical of these ideas, have been consistently wrong, I would say now for over a decade.

And so I think that even if we don't wholeheartedly adopt this hypothesis, we should start to consider it pretty seriously and think like, if this is true, what might that mean? And yeah, I've got just another graph from the GPT-4 paper, again, using these scaling laws to predict the performance of GPT-4 very precisely before they actually train the model.

So is scale all unique? Can we take the existing systems, just make them bigger and not have that many new ideas and go all the way to human level or beyond human level intelligence? is a reasonable question that someone might ask and dwell on.

The Bitter Lesson by Richard Sutton

And I put this kind of, I grabbed this meme from somewhere on Twitter. The picture here is of Richard Sutton. And the reason I chose to include Richard Sutton is he wrote this essay, I think three or four years ago, called A Bitter Lesson. And it was his reflections on...

what had happened in the history of machine learning and AI. And he considered computer chess and computer vision and machine translation and any number of tasks that were mainstay tasks for the field of AI. And his reflection was that the

Sorry, I'm just getting distracted by Q&A. Let me just put this down. His reflection was that our attempts to hand engineer and craft in human knowledge into these systems with heuristics and other methods had all largely failed. And that the methods that had worked well were methods that were able to exploit more and more compute as that compute became available.

I.e., if you just said, hey, Moore's law exists. computing power doubles every couple of years. And it's been doing that for a long time. And we're going to assume that keeps happening. Let's just use methods that are going to benefit from that increase in compute and more or less ignore anything else. Those methods have consistently won. And the methods that he talks about

General Power of Methods

are are really twofold which are scale which are search and learning and search and and i'll leave you guys to read this quote maybe i'll read it he said you know one thing that we should learn from the bitter lesson and the reason he calls it a bitter lesson is because from a researcher's perspective it's very intellectually unsatisfying that the only really interesting thing to do is to scale everything up but he says you know one thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation, even as the available computation becomes very great, even as Moore's law gives us more access. And the two methods that seem to scale arbitrarily in this way are search and learning.

So my personal feeling on this question of is scale all you need? The answer is probably no, it's not all you need, because we have continued to make algorithmic improvements. But betting on methods like search and like learning, and so far, most of the benefits have been on learning. We haven't actually done that much on search in terms of LLMs. continues to yield fruit. And we should probably expect it to continue to yield fruit very quickly. And that means that we should expect quite rapid progress as further compute becomes available.

The Potential of Scaling for Rapid Progress

4And so my personal take on this is that, yes, I think scale is going to continue to deliver surprisingly rapid progress. like much faster than most people are anticipating if they're not taking this into account. And that also augmenting learning with more search-based methods is going to continue to allow models to not only do things like what ChatGPT does, but actually to discover new knowledge and go beyond what's in the existing data set. And an example of that was actually this week where DeepMind released this paper doing something that I think is conceptually very straightforward, but turned out to be very powerful for actually solving novel problems in mathematics. And they combined large language models as generation engines. as conjecture machines that could just produce hypotheses with a scoring mechanism that could do search. And so the language model would propose a lot of different ideas. They would score them in some way and then use that as a search mechanism. And they were able to actually solve some or improve upon some open problems in mathematics. And so that's an example of the combination of scale across search and learning starting to actually do new knowledge discovery.

And new knowledge discovery was one of the things that, there's a long list of things that people said AI would never be able to do, AI will never be able to do X, be creative, AI will never be able to solve X, you know, we talk about it forever. Knowledge discovery, I think, is one that people have long said AI will not get to. Well, they're starting to get to that point as well.

Controversy and Evidence of AI Capabilities

There are data points on both sides of this question of is scale all you need? And it continues to be a controversial question. The best paper award in Europe this year went to a paper that was trying to show that actually a lot of the surprising abilities of large language models shouldn't be as surprising or smoother in learning than you might think. On the other hand, there was a paper that I read a couple of years ago from DeepMind called In Context Reinforcement Learning that I kind of recommend to people that shows that transformer models are able to learn to learn. And I won't go into the details too much, but the reason I think it's such an interesting paper is to me, it's one of the strongest bits of evidence that transformer models are doing more than just memorization and parroting, but are actually learning short programs or short algorithms to solve complex tasks.

Reflecting on AI Prediction Accuracy

And so the kind of, I guess, parting thought or the last thing I want to think about or like leave people with is that a lot of us were very surprised by the rate of progress that came in 2020 and when ChatGPT came out. Most people did not correctly anticipate that prediction. If you were wrong, then it's worth reflecting on why you might have been wrong. And what did the people who disagreed with you, who got it right, believe that you didn't? And I think one of the main beliefs of the people who have been consistently right in forecasting AI performance is that AI models will continue to improve with scale and they can saturate the available compute in order to improve further. And the amount of compute that models are able to use is going up very quickly. The amount of resources being poured into the space is going up very quickly. And so I think we should at least take seriously some of their predictions about the plausibility of very advanced AI systems, human level systems or better, coming online in the not too distant future. And I don't know exactly what the not too distant future means. I grabbed a prediction from Metaculous here which polls a thousand people who are betting to try and forecast when AGI might arrive. And their median prediction is 2031, which is remarkably soon, right? For something that will probably completely transform society. And I think that we should at least take predictions like this seriously because the people making them were consistently right in the past. And those of us who are more conservative have been consistently wrong.

Conclusion and Open Questions

So that's a very personal romp through the last few years of how we got to where we are. If anyone has any questions, very happy to discuss them further.