Groq's Secret to 10x Faster LLMs

Introduction

Everybody's talking about grok.

Which drink would you like? I'll take a Coke.

Just a sec while I finalize your order. Alright, your card has been updated.

All right. Thanks.

I'll introduce myself in a second.

Productivity and Technology

1But, you know, what if a technology arrived today that immediately increased productivity as much as 50 percent? For instance, what that would mean for an automobile company is maybe they go from making half a million cars to three quarters of a million cars. That might be a 12 billion dollar increase. per quarter difference in manufacturing. It's a huge impact.

And in fact, that's what's happening with things like ChatGPT with large language models today. For instance, if you're a developer, your productivity might increase using Copilot or something similar, might increase 25 to 50%. Speed up the documentation of the code that you're working on or code refactoring by 20 to 30%. It's an incredible impact. And it's also impacting a lot of other industries as well.

The same study from McKinsey talked about how physicians were able to focus more on their patient and less on the bureaucratic work, the paperwork they need to do. And it also improves the quality of the work that they're doing as well because they're able to focus more on where they add value.

So let me just give you a quick demo.

Grok Capabilities

Grok, just to be clear, Grok doesn't train models. We just provide inference in the cloud right now. We provide hardware for inference. And we provide it at a super fast speed.

So what I'm going to do is show you this. I'll just give you a demo here. Let's see.

I'd like to visit New York City for two days, and I'd like an itinerary that includes things that are not typically touristy. Can you make that itinerary for me?

1So basically, in that amount of time, the audio was recorded, sent to Grok hardware, converted to text, and then sent through the large language model. And the data was generated at 1,200 tokens per second, which is a lot faster than people typically see. This is with an 8 billion parameter model from Metacod Llama. And so that's kind of interesting.

Let me try a few other things. Let's say, can you put this itinerary in a table? It's in a table format. Please add a column with the approximate cost of each event. And now, can you do this for Atlanta, Georgia also?

OK, so it's pretty fast, fun to work with. Let's see. Let me switch back to the presentation.

Importance of Speed

So basically, why does it matter that it's fast?

AI, at its best, will help anyone who uses information in their work better. And AI co-pilots make it faster and easier to access and use information. But most important, it'll make this effortless and fast, sort of like having a smart expert sitting next to you.

to bounce ideas off and to try things. And brainstorming like this works best when it's fast, and you can try new things and quickly make decisions.

And Grok speeds up iteration, which speeds up innovation. So in fact, the demos you saw from our other presenters, they might see a speed up if they were to use the Grok models on Grok hardware.

And so any of you that are developers know that when you're compiling, sometimes you're sitting around waiting a long time. Well, in fact, large language models help programmers, but they also help knowledge workers.

And there's 36 times more knowledge workers in the world than professional programmers. And generative AI or large language models work especially fast for things that require iteration or reasoning where you're thinking about an answer and then critiquing your own work and then responding again, and also real-time interactions like the voice example that I just showed you.

There's one other demo, but I think I'm a little bit limited on time. So maybe I'll just show this to people afterwards if they're interested in seeing it. I'll skip over this.

But basically, it's being able to create a 36 to 100 page book very quickly in seconds instead of minutes using the Grok hardware.

About the Speaker

So my name's John Barras. My background's mechanical engineering, but I became a product manager back in 2015 when I joined Google. And I worked on putting GPUs and TPUs in Google Cloud.

I led that effort and then I joined Grok in 2018 because I wanted to learn about making chips and building chips and especially for AI. So I'm going to talk about Grok a little bit too.

Company Background

Here's Grok's background. So the word Grok comes from Robert Heinlein. means to understand something deeply and intuitively. So it's a good name for an AI company.

We have about 300 or so people. There's some contractors as well. We just raised $640 million in funding, so we're up to about a billion in funding right now.

Everything we do is mostly in North America. We have people scattered all over the world, but the manufacturing and the compiler effort and everything is done in North America.

Our founder is Jonathan Ross, who was one of the initial developers of the Google TPU. If you're interested in working for Grok, there's a link there you can look at.

And we have a lot of developers. When we first launched our API, I think it was back in February, end of February, people started signing up to get API keys and experiment with Grok models.

Now, we offer mostly open source models, although we do have a few other models. We partner with Meta to get the Lama models. In fact, I'm sure it was a Lama 3.2 vision model, so I'm wearing that here.

Technical Details

And Grok does everything. We're a chip manufacturer. We don't build the chips. We send it out to be found by a global foundry. But we build those chips into cards. We build then eight of those cards into a server, and then nine of those servers into a rack.

And so in order to run, we're building a cloud right now with a significant number of servers in it so that we can provide access to large language models to as many people as want them.

And one example, just to give you a sense of the speed, you saw the speed when it was showing you 1,200 tokens per second. So our output speed tokens per second is 250 tokens per second. We can do 3 to 4 images per second when you request information about images. And compared to systems that run on GPUs, you'll see we're quite a bit faster, between 5 and 10 times faster.

LAMA 3.290B is similar to a 70B text model that supports vision, which includes visual question answering. You can ask about something that's in an image. It can do OCR of text and translate text and things like that that's in an image. or count objects, for instance.

And so the point of the rest of my talk is to talk about why we're that fast. Here's another several examples of LAMA 3.1 models, the 3.21b model. which is running at over 3,000 tokens per second.

LLMs are getting larger. Even though you can get that one billion parameter model that'll run on your phone or on your laptop, a lot of people like to use the larger models like the 70 billion parameter models because the quality of the answer is much more accurate, much higher quality. But as they get larger, they need more memory and the speed slows down.

If you run on a CPU, it might be one or two tokens per second, but that's almost not usable. And so when we designed the language processing unit or Grok's LPU, we took into account some certain things. One is a software first approach where everything is driven by software, not by hardware.

And then we have a programmable assembly line architecture. So the architecture on the chip level is like an assembly line and that actually scales up over the network as we add more and more chips to a single solution. And we also, our chip is deterministic and that's important in order to get the speed to get really rapid token generation or language generation. And then we have a significant amount of on-chip compute memory and bandwidth. in order to avoid bottlenecks.

Technical Innovations

So let me talk about how that works. One other point I'd like to make is that with Grok, we're also a low power solution.

So if you want to produce a million tokens per second, you can do that with two megawatts with Grok hardware compared to roughly six megawatts using competitive GPU hardware. We have a paper about that if anyone's interested in more details.

One of the reasons that we're fast is because we run the whole model on SRAM. With a GPU, or other competitors, you typically have HBM memory. HBM memory is about 100 times slower and much higher latency.

But you can put a lot more data in there. So typically what you'll do is run a model on four or eight chips with gigabytes of HBM memory on them. But then you have to take all the weights and load them all onto the chip, run and do the calculations, and then another set of the weights and load them onto the chip and do the calculations over and over again. And that's basically what's slowing down the process of generating tokens.

We run everything in SRAM, and what that means is we have many more chips in order to get more memory for a single model, but that means we can run those tokens through much, much more quickly because we're not stuck loading the weights from high bandwidth memory. We load it right from SRAM into the multiply units, and it makes it much faster and also lower power.

The other thing we do in order to scale so high, people don't typically design chips and think about the network as a whole. We design this chip to scale up to 10,000 chips interconnected at once or more. And so the other thing that's interesting is our compiler, when it compiles a large-angle model run on the chip, it actually orchestrates it across all the chips that are deployed for that model.

So we've talked about the 70 billion parameter model running on eight racks. of hardware 576 chips. And the compiler orchestrates all of that effort across all those chips to allow them to communicate using deterministic timing.

So basically, it says, this operation will run on this chip. It takes this long. And as soon as you're done, communicate your answer to the next chip. And that chip starts operating right away. And so basically, when you build it that way, you can minimize the latency and speed everything up.

Networking and Communication

And so I'll talk about the networking for just a minute. All the local nodes in one of those servers I showed you with eight chips in it, all the nodes are connected to enable a low-diameter network, a lot of communication. And, in fact, software controls the networking as well as the compute, and so we know exactly when to send packets from one chip to the next chip.

And, in fact, that lets us have very high bandwidth between chips, but also we can send it directly to a chip or we can send it from one chip to another chip if we need additional bandwidth that isn't supplied by the initial connection. And so each chip actually acts as not just a compute, but as a networking router. And so we don't have to provide any extra chips in between our LPUs in order to make this operate as a single unit. We don't do switching. We just do direct chip-to-chip communication. Again, it makes it very fast. There's no congestion.

In fact, one way that I will show you how that works I just said this, so I'll just skip over this slide. But basically, if you imagine if you said, if you're commuting in New York, and you could say, if you leave exactly this time, drive at this speed, take this exact route, you won't have to stop or slow down, you can make that eight mile commute in 15 minutes, as opposed to typical commute, if you go eight miles here, maybe from Long Island and from New Jersey, and it's going to take a lot longer than that. And so the idea of having everything orchestrated makes things fast.

So we have a unified system, hardware, software, and communication scaling. Our on-chip SRAM is 100 times lower latency and quite a bit lower power than HBM.

We have deterministic execution, which allows us to combine the networking and compute at the same time and compute that all at once up front. And then we don't have to write kernels, so it makes it very simple. We're not trying to fit things special way on the chip. The compiler takes care of all of that.

And then we have this very low latency C to C interconnect. this leads to like a 10x performance increase.

Getting Started

Now the funny thing about this is we're running on a 14 nanometer chip which is from 2019. Most of our competitors are running on 4 or 5 nanometer chips. We've announced a version 2 of our chip that's 4 nanometers and that'll be on its way shortly.

If you want to get started, you can go to grok.com and play with the chat. That's the demo I showed you at the beginning. Or you can go to the console.grok.com, the one on the right, which allows you to create an account and get an API key. You can start playing around with inference on Grok.

That's it. I may be over my time. I'm happy to hang around and answer questions if we don't have time for questions now, but I can also take some now.

Q&A Session

so how like specialized to achieve in terms of model choice does it only support the layers and stuff for a llama or does it support other models and other applications and how is the particular So I'll answer that in two ways.

The first way is it's train complete. So it's just a regular compute chip with a different kind of architecture. So you can execute anything, even though we're deterministic.

If you put a for loop in there, it ruins the deterministic nature. So you can do that, but you don't have to.

Second thing is, it's very simple. It's a single core chip. Most chips are at least multiple core and have a different level of cache.

We have no level of cache levels. We use a streaming architecture for the memory. So it streams right into the, you just load things on a memory and it streams right into the matrix units and things like that.

And there are, on grok.com, there are papers that talk about the detailed architecture if you're interested in learning more. Yeah.

You mentioned that you're focused on inference right now. Uh-huh. Is there plans in the future for your cloud operating to support fine-tuning or training of models as well?

Yeah, we actually run fine-tuned models. If you give us weights, we can run fine-tuned weights on inference. That's not a problem.

We don't have a fine-tuning API today, so it doesn't make it easier. But if you provide the weights for us, we can do that for you as a dedicated instance.

So we do have... Like I said, it's a general purpose chip. We have very fast FP16 and Int8 matrix multiply and add performance.

We have high quality, something called TruePoint high quality multiply. So, we can do a lot of different things but we're not really planning on training. We may support fine tuning better than we do today though, yeah.

Okay, thank you very much.