Speeding production model deployment by >7x

Introduction

All right, so quick introduction. My name is Brad Mickley. I'm the CEO of Jozu.

Jozu basically is an enterprise DevOps platform for AI. And I'm purposely not using the term MLOps, because MLOps tends to focus more, in my experience, on the dev side and the training side and the data science team. 2What we are focused on is the DevOps side, so the things that land in production, the people who are responsible for the success there.

So providing a lot of different things, centralized model registry, vulnerability scanning, tamper-proof storage, cryptographic attestations, deployment on-prem, all those good things.

And I don't know why this thing keeps popping up. Let me see if I go out. Is it gonna? No, okay. Try not to be distracted.

Why Kubernetes for AI in Production

Okay, so what you're gonna try and learn today or I wanna try and teach you is kind of at a high level how Kubernetes deployments work.

The reason why we're focused on Kubernetes is because if you're in a large organization, it's like 85, 90% chance that the things that run in production that the end users use is running on Kubernetes. Even if it's not obvious, it is probably still true. So it is by far the most ubiquitous place to actually run production workloads.

And this is in some ways an awesome kind of segue from Ashni's great talk because as people move from prototyping, to how do I do this in production with an MVP to how do I do this in production for real now that I've got funding and now that I'm trying to scale this out and now that I've got a million users, Kubernetes is probably going to be part of the answer.

Deployment approaches at a glance

So we're going to talk a little bit about different ways to deploy models on Kubernetes because it's of course not as simple as there's only one way and why you might want to use different ones.

So I'm going to try and keep this simple.

Kubernetes fundamentals

There are four components that you kind of need to understand.

Control plane, cluster, nodes, and pods

The Kubernetes control plane is like the brain. That's the thing that controls everything. It's going to decide what goes where, when things get scaled up, when things get scaled down, how people get into the system, who doesn't get into the system, all that stuff.

Your cluster is where all the resources live. And so you can think of that as kind of the bloodstream, the lungs. It's what allows you to do your movement. The cluster is what allows Kubernetes to do its work.

A node is a single kind of subsection of that cluster. So you kind of have, let's say you have 10 GPUs. You say, okay, this node is going to get two of those. This node is going to get eight of those. And I can kind of subdivide, portion things out.

And then a pod is the actual workload, the actual model running inside the node on the cluster because the control plane said it could. So you can kind of think of it that way.

The role of registries in deployments

So when you fit it all together, you've typically, we've been talking about this side, the cluster, that's the Kubernetes side, but key part of this is something called the registry. Now registry is like a catalog for all of your apps, for all your models, for everything that can be deployed. So when I create a new product, it goes first into the registry, then it goes into Kubernetes.

So when I look at this, here's I've got two nodes. I've noted some GPUs, put a node of some CPUs. I've got some pods in there.

I've got my app container that's going to go only, of course, to a pod that runs CPUs because apps don't need GPUs. And I can have my model container as well in my registry. And that's, of course, going to go to a pod with GPUs.

All right. Hopefully that all makes sense as the underlying foundation.

What affects model startup time

So when you're talking about loading one of those models, pushing it from the registry into Kubernetes, making it run in production for real people, there's a few things that impact how fast that is going to happen. The network speed, obviously you guys know. Models are really, really big.

And some of them are really, really, really, really, really, really, really big. And so the network speed is going to be important.

The disk speed is going to be important because you have to unpack and deserialize those models. And then, of course, the CPU or GPU and the power that that has to actually run it for runtime is important.

A couple of things I'm going to focus on here.

Scope: focusing on LLMs

We're going to talk about LLMs.

LLMs can't be meaningfully compressed. There are lots of predictive ML and other workloads that are AI and ML related that can be compressed. And that's going to radically change how important some of these different resources are.

The other thing is I'm not going to talk about mixing models and standard container apps. This is going to focus just on the model.

Four deployment strategies

All right, so there are essentially four deployment strategies that I can use. Standard container deployment, which is kind of just the default. It's what will happen if you don't choose to do something else.

There is something called a streaming deployment. You might have heard of Tensor RT. So that is gonna be a streaming deployment. It's gonna kind of start just like a Netflix stream. It's gonna start as quickly as possible before necessarily the whole thing is loaded.

NVIDIA has something called an inference microservice, or a NIM. And a NIM has an optimized way of handling the GPU. And so that has a slightly different deployment characteristic as well.

And lastly, my company has... built a rapid inference container. We really focused on trying to get that start time down as much as possible because one of the things we found frustrating as we were developing models earlier on and helping customers who were developing models was the iteration time can be quite slow when you have to wait minutes or sometimes hours for a model to even be loaded before you can actually start to evaluate how did your change happen. And if you're trying to do rapid iterations, that's your biggest handicap.

All right, so let's look at these in turn.

Standard container deployment

Let's start with a standard container deployment. So first of all, a big chunk of time is just gonna be the transfer across the network of the container and the weights that are getting pulled from that registry into the Kubernetes system. So that's gonna take a non-trivial amount of time.

Then much less time is loading those weights from the container that's now in Kubernetes into the Kubernetes memory area. Then you've got to build the CUDA graphs because this is an LLM. And you start the cache in the server. And at that point, now first token can be processed.

So you would call this kind of deployment to first token time. Now, I'll get to the actual times for these at the end, but I just want to walk through so you guys can understand where the pieces fall.

Now, the little hollow part above is what the previous slide was showing.

So you can just kind of compare.

Streaming deployment (e.g., TensorRT-LLM)

I'll be honest, when I saw streaming, I was like, this is going to be so good, I can't wait. And we tried it, and we tried it multiple times, and we were like, huh, OK, it's a little better. Not nearly as much as I expected, though.

So streaming deployment cuts a decent amount of time out, but not a massive amount of time. Overall, it's better, but it's not massively better.

Now, you can see the container pull time was reduced significantly, but the problem is you trade off into a lot more time than it takes to stream those weights into memory. So the CUDA graphs and starting the cache server, roughly the same. And this is obviously not moment perfect, but roughly where things land.

NVIDIA Inference Microservices (NIM)

Now, NVIDIA is much faster still. That now cuts a significant chunk because it entirely removes, if I just go back for a second, that building the CUDA graphs. That gets pre-built with a NIM so that it's part of the build time, not part of the deployment time, not part of the runtime. So you utterly remove that step. That makes things quite a bit faster now.

It is still streaming, in fact. NIM uses a streaming algorithm as well. So we're doing quite well.

Rapid Inference Container (RIC) by Jozu

Now, what we did was we actually did a pre-cache to Kubernetes so we can remove that container load time for the deployment. And we go directly into loading weights into memory. And so we cut a large amount off that time.

We do build a CUDA graph. We don't do that during build time. We could, but we haven't implemented that yet. It didn't seem necessary.

Comparative results and observations

So as we look at the deployment race, what you see is standard deployment at the top. You can see initial deployment and then reload time. So the initial deployment is the longest.

It's a little faster with streaming deployment. Again, this is using AWS GPU node, LAMA 3.1 AP model. So not a massive model, but decent sized.

Stream a little faster, NIM significantly faster than that, and then obviously by cutting out that network transfer between the registry and the Kubernetes node, you take an enormous amount off.

When you come to reload, it's interesting because both the stream deployments actually take a little longer on reload. There's more processing required for the streaming than there is for just bulk slamming something in there. And so both the standard and the Josu rapid inference container are noticeably faster on reload, although reload is relatively quick in all cases.

So obviously, I'm slightly biased. The RIC, we built it. And the reason why we released it is because it is significantly faster.

And that has a lot of value, especially during training or iterative cycles. But it is more complicated, obviously.

When to use which strategy

So standard deployment, you could use it in pre-production. You could use it in production, obviously, as long as you don't need to iterate on your model very often.

So if you're doing a deploy every month, I mean, who cares if it takes half an hour versus five minutes? That's not the biggest deal when you amortize that over a month. It's a big deal when you're trying to do 10 iterations every half day before lunch.

Hot reload and high-throughput scenarios

So streaming, these can be really good for high throughput inference. And they have one super cool benefit, which is that they can do hot reloading. So if you were doing something super advanced where you needed to hot reload the model, in other words, not to take the container down, not restart it, but actually just inject the changes directly, then that's quite cool.

It's a lot that goes around. You have to have a lot of processes in place and a lot of other tooling to make that work really, really well and safely because it is inherently kind of dangerous. But if you can handle it and if it makes sense for your use case, that's really where streaming kind of comes into its own.

Security, governance, and reliability considerations

Last thing I'll say is just echoing both the previous talks. As important as fast deployment is, and people are very focused on iterative speed, of course, when you're developing something, it's really, really important not to forget about security. Because being first to blow up or first to be compromised is not a crown anybody really wants to wear.

So you do need to think about the fact that loading weights from outside Kubernetes, if you're not in that ecosystem of a registry, can be dangerous. 2Because all the security systems in Kubernetes are designed to work together with the registry. Jamming something in from elsewhere, tricky.

Governance and auditability become hard if you're combining artifacts in different repositories. So if you've got some things in Git, some things in S3, some things in weights and biases, some things in, you know, name your poison. Trying to coordinate who's changed what to what and when. Trying to make sure that your authorization patterns for each of those multiple registries are actually the same so that Brad can't get access to something that he shouldn't through a separate registry than the one that you thought he would go through. These are tricky edge cases that you do still need to think through, especially for enterprises.

Hot reload I talked about a little bit.

Runtime tamper-detection gaps

The other thing to keep in mind, and this is a little bit tricky because again, for non-AI workloads, it really isn't the big deal, but it is much bigger for AI. 2Tamper detection tends to be done only at load time. It is not done at runtime.

So for example, when you do a hot reload, there is no check to make sure that that hot reload has not been tampered with. If you do a rollback, there is no default check to make sure that it has not been tampered with. So you need to build in those patterns.

not because you're necessarily always under attack by some nefarious person, but because we're human and we make mistakes. I was at a very large company where one of my engineers Best intention was trying to fix a problem. Thought he pushed the change into staging, but accidentally pushed it into production and took out an entire AZ.

It was an unfortunate mistake. It's a human mistake, and human mistakes happen. It was not a big deal in the end. We were able to fix it.

But that's the thing, is tampering doesn't necessarily just mean malicious. It can mean I messed up and grabbed the wrong thing and put it in the wrong place. And you want to make sure your systems are protecting you from that, because humans do miss stuff.

Closing and resources

All right, I'm nearly out of time. My shameless plug.

Why Jozu built on OCI registries

Okay, so part of why we built this in Jozu is because there wasn't anything that actually used the OCI registry to hold AI ML artifacts and did a good job of making sure that the security systems that were already in place for Kubernetes could be used for AI ML. And that seemed madness to us because we've worked with this stuff for 10, 20 years. And why wouldn't you want to have your production system have the security checks for AIML that it does for non-AIML, which frankly is safer in many cases.

In most cases, very little new infrastructure is required. We offer full supply chain control, very dev-friendly, and this stuff is production-grade.

It's already been downloaded over 150,000 times. That's my shameless plug.