Custom LLMs in Production with Mystic AI

Introduction

Sure thing. Thanks, Josh.

Yeah, awesome to see the review of the year. I love the music.

All right, let's I'll share these my screen. Let's see if this works fine. Okay, so now you guys should be seeing a presentation or something. Cool.

Automating Machine Learning Operations

So yeah, we'll talk a bit about like, you know, what is the whole autops for machine learning, which at the end of the day autops, it's, uh, trying to automate the entire operations of machine learning. And we'll come into like, you know, the reason why this is important, but like, you know, simple terms is all about deploying machine learning models and how do you do that?

Case Study: Leveraging AI and LLM Models

And so, you know, I wanted to share here, like a classic use case, but a lot of, like we're seeing a lot of companies leverage, uh, AI for, and like this LLM models for. which is like RAC in this case, like Retrieval Augmented Generation.

Retrieval Augmented Generation Use Case

And so the usual scenario here is like you have a bunch of questions, you have a bunch of context, you have a DB where you retrieve and store some of the vectors that you use some models to generate these vectors from. And then you get a similarity between these vectors and then you pass it, the words into the LLM and the LLM gives you a response, which then is sent back into the user.

So all that can happen locally, but then remotely there's a lot of like, API calls to providers just like OpenAI or Anthropic. And so the question here is how do you have the same experience that you have here, but just calling OpenAI, but for all open source models? And so here is literally the same experience that you would get normally with like closed source models, but how do we achieve the same experience, but with open source models?

Advantages of Open Source Models

And the reason why people might want to actually do something like this is very clearly defined by You know, there's many companies that don't want to send data over to OpenAI or Anthropic or companies like this. And so, you know, open source really gives you that opportunity to have like maximum compliance, maximum security by making sure that the machinery model runs directly where your data is. So in this case, if you have a cloud provider like AWS or GCP or Azure, you want to make sure that these LLMs are also on those cloud providers where also your data is.

And so open source really allows you to achieve something like this. It gives you more control and customization because obviously the model is available. So you can actually do whatever you want with it.

There's also cheaper running costs of like actually running the model. So we've done some analysis and you know, when you run Mistral can be depending whether like you put batch problems into the model can be up to 30 times cheaper than your latest GPT-4 Turbo. As well as other optimizations that you can use from a backend of which backend they use in order to run this LLM behind the scenes.

Open Source Model Quality and Cost-Effectiveness

And in general, like just, we're seeing, and I'm sure rather later on we'll be talking about it, but we're seeing like, you know, the quality of these open source models is constantly catching up. And so, uh, traditionally closed source has always been at the forefront, but open source is getting there in which we're already seeing like an establishment of things that right now people are just using GPT-4 already open source is good enough to use instead of closed source. And so in general, like there's a lot of really good things. And, you know, we're seeing that that all of this is improving constantly, weekly in many cases.

I mean, recently we've had Mixtrel, the latest release from Mixtrel, and who knows which new model is gonna come in the next few weeks. So companies should have at least a strategy of figuring out how can we leverage these models now that we are ready, the quality is ready, and we need that privacy and security.

Challenges in Deployment

Now, the problem is that actually achieving this it's rather difficult.

So that is what the infrastructure for machine learning stands for. And unfortunately, it's a rather difficult problem.

And traditionally, teams have been required to like, you know, spend quite a lot of time investing into a good software stack that allows you to deploy this machine learning models and get that simple API endpoint that I was referring to earlier, right?

So here we're like calling it APIs, these closed source models, but now we're getting into the equation of like these open source models. So how do we achieve this same API call to these LLM or this embedding model in a very simple few lines of code in which is easily hosted directly in your own cloud provider. And so to achieve that is actually pretty hard.

Mystic's Deployment Platform

And so, you know, what Mystic came in with, what we've been building for the past couple of years now is really this ultimate platform that allows you to deploy this machine learning models directly on your own infrastructure. So you don't need that DevOps team. You don't need that software engineer expertise. And it's a simple Python library to really do those, all the orchestration, so to speak of this machinery model directly on your cloud.

And you get a simple API input that you can embed directly in your application, whether it is a rag, like I was showing earlier or whatever else that your machinery model requires to be used for. And so this allows you to actually have a platform that directly deploys on GSP, Azure, AWS, or even on-prem, giving you the maximum compliance and security. Allows for instant cost optimizations.

Serverless API for Cost Optimization

One of the things, uh, partly to this platform that we've built in which people can deploy directly on their own cloud provider for startups and developers for the site. project sometimes, we also have this serverless API. And to provide the serverless API, in which you actually only pay for the amount of usage of a specific model, just like you do with OpenAI, we've had to build a platform that was able to really optimize the cost of running these models at scale.

And so after many years of iterating on this platform, this is the platform that now we're giving to companies to directly deploy on their own infrastructure and cloud. So this can be calling directly into the GCP cloud. Other optimizations that we've added, so GPU fractionalization, running on spot instances, those are two of the things that really allow you to lower the cost massively of running these LLMs directly on your infrastructure, as well as like to improve the cold starts, which is how long does the model take from like, you know, you have now a GPU available, now you need to load it into GPU memory, and then now the model is ready, now you can do the first inference. So there's a lot of delay there. And so we've done certain things like preemptive caching in order to reduce that cold start.

Python SDK and Inference Optimizations

And the fact that all of this uses a simple Python SDK really allows you to use any kind of open source inference optimizations that you may want to use. So for those potentially familiar there, like VLLM or TensorT, that's something that is directly available through our Python SDK. And all of these API with minimal low latency to make sure that it doesn't affect the overheads of your API calls, as well as other monitoring and CI-CD integrations that you can have out of the box.

Mystic's AutoML Process

So the process is very simple. With Mystic, really, that auto ops that we've been talking about is all about getting this API endpoint. And the way it works is you package this machine learning model through our Python SDK. You deploy it, you upload it, and now you get a monitoring of this pipeline. And you just keep on repeating this for any machine learning model that you want, whether it is LLM or an embedding model or any other type of machine learning model.

Demonstration: Python SDK for Machine Learning

So we probably should stop talking and actually I'll show you a bit of that, of what I mean. Uh, let me new share now a bit of code here and here I have, hopefully you guys can see these and it's not too small font. Um, but here basically it's our Python SDK in which allows it to really decorate this machine learning pipeline.

So here I'm using one of the inference engines that I spoke about, which is this one is VLM. And so very simply what I'm defining here, it's the API endpoint. It's basically what should be the inputs to this model, what happens inside, and what is the return of this API endpoint? And so in this class here, well, this construction class, I'm basically defining the inputs, the prompt that I'm passing to this model, other keyword arguments, what functions I am running under this API endpoint, and then what should be the output of this API endpoint.

And inside that is just simply a Python class. where one of the functions, one of the methods I have into this class is a load, basically load the model, which I'm just using VLM. So I'm loading the model. Now, I already have certain parameters here that define that this pipeline should only, this function should only happen once at the beginning, so that in every inference pass, this should be happening. So that's really handy to obviously lower the cold starts as well.

And then once I've loaded the model, then I can just do the prediction. And so what this happens is that when I deploy and I upload this onto my cloud, then I get a simple API endpoint, which if I share, you know, this allows you to actually visualize, um, which screen there we go, which actually gives you like a small front end to actually see the model directly and test with it. So automatically it gives you this, uh, nice front end where you can actually test the model and check that the responses are being good and you can just tinker with it and, you know, pass whatever you want as prompts. And this allows you to make sure that validate that the model works before you scale it through our platform.

So very simply with this Python SDK, you upload this pipeline directly into our cloud provider. You get an API endpoint, but now you can actually just hit that right inside your application. So you can build that rack, uh, project that you have for a company, but now it's calling an LLM and an embedding running directly on your cloud provider, AWS, GCP, Azure, et cetera.

Conclusion and Strategy for Companies

Yeah, so basically I just showed you a bit of like what is that we call this whole auto ops, how to get this API endpoint. Now you with teams and companies can very easily do something like this in just a matter of like a few minutes. They deploy this platform directly on their cloud provider, and they get API endpoints for these models running at the lowest cost possible, thanks to all the optimizations that we've done. You don't need to hire a team, and you just get your application up and running, directly running open source models.

So it is a very good second strategy if companies rely massively on these closed source models, or some companies cannot actually use these closed source models from OpenAI and Anthropic. And so they actually need to rely on open source models running directly in the AWS account, for example. And so there's a very easy strategy for them to make sure that they have a reliable infrastructure that just works for them and they don't need to invest it heavily on that.

Closing Remarks

And that's us. That's basically Mystic.

Contact Information and Offer

If anyone wants a free personalized demo, then definitely please do reach out to me. It's or at mystic.ai. I'm also on LinkedIn. And yeah, hope you guys enjoy the rest of the day and I'll be around.

Finished reading?