Fine-Tuning a Foundation Model for Multiple Tasks with open source

Introduction

In today's discussion, what we want to talk about is fine tuning foundational LLMs with open source technologies. So my partner in crime, Carlos Hernandez, will talk about ways that you can leverage the open source community to really help with your accuracy in terms of fine tuning your foundational LLMs.

So in terms of the people today, sadly, we are missing one of our, what I like to call our nucleus of our team, Marcos Coram. But in terms of people that we'll be discussing today, you have myself, Sam Ajithkumar, as well as Carlos Hernandez, who will be doing a live demonstration.

If there are any questions though, during our session, feel free to just kind of keep them till the end. And, you know, by all means, Carlos and I will walk across the floor as well to ensure that we can answer everyone's questions or everyone's curiosity or any sort of scathing criticism.

Agenda Overview

But 1in terms of the agenda today, what we want to talk about is obviously, you know, LLM application architecture and why fine tuning is important. After that, we'll be doing a live demonstration that Carlos will kind of go through in the sense of how you could fine tune a pre-trained LLM in either a SaaS or self-hosted method. And then last but not least, we're just going to kind of keep the floor open for any sort of questions, comments, concerns, or escaping criticism.

So without further ado, I'll pass it over to Carlos. Awesome, thank you. Can you guys hear me okay? Perfect.

Okay, so first off, not going to lie, my sandbox environment, I just found out, is kind of broken. So let's see how this goes. But either way, we'll be able to show you some, what I hope is cool stuff, and you will get some takeaways.

Understanding LLM Architecture

Now, before getting into the demo, we just wanted to talk a little bit about the LLM architecture. Now, nowadays, the enterprise use cases for LLM, they're growing intensely, I would say.

What can LLM help you with? Chatbots, quotes and tweets, users and search, summarizing large documents, regulatory compliance, and translations. But the devil is in the details, as they say.

The moment that you start going away from these foundational models, a chatbot needs serious guardrails. I don't know how many of you heard of the story from, I think it was the Bing chatbot, that in two hours, it learned how to be the most horrible human being ever. You don't want that, right?

The other one, it's tweets. I don't know you, but I still wouldn't trust ChatGPT to tweet for myself without any supervision. I also don't tweet, to be honest, so that doesn't help.

I'm sure that you guys have seen this, but hallucinations or confabulations, depending on who you talk to, essentially they make stuff up. And when it comes to regular compliance, you need to fine tune it. You need to make it specific to the use case.

And yes, they can translate languages very well usually. But in the words of Jared Esparado, who's a corporate VP at Microsoft, ChaiGPT is the very knowledgeable general employee that you don't trust. Arguably, I'm still at this level.

So how do you get to a point where the LLM will be of use for your particular use case? Well, it's all about the data. Conditional wisdom tells us that the more parameters that you have, the more training that you have, the better the model. Now, that is changing nowadays.

We're getting models that are smaller in terms of parameters, but they're still performing relatively well. And why is that? Well, the reason is the data that is used to fine-tune the model.

So if everybody's using ChatGPT, how do you differentiate yourself? 1The difference is the data that you use to train the model, right? So the more data you have, the greater the insights.

Yes, Carlos, very useful insight on that one. Thank you, Captain Obvious. Captain Obvious.

Evolution of LLM App Architectures

1Now, how do we get to that point of customizing the model? Well, let's talk a little bit of the evolution of LLM app architectures. So it used to be going from a base LLM into an app, think of ChatGPT. It's a generic use case.

Now we are getting what is RAG architecture, triple augmented generation. How many of you have heard of the RAG architecture before? OK, about half. Great.

So essentially what changes here is that you have a basis lemma, what we call a foundational model, and you use that for your questions or whatever you want. But you do what's called fine tuning. So you actually create a vectors database and you give that to the model so that the model learns some context on the questions that you're answering. So it's text embeddings.

Now I'm not going to go through all of these beautiful diagrams, but the moment that you get into RAG, then the next question becomes automatically What about governance? How do you authenticate? And then MLOps, how do you do the parameters tuning automatically? And how do you cache it?

1So this is getting more and more sophisticated, but the core idea is that the RAG architecture with the vector database that keeps the text embeddings to give it context is consistent. So this is what we are seeing our customers, at least in Cloudera, we're seeing the path they take.

When we look at the project lifecycle, I'm not going to go through all of this, but just for your information. And we can show these slides after, by the way. I don't know what the method is for that, but we're happy to share this.

We go from scoping. 1You have to use a use case that it's big enough to make a difference, but it has to be small enough and targeted enough that you can actually create an LLM that it's fit for its purpose rather than trying to do all the things at once.

Now, when you choose the LLM, there's a few decisions you have to make. One is SaaS or self-hosted. You have to assess the cost structure of it and fit the model performance.

Now, what do I mean by model performance? Well, model performance You can do certain analysis before even training your model to understand, is it fast enough for you? Is it fast enough for you? Does that meet your SLAs, for example?

And a little bit short on time, so I'll move pretty quickly through this.

Live Demonstration and Discussion

But let's look at the demo. We're going to talk about the demo really quick.

I'm done with the slides now. And now let's see how we are doing with my sandbox environment over here. We'll let that load.

All of the code that you will see All of the code that you will see, it's available on GitHub. Again, you'll get access to this after if you want to use it as a point of reference.

Now, the code here, this is wrapped around our Cloudera machine learning, so there are some scripts that are not going to be that useful to you, but the ones that I will show you hopefully will be of interest.

I have here a tab group called For Local, where I also loaded it locally. I was about to call it for emergency, but I think that would have been too on the nose.

Now, when we talk about the different types of machine learning models, actually, let me just go to this one slide just to give you an idea of what we're talking about.

So in this demo, we'll look at two chatbots. It's a Cloudera AI assistant. One takes a SaaS approach, and one is a self-hosted approach.

So in the SaaS approach, we use AWS Bedrock for running the model. And we use Gradio, which is a Python framework for chat applications and other AI applications. And we store the vectors in Pinecon.

what we call like a CML job, think of this just as Python for our purposes. So this is Cloudera machine learning.

The second example is the self-hosted. In this example, we're actually having everything in our environment.

So we have the model in our environment, we have the Gradio app, The vector startup is Chroma, and the code is in our environment.

So those are the two examples that we're going to look at today.

And just to give you an idea of the difference, here I have, can you see it? This is just a chatbot. This is a Gradio app. Here's the answers. Here's the questions.

We can ask it, let's ask it just what is CML. Now, you can select the foundational model. We have here the local Mistral model, and we have the AWS Bedrock Cloud model. We can also select some temperature or randomness in the responses. We can select the length of the tokens, and we can select which vector database to use.

So if we ask from the Bedrock Cloud model, what is CML without any vector database chosen, He doesn't know the answer to that question. But if we say, you know what? Fine. What if CML used the vectors from the Pinecon database? Same model, though.

That's the important part. It's the same foundational model. it should, in theory, know the answer to this.

Because what we have done is that we have loaded as vectors the Cloudera documentation. So we went into the Cloudera documentation, and we loaded the vectors into Pinecon. And now when we ask the exact same question using the same model, it tells us, well, CML stands for Cloudera Machine Learning. And it tells you even where it got it from.

So you can go here into a reference, and you can see where it got it from. Now, this is all running SAS, Software as a Service. If you wanted to run a local model, you can also use the same model. You can have a local vector database, which is Chroma in this case.

Now, my environment's a little bit opinionated today. But what you can see here, this is our project that we're looking at. These are all of the same folders from the GitHub repo. We have a Chroma data folder here. And this is the Chroma database where we have also stored the text embeddings.

Now, in this case, we can have the same question, same model, and different vector database. And it should also have an answer. Now, it will be a little bit of a different answer. Remember that at the end of the day, generative AI is probabilistic. It's not deterministic. So the answers will change slightly. And it's also because the embeddings that we saved into Chrome are a little bit different. I think these ones are 768 dimensions, and the Pinecon ones are, I think, a little bit more. But same concept.

You have here, what is CML? Well, it stands for Cloud Machine Learning. it's using the same page. And really, all we do to make this work is we pass it a text file with the HTML sites. So we have here a, sorry, one second. Play vector DB. We have a simple Python script. All it does is it normalizes some of the text, and it makes a request to a link in this file. So it makes a request to each one of these pages. You can see they're all in the cloud documentation, and it puts it into a database.

The moment that you enhance your LLM with the vectors from the database, it knows the answers that it wouldn't otherwise have known. And we haven't done anything to the model. We haven't trained it. We haven't done anything else to it. You can also use the local model. So this is all self-hosted. Now, when we look at self-hosted in our case, just so you can see it, I'll wait for that to load. But that's essentially running in a different endpoint, and it's the Mistral model. Thankfully, my workloads are still OK. So in this case, I think I think the Mistral model, when you ask it, what is CML without any vectors, it actually gives you a completely irrelevant answer in this context, though. Let's see. It did this to me yesterday. It's a chronic myeloid leukemia, right? Very different context. So you can see how the important part here is that we gave it the, the vectors that it needed to have that, that missing context. Okay.

Leveraging Langchain for Model Building

So now how do we do this? We use the, um, we use the land chain, the land chain in a library. How many of you are familiar with land chain? Nice. Okay, great.

So for those of you that are not familiar with Langchain, Langchain is essentially a framework for making it easier to build your LLM applications. So you can, for example, create a vector store. This one is Chroma. But you can also create one for Pinecon and a whole variety of different sources.

So if you go to the LandChain website or GitHub, they have here a list of all of the... They have a big list of all of the different vector databases that it supports. And same with the models. What that allows you to do is that it allows you to make your models modular, because this space is developing incredibly fast, as you all know. And it seems like every couple of months, there's the latest and greatest.

So instead of rebuilding and refactoring all your application code, all you need to do is put it into the LanChain app, into LanChain code. So for example, you can come in here. And this is the LanChain chain, I guess. You can fit it a different type of model. You can fit it a different vector store. And you can even fit it the specific prompts.

There's a number of templates. And if you want to change it, all you need to do is you just change the LLM model. You just change the vector store. So say that you wanted to say, well, we grew. We cannot use Chroma anymore. We need to be a little bit more scalable. Let's use Pinecon. Well, you could do something like change this to Pinecon. There's obviously some other configuration that you will need to do for this one. And just like that, your LLL model will be using Pinecon.

And that's the power of Langchain. That's why we use it in this case. So if you haven't checked it out, highly recommended. We have Langchain for building the models. We have Gradio for the UI that you saw. So all of this is Gradio, for those of you keeping track. And all of the code is available here on GitHub.

Just a quick time check. Where am I at? Five minutes, perfect, nice, okay.

Cool, so what that allows you to do, essentially, is when we think of LLMs oversimplified, all LLMs do is they essentially guess what the next word is, right? So it's all about the prompt and the prediction. Now, the LLMs will know or will attempt to guess the next word in the sequence. So you can continue.

So , that's an idiom in Spanish. Choose promise, choose do, which is an idiom in English I didn't know.

The quick brown fox. The quick brown fox jumps. The quick brown fox jumps over. So these are all new tokens.

Now, you don't want to be retraining the model for every two months because it's very expensive. But you can play with certain parameters to help you achieve the use case you want.

So for example, say that the quick brown bear. So this is all related to the temperature and randomness of the model. So this is one of the parameters that we have implemented in our example.

which is, again, in GitHub. So we can increase the randomness of the response, and we can say, what is CML? Let's give it, let's use AWS a little bit faster, let's use findCon, because it's a little bit faster.

Ah, failed to connect, perfect. My environment is not cooperating today. So let's just use Chroma.

But you can see this is a really good example, right? So here we have some, I guess, high availability type of design unintentionally, where I have a problem with PanCon. I just chose Chroma. In Langchain, it gets replaced with Chroma.

And that's the prediction that gets executed. And you can tell here, because we increased the temperature and the randomness, you can tell how the answer is completely different from what we had before.

So that's essentially one way that we have seen our customers. It's a fairly common pattern that we're seeing when it comes to the RAG architecture.

SaaS vs. Self-Hosted Models

So just as a quick summary. The difference between these models, the SaaS and the self-hosted, is that we're working with our customers to, essentially a lot of them are finding that SaaS is the easiest way to start. There's no need to run GPUs. There's no need to run training.

And you can start with this, but as you mature, we're speaking with some customers, for example, in the financial space, in the healthcare space, that they don't want to be transferring the data over to these organizations. So they want to have everything in one environment, in a self-controlled environment.

And for that, we are using Chroma. Now, I'm not sure. This is an area where I'm still learning the variety between the different vector databases, to be completely honest with you. We can have a conversation after.

I would love to know what the different folks in this meetup are using. For us, we're using Chroma locally. But at the end of the day, it's a SQL-like database, which is not particularly performant, of course.

Pinecon has been very good, though, from our experience. And yeah, so yeah, I guess the key takeaways here is that LLMs are fire, right? Technology is here to stay, but the target state is evolving very rapidly. So it is important to architect your applications in a way that is fairly modular.

In this case, of course, we did it with LanChain. Also, it's not just about the LLMs. The decisions on the surrounding components, they need to be considered carefully.

Like the performance of the model, like we said, needs to be considered carefully. How to also fine-tune the model. One of the areas that I didn't cover too much is there's different ways to fine-tune the model.

We talked about vectors today, but there are as well in-context learning, where it's essentially prompt engineering. And you can teach your model as you go along. I don't know how many of you have tried asking ChatGPT, ChatGPT, explain to me what happened in the 2008 financial crisis.

And then you say, explain it to me like you're a bro. And then ChatGPT's like, sure, bro, let me tell you about the financial crisis. And then he breaks it down for you.

And that's in context learning. And you can do the fine-tuning, which is what we discussed today with the vector embeddings for enhancements. You can also do reinforcement learning from human feedback.

Or you can train your model. Most customers, it doesn't make a lot of sense to train your own model because this is particularly expensive nowadays with the GPUs. But using a foundational model, you don't have to.

Conclusion

You just optimize your model in a way that's right for your use case. And that's how, even though you're using the same foundational model as other organizations and competitors, you will actually be able to outperform with fine tuning. Then the last piece, of course, is the integration of the app.

For us, we just prototyped it with Gradio, but this is also an important consideration. This ongoing feedback loop is an important consideration, and MLOps is obviously a very important consideration for this world. Now, that's a bit of the scope of today.

Finished reading?