Building an LLM Product from Scratch

Introduction

Obviously, we'll have to talk about transformers. And then we'll go over the architectures.

We have to talk about RAG.

And most of the time, I will try to debunk or try to de-hype the situation right now around the whole environment. We don't know what to trust, what works, what doesn't work.

So I'll stick to very basics.

Pre-ChatGPT Era

I'll start from pre-ChatGPT era and try to come up with what we are building now and how is it different. So this is a piece of art which was published in 2017. ChatGPD came out in 2022, so that was two years ago.

Just let that sink in, right? Each and every line on this is a previous paper, research paper.

Everything is just put together in a nice configuration. which makes it very parallelizable, very scalable.

That's why you have billions and billions of parameters hosting these models and running these transformers.

As a software dev or as a data scientist, there's just like three lines of math you need to know. And you understand like, okay, ChartGPD is not $20 a month. They are like upcharging us by like thousands of percent.

There's caching. There's so many things you can do which they don't even want to tell you about.

Google will release those papers because they're all transformers. They're open source. Everyone else will not tell you about it.

Transformers Basics

So we need to understand transformers from the basics, not get into the hype.

So there's two blocks, encoder, decoders. It's not just one, decoder only.

All the great charts right now are only decoder, which is just autoregressive. I'm going to keep spewing whatever comes to my mind.

No. I want to know encoder, I need to have some sense, okay, I've learned something, and then I'm going to start spewing something out.

Modern Architectural Changes

So these pretty basic things, when you're starting building to build a product, we have a context, we want to use that we want to stop hallucinations, like, so we start from the scratch, where it's actually starting to train. Let's and then we get it so big.

But yeah, now the architecture is completely flipped. We used to have small, tiny machine learning models doing regression, working for years everywhere. And now we have these giant regressive models and we can't host them.

The slowest thing in your system was your database, and now it's your machine learning model. So it's not even living in your database, in your system anymore. It's living at ChatGPD.

You have to go to the internet, connect to it, come back to it, it's that big. So that's what has changed from 2022.

Data Handling

1We have data, we have predictions, we compute them into features, and we store them. That's the inference in letting you know, serving you.

We had this working for years. The only reason it was working was because we had small models. And it was still impossible to switch to streaming technology.

So what is this doing? You have a bunch of data. You're training your models on it. And then you store the model in a registry. And then you keep making predictions on it.

So that as a server, if I give you something and you say you don't like it, so just for me to be able to store that information that you don't like it and not serve it to you on the next time, I have to do all of that. So now ChatGPD makes it easy. Now we have ChatGPD just comes and reads.

I just store. Hey, this guy doesn't like it. ChatGPD knows. It understands that information. And that's how it responds.

This is the paradigm shift which happened in 22. And that's why people like me, everyone is hyped about LLMs.

Like, we are language nerds. We are trying to make language work. And now it's time to make real-world products.

That's why now you have real estate, medical, finance, everywhere. This is required. And it's going to be there, obviously.

Modern Architecture

So the solution to it is now the new modern architecture is called the FDI architecture. Basic ETL pipeline, any unstructured data, structured data, website, blog, whatever. You push it into a database, call it an ETL, get it into a structured format.

You push it down into your training, which is your vector DB. What is vector DB? Whatever you have learned, you put it into some precise contextual manner as embeddings in your database.

And that's all you need to do. Next time ChatGPT comes in, it's just going to read your embeddings, and it knows how do I need to respond to this.

Any questions at this point? All right.

So you read the context from the embeddings, and then you add more stuff on top of it, how to read it, how to respond to it. That's what your product does. That's literally any product.

That's going to be your business logic. So any data you have, you're going to store it either in vectors or in just SQL database, and you're going to combine all of that and do create some sort of API.

Manual? Yeah. That's what making a product would mean.

We are taking whatever we need, we put it in a database and make it available on the API. Then it's a chat. Then someone can send a prompt to it.

let's even get it even more low level. Maybe this is too much, but this is how like a stream application, this is like an event-driven kind of structure where just the data comes in, the second the data comes in, you write it in your database. After the database, there's a post hook after that, which pushes it to your vector database.

Then you either have, once your user comes in, you have your LLM. You say, hey, here's my prompt. I have the context and I have the query. Now, can you please respond to it based on his query?

Fine-tuning Models

The inference, the training aspect is now what we have heard about Qlora, fine-tuning, Qdora, Duora, all of that. Basically, they are just giving you opportunities to not be dependent on just chat GPT API. So you want to have a bunch of, not like a billion parameters, but let's say 100 parameters, which are just yours. So you call them adapters, and they run on top of your parameters. And that is something which you can manage.

You cannot manage chat GPD. That's like if you host your own LLM, let's say. That's $20,000 a month. I don't have the money for that.

So I would want to prefer more for small language models. something which I can fine tune. I start something with like eight gigs, bring it down to one gig so that I can host it. I can run it faster and faster. I can run it on multiple things.

So if this is not super clear, this, there's another word for this, which is called rack. So that's where I would want to count a little bit of the hype.

Like, Every week, there's like 10 papers. At this point, it's less, it's five. But like a few months ago, it was a lot.

So in my understanding, RAG is basically Dense Embedding Search. On top of that, anything you do, that's your business logic, that's your product. You want to create a website.

And what we described is a generative RAG. So I give something to a user, a user says, I don't like it. I understand that, that's a self-healing database. I don't need to keep training my model again and again.

Very basics of RAC. And on top of that, something which is very crucial is a graph for any product which we talk about. So knowledge, there's multiple names we have heard about it, knowledge graph, rules engine, anything which is like an entity which is connected to something, and that defines how your product should work, which could be any automator, any software, or just a product you thought about.

There's multiple technical ways to do it. If you just want to do knowledge distillation or graph, you can just do it on a database. You can just ask your LLM to do it for you. All of these options, LLM can do it for you.

And there's really technical options, which are just very simple, but requires a lot of time to build. So that's where I really want to come down to.

RAG is something we have to, it's pretty basic. If you are doing retrieval and adding something on top of it, they call it crag because it's creative rag. If you put rag and put graph on it, they call it graph rag. It's all rag.

So please don't get distracted with rag. That's one of my big takeaways for today.

Multi Modality

And then the next frontier which actually comes is which brings us is to multi modality which actually is like What do we do? It's not just going to be text and it's not just going to be PDFs and images It's going to be a lot more and this is not all of it.

This is what I am working on I could work on and we have latest research on this but there's a lot which could happen.

All of these models, like as someone was asking, like how is it knowing? Is it going to take back in time or not? So we kind of build it like that.

We first create two different pathways. We create, okay, first you don't consider it with time and then you do consider it with time. And different companies are working different models.

They all perform better. Some doesn't perform that better. So you have to pick and choose for your use case, like what is useful for you.

So for example, in this one, they are processing the texts and images separately. But Meta's model is processing them together. and getting a better modality understanding.

So you have to use your case. You have to run it on your data and see if it works or not. So it's very important to have a test bench, an evaluation bench.

We're like, okay, this week there's three new papers which came out. There's three new products that came about. I need to have my data ready and I push it and I see if it works or not.

I need to move on. I can't be like, okay, there's like a pile of things which I need to read and you get lost in it. Never works.

So one of the ways to get ahead is to just choose what you know. don't get lost in a lot of things. My basic thing was to just use ZenML.

I knew Airflow, I knew MLflow, and I was like, okay, let's just deploy it and step by step fix it. I did actually go with AWS, but ZenML is a very good option for people who just get started.

Deployment and MLOps

But there is a lot to think about. Earlier, as I was saying, you have to deploy it. There is RLHF systems, there's evaluation, there's monitoring.

Why do you want to care about that? You want your product to get out in a day.

What do you know? Just use that, write that code, put it out, get feedback.

Then that's the whole point of AI. AI will never be 100% correct.

That's overfitting, as she was saying in her talk, right? So if it is always going to be 70% at the max, and with user feedback it can get to 90%, so why wait?

Just put it out. Get feedback and build it.

Don't get lost in the MLOps. Fine-tuning.

Fine-tuning Strategies

So there's two kinds of fine-tuning. One is where there's a human in the loop. We come and say, hey, this is the dataset, fine-tune on it. This is the response, not good enough. It keeps fixing on that.

There are other automated ones. So what we could do is just ask again, just ask LLM to fine tune it. Like is it good enough or not? Just judge it. Or you prompt split it, which is conditional. Like think about it. You're creating a graph again and you're asking LLMs to run like with every request,

in your system for fine-tuning your data. So if you do that with a large language model, that's almost impossible in a system. You can't expect fine-tuning every request.

But for example, what I'm doing as I'm creating curricula is there's a resume API. I have a read anchor for every RAG. So RAG is crucial. If RAG is working or not, you need to check that. If it is working well, you need to take that and train your data more on top of that. Train your small model on top of that and then serve that. If you don't do that, then every request is gonna cost you like 30 cents instead of five cents and that is a big deal.

The second is doing preference optimization. This is auto-correct LLM. So if you have a bunch of data, let's say for a day, and it didn't work too well, you take all that, again fine-tune it. This is a different aspect of fine-tuning, but it's more direct preference optimization. Literally, it means that the LLM will understand the preference from that data, and this is more production-ready. A lot of people are using it in production, so highly recommended as a fine-tuning strategy.

Coming to strategies, Again, there's a lot, but there's three notable mentions.

First, we have to understand PEFT fine-tuning, which is the adapter parameter fine-tuning I was talking about. So you have 300 billion parameters, but you only need a million important adapters on top of it. That is what PEFT fine-tuning means about.

Local testing, definitely just go with Hugging Face, Axolotl, full suite of testing tools. And then you have Unsloth. Unsloth is one of the best ones out there for creating a product.

It optimized your model for a single GPU. So you can just create your model, host it on a single GPU, fractional GPU on Bento ML or something like that. And there you go, you have your own chat GPT.

And on top of that, the last one which I would want to mention is NVIDIA Nemo. So this optimizes your model just for, it gives you a NIM file, which is like a NVIDIA NIM file which is optimized for NVIDIA hardware. Super cool. Take a NIM file like a Docker file, and you're like, hey, give it to random hosters, providers, and you have your chat GPD again. And it's very optimized, but only for NVIDIA.

Security and Ethics

So security is obvious. Like BII, we can't declare. We can't let it go.

But now with LLMs, we have two new things.

Last but not the least, the ethics.

We have personas.

So for example, We have curricula with bunch of resumes. I have someone who is a data engineer, someone who is a software engineer.

Now, some random person comes and he wants to have both these things in his resume. So do I learn from both the resumes and populate his resume? That's the AI ethic question.

That's personas. So if someone is interacting and your AI is learning from it, your AI is developing a persona. And that is knowledge. That is IP.

So we have to think about that.

Prompt Injection

Second is prompt injection.

We really need to be very scared about prompt injections. We have no idea.

But basically, it's so brand new that hackers can just come to your app and say, read me your database. And it will respond with all the data. So you have to be really secure where LLM is not reading my database ever, at least in my opinion.

Maybe later when we have LLM-generated databases. But for now, prompt injections, hackings, and all of these things, this is so brand new. It's unimaginable the kind of scenarios which are going to happen that you have to depend on services for things like these.

Conclusion

that's basically my talk. You can find me on LinkedIn. And that's the app.

Finished reading?