so welcome everyone my name is Pranjal Jain and today i'm going to explain about large language model how to use in your local system so by the end of 15-20 minutes you will going to know how you can run your own large language model for free so let's get started oh yeah about my introduction i work in a company as a ai ml developer
For the presentation you can scan it quickly. So even if at the back you cannot read the information you can see in your phone. So it's better you can just scan it and get the presentation on your phone.
So the question is like how many of people, how many of the people you use AI on your own system? Does anybody use?
And which, which service you use? Oh Lama. Oh Lama.
Oh Lama, which 2.1 or three point the one which is recently which version? Yes. No, no, no. Ulama in which version? Which model, basically I'm asking.
Okay, so we have one of the guy who is, yes, let's start it.
So the agenda is like what is large language model. We will go with the overview, a little bit of theoretical so that you have a understanding what we are talking about and then we will going to have a demo on it.
I will show you, run the code and we will discuss, uh, using ChatGPT or using other services, which one will be the best and how is the performance.
So let's get started.
So large language model, basically in very, very easy words, it's like a large neural networks, train out the internet level of scale of data. Like you have internet copied on your system, on the model. So in easy words, this is the large language model.
few of the services everybody knows like we have CharGPT from OpenAI we have different model from Google like BARD and today we will going to use Gemini I will show you the Gemini and then we have Llama model from Facebook and Meta Here is an overview of how large language model so you have all the resources which goes into the neural networks architecture. It trains, trains, trains, learns, and then gives the information.
So for this presentation, we will be working on something which is like on your personal information.
Like you have written like 15 different articles or you have 15 different documents from your company and you want to use that information. And you cannot, let's say, you cannot upload it on the chat GPT.
So there is a thing like illegal or GDPR thing that you cannot share the information because they take the information. So I will going to tell you how you can bypass that information and you can use the large language model including you can upload whatever you want and ask them to do it with the same efficiency of ChatGPT-4.
Here is like one more information, like detail information about the main components, like large language models are based on transformers, and there are a few more things.
So I will going to skip this part because this part is something if you want to get deep into it. So I'm trying to skip this part, but for the future, you can keep, if you are learning all the step by steps, if you want to learn more, this is the part you have to learn.
and yes I can explain you this part is little bit information so there are like different types of architectures like encoder decoder and encoder decoder both so in short what you have to know for today Like ChatGPT use decoder model.
Another model, they use encoder type. It's like they encode the information before and then give to you. ChatGPT, they decode after and then present to you.
And you don't have to worry about much detail because we will not going to use any of them. But for knowing, it's important to know.
So different way to build LLM model. This is something which we are interested in.
So the first is prompt engineering. You go to the chat GPT, open, type the information, and get the information back. This is pumped engineering.
The second is RAG. So today I will be covering this kind of application.
Here we are main focus on and fine tuning the LLM models. Fine tuning is something which usually a company uses.
You have a large data set, you train according to your data set and then you ask like, hey, what is happening in my company? And then it will give the response according to your questions with your information. So this is like fine-tuning and training LLM from the scratch.
This is also, we cannot do it. You cannot train a model at your system. It will going to explode.
Yeah, please. Can I? Yeah, sure. I have a question. Yeah, please.
I tried to talk to Chagi PD, but I don't understand one thing. Because LoRa and fine-tuning is the same, or?
Yes. Can we actually, we're going to do like a flash Q&A at the end. So can we finish the presentation? Yeah, but yes, it's the same.
So to explain prompt engineering, RAG, fine-tuning, and making the present. Okay, so now let's get started.
So here we see the shared GPT, which is an application. That's the large language model, and we have the open API key. So these are the key which we are going to use to build our personal LLM model at a local system.
So we can train an application, anything, and then using ChatGPT. So now, we cannot use directly just an API to build it. We cannot use it because there are a few things which we need to know.
It's the higher cost. no access to the private data source and it does not go to the live because we are just using the model into the system and using it.
So to overcome this challenge, we are using line chain.
It's a framework which use to build the large language model. In short for today in this short presentation, just remember we are working with LLM models using line chain and you're good to go.
There are a few companies, the top companies. So ChatGPT, Meta, and Google.
Also, I have to add NVIDIA, because recently they've released a very good one, NVIDIA model.
Now let's switch to the working part.
I hope we have good time. I have uploaded all the codes in my GitHub. I will share the links in the presentation.
There is a link, so you just copy. There's the full instruction, how you can execute in your own system. If there is some issue, you can just let me know.
I cannot share my screen. The chat. Yes, I cannot share the screen. No.
So right now, when you open the GitHub repository, just clone it. And first step is to install all the requirements. So pip install, get all the requirements.
The second step is to create an environment environment so that our application can run into it. So second step is to create an environment I have already created, and then you have to add one more information, which is like Google API key. So Google API key, let me show you how we can get the Google API key.
Um, I provided the link. So it's like aistudio.google.
And from here, you have to click on Generate Key. Once you click on the Generate Key, and then you have to specify the project and generate it. So you get it generated.
In some case, there is a possibility that you cannot access. They're asking for the bill. Just use the VPN. Change it to another country.
I use for India. It's worked well. And you will get for unlimited.
So when I was doing this a month before, Europe is not allowed. So the countries which is external, you can use it.
So from here, copy the API key, put it here in the .environment file. Once you paste it and then that's all. You don't have to worry about anything.
You just run the application and it will going to execute I will explain you how it works, so let's try to run it and then we will see how the things are working. Okay.
So this application, we are using Streamlight as the framework, which uses Langchain in order to make the application. So here, if you see, we have a menu where you can upload n number of files, PDF files.
There is no limit. You can upload thousands of documents.
And then you can ask whatever your question is. So let's try if it's working or not.
So once you apply, and it's submitted. So let me explain to you what is going in the back.
So we have a PDF. It has many words, many letters. So what the systems do,
it's take the text, it's take the text and it's divide everything into the small parts which we call as a chunks. So it's divide all the full phrases into the smaller lines like a chunk and then this chunk goes into
Yes, here, the vector embedding. So it converts because, OK, the model cannot understand the letters. It converts everything into the numbers, like 0.1, 0.2. So it converts everything into the embeddings.
And then it's memorized and stored in the local. Here is the place where it's stored. So you put the information. It's cracked into the small chunks and then stored.
And then you can ask anything. So it can go through one by one. So we have submitted. You can submit more of the documents. And then let's see.
What is quantization? It's not in the PDF file, but if it has no information, it will say like, no, we don't have the solution. Or if it has the solution, it will show the solution.
Answer not available in the content. So this is how you have to make sure that, okay, it's not giving the false information. It's giving the correct information.
Okay. Let's try something else.
what is hybrid because this is like I make specifically a document which shows how we can use RAG application in the hybrid way. So it's trying to demonstrate the results. And now we will going to test some more models after this.
You see, we got the answer. And now if, if we make some, let's kind of spelling mistakes, like, okay, let's say, okay, we make something weird. So let's see if our model can understand or not, because usually when we type on chat GPT, we make the mistake, but chat GPT understand everything.
Let's try if it can be useful or not. So usually it gives the answers. And yeah, you see we make the mistake in the spelling, but still it gives the answer.
So it means it is having that understanding of your language which you can work. So this is like the very first application.
Let's move to the second application.
So this application is like question and answering chatbot, just clone as ChatGPT. You can ask the question. It will get the answer back.
OK. Somebody has some interesting question to ask. Yeah, please.
When you send the in the previous application, I understand that the model understands the intent and the entity, and then sends that to the vector machine. Yep, that's right. And then it fetches back from the vector database.
So let's say we have this question, what is quantization? Again, wrong spelling, wrong working. And let's see if it can answer or not.
This is running slow because we have internet issue, otherwise it's super fast. And you get the information, the response. You have everything with the code included.
So this is like we are using Gemini 4, like the one which is like the latest one, 1.5 Pro, the model. So that's why we are having the best results right now as compared with ChatGPT-4.
And now let's It is also doing the same information. If we talk about the code, it's doing the same information, taking the text, um, processing it using the chat GPT API and then giving you the response back as an application.
So here, uh, let's move to the one more application.
This would be something more interesting. Um, So this takes images as an input and you get the response. You can ask anything and it can explain.
So we can test this in this application, Gemini application, and let's see how this antidepressant response at the same time. So we will get like, okay, the accuracy is very much high. Um, Okay.
Linear regression. Let's take some images and ask the chat GPT and at the same time, Gemini to explain the images. And you see we have less than 50 lines of code in the total.
So that's the beauty of Streamlight application. And that's the beauty of the API. You're doing nothing just using like 50 lines of code and getting your results. It's not difficult.
Just install it, copy the same thing and run it on your local system. Okay. Okay. I will take another image.
It's taking a long time. So this is like an activation function of machine learning. And now we will going to ask about, okay, tell me about the image and let's see if he can understand or not. And we will going to ask the same question to tell GPT and it will going to
Okay. So he understand and he gives the response back. And now do you think like, no, it's not what we are looking for. So we can like write something like, please give me a code for the image.
Okay. Now let's try to make some spelling mistakes. Image the code and let's see. So let's see our model is efficient or not.
If it's efficient, we will get the response. Yep. We get something. Not exactly, but yes, we get it.
So when I tested in many of the, when we compare it's works very well and you don't have to pay for anything. You just use it. You upload your file of your project works of personal details. Just upload it.
If it's in the image form, use the vision one because it's using the model like multimodal model, like who take the images and it works very well. I have tried in few of them. Now,
Now here I prepared like few of them, like which take the history also. So if you have a conversation, it's record the history.
So it's in chat two PDF and in QA two PDF. So it just take all the records, whatever you type, so that you don't lost somewhere.
And vision is the last one. So overall, this is a very interesting one. You can try to implement it.
And now, it's not in the presentation, but I'm thinking we have enough time. Do we have enough time?
Now, we should do the Q&A. Okay.
So this is like you can use ChatGPT, oh no, Gemini in all of them, all of the use cases, there is no restriction, no GDPR, everything is personally on your local host because you can see it's stored here in the pickle format. So you upload thousands of things, nobody will going to ask and you get your things done.
So that's the best thing. And then rest is like, This is like something fine-tuning is like more on the top of it. First try to go with that basic and then go jump to the fine-tuning.
Applications, I think everybody knows the application we can use for coding, sentiment analysis, text generation. 1And how we use in our company is like we train the model and try to give the forecasting.
But in the forecasting we use different we use LSTM model which is based on the series of the data when you have series of times and you give so we are not using LLM for the forecasting but we are using LLM for different services like to analyze all the reviews you have got the reviews check if it's good this is bad what's also it's like in multi-language so that it can understand okay in Germany what is happening in Spain what they are saying so
Um, the future and comparison. Yeah.
Also don't think like we can use LLM in everywhere. We cannot use generative AI in everything. We have to be, make sure where we can use and where we cannot use.
There are a few models which needs like human, human like answers. So you use generative AI, but you cannot use everything based on LLMs.
Okay, so in the conclusion, for today, I would say congratulations. You know how to make your own chat GPT-like model, and you can install and run in your system.
So just try it.
Here are resources.
In this NVIDIA, they also provide good models, which is better than chat GPT-4 or compared to chat GPT-4. or Gemini so everybody is like fighting to get which one is the first one but for you use the Gemini 4 create the free API install it in your system and you can use without restriction without any worry you don't have to worry about anything and yes