Hello, everyone. Very excited to have you here.
I am Miles Harrison. I am a data scientist, consultant, and trainer. I am building my business NLP from scratch, which provides training and consulting services in data science, AI, and now large language models.
I'm very excited to talk to you today about running LLMs locally, that is running large language models entirely on your machine using the OLAMA framework. I'm going to talk through a little bit about the bits and pieces of what OLAMA is, and I'm also going to do some hands-on demoing of what that looks like to work with large language models on your local machine.
Oh, and if you want to share anything you are seeing on social media, please share on the Mindstone social media and tag them as well as the different handles you see here, which are all just NLP from scratch.
Okay, so first quickly, who am I? You're not here to hear about me, but I'm Miles Harrison. I'm a data scientist. I worked in consulting for many years and then ran the data science program in a global tech bootcamp based here in Toronto for about five years. I'm also a meetup junkie. I go to a lot of meetups. I have for a very long time. And now I organize a number of groups here in Toronto as well.
I recently talked to someone about presenting at events like this, and they told me that when you present at events like this, especially when they're being recorded, that you should be your best self. So yes, I am the guy in the jacket, but I also wanted to mention that I caught this nine-pound pike ice fishing on the waters of Lake Kuchiching, and now this fact has been recorded forevermore for everyone to see.
Okay, so if you're here, you probably know what a large language model is, but for the purposes of completeness, just quickly, what is an LLM? A large language model is a type of deep learning machine model that's trained on very large data sets and also is very large in size in terms of the number of parameters that they have. So the L in large language models refers both to the size of the data, we're talking about sort of double digit percentages of the internet, as well as the size of the models themselves in terms of the amount of parameters.
And I probably don't have to tell anyone in this room or watching this video that this new AI boom and AI hype cycle that we are in is because of the release and smash success of ChatGPT in November of 2022. Today, when we're talking about LLMs, there are multimodal LLMs that work with other media as well as text and language. I'll touch on that a little bit, but most of what we'll be working with today is just working with text-style chat models.
Do you know where your data are? If you got that reference, you are as old as me or older.
So yeah, a lot of concerns about LLMs and Gen AI and everything we're seeing now is around data privacy because the scale of these models is such that they have to be trained and run by companies that have the infrastructure to do so. So software development for working with Gen AI is looking a bit different than traditional machine learning.
And there are a lot of concerns about privacy and people doing things like going into a cloud-based model provider and putting PII and social insurance numbers and addresses and emails and stuff like that into them. So this is a real problem.
And this is one of the motivations for working with LLMs locally on your machine where you can be assured that the information never leaves your laptop. You can turn your Wi-Fi off and the LLMs are just running completely locally on your machine.
Okay, there's also what I call the double black box problem.
So LLMs are a type of deep learning model. So they're not interpretable in terms of how they're making their decisions.
And now we have what I call the double black box problem, TM, where we not only are using a deep learning model, but we are putting our data into some API. It's being sent over the wire to some provider. We don't really know the architecture of the model, how big it is, how it works, what's happening to that data. and then we just get a response back through the API and it's all very mysterious and magical what's happening in the middle.
1So this is the kind of thing we would like to avoid, even if we can't necessarily avoid the black box nature of deep learning models, including Gen-AI models and LLMs. What we really want, the goal here, is to just have the model on our machine, in our little bubble, no worry about data going anywhere else. I'm very self-centered that way. I hope you all are as well.
So that's why I want you to introduce OLAMA. This is an open source project. I do not get paid by them to promote this. I am not affiliated with OLAMA, but it is a great framework and a great tool that is seeing a lot of rapid adoption.
It is the framework for working with large language models locally on your machine. works with almost every large language model you can think of. And as new ones are coming out, they're very quickly and very rapidly adding support for all of them.
If you're technical, you can run a local web server through a Docker image that they make available. It has libraries for development in Python, as well as many other popular languages.
And I said I wouldn't talk about it too much, but they have actually added multimodal support. So you can do things like use the OpenLava model, which is kind of like an open source, open weight, excuse me, GPT-4V to do multimodal things just completely locally on your machine.
Okay, but how does OLAMA do this? OLAMA is actually a framework written in Go that is built on top of lama.cpp. This is a framework that runs large language models, the transformer, completely in C and C++.
So it's very, very computationally efficient because the computations are, as we say, sort of close to the metal. We don't see the overheads and things like that from using Python and frameworks like TensorFlow and PyTorch. So very, very efficient inference and implementation of large language models in Lama.cpp, very actively being developed, new models being added.
And we have, I think it's Mr. Georgie Georgian of to thank for doing this work as well as many other contributors to this open source project.
So what models can we run? I already basically said that if you can think of a popular LLM, you can probably run it on Ollama, but some of the, this is not an exhaustive list, but some of the popular usual suspects, the incredibly popular LLama series of models from Meta, which give the framework its name, the Mistral models, models from Stability, the Gemma models from Google, which are their kind of open version of Ollama, of Gemini, and then models like Command R from Cohere based here in Toronto, and then other popular open models like Yi and Quan from China, the lightweight Microsoft PHY models, the very popular Falcon model, and so on and so forth.
Now I'm going to break out of these slides by clicking on this link. If my Wi-Fi will cooperate, that's great. Oh, now it's deciding to cooperate. Okay, great.
If you want a comprehensive list of everything that is available, you can go to olama.com forward slash library. So yeah, you can go to olama.com forward slash library and they have a list of all the different models that will work with Ollama and you can pick your favorite model and just type like falcon. And you can download this and work with a null llama, which we're going to do shortly.
I'm so glad that this is being recorded. It only does this when you are being recorded.
It's a great day to be alive.
Thank you to my mobile provider for that candid moment.
And is this gonna work or not? No, okay, well, oh wow, this is special. Okay, I think I'm just, there we go.
Okay, so here's llama3 anyway, and it has a list of the model and the size of the model as well. Okay, back to the slides.
So now I'm gonna do the demo and hopefully that goes better than my internet provider because it is all based locally on my laptop. So what I'm gonna do here, I am running a Windows machine and I am gonna open a PowerShell terminal and I'm gonna make the text bigger so people can read it.
The first thing I'm going to do is run... Oh yeah, by the way, Ollama runs on all platforms. You can install it on Windows, Mac, or Linux. So it's a native install in Windows as well.
So I'm going to run the Ollama server first by typing Ollama serve, and that will actually run the Ollama server locally on my machine. And then I'm going to open a separate tab. And now that the Ollama server is running, I can type...
olama list to get a list of the different LLMs that I have locally on my machine. You can see I have Phi3, Lama3, Lama2 and Mistral. Don't have a lot on this machine. I downloaded these some weeks ago.
So then if I want to drop into an interactive session to chat with the LLM locally, I can type olama run and then the name of the model. So here I'm going to do Lama3 and we will see some gears grinding as olama is running the Lama CPP library. And this will also take a long time because I am presenting just like how when you wash your car, it will rain.
So now I'm in an interactive session and now you can just type as you would interact with any LLM and it will stream the responses into your terminal. So here I'm gonna say, who made you, actually, I'm not gonna say that. I'm gonna say, write a poem about applesauce. And now Llama 3 will happily write a poem about applesauce and stream the results to my terminal. And this is 100% local.
So no information is going over the wire as this is happening. If you do not believe me, why am I doing this in this demo now? I don't know. But if you don't believe me, you can take a look at your system monitor and take a look at the GPU and CPU.
So when you write a message, your GPU or CPU is actually rendering the results, actually doing the processing. So here I can say, please write another one. And now it will stream the response. And you can see that it is actually using a lot of memory and should be hitting my GPU as well to actually stream that response.
Okay. So there you go. Local LLMs in your terminal, on your laptop, no data being sent over the wire. To close the session, we just type forward slash buy and we're back into the terminal. Okay.
So the other amazing thing is very recently in February of this year, the people that developed the LLAMA are very smart and they actually got it to be interoperable with the chat completions API from OpenAI. 1So if you have any application that currently uses the OpenAI API, you can make two small changes in your code, the name of the model, the API key, that's three changes, and the URL. and you will now be making calls locally to your LLM instead of sending data to OpenAI and now you can test other types of LLMs with the applications you've already built. So you can read about this on their blog and you can very easily test other types of LLMs developing locally.
So this is what it looked like before. So this is what it would look like before. We have our laptop, and we're doing development locally on the left.
Oh no, I'm presenting so I knew everything would go swimmingly. There we go. OK.
We hit an API. It goes to OpenAI. Some magic happens, and then we get a response. And our model is GPT-4, or I guess now GPT-4.0.
And then we instantiate a vanilla OpenAI client from the OpenAI Python library. And I don't need to specify anything because my API key is stored as a secret and so forth. OK, but what I want to happen is to not pay money for my API requests. So I actually don't want to use that API.
And now I have to ask the question, OK, well, what is my model and my client going to be? Well, now I just say the name of the model that is stored in Ollama. And my client is still the OpenAI client from the OpenAI Python library. I just need to specify a base URL of my local machine and then specify the API key to be the word olama, which is not used but is a placeholder but is still required for this to function. So now I am making API requests 100% locally to the olama API web server, excuse me, which does not cost anything except to compute on my machine.
So I guess I should also demo that. So what I'm going to do now is go to VS Code.
And here I have some Streamlit code running a chat application. And I'm not going to go too much into details. There's actually not a ton of code here.
But what I do want to point out is I have a variable called LLM model, which here is GPT 4.0. And here I also have my vanilla OpenAI client. And then the rest of the stuff here is all just making the request to the OpenAI API and rendering the results in the chat client.
So hopefully this goes better than my mobile provider. So now I'm going to go back into my terminal. And what I'm going to do is say, well, I need to be on my desktop. I'm going to say streamlet run. And this file is my OpenAI Ollama chatbot. So this is a Streamlit application.
And now when I type in here and say something like, who made you, when I hit the Enter key, this is making a request to the OpenAI API. And you can see that GPT-4 responds and says, I was created by OpenAI. And so this is using the OpenAI API. And I'm getting charged on my account through my API key for this. But that is not what I want to happen.
I want to do local development. So I'm going to show you live now everything you have to do to swap out an OLAMA model with this existing application. So I'm going to go back to my code editor. And all I'm going to do is change GPT-40 to LAMA3. And I'm just going to add two things here to the OpenAI client. I'm going to add the base URL. which is going to be localhost, and it's port 114.34 forward slash v1. And then I'm also going to add an API key, and that's just going to be the word olama, and that is literally it. So now I have to do the hardest part, which is remembering to save the file by hitting Control S.
And now if I go back to Streamlit, now when I make a request here in the chat app, that will actually be hitting the Ollama server that is running on my local machine here. So we should see a request come up here when I interact with the application. So I'm going to go back here and say, who made you? And you can see now there is a delay, and that is because It is loading the model in memory. So the first time you actually hit it, there will be a bit of a delay as it loads it into RAM. But now we see a response being streamed by the Ollama library here. And you can see it says, I was created by Meta AI. So that's no longer the OpenAI model. That's now the local Llama 3 model that's running on my machine. Okay?
So I literally made two tiny code changes to my application and went from a web-based model provider to 100% local LLM development. That's not the end of the presentation. But thank you. Okay. Yeah, don't thank me. Thank the LLM developers. and the developers of llama.cpp. Okay, I don't need to do that. We already did that. We already did that. No, no, we didn't do that. Yeah, OK. But wait, there's more. Thanks, Billy Mays, RIP.
There are actually a number of GUI clients that you can run locally that have Ollama support. So things like LM Studio are based on Hugging Face, which is a Python framework. So the evaluation and inference is a little bit slower. But these different GUI frameworks that have been developed are all compatible with Ollama.
So it's still running it in C, where the inference is much faster. So a couple here. The Open Web UI was originally the Ollama Web UI. Very advanced.
It can do things like RAG and annotate data. So there's lots of different use cases for that. Does require a little bit of technical knowledge as you do have to run it in a Docker container.
Anything LLM is a little bit easier to use, has a fair number of features, and I'm just going to demo that quickly. And then there are other applications you can use as well, things like Yawn, which I think is more for people on Mac. And then Chatbox is one as well that works with many, many different types of LLM frameworks and providers, including Ollama.
So now I'm going to go to the final part of the demo. So what I am going to do is I already have my Ollama server running. So now all I need to do is launch anything LLM.
And I am also not associated with the developers of anything LLM. And the application will launch and it will take a long time because I am presenting. Is it on a different screen? What is happening?
There it is, okay. Do I need, oh, I need to minimize this. Is that what's happening? How about now?
How about now? How about... Okay. Yeah. We gotta make sure this is still recording while it doesn't work.
OK, once more with feeling, shall we? Once more with feeling. I said that this wasn't going to happen. And then now I'm enjoying it because I'm just making all these jokes about it.
And it's kind of fun. Okay, well, I'm unimpressed. So I'm just gonna wave, wave, wave my hands past that part. And instead, I'll do something more difficult, which is sure to crash and burn.
So anyway, you can check out anything LLM and take my word for it. But I will demonstrate using Open Web UI. And that you can't actually use with the server that you run this way.
So you do need to actually run the server as it was installed. So I'm just going to run olama where it's running in the taskbar here. So that looks good. And now I'm going to run Docker.
And I have actually set up my machine to only have a single Docker instance image container, whatever you want to call it, running, which is going to be the container for the Open Web UI. And you can see that it has an application running on port 8080. So I'm just going to click on that link. OK.
How about now? There we go. OK. Thank god.
So here is the chat application. And this is using Ollama that's running on my machine. And you can actually choose the different models that you want to use. So you can chat with different LLMs locally.
Let's say I try the PHY model because it does inference a lot faster and it's smaller. And I say, who made you? And there will be a delay, as usual, as the model gets loaded into memory. But once it is loaded, you now basically have a UI that's like ChatGPT, just running locally on your machine, and no information is going over the wire at any time.
OK? So that's one framework. Again, I had some trouble with anything LLM, but there are also other GUI applications that integrate with Ollama. OK.
So that is pretty much it for my talk. If you are interested more in O-Lama and doing local development work with LLMs, I highly encourage you to check out the local Lama subreddit on Reddit. Very, very dedicated community of hackers and technologists and developers that are interested in working with local LLMs, including but not limited to the Lama models.
I already mentioned some of the official resources from Ollama themselves. Check out their GitHub for documentation and their blog for announcements and new features.
There's also an official Discord for Ollama, but it is quite the stream of consciousness at the moment. So if you choose to wade in there, just be aware of kind of how it kind of looks at this time. But lots and lots of people that are working very actively with Ollama there as well, and you can seek support there.
Okay? So that is now it. Thank you very much, and I hope you enjoyed the presentation.