The future of LLMs: how to use local LLMs by Ian Broom

Introduction

Hi, everybody. Thanks for coming out this evening.

Quick show of hands, who has played with LLMs on their local machine? OK, cool.

Everybody else in the room, we're going to show you Does this work? Oh, it does.

Everybody else, we're going to show you how to do it tonight. And you don't need to know any code. So hopefully, you'll be able to basically take most of what we show you and apply it.

Presenters and Their Background

My name's Ian. This is Deb.

I'm the CEO of Fliplet. Deb's our AI tech lead.

And Fliplet, for those of you who haven't seen me present at this event before, because Josh keeps asking me to, which is very nice of him. Thank you, Josh. Fliplet is a no-code or low-code app builder, and we have a lot of AI capabilities, and we're always adding more.

But we're not here to talk about Fliplet.

Pros and Cons of Local LLMs

We're here to talk about the pros and cons of local LLMs. So what is a local LLM? So you guys have probably heard of these different models by now. Llama 2 is very popular. You've got Mistral. You've got PHY 2, which is kind of Microsoft's latest local model, very much designed so that you're able to run it on your local machine. Stability AI, UK company, is building out some really amazing models. Hugging Face apparently produced Zephyr, and there's heaps and heaps of other ones. There are literally new local LLMs, or LLMs that you can run, or open source LLMs, depending on how you want to describe it, coming out every single day. Actually, Deb and I prepared this presentation, and then we have to add more to the presentation today because new stuff had come out so recently.

So what are the pros of using an LLM locally? So as you can see, the list is quite long. So I'm going to rattle through these so we can get to the demo. But basically, the first is that it runs on your hardware. So somebody asked a great question earlier, how do you control the security of your data? Well, one great way to do it is run it on a server that you own or infrastructure that you control or on your local machine. The other thing is local LLMs are progressively getting faster and faster as they figure out new ways to build and execute LLMs on more commoditized hardware. Reliability is a really big one. I was very frustrated using ChatGPT on the weekend because it just would stop. halfway through a conversation with it. Very annoying.

Consistency, one of the things that is kind of really relevant to context AI is the fact that your underlying model could change. Open AI can just come along and update it at any point. This gives you the ability to build on top of something you control, so therefore you know when it changes. Cost control, a lot of people are very concerned about how much will LLMs cost if you're using online LLMs. Of course, if you're running it on your local machine, you've already hopefully purchased the machine, so the cost shouldn't be changing. You've got security, as I mentioned. You've got a whole variety of different models and different model sizes. Deb has been able to test some, and we're going to show you some benchmarks that he explored kind of very hands-on later on in the presentation. You can evaluate lots of different models, and the mechanism we're going to use for demoing it today will show you how quick and easy it is for you to test new models when they come out. You'll be able to fine-tune your local LLMs, assuming they support fine-tuning locally, which is also something that can potentially save you a lot of time.

One of the main reasons why I started playing around with local LLMs is, A, I had a long flight and I didn't want to be dealing with crappy Wi-Fi trying to access an LLM, and, two, I wanted to make sure that I was able to get kind of consistency and start to integrate it into different things on my machine. So I started to play around with different models to see if this was achievable and honestly I was blown away. And this was a couple of months ago, maybe in December that I was playing around and LLMs have only gotten better the ones that you can download now. The other thing is I think most people know now that GPT-4 is kind of impressive partly because it's a mixture of experts model or an MOE model. Well, you can now not only download MOE models and run them locally, but you can also create your own. And we'll touch on that in a second. And then finally, everybody thought, oh my god, Gemini 1.5, 1 million token context window. This is going to be so amazing. And then Berkeley came out with a model you can download and run locally that will support 1 million tokens in its context window. And it's like, wow. I'm surprised that that hasn't kind of been picked up more because it kind of pulls the rug out from underneath Gemini 1.5 potentially suggests that local LLMs might now start to actually surpass hosted LLMs in a context window, which is pretty exciting. And if you want to find out more about that, it's called Large World Model on Hugging Face.

The cons. Some models eat your machine alive, trust me. I have run local models and I have borderline not been able to get out of them because I can't get the mouse to the button to stop it because it's just literally eating every piece of resource my machine has. The cost can also be quite high. If you're obviously having to get a big machine or you need to pay for a big server, then obviously the cost could be high. You might need some technical skills, but hopefully you will kind of demonstrate you don't need too many technical skills today. You'd have to do the maintenance, whereas the advantage of just using Anthropic or OpenAI, of course, is they're responsible for maintaining the model, improving it, making it safe, et cetera. You don't have to worry about scaling it. Typically, you get a smaller context window. Although, as I said, we have to update the slides as we went because bigger context windows came out. But the vast majority of local models that you'll be able to access do not provide big context windows. Parallel processing is difficult. Most of the models are just designed to kind of manage a single stream. So you kind of have to figure out how you're going to deal with multiple requests if you were going to use it as a piece of back end infrastructure. Reasoning is worse, typically, than GPT-4. I don't think there are really any local models that currently compete with GPT-4, so manage your expectations.

Yeah, so why would you use a local LLM? Well, as I said, I was going on a flight. I wanted to, to be honest, I kind of feel, don't tell people outside this room, I kind of feel a little bit less smart if I can't access LLMs anymore. So knowing that I was gonna be stuck on a plane for like eight hours working, I wanted to have the power that an LLM provides available So that's why I started to play around with local LLMs. But also I realized as soon as I started to get them working, now all of a sudden I don't need to worry about how many documents am I summarizing, how many documents am I categorizing. All of these other data processing requirements where you'd sit there and think, how much is the OpenAI bill gonna be? Now all of a sudden it's like, well, if I have a chunky Mac, like Deb does, then actually I can get through a hell of a lot, as you'll see in a second. Offline access, as I've mentioned, you can get lots of different models. So you can get just general knowledge models, chat models, maths models, code models. So you can actually pick a model that's actually doing exactly what you want. You don't have to necessarily use a general model. You can remix your own, so this is quite exciting. There's technologies out there called STMOE or SEGMOE, if you're interested in looking them up. They're quite technical, but basically it means you can make your own models now out of multiple other open source models. You can do image generation, you can voice transcription generation, and actually you can now even do video generation as well, although the videos are not good. Don't expect Sora-type quality off your local machine.

Demonstration of LM Studio

Hi, thanks, Ian. So basically, what I'm going to show you is using a software called LM Studio. Some of you may be familiar with it. But it's basically a nice UI for downloading models.

It has its own chat interface. It can even run local servers. So you can quickly test out Python code or Node.js, basically JavaScript code. But yeah.

This is kind of basically some rough test that I did on testing four different types of models, at least these are the four popular ones that are going on right now. And the way I tested this is that in the same chat, I gave it like a business question, like kind of an analytics or like a marketing type question, and then a code question, and then I asked it for its opinion on something. and all in the same chat to see how well it would perform. And the tokens per second is basically how quickly it can respond on my local machine.

Most of the time, like, 25 is way faster than you can basically process, so it's not really an issue. 50 is kind of on the higher end here. Overall, I found, like, the Microsoft Phi 2B model to be really good, like, generally speaking. Obviously, there's stable LM, which has turned out to be really good at, like, business use cases and reasoning, so it depends on your use case.

Um... Well, good is slightly better than okay. Okay is medium, basically. That's it.

So while Deb kind of does the demo, I'll kind of explain what he's doing just so that he can use his hands. So basically, this is LM Studio. So when you first load up LM Studio, as you scroll down, you can see it just literally lists all these models. And all you do is just click Download, and it starts downloading it.

You don't have to go out. You don't have to find the model. You don't even have to really understand what the different models mean. It also filters the models based off what it detects your machine has available, so it gives you recommendations.

So it says, like, this model will probably work, this model probably won't work, which is quite useful, because it's very annoying waiting, like, 15 minutes to download one and then find it doesn't work. So once you've downloaded one... Yeah, so... Once you've downloaded one, you can basically... There's a chat interface here. I'll start a new chat so we don't... Okay.

Thanks. Okay. Yeah, so this is basically the chat interface. And here, I've got the four models that I usually use loaded up.

So I'll just load up this one. And right now, it's loading the model into memory. Just a pro tip for people. If you have a Mac, I highly recommend ticking this option.

Otherwise, it's gonna be very, very slow. And you can ask it... Well, you can then start chatting So what's interesting is LM Studio also tells you how much of your machine's resources it's using to run the model.

So you can see here on Dev's machine, it's using 2.2 gigabytes to run the PHY2 model. And it's currently using, it topped out at about 15% of its CPU. Yeah, so, yeah, basically, this is the chat. It just works like OpenAI, you know, ChatGPT.

There are some settings here which you can basically enter a system prompt, as those who use the API will know what that is. And it's giving you the context length here. There's a few more settings here. You can start playing around with sampling and the formatting of the prompts and basically...

you know, how to manage the memory, basically. So here I have to keep the entire model in RAM for performance reasons. And yeah, these are kind of, well, more settings, basically. And you can then basically, if you want to switch to a model, you basically eject the model.

and you come here and you select another one. And then it'll load the second one into memory. And off you go, basically. This one is 7 billion, the other one was 2 billion, so this is taking a little bit longer, as you can see.

So, yeah. And then the other thing is demonstrate how you can use this as a dev server. Right. So you can come here.

Here, sorry. Here you can basically say I want to start a local server. Now, it even gives you a sample code of how to use it. This is for ‑‑ this is for using code requests, but they have a Python code here.

And more versions of Python, basically. We'll just stick with curl. So I will basically start the server. and the server has started.

So you can just, I'll just do a curl request. Let's just open it up. I hope people can read this, but let's see if this works. So basically what it does is it launches an OpenAI compatible server locally and just runs it on a port on your local machine.

So then you can just hit it with a curl request and it'll return an OpenAI compatible response, which means if you've already integrated with OpenAI, all you do is change the URL and now you're running on your local machine. So not that I do development anymore, but coming back to my plane example, now you'd basically be able to write OpenAI compatible code while on the flight, land, switch it back to OpenAI and just keep going. So pretty powerful stuff. Here you can basically see all the logs from the output, all the tokens it's used and streamed the output, basically.

And as OpenAI API also does, it gives you all the tokens it's used approximately so you can test the size of your prompts and everything. But, yeah, I think the only small limitation is that while you're running the server, you can chat with it. You can only do one thing at a time. But, yeah.

Cool. All right, let's jump back to the slides. Excellent.

Other Tools and Resources

So we demonstrated LM Studio, but there were lots of other tools out there. And LM Studio wasn't one of the first.

You can obviously just download various models and run them locally, but you have to spin up the whole infrastructure yourself. One of the things that I personally, because I'm kind of more interested in using LLMs than loving the reason why I would kind of mess around with trying to figure out how to run them or spin up servers and things like that locally.

One of the things that I really like about LM Studio is it's continuously getting updated. So most of the time when I load it, it's received an update. As soon as I've completed the update, it's got new models available. So I don't usually have to wait very long.

If I hear about a new cool model I want to try out, it might be a week before it's available in LM Studio. You can also use, NVIDIA now has their own piece of software available, which I'm pretty sure you need an NVIDIA GPU and a Windows machine. There's JAN AI, there's GPT for all, and there's probably heaps of others as well.

And ultimately, I think one of the biggest problems is once you've got a tool like LM Studio is just keeping up with how fast LLMs are changing. So Hugging Face is your friend. Personally, I find the best way to keep up with what's going on is to follow them on social media. I'm pretty sure they're available on pretty much all social media channels. So if you're not following Hugging Face yet, I would recommend that you do.

And then, of course, the way in which I found out about a lot of the other models, and particularly the Berkeley model with the 1 million kind of token context window is via other people who are kind of in this space and, yeah, just following different people on social. So, yeah.

Q&A Session

Any questions? Mainly for Deb, hopefully. Yep, at the back there. Yep, you.

Hi. Hi. Just wondering about, like, system requirements. Like, was there a network there?

Yep. That's a great question. I was very envious when I saw how many tokens a second Deb's machine was getting.

So I've got an M1 with just 8 gig of RAM, and I'm able to run these models, but I do have to go and close some tabs in Chrome first. Deb, on the other hand, his machine hardly notices. He could sit here and just run this while he was doing other things, and it wouldn't really be doing much of a drain. So I'm seriously considering upgrading after watching Deb's demo.

I mean, probably. I mean, obviously, there were some pretty small models. Sorry, we had another question.

Yep. I guess it can. Anything that would be able to interact with a local server. So if you had a Chrome extension that was able to interact with a local server, for example, I was reading about a Chrome extension that uses Ollama, I think I'm saying it correctly, which is a pre-LLM server version where you can host your own LLMs. And there are Chrome extensions that work with that, so you can summarize the current page. It's just a matter of if you can get the Word extensions or things like that that will support it.

You're going to be around afterwards, right?

A round of applause to both of you.