Optimising and deploying open source LLMs - Paul Hetherington

Introduction

Hello, everyone. Good to see some nice friendly faces. It's been a while since I've spoken here now.

Is it closing up? Yeah, there you go. Thanks.

Background and Previous Talks

I'm actually going to be reusing one or two slides from a talk I gave a while ago at one of the Mindslam meetups to do with LLMs, and obviously here.

Originally, when I spoke here before, I think it was the night that Llama 2 was released, and my talk at that point in time was on Falcon, which was the latest state-of-the-art open source model. As I was on the way to the talk, Llama 2 got released, which then became the state-of-the-art model, which made my talk a lot less sexy.

Open Source ChatGPT: An Overview

So today what I'm going to be going through is how to make an open source ChatGPT. Now, of course, it's not going to be quite the quality that you would expect from it, but a lot of open source tools that are coming out nowadays are getting very surprisingly close. And open source is something that is basically a default option for a lot of companies that can't use OpenAI for an array of reasons, maybe privacy, compliance, trust, things like that. So hopefully this is useful.

About the Speaker's Company

A little bit about us. So we're a YC company from the Winter 21 batch. We started off in 2019, actually focused in hardware.

We're now a team of engineers, salespeople, now marketing, which is quite a nice new thing for us. We really just focus on the infrastructure side of running AI and ML models.

The Need for Privacy and Compliance in AI

So you can't always send data to OpenAI. Now what do you do?

As I alluded to before and said, you can't always use OpenAI. Sometimes for a lot of companies, it's not necessarily rational. But it is what it is, and you just have to deal with it.

A lot of companies we work with now do have genuine needs to not use them. Healthcare companies, for example, kind of patient data go outside of their own private cloud, things like that. 1So adopting these open source models really is something of a necessity.

Specific Use Case: LLMs with RAG

What I'm going to be going through today, though, is a very specific use case for LLMs in conjunction with something called RAG. So RAG, just for those that don't know, retrieval augmented generation. What it basically means is, as many people are aware, the chat GPT models or GPT models in general were trained on data from a while ago. If something happens today, they need to know about it.

Now, just raise your hand here, actually, if you've heard of RAG before. Oh, OK, very nice. It is an LLM meetup.

So in an ideal situation, what you do is look at the input from a user. retrieve information relevant to that, pass it to the model, the model processes that, and gives you some useful result based on that data. That's the long and short of it.

Demo Preparation: General Techniques and Tools

Also, if you have internal data in an ideal world, it would take that internal data, pass it into the model, and provide very specific information for you. So what I'm going to be going through today is just some general techniques to do that along with a demo that everyone can use and some open source code that you can use to do whatever you want with.

So in general, you take some input, search data, retrieve, and then you create a context prompt that can be input to the model. So there are things like vector DBs that do actually help quite a lot with this.

So Lama index, maybe raise your hand if you've heard of Lama index as well. Slightly less, but yeah, people there. So Lama Index is basically a way for you to pass in a big data bank that you have. It then generates like a numerical representation of that data that you can compare to a user query. So you can see if that data is represented as one, and then your query comes in also embedded, represented as one, they're a good match. That's really a very overly simplified version of it. But Lama Index is a way that people have mapped a lot of private data or big sources of data in general, and it's something you should definitely check out if you do have a big internal data bank.

I'm not going to go into too much detail here, but slides will be available after if you do want to read a bit more on this.

Redis is something that is really useful to use here. It's a very powerful tool. I don't think it gets enough credit, which is also a note on my conclusion at the end.

But this is a tool where you can use out of the box to get very fast data retrieval methods where you don't need to use something that's really specific to ML. A lot of clouds already provide use for this.

Overcoming Challenges with Open Source Models

At the time, when kind of throwback to Falcon being the best model here, I tried looking at how I can get my personal email into a nice format that I could then query via a chatbot. So what I did was I just fed in an email dump using a basic API. I then tried summarizing it and doing a series of things to try and get that email into a nice form that the model could actually understand. This proved to be very challenging.

One thing at the time that was really bad was just the model. So again, Llama 2 came out that day, which was substantially better than this kind of crappy model, but now things have come substantially further from there. The Mistral models, they're very good, and the demo I'm gonna be showing later actually uses that as well.

I tried doing a bunch of analysis on how to get this to work, so I even tried processing documents with an LLM before generating these embeddings. It was not useful. It generated a slight improvement, so double negatives and things like that really helped in those situations. But still, it was just not quite there.

So fast forward to the way I like to do things today and the way things are done with a lot of tools like ChatGPT, for example, is not using this private searching, but is using search engines directly. And that is very useful to a lot of people.

Leveraging Google Search for Custom Chatbots

What's great about Google Search is they do provide a private search engine that you can just create an account for and get API access to. So here is, you can see me creating a search engine here.

I can pass in an explicit URL, so say I wanted to generate a document chatbot specifically for Mystic. I could just pass in docs.mystic.ai and then it would exclusively search our website to provide chatbot information for. This is something you should really check out if you do want to ever use a tool like this.

Building Your Own Chatbot: Steps to Follow

So very basic steps to go through to build your own chatbot here. 1You want to make a UI, add support for this custom search API, search Google based on the inputs. A little bit of information I'm going to go into there.

You want to then render web pages, extract text. This is something you can abstract to in other areas. If it's a PDF, you will need to render it. You will need to get important information from it.

Can't just dump it in there. you will then have to create what's called a system prompt. Now a system prompt is basically something that guides the model to say, here is what you are. So you've seen a lot of people have done like prompt injections and things like that to try and trick the model.

And this is a place where you normally get the best results for doing that. And then you want to run the inference on the model. So I have a tool where you can do this right now.

Apologies if it breaks, because I did it on Thursday in the evening. But this is a little UI I spun up, where it's chat.mystic.ai. And a lot of people are scanning their phones, so it will break now. But it's basically a chat app that I'm going to show in a second.

So it runs on Mistral 7b. And it performs all of this Google search engine and should stream results to you. It's got, I think, 10 A100s powering it at the moment. So there's a link to GitHub up at the top where you can see all the source code for how to do all of this, as well as create the Google Search API, run the model out of the box, things like that. Whoever does get it to work, well done, because no one else will.

Live Demo of a Chatbot Interface

So the demo I'm going to show you now I can actually pull up here. So this is running on my workstation over at home, and the internet isn't great. So what I can do is just ask it to tell me the time, and hopefully this does work.

All right. What would you like to tell me about? All right. Oh, sorry.

Tell me the time. What is the time? That's a better example.

So you can see here, this is an older Mistral model. It doesn't actually know what the time is right now. But what I did here was I searched just that query in Google.

So if I perform that query in Google now, what you will see is very similar to or the same results mapped over here in the sources column. Again, this is all in the source code.

So this does break down in certain situations, though, that I'm going to go into. So this chatbot is very basic and primitive. I'm going to be adding image generation and things like that to it.

If you want to do that, again, the source code is here. But you can get very surprising quality of results from such a basic system like this. And again, you can run this fully completely privately.

Quality of Information vs. Quality of Models

So one of the things I found when generating this actually, and let me get back here, was it's actually about the quality of the information that you put into it compared to the quality of model that matters. So a lot of time, people tend to say, this isn't working. The model sucks. That is a very naive approach to doing things.

Typically, it's your fault. Look at the data. A lot of people try to fine-tune models, try to train them as well. I was just having a conversation with someone about this earlier.

It's very easy to be frustrated with these models. But if you have low-quality data going into them, you will get very low-quality results.

So I have some examples here where it's performing very nice. So basic questions, what is the time in Dubai? It can search that and provide it very quickly. That's up-to-date, real-time information that I can pass in.

Who is Linus Torvalds? This is easy, just search his Wikipedia, gets information from there, no problem.

But when I start to go into more in-depth consecutive tasks, things start to break down somewhat. So here I ask, how much is an A100 on Mystic AI? That's us. Gets the information perfectly fine.

I then pass in, how much is that per hour? And because it uses the most recent search result, it loses actually the web page that was passed in to get that information. So all it has is... this kind of prompt that was in there before.

And it tries to actually do maths, which is quite weird. What is more interesting though is it does it correctly. So it actually manages to calculate the rate per second up to an hour to quite a high number of decimal places. I didn't check this, but I do believe it to be correct.

So the quality of these open source models is actually relatively impressive. But I think when working with them privately, you do need to do a lot of processing to make sure stuff is passed in correctly. And at OpenAI, that is where a lot of the work actually goes into, is checking that things are filtered properly and passed in appropriately.

Importance of Context Length and New Approaches

So with these systems, I think one thing that is really important is context length. So now with the GPT-4 turbo models and things like that, the context length is going up. One next thing to try on here actually is the QEN models. So I believe they have a context length of 128,000 and it's still relatively small as well. So I very highly recommend checking them out.

And that's something that is on par basically with GPT-4, although slightly lower quality.

One difficult thing when working with these chatbots, you get into a whole... rabbit hole of other problems that are really unrelated.

Challenges and Solutions in Chatbot Development

For instance, with web pages, there's just how do you render them properly. So on web pages, there's JavaScript that dynamically generates the web page as you go along. So if you just get the raw text, you're not getting anything useful. So there, I really recommend using tools like Selenium in Python that generates a virtual Chrome that allows the web page to be rendered fully in JavaScript. Really good tools there.

Also using a big model will help as well, as we've seen with a lot of standard chat GPT stuff. Adding in other modalities would be really important here as well.

Yep, you can contribute to this on GitHub as well, or just check out the source code if you do want to know how to work with some of these models.

Keeping Up with Open Source Model Innovations

One real challenge, though, that we are seeing at the moment is there are lots of open source models out there. And it's pretty difficult to stay on top of it.

So I took this screenshot today. The top three models weren't there yesterday, I don't think. They did look kind of funky when I actually clicked on the page because there was no sources or anything.

But it just kind of goes to my point where, again, my last talk, a new state-of-the-art open source model came up pretty much straight away. Why am I saying this? I'm leading into our plug.

Simplifying Model Deployment with the Speaker's Company

So the hardest part of a lot of these things is actually getting the models to run. That's what we specifically focus on.

So what we do is we use for Google, for example, we use sign in with Google to get access to your cloud account. Now, if you're a startup, you're lucky to get funded, and you have a bunch of cloud credits which get thrown at you by all of the cloud providers at the moment. we can allow you to use them very quickly without having to do any work yourself.

So you click buttons on us, and then we set up all of the models on your private cloud for you. And there's a nice GIF to prove it as well.

We also do GPU fractionalization and things like that.

Conclusion: Key Takeaways and Future Outlook

So just some opinions to wrap up.

I think one main thing that we keep on seeing as a company altogether is you really want to be focused in what you specifically do with LLMs. You can't just throw data at them and expect them to work very well. And that's something that this project, I think, really highlighted, especially with context length as well.

Web pages, PDF documents can be really large and you cannot just dump them in. That information will just get lost. It's not useful. Pre-processing data and maintaining that pipeline is very important here as well.

Again, Redis is very great and deserves more credit. Hosting models is really hard. You shouldn't be doing this yourself. Again, that is a plug for us, but a genuine problem.

So we see a lot of companies where they spend hours trying to get things to work in VMs in the cloud. They managed to get it to work, and they're so happy, and that's fine, but they haven't touched their product. So I really think using open source tools, even if it's not us, I think is a really important thing there. Use pre-made things to get a model to run. Don't focus on that. Focus on your UX and your product.

And finally, I think one good message to end on is there's lots of applications that can still be built on this. I think this technology is still very early on. A lot of companies we speak to, even in the more advanced stages, are in the R&D phase of using AI. And our role is definitely to help them lower costs, to just focus on the product and other things. But I think there's a huge amount of things that can still be made. So it can be overwhelming to see all the new stuff coming out, think, how can I keep up with it? Yeah, there's a lot of cool stuff out there. Thank you.