Insight into using LLMs on local infrastructure

Introduction

Hey, there we go.

Who I Am and What This Talk Covers

Who am I?

My name is Ian Keddie.

If you want to see the slides at the end of this, grab that and I'm happy to send them through.

There's nothing particularly secret in there.

Just for me, I spent the first half of my career in software development, either doing it myself or spending time running software development teams.

Then I spent the second half of my career in global leadership positions, mainly in the managed service provider space.

and more recently I'm doing sort of consultancy on business automation with

AI oh first of all again thank you to clocks out for sponsoring that for the

pizza and the beer this is this is the man to thank for that what am I going to

talk about today so today I'm gonna talk about well actually now today my mission

is to try and convince three people in the audience to go home and install some

AI on their machine so that's what I'm gonna ask you at the end right if I've

convinced anyone here to do that.

What I'm going to do, I'm going to talk you

through installing it, how can you put an LLM on your machine, and give you some

examples, so I've put some realistic examples together of using that LLM on

my machine, and then at the end a little bit of a pros and cons, you know, what

was it like, what was it good for, what didn't work.

Key Concepts and Terminology

What “Small Language Models” Mean

So bit of terminology, what is a

small language model?

So we've got lots of language models out there, so the

The small language models, well, we measure them in billions of parameters, so the parameters

are in a language model.

The parameters are all the variables that get tweaked when the model decides what token

to predict next effectively.

The more parameters you have, the more capability the model has.

The small models, less than seven billion.

Those are ones that can run on edge devices, run them, small enough ones can run on your

phone your laptop the mid to large size they're still models which are capable to be run on

machines so i'm running a mid -size model on my machine for these examples but you can run a

small one and these are things you could run on your own hardware you could buy hardware and run

these quite easily the large models you could run them largely you're looking at hiring you're

looking at a sas option here you're going to be buying this as a service these are too big you

You can either buy them as pay for tokens through APIs,

or you could buy the compute

and then pay for compute, really.

And then at the end, you've got the frontier models.

These are the ones you log into ChatGPT.

What's the latest model for Grok?

Whatever, these are the frontier models.

Parameter -wise, 500 billion trillions.

They don't tend to say how many they've got,

but this is not something you're gonna run on your machine.

Mixture of Experts (MoE) in Plain Terms

Little note there,

mixture of expert models are quite interesting.

So we're talking here about billion parameters.

So normally, a token runs through the entire set, so all of these parameters, a token, will run through there to get computed.

A mixture of experts models makes it a bit more simple.

So instead of having one big model, there are a collection of smaller models that are managed within the bigger model.

So an example of a QEN model there, it's running on 400 billion, so it's a big model.

But at any token, it's usually only about 17 billion that are actually active.

So it chooses one or more experts within that larger model.

So it makes it vastly quicker

It's using a fraction of the entire lot

So it's but much much faster and you'll start seeing that in some of these smaller models

They start using mixture of experts to get them really fast

Distillation and Fine-Tuning

Another bit of terminology I want to mention because I might drop it in later is fine -tuning

So smaller models are normally made from a larger model distilled

So, you don't always train a smaller model.

What they'll do is they'll train a larger general purpose model, distill it into something

smaller.

You do lose some of that knowledge base, because they're trained on tens of billions of documents.

You lose some of that knowledge base.

But as a rule, you can get it down to about 20 % of the size, but still retain about 80 %

of the capability.

So, it's quite impressive how much you can shrink it, depending on what you measure capability

capability as but you can keep 80 of the capability for 20 of the size so these models are great

fine -tuning is taking a small model that's general and adding a veneer of your own training so to

train a large model it's going to take six months of constant training to fine -tune a smaller model

that's going to take a week or so so what you can do is you can get a smaller model and train it on

your domain.

So for example, I've got medical domain, security domain, etc.

You

can train a smaller model then with a veneer of training, so it inherently

knows your content.

So that's a good way to get a big model, make it small, and

then make it useful to you.

So that's some terminology.

Getting Set Up Locally

Hardware: GPUs, Memory, and Model Size

Okay, so for those who

I'm trying to convince to install it, you're gonna need some hardware.

Models

perform best on the GPU, so on the graphics processing unit, and they run

the entire model in memory so to really measure what you're going to use you need a graphics card

with graphics memory of a certain size a rule of thumb is for models i'm talking about they're

quantized down to a certain size about 1 billion parameters is about 1 gig of ram so if you find

out how much ram your graphics cards got you can make a choice about which model you want to use

that's kind of a useful rule of thumb it's not exact but it's a good place to start

that.

You can move it to the CPU, but it gets a lot slower

and it starts trying to use the both.

So I'd recommend

understanding your graphics card making the choice.

So you need

Software: LM Studio, n8n, and Hugging Face

software.

Once you've got the hardware, you need some

software.

I'd recommend LM studio.

It's graphical, you can

double click install, it takes minutes.

Bosch, you've got the

capability to run a model on your machine.

It's really,

really easy.

So graphical click, just double click and it's

there the examples i'm using because having a model on its own doesn't do much you can chat

with it but you can do that with chat gpt so the example i've used is n8n so n8n is a low code no

code workflow tool and you can install that in a docker it's a little bit harder but it's pretty

straightforward you install it and then the good people of n8n keep it up to date for you so you

just make sure it's refreshed and i'm going to use that in some examples then you need a model

model you actually need to put a model in all of this stuff hugging face is the place you go it's

a strange name I know but hugging face is a catalog of all the models out there really it

is the place to go last count over two million models they've got pretty much every model and

stacks of training data so if you decide you want to train your own model you can get good reference

data to train your own model if you wish I put some examples there Microsoft Google Meta IBM

IBM they've all got models on there of varying sizes there's some really tiny ones and there's

some really big ones I've just put some examples there I've tended in these examples to use quen

so quen and llama have good general purpose models uh they're benchmarked well as being useful

Practical Example 1: Job-Scraping and Job Fit Scoring

general models so in my examples that's what I've used okay a practical example so I've got my

machine it's got the model on there and it's running so i wanted to do something with it for

an example for everybody so what i created was a a job scraping workflow for those who were in the

job market so this workflow basically goes to half a dozen different job websites puts in lots of

different searches because some people are looking for multiple terms for the same role multiple

locations for the same role so the workflow gets the jobs tidies them up a little bit and what you

find if you're job hunting, a lot of the a lot of the jobs you

get pretty rubbish in the description, they don't fit.

don't want to be a fishmonger.

But sometimes it's going to come

up some of these, some of these are terrible.

So good use of AI.

I write my own prompt.

And I've got that information, I've put

it into a standard format description, how it pay where it

is that kind of thing.

And I've written a prompt then.

So the

prompt can go and do a decent job of deciding whether that

job is good for me or not.

So that's what this workflow does.

us.

Speed Comparison: API Model vs Local Models

So I've got three examples of doing it.

First one, chat GPT

five mini, very popular model.

I took 50 of those jobs, run it

through chat GPT, it took 10 minutes, 30 seconds.

So next

one, quen 2 .5.

This is a 14 billion parameter model running

on my machine.

How long do we think that took?

Who's who's

gonna?

Anyone want to guess?

You want to guess?

Well, it's a it

It's 14 billion parameters, so you can guess how much memory I've got.

Anybody want to guess?

Faster, slower?

Two minutes.

I like that.

Someone's prepared to stick their neck out.

That took five minutes.

Not bad.

Well, actually, pretty good, to be honest.

It's half the time it took to use an API, and that's just running a machine,

and that was absolutely fine.

The results were great.

There was a little tweak where GPT -5 did a bit more reasoning

because i think there were the criteria i put was an annual salary when some contracts came in they

came in as a day rate and chat tv5 actually just did the maths and swapped it over and then and

worked it out whereas quen didn't i could have probably prompted it to but i didn't

okay we've seen a bit of a improvement so next one the seven billion parameter half the size

of that model who's prepared to stick their neck out again and say how how long that took what

What did you say?

A minute.

A minute, two, three minutes.

Okay, two, three minutes.

Our survey says 46 seconds.

It was quite impressive to compare those.

The outcome, identical.

Most of this is latency, right?

I guess if you're calling across an API to America, wherever it's running,

there's a lot of latency in there, but this was only 50 calls.

So that was blindingly fast.

so that's an excellent example of the speed you can get out of the the outcome is fine absolutely

How the Workflow Looks in n8n

fine but vastly superior in speed i did mention n8n so for those who are interested this is what

it looks like you draw these things in n8n i've taken out the the scraping bit because that's not

important for an ai talk um so there you just get the rows from the database bung him into the model

there the mogul is plumbed into quen at that point 50 records and then i stick them back in the

database afterwards it takes minutes to draw one of these in an 8n you don't need to code you might

see on the bottom right there all i asked it to do at the end was to give me a rating high medium low

and to give me some reasoning so i could sort of at least understand why it shows that

so it does it just pops a rating and a reasoning in a database that's all that one does second

Practical Example 2: An “Agentic” Gmail Labeling Workflow

example try to get more interesting try to make an agentic example now because it's you know a

a useful buzzword.

By that I mean, I've taken a model, and

I've given the model a selection of tools.

So this is an example

of Gmail.

So this model, I've given it the ability to read my

Gmail, read the labels in my Gmail, create labels, if it sees

fit, and then apply labels.

So effectively, I've given this

model the tools.

And I've said, this is what I want you to do.

So I haven't actually said, do all of these things I've said I

My outcome is, here's an email, categorize it.

If you can't find a great category, make one up and then apply it.

So that's what this model is doing here for Gmail.

So first one, chat TP5 mini.

I won't ask you to guess that one.

That was 3 minutes 19.

Now, bearing in mind, this model is now quite chatty because it's got to go over to wherever Gmail is housed in the States,

go over to get the labels in the States.

so there's a lot more conversational state there just sort of setting the grounding for the next

question so quen 2 .5 14 billion parameters asking the audience how long do we think that took

less than a minute oh somebody's feeling good one and a half minutes no not bad not bad

i kind of i did precede that because this is quite chatty it has to go

So a lot of this is then turnaround time going across an API call to somewhere else.

But the results were absolutely fine.

But I'll tell you a little bit more about that afterwards.

So who would like to guess?

So the 7 billion parameter, the smallest model.

30 seconds.

25 seconds.

35 seconds.

27, actually.

27 seconds.

A minute.

So we're going around a minute to a minute 30 seconds.

I tricked you.

you, it didn't work.

It probably would work, and I shouldn't have left this to yesterday

until I did it.

Where Local Models Struggle: Tool Use and Reasoning

But I left that in on purpose because, to highlight a point, when I created

the model for ChatDB 5 Mini, I bunged in a prompt.

I quickly created a prompt.

I roughly

described the tools, gave it the tools, worked first time.

Brilliant.

I thought, this agentic

larks brilliant as easy right on my i thought right i will just plug in my local model failed

straight away and i had to spend quite a bit of time on the local model getting it to work so the

point of this is the local models are good but they're not as good at reasoning as some of these

grinder bottles so i had to spend time really clearly defining the steps it needed to take

clearly defining what each of the tools did so that took a lot more time and effort arguably

that's a good thing because I've got a better understanding of what my model's trying to do

because I was a bit more prescriptive.

But there's something to learn there because then I took that

same prompt, just plugged in another model, and it failed.

Now I'm confident if I'd have spent

another couple hours prompting the hell out of this, I could probably have got it to work because

according to the documentation, it can work.

And most of the time, to be fair, I don't need to

generate the prompts.

You give the prompt to a larger model and say to the larger model, create

me a prompt please try it and then you go back to the larger model and say this prompt isn't working

what do i do um you kind of get the gist of it after a while but that's a point to make that

The n8n Agent Workflow at a Glance

you know there are some downsides to running these as well oh yes following on here's the n8m

workflow basically there's the model i just plugged in a bunch of tools um there's the various models

that i was plugging in quinn etc uh on all that the workflow does is get the emails and pump it

into the agent and the agent makes a choice and you'll notice if you can read the bottom right

you can see for each one of these goes to the model model makes a decision to read the labels

model makes it another another decision there to apply label makes another decision

it's happy it moves on um so just an example of what anything it just took me not very long to

draw that because it's all no code right i just drew that so it's not bad tool right i won't go

Where Local Models Shine

a detail on this one just an example of what some of these use cases what you could use these local

models for so you've seen how well it worked on the first example where it tripped up on the second

Security-Focused Uses: PII Detection and Prompt-Injection Gates

example um few places you can use it there that my favorite the two on the end for security actually

because these models can run locally you've got control of your security boundaries so if you

wanted to do something like um personally identifiable information detection pi detection

detection.

You might have a large body of documents and you don't want to send those

floating around everywhere because they might be secure.

Running one of these smaller models

against that document set will run really quickly.

You've got control of the security

boundary.

It's an excellent example of what you can do there.

And you can control it through

your prompt then.

1Prompt injection detection is quite an interesting one as well.

So prompt

injection, we were chatting about earlier, for some of you is where a bad actor is trying

to inject something into your existing prompts to get the model to do something it shouldn't

so what this can be is a a very small fast gate before you pass a prompt to a larger model you

have a small fast gate which doesn't implement what's in the prompt but it checks the intentions

of that prompt and you can do that quite quickly on a small model you can do that locally you can

have control of that so that's quite a good example of where people are starting to use

these smaller models to track and try and put a little gate in front of a larger model

Pros, Cons, and Trade-Offs

right wrapping up them as you've seen costs of running these well it's running on my local

machine i can run it all day and i'll run it all night it's just electricity then maintenance of

the box so from a cost of run these are very cheap to run that's pretty good latency as we

saw earlier if you've got the content locally you don't have to bounce every request to a model

somewhere in america or wherever that is right so if you're doing a lot of local work these things

can run blindingly fast privacy i mentioned earlier it's local you've got control of your

fence i mean you can you can host these uh on cloud but you you can change the size of your

security perimeter but you can manage that yourself from a privacy and i guess vendor lock

in if you if you like to try swapping your models out here there and everywhere you're not locked

knocked into anybody you can try using different models weaknesses I mentioned earlier you need to

know what you're doing a bit more to prompt these things they are a bit more tricky to prompt that's

not a bad thing but it takes more work hardware you have to buy the hardware to run this I did

say it's cheap to run but you have to buy some hardware depending on what you want to do depends

how much hardware you need to buy so that's something to think about and governance if

you're paying for a sas you don't care about well you do but you're offloading the risk to somebody

else to keep that model up to date keep it secure keep it looked after if you're hosting your own

model you've got to make sure you're running it all up to date everything's patched everything's

looked after so that's a consideration and the last bit i mentioned was reasoning limits now

these things are distilled to be small and neat they haven't got lots of information about security

security, legal issues, whatever, right?

They've not got in there.

So they're not as good at,

they haven't got that large body of knowledge to play with.

So that's the downside as well.

Conclusion and Audience Q&A

So my last question, hand up, is anybody thinking of trying this when they go home this week?

One,

two, three, four.

I was aiming for three.

I got about six or seven there.

So I'm pleased with

that um that's it for me have i got time for questions nope not at all anybody get any

questions uh see me outside if you want and if you want if you want the slides or anything let me know

get my details there thank you very much