Hey, there we go.
Who am I?
My name is Ian Keddie.
If you want to see the slides at the end of this, grab that and I'm happy to send them through.
There's nothing particularly secret in there.
Just for me, I spent the first half of my career in software development, either doing it myself or spending time running software development teams.
Then I spent the second half of my career in global leadership positions, mainly in the managed service provider space.
and more recently I'm doing sort of consultancy on business automation with
AI oh first of all again thank you to clocks out for sponsoring that for the
pizza and the beer this is this is the man to thank for that what am I going to
talk about today so today I'm gonna talk about well actually now today my mission
is to try and convince three people in the audience to go home and install some
AI on their machine so that's what I'm gonna ask you at the end right if I've
convinced anyone here to do that.
What I'm going to do, I'm going to talk you
through installing it, how can you put an LLM on your machine, and give you some
examples, so I've put some realistic examples together of using that LLM on
my machine, and then at the end a little bit of a pros and cons, you know, what
was it like, what was it good for, what didn't work.
So bit of terminology, what is a
small language model?
So we've got lots of language models out there, so the
The small language models, well, we measure them in billions of parameters, so the parameters
are in a language model.
The parameters are all the variables that get tweaked when the model decides what token
to predict next effectively.
The more parameters you have, the more capability the model has.
The small models, less than seven billion.
Those are ones that can run on edge devices, run them, small enough ones can run on your
phone your laptop the mid to large size they're still models which are capable to be run on
machines so i'm running a mid -size model on my machine for these examples but you can run a
small one and these are things you could run on your own hardware you could buy hardware and run
these quite easily the large models you could run them largely you're looking at hiring you're
looking at a sas option here you're going to be buying this as a service these are too big you
You can either buy them as pay for tokens through APIs,
or you could buy the compute
and then pay for compute, really.
And then at the end, you've got the frontier models.
These are the ones you log into ChatGPT.
What's the latest model for Grok?
Whatever, these are the frontier models.
Parameter -wise, 500 billion trillions.
They don't tend to say how many they've got,
but this is not something you're gonna run on your machine.
Little note there,
mixture of expert models are quite interesting.
So we're talking here about billion parameters.
So normally, a token runs through the entire set, so all of these parameters, a token, will run through there to get computed.
A mixture of experts models makes it a bit more simple.
So instead of having one big model, there are a collection of smaller models that are managed within the bigger model.
So an example of a QEN model there, it's running on 400 billion, so it's a big model.
But at any token, it's usually only about 17 billion that are actually active.
So it chooses one or more experts within that larger model.
So it makes it vastly quicker
It's using a fraction of the entire lot
So it's but much much faster and you'll start seeing that in some of these smaller models
They start using mixture of experts to get them really fast
Another bit of terminology I want to mention because I might drop it in later is fine -tuning
So smaller models are normally made from a larger model distilled
So, you don't always train a smaller model.
What they'll do is they'll train a larger general purpose model, distill it into something
smaller.
You do lose some of that knowledge base, because they're trained on tens of billions of documents.
You lose some of that knowledge base.
But as a rule, you can get it down to about 20 % of the size, but still retain about 80 %
of the capability.
So, it's quite impressive how much you can shrink it, depending on what you measure capability
capability as but you can keep 80 of the capability for 20 of the size so these models are great
fine -tuning is taking a small model that's general and adding a veneer of your own training so to
train a large model it's going to take six months of constant training to fine -tune a smaller model
that's going to take a week or so so what you can do is you can get a smaller model and train it on
your domain.
So for example, I've got medical domain, security domain, etc.
You
can train a smaller model then with a veneer of training, so it inherently
knows your content.
So that's a good way to get a big model, make it small, and
then make it useful to you.
So that's some terminology.
Okay, so for those who
I'm trying to convince to install it, you're gonna need some hardware.
Models
perform best on the GPU, so on the graphics processing unit, and they run
the entire model in memory so to really measure what you're going to use you need a graphics card
with graphics memory of a certain size a rule of thumb is for models i'm talking about they're
quantized down to a certain size about 1 billion parameters is about 1 gig of ram so if you find
out how much ram your graphics cards got you can make a choice about which model you want to use
that's kind of a useful rule of thumb it's not exact but it's a good place to start
that.
You can move it to the CPU, but it gets a lot slower
and it starts trying to use the both.
So I'd recommend
understanding your graphics card making the choice.
So you need
software.
Once you've got the hardware, you need some
software.
I'd recommend LM studio.
It's graphical, you can
double click install, it takes minutes.
Bosch, you've got the
capability to run a model on your machine.
It's really,
really easy.
So graphical click, just double click and it's
there the examples i'm using because having a model on its own doesn't do much you can chat
with it but you can do that with chat gpt so the example i've used is n8n so n8n is a low code no
code workflow tool and you can install that in a docker it's a little bit harder but it's pretty
straightforward you install it and then the good people of n8n keep it up to date for you so you
just make sure it's refreshed and i'm going to use that in some examples then you need a model
model you actually need to put a model in all of this stuff hugging face is the place you go it's
a strange name I know but hugging face is a catalog of all the models out there really it
is the place to go last count over two million models they've got pretty much every model and
stacks of training data so if you decide you want to train your own model you can get good reference
data to train your own model if you wish I put some examples there Microsoft Google Meta IBM
IBM they've all got models on there of varying sizes there's some really tiny ones and there's
some really big ones I've just put some examples there I've tended in these examples to use quen
so quen and llama have good general purpose models uh they're benchmarked well as being useful
general models so in my examples that's what I've used okay a practical example so I've got my
machine it's got the model on there and it's running so i wanted to do something with it for
an example for everybody so what i created was a a job scraping workflow for those who were in the
job market so this workflow basically goes to half a dozen different job websites puts in lots of
different searches because some people are looking for multiple terms for the same role multiple
locations for the same role so the workflow gets the jobs tidies them up a little bit and what you
find if you're job hunting, a lot of the a lot of the jobs you
get pretty rubbish in the description, they don't fit.
I
don't want to be a fishmonger.
But sometimes it's going to come
up some of these, some of these are terrible.
So good use of AI.
I write my own prompt.
And I've got that information, I've put
it into a standard format description, how it pay where it
is that kind of thing.
And I've written a prompt then.
So the
prompt can go and do a decent job of deciding whether that
job is good for me or not.
So that's what this workflow does.
us.
So I've got three examples of doing it.
First one, chat GPT
five mini, very popular model.
I took 50 of those jobs, run it
through chat GPT, it took 10 minutes, 30 seconds.
So next
one, quen 2 .5.
This is a 14 billion parameter model running
on my machine.
How long do we think that took?
Who's who's
gonna?
Anyone want to guess?
You want to guess?
Well, it's a it
It's 14 billion parameters, so you can guess how much memory I've got.
Anybody want to guess?
Faster, slower?
Two minutes.
I like that.
Someone's prepared to stick their neck out.
That took five minutes.
Not bad.
Well, actually, pretty good, to be honest.
It's half the time it took to use an API, and that's just running a machine,
and that was absolutely fine.
The results were great.
There was a little tweak where GPT -5 did a bit more reasoning
because i think there were the criteria i put was an annual salary when some contracts came in they
came in as a day rate and chat tv5 actually just did the maths and swapped it over and then and
worked it out whereas quen didn't i could have probably prompted it to but i didn't
okay we've seen a bit of a improvement so next one the seven billion parameter half the size
of that model who's prepared to stick their neck out again and say how how long that took what
What did you say?
A minute.
A minute, two, three minutes.
Okay, two, three minutes.
Our survey says 46 seconds.
It was quite impressive to compare those.
The outcome, identical.
Most of this is latency, right?
I guess if you're calling across an API to America, wherever it's running,
there's a lot of latency in there, but this was only 50 calls.
So that was blindingly fast.
so that's an excellent example of the speed you can get out of the the outcome is fine absolutely
fine but vastly superior in speed i did mention n8n so for those who are interested this is what
it looks like you draw these things in n8n i've taken out the the scraping bit because that's not
important for an ai talk um so there you just get the rows from the database bung him into the model
there the mogul is plumbed into quen at that point 50 records and then i stick them back in the
database afterwards it takes minutes to draw one of these in an 8n you don't need to code you might
see on the bottom right there all i asked it to do at the end was to give me a rating high medium low
and to give me some reasoning so i could sort of at least understand why it shows that
so it does it just pops a rating and a reasoning in a database that's all that one does second
example try to get more interesting try to make an agentic example now because it's you know a
a useful buzzword.
By that I mean, I've taken a model, and
I've given the model a selection of tools.
So this is an example
of Gmail.
So this model, I've given it the ability to read my
Gmail, read the labels in my Gmail, create labels, if it sees
fit, and then apply labels.
So effectively, I've given this
model the tools.
And I've said, this is what I want you to do.
So I haven't actually said, do all of these things I've said I
My outcome is, here's an email, categorize it.
If you can't find a great category, make one up and then apply it.
So that's what this model is doing here for Gmail.
So first one, chat TP5 mini.
I won't ask you to guess that one.
That was 3 minutes 19.
Now, bearing in mind, this model is now quite chatty because it's got to go over to wherever Gmail is housed in the States,
go over to get the labels in the States.
so there's a lot more conversational state there just sort of setting the grounding for the next
question so quen 2 .5 14 billion parameters asking the audience how long do we think that took
less than a minute oh somebody's feeling good one and a half minutes no not bad not bad
i kind of i did precede that because this is quite chatty it has to go
So a lot of this is then turnaround time going across an API call to somewhere else.
But the results were absolutely fine.
But I'll tell you a little bit more about that afterwards.
So who would like to guess?
So the 7 billion parameter, the smallest model.
30 seconds.
25 seconds.
35 seconds.
27, actually.
27 seconds.
A minute.
A minute.
So we're going around a minute to a minute 30 seconds.
I tricked you.
you, it didn't work.
It probably would work, and I shouldn't have left this to yesterday
until I did it.
But I left that in on purpose because, to highlight a point, when I created
the model for ChatDB 5 Mini, I bunged in a prompt.
I quickly created a prompt.
I roughly
described the tools, gave it the tools, worked first time.
Brilliant.
I thought, this agentic
larks brilliant as easy right on my i thought right i will just plug in my local model failed
straight away and i had to spend quite a bit of time on the local model getting it to work so the
point of this is the local models are good but they're not as good at reasoning as some of these
grinder bottles so i had to spend time really clearly defining the steps it needed to take
clearly defining what each of the tools did so that took a lot more time and effort arguably
that's a good thing because I've got a better understanding of what my model's trying to do
because I was a bit more prescriptive.
But there's something to learn there because then I took that
same prompt, just plugged in another model, and it failed.
Now I'm confident if I'd have spent
another couple hours prompting the hell out of this, I could probably have got it to work because
according to the documentation, it can work.
And most of the time, to be fair, I don't need to
generate the prompts.
You give the prompt to a larger model and say to the larger model, create
me a prompt please try it and then you go back to the larger model and say this prompt isn't working
what do i do um you kind of get the gist of it after a while but that's a point to make that
you know there are some downsides to running these as well oh yes following on here's the n8m
workflow basically there's the model i just plugged in a bunch of tools um there's the various models
that i was plugging in quinn etc uh on all that the workflow does is get the emails and pump it
into the agent and the agent makes a choice and you'll notice if you can read the bottom right
you can see for each one of these goes to the model model makes a decision to read the labels
model makes it another another decision there to apply label makes another decision
it's happy it moves on um so just an example of what anything it just took me not very long to
draw that because it's all no code right i just drew that so it's not bad tool right i won't go
a detail on this one just an example of what some of these use cases what you could use these local
models for so you've seen how well it worked on the first example where it tripped up on the second
example um few places you can use it there that my favorite the two on the end for security actually
because these models can run locally you've got control of your security boundaries so if you
wanted to do something like um personally identifiable information detection pi detection
detection.
You might have a large body of documents and you don't want to send those
floating around everywhere because they might be secure.
Running one of these smaller models
against that document set will run really quickly.
You've got control of the security
boundary.
It's an excellent example of what you can do there.
And you can control it through
your prompt then.
1Prompt injection detection is quite an interesting one as well.
So prompt
injection, we were chatting about earlier, for some of you is where a bad actor is trying
to inject something into your existing prompts to get the model to do something it shouldn't
so what this can be is a a very small fast gate before you pass a prompt to a larger model you
have a small fast gate which doesn't implement what's in the prompt but it checks the intentions
of that prompt and you can do that quite quickly on a small model you can do that locally you can
have control of that so that's quite a good example of where people are starting to use
these smaller models to track and try and put a little gate in front of a larger model
right wrapping up them as you've seen costs of running these well it's running on my local
machine i can run it all day and i'll run it all night it's just electricity then maintenance of
the box so from a cost of run these are very cheap to run that's pretty good latency as we
saw earlier if you've got the content locally you don't have to bounce every request to a model
somewhere in america or wherever that is right so if you're doing a lot of local work these things
can run blindingly fast privacy i mentioned earlier it's local you've got control of your
fence i mean you can you can host these uh on cloud but you you can change the size of your
security perimeter but you can manage that yourself from a privacy and i guess vendor lock
in if you if you like to try swapping your models out here there and everywhere you're not locked
knocked into anybody you can try using different models weaknesses I mentioned earlier you need to
know what you're doing a bit more to prompt these things they are a bit more tricky to prompt that's
not a bad thing but it takes more work hardware you have to buy the hardware to run this I did
say it's cheap to run but you have to buy some hardware depending on what you want to do depends
how much hardware you need to buy so that's something to think about and governance if
you're paying for a sas you don't care about well you do but you're offloading the risk to somebody
else to keep that model up to date keep it secure keep it looked after if you're hosting your own
model you've got to make sure you're running it all up to date everything's patched everything's
looked after so that's a consideration and the last bit i mentioned was reasoning limits now
these things are distilled to be small and neat they haven't got lots of information about security
security, legal issues, whatever, right?
They've not got in there.
So they're not as good at,
they haven't got that large body of knowledge to play with.
So that's the downside as well.
So my last question, hand up, is anybody thinking of trying this when they go home this week?
One,
two, three, four.
I was aiming for three.
I got about six or seven there.
So I'm pleased with
that um that's it for me have i got time for questions nope not at all anybody get any
questions uh see me outside if you want and if you want if you want the slides or anything let me know
get my details there thank you very much