From the event: Mindstone Warsaw June AI MeetupPractical AI GTM Use case
View event

Practical AI GTM Use case

Introduction: A Go-to-Market Engineer’s Role

My name is Camille. I'm a go -to -market engineer. I know it sounds fancy.

Basically, the idea of this position is that sales teams and marketing teams have so much software and the software is a little bit a silos to each other. And these days, basically, whenever you want to get, acquire any new clients, you have systems to be speaking with each other.

and then there was a need to have a developer inside sales and marketing teams that is not working on waterfall it's like working like this project this week it's not going to be used next week because we are part of the buying the way we are acquiring basically clients and this is my job so I go I have

experience in sales and engineering and guy I spoke with sales teams marketing teams and connect CRMs with AdWords with everything and create automated projects to get more clients actively through basically reaching out so I am the guy

who wrote to writes all those cold emails and stuff you are getting and you're pissed about but I only write the good ones okay so you know whenever you get any good email you say like oh this guy is really sneaky that's me okay the The other ones are not mine, OK?

Why AI Changed Prospecting at Scale

So the big part of it is AI, actually, because just to give you some example, five years ago when I was a sales development representative,

I was able to contact 250 people in a month manually doing Excel spreadsheet, LinkedIn, and stuff. Right now, I can contact up to 10 ,000, OK? And most of it is due to the AI.

but today's talk is not about this it's about something practical and let me

A Quick Audience Check: Are Your AI Flows Actually Monitored?

just ask you some questions how many of you have AI projects that run on their own without you manually starting it hands up who has automated flows projects meaning that something is done while you're not in front of the

computer whatever okay and how many of you check the quality of that through doing this meaning that it's not like if it falls you've got like okay I've got like a bad result let me go to the setups and check it how many of you check every day let's say if I will do evaluation of your projects with AI how

good they are doing how many of you one person there we had four that are doing okay yeah like it's also doing that because most people don't and is this bad yes and no basically I think yes but daily my automations with AI do

The Real Problem: Output Drift and Silent Failures

prospecting meaning that they find new companies for my clients they find new people in those companies and evaluate meaning qualify if those are fit or not fit to or ICP ICP is just like ideal client profile and I have to be sure that whatever is working behind basically my army of agents are doing

well and the question is how to actually check that how to do a full -time basically evaluation of if the AI is doing right or wrong because I'm not talking about the specifics like you know you can check the input you can can check the output, but those small changes,

like whoever have a project and then somehow something, whatever you're working with, OpenAI, Entropic, Gemini, somehow the same input with the same LLM produce the different output. Who had this kind of thing, yeah?

And, you know, it's because, you know, then change the weights is because there's a lot of users using it and they have to kind of figure out how to work on the performance. So the question is, if I'm using the same version of the LLM and on a daily basis produce different outputs, outputs, how to actually have a trigger to check that and find out that maybe my project that's somewhere on the server is not giving me faulty results.

“Is My AI Stupid Today?” and Why It Matters

And just to give you the idea of how important it is, somebody created this page, which is called Is My AI Stupid Today? And it's called AI Stupid Level.

and on you can check you can have some it's a a stupid level that info basically it gouges on a daily basis which a lamb is a little bit out of his intelligence today and you X who experienced that that basically you log in today you do the same work as before as yesterday and you see like different

like you know like speaking your flag with a fire fear old instead of like a student or some you've experienced that yeah it's true okay and imagine that you You use that, and probably when you see that, you're a little bit accustomed to it.

So you basically, yes, please read this file, because in this file that you already have an instruction, it's a specific thing, and blah, blah, blah, blah, blah. But when you have something on ultimate, like basically an AI agent, you cannot do that basically on a daily basis, and you don't know when it basically cracks. So how to do it?

A Practical Framework: Observe, Evaluate, Guard

There are three layers. because the first one is observability second is evaluate and the third one is I'm gonna talk about the middle one which is evaluate basically great

Using Ground-Truth Test Cases to Evaluate Agents in Production

Because not many people think about that, okay, so let's assume I have a simple project I Have a list of companies from LinkedIn. I have specific criteria for which I check if they are fit meaning that if they're all ideal client profile and

I'm doing this daily for like thousands of companies and I have a specific alarm let's say it's opus sonnet whatever and it's daily checking so the script is done basically and it's doing it right so the question is how I can check if

let's say today at 9 a .m. or maybe tomorrow at 2 a .m. my LM is a little bit bit stupid and it's going to provide me worse results one thing that is very useful is a test

ground test uh test input and test output meaning that if you have a project meaning that it consists of some input then there's a lem which prompt you've perfectly you know created tested it and you know yeah the result is great so how about every tens of rounds you

provide the test input so that you specifically know what should be the output and if the output a little bit or much differs from the one that you've been creating on is you get a notification says hey your LM is not not doing great today is that could be right yeah that's one of the things so

basically what I do is that let's say I have a five or ten labeled companies meaning that I know what should be the result of this ICP check so qualification and every tens or fifteens of the rounds I'm providing the project I'm providing the agent this this companies to do the same and I'm testing

testing, how he basically analyzed them and what results he provided. And then if those results differ from what I think they should be, or from what I know what they should be, I have a notification email and basically this project stops. That's one thing.

Catching Cost Cuts and Model Drift with the Same Test Harness

So for instance, let's say, and you set up this test kind of input and output, and you're doing this evaluation and this can you how this can help you with two things first let's say that you want to get down with what Jacek was saying costs you can just swap to a cheaper model and then you get a notification like

accuracy drops down to 78 % precision to 0 .69 in ICP correctly fact yes but then out of ICP companies let through meaning that he said that more companies than there should be our ICP and we should contact them basically but also can help

you with the drift that I was talking about basically same model provider drift let's say a little bit the weights changes today because of performance because of some big companies deciding so or maybe something else you get exact email and you can stop the production basically to stop the project stop the

flow and you get information what this is about so that's one thing and that's the main thing that I think people forget about and that's very very useful especially if you have like 24 7 agents working in background so but I was speaking only about like the middle of the thing and other things that you can

Observability Tools and Prompt Libraries

use basically they're more a little bit more programmatic so first observe so of course you can set up like in an instance use Lankfuse Lankfuse is a great service to observe like amount of costs and tokens you use amount of time the given LLM provided the result also you can have there some sort

of a library of your prompts to just test which are performing better right now so that's absurd but those are things that you actually been using in

whenever you were like creating any software basically evaluate this is the second thing basically that I've been this is the this is the main topic of my actually presentation so do test rounds one one up on time basically whenever you can and then you will have those then and and and the third thing is

Guardrails: Moderation, Prompt-Injection Defense, and Output Checks

guard very often if you go to a documentation and Doc's from let's say open AI and tropic they api's has some sort of a moderation api's mean that you can say which things can go into the prompt let's say into the model and

which are done for for x for example salesforce in 2021 had introduced an ai chatbot um which was no sorry it was just an ai um informative ai system on salesforce which could basically whenever somebody through the sales form uh got into database he got information about and and provide those information to the sales rep.

And one guy used this for prompt injection, meaning that through sales form he put instructions and sent the whole database of clients to his private email, basically.

So there are providers like OpenAI and Shopping that provide you with some things that can guard, meaning that you not only can do evaluation tests, as I said, which is basically main thing that I wanted to share,

but you also can a little bit guard the input meaning that if input a little bit is much different from what you are assumed to have because let's say I script tons of websites maybe on the website itself in the white form on the background there is some prompt injection I can guard through some specific functions in API and also the output of it basically when output is a little bit out of what you've been thinking of basically you get you

Who Evaluates the Evaluator? Choosing Stable Scoring Methods

an email and you know where to stop so who grades code for ground truth uh cheap fast and stable so

whenever regex is possible whatever use that but of course very often like in emails the tone is important and stuff so i use llm for subjective calls like tone and stuff but then you say hey that's interesting but if you have a llm checking the results of llm so not the second llm who's checking the first LLM can hallucinate too?

Yes, it can.

So just to not do builds of test upon test upon test, you can use tools like PromptFu. It's very, very good as it's just like JSON -based analysis on the result of your prompt, whether you basically give him the trademark, basically the result that you want to achieve, and then he checks and gives you information if let's say the accuracy drop down to below 90 % and

Human-in-the-Loop for High-Stakes Decisions

human as my previous presentation and my son was about human in the loop where to put it here so basically here you can when you have very important kind of bits you can also have an email we just be hey accept it or deny it basically

that's it so what's important because you cannot control the model isn't it like you of course you have those basically is features that you can have like temperature and stuff but you cannot control the model itself so pin

Conclusion: Monitoring Isn’t Evaluation

the version but watch the draft so please observe what is in doing and what is this producing based on your test results and yeah all I want to say is monitoring is not evaluation basically because most team have locks and stuff

but they don't have a test set and very often they can give you information of how long does something run and basically how much it's spent but very often it doesn't provide information on the quality which is very important in

AI and LM projects that's all this is my LinkedIn if somebody would like to connect any questions

Finished reading?