What’s It Doing That For? – Observability for AI in Production

Introduction

So my name is Jack Rehal.

I'm going to talk to you about AI observability.

Why AI Observability Matters

So I used to say that every single AI generation is a hallucination is just how much of a hallucination. That's really the question.

Now, what I'm using, I'm working with AI on a databases, this is what I think.

Maybe good, sometimes maybe shit. All right.

So my hope for this talk is either you're going to be really happy you're already doing this stuff, or you're going to think to yourselves, oh my God, I should start doing this.

So observability in our systems that we build is not new. It's just different in AI.

A Simple Demo: Generating a Joke

So let's ask AI to generate a joke. So I'm over here and we're going to ask AI to generate a joke.

And sure, it's generated a joke, right?

But the problem here is that, like, really, Houston's asking us, how do I know? How do I know exactly what happened?

Traces: Seeing What Happened Under the Hood

And here we can start using things like traces.

So if I refresh this, what we're going to be able to see is a trace. This tells tells us a lot of information about what happened in the background here, what's going on.

So you can see here things like how long did that take? How many tokens are used?

In an AI, don't forget, when you're paying for these services, it's all about tokens. The more tokens you have, the more it's going to cost you.

It'll tell you things like your temperature, your model, and everything else down to that kind of level.

So this gives you the insight that you didn't really see when you were just asking how it's a joke.

And down to the levels

Inspecting Prompts and Reasoning

examples of where you can see what the system prompts were. So here you get a bit more insight into the reasoning of why AI generated what it did.

Okay, so let's continue.

And by the way, so here I've just used an example. Is this good? Well it totally depends on your case.

Remember, you don't always want to use the same model. You're going to be in different scenarios.

Sometimes you want speed, sometimes you want low cost, sometimes you want reasoning. AI is going to help you there.

Open Standards and Tools

So what is that that magic I just showed you, that example I could see the traces.

Well, it's not really, it's just built on open standards, open telemetry. There's no framework, there's no vendor lock -in there either. It's literally right once and you can serve everywhere.

So in these demos today, I'm using something called LangFuse, which you can self -host and we're running it locally because we don't want to rely on Wi -Fi for the demos. But you can use the same thing and other tracing tools as well.

So you guys might be using something like Grafana or Jager or whatever. You can you do that exactly the same stuff there

The only difference is with tools like lang fuse arise and lang Smith You'll get a few more bells and whistles for AI like we're going to see

Observability for Tool Use and Agents

Tool use so as can use tools Let's use a tool and let's see how that's captured.

So over here I'm going to hard code the weather but their agents gone off and it's found us the weather in Tokyo So, how does that look?

So let's refresh And let's go to that new trace

Capturing Tool Inputs and Outputs

So this gives us a bit more insight, and you can see here the icon shows get weather. Now this shows you what was passed to the tool, what was the input to the tool, and what are the tool output.

This is going to help you reason again about why I generated what it did, right? Did it pass the right information to the tool? Did it get the right output for the tool? From its generation, it's all going to depend on that tool.

So here you can see that again. The observability gives you insight into why it's done what it's done.

Tracing Autonomous Agents

Okay, so let's take another example. Let's take a genetic agent. So in this example, we're going to let AI kind of just roll by itself, right?

And I'll hit this button here. And so we're just asking it a question about some sort of investments type things, and it's going to go off and do what it wants to do.

We've given it a tool called Calculate. No idea what it's doing. It's carrying on by itself, and it's doing all these calculations, blah, blah, blah, blah, blah.

In a minute, it's going to finish whenever it's going to calculate this thing and here we go, it's got an output and that's brilliant.

So the question is, what exactly happened? Because once you build these systems, you've got to run these systems.

Now if we refresh, we'll be able to see exactly what happened in this agent. As you can see here, it's got a little bit more complicated.

Visualizing Complexity and Steps

One of the good things about LangFuse is that it actually gives you the diagram of the complexity of what happened.

So here's all the different steps here and the calculate function, the tool call from AI was called seven times. Here again, we can calculate,

see what was passed and coming out from the calculate tool as well. So again, this gives you the full insight into at the end of the generation, after it's done all these web searches and this and whatever it might do, look at RAG, look at document, whatever, what exactly happened for the final output to be output to the user.

So that's what AI observability gives you.

From Traces to Sessions to Users

You might have seen this in 20 minutes, it's kind of hard to see, but we've got spans at the lower level. We've got a trace. Every time I hit the button to generate a joke or whatever it might be, that's a trace.

Within a session, you can have many traces and a user can have many sessions. Just to show you an example of that, if we go over to here and we go to users and I say for the last, say, six hours, you can see that we've got a demo user here and if I click on that, I can see all the demo users' sessions for traces for the last six hours.

So that's giving me insight into kind of like what's happening and why is that. Well, using that data, you can correlate it.

Correlating Experience with Behavior

If a user complains about their experience using a product, you have the information to find out why. So you could say to yourselves, oh, man, this

user123 is the most engaged with chatbot, must absolutely love it. And if you pull that data down and you find out about it, user123 is actually swearing a lot and keeps asking asking to chat to a human, right?

And this is the insight you can get, right? So, if you just count just quantity of data, like number of engagements, that's pretty useless.

What you can actually do from LangFuse or any of these systems is put it down locally, run some summarization on it, try and find out what's actually going on, and then you'll find out it's here, like this.

Linking Analytics and Conversations

On a positive note, if a user has a positive experience, like they've actually bought something or subscribed to something, you can go the other way.

So in PostHog, whatever analytics tool you've used, you've captured the conversion, it's for a particular session ID, you can go and have a look at what conversations that user might have had with AI to help you celebrate something rather than always being miserable.

Scaling Observability Safely

The question here is,

if you've got thousands of users and you've got thousands of traces, are you going to capture absolutely everything? That's something you definitely don't want to do.

Sampling Strategy

Normal times, you use something called sampling. You're going to say something like, we'll capture 25 percent of our activity because that gives us a good idea about what people are chatting about.

Should there be an error, we definitely want to capture that. So it's 25 percent of normal stuff, good stuff that really happens, pretty boring.

Any errors that happen, any anomalies, we definitely want to capture that so we can investigate that later on.

Privacy Considerations

The other thing you should think about is privacy here. Do you want to log absolutely everything? No.

So there's tools out there and frameworks which will allow you to say something like, no, we don't want to capture the input now, or we don't want to capture the output.

We definitely want to capture how many tokens we used, definitely want to capture other details to help us build reliable systems,

but some things are private and we don't want to capture those in our logs.

Measuring Quality and Joy

So the other question is, how do you know if your AI products bring you joy? And here again, AI observability can help.

Explicit Feedback in Traces

So one of the things that you guys might have seen is that when you're chatting with something, if we go back to our main chat and we ask for a joke again, we can have these feedback buttons and we can say something like, great joke.

What's going to happen here is that when we go to our tracing tool or anything else like that and then we do a refresh on our traces, what we're going to see in this tell me a joke is that we've captured the user feedback here and you can see here we've actually captured that and again you can do these things yourself so

you can do a search at the end of run a pawn job every six o 'clock in the morning go and get the traces that happened yesterday with everything with negative feedback bring it down to your local systems and have a look at it and find out what's going wrong the sad truth about these

Automatic Evals on Runs

thumbs up thumb button nobody interacts with them they really do get ignored so what else can you do evals.

So I ran that trace and what we've got is a system where once we get the response back to the user, we actually run something else after that.

So if we go back to here and we refresh the screen here, what you'll see here, and it's easy to see if I go across, is we ran evals on this trace automatically for us.

And here we've got things like hallucination score. And it's like the A response contains classic jokes with no made up facts. Well, it's a joke, right?

Well,

But you can see helpfulness, latency, response completion, response tone, and everything else in there. There's a user feedback there.

So this has happened automatically for us. We didn't do a single thing, and we've already got this analysis on our traces if we want it.

This is the power of observability once something happens within our systems.

Designing for Engagement and Simplicity

Last note about this is engagement and conversion.

Really, when you're building our systems, if you're building a support app, whatever it might be, you should know in your heads something like

Interaction Budgets and UX Fit

there should be a maximum of five interactions with the user. Any more than that, have you really built the right system?

Are you giving the user enough information to get what they want and get out of there?

So again, observability can help here by telling you, is it matching your requirements of what you thought the user would be doing?

Architecting for Observability

I'm going to skip this part because we haven't got enough time for

it about how to architect and design for building observability. The only one point I do want to point out here is that here,

When Not to Use AI

sometimes you don't want to use AI. You really don't want to pay the cost of inference here. So here, if you can, work out beforehand if you need to use AI or not.

Compose, Don’t Monolith

And then when you are using AI, build composable systems. So don't get one thing to do way, way too much. Build things separately.

So you can generate text in this example, and then we can generate object to get structured output.

This gives the benefit of using different models, few short examples, and getting one thing to do one thing really well rather than have a general purpose agent.

Guardrails: Preventing Bad Outcomes

Guardrails, super, super important. And it shouldn't ever, ever skip this.

We've all seen the headlines where we would laugh about AI because AI has done something stupid or said something like you can put Glue on a pizza or whatever it is right crazy stuff, right?

So we're preoccupied with the AI code. We didn't stop to think if it should say it's hard enough as it is, right? It's so random and you know, like whatever we get type in the words. Are you sure we'll get a completely different response?

Fail Fast and Recover

So my tip here is that as soon as something goes wrong as soon as something doesn't look right Just stop because after that if something doesn't look right the next thing definitely won't look right and the further you go the worse It gets so as soon as you see something definitely definitely stop

So you can choose what you want to do here You can either stop the execution or you can tell LLM to continue to provide some guidance

Guided Continuations

So in the previous example, we said respond with the person's name So here if we recognize in the in the output guardrail the person's name wasn't there What we do is tell the agent again. Hey, by the way, you definitely need to put the person's name in there

Doing guardrails will keep things compliant and it will keep You know everything nice and happy and you don't have to worry too much about it.

So I'm gonna quickly show you a demo Hey, oh Will you put your hand up? Yes Oh man, absolutely. Let's do it now. It's okay. Oh Oh, right. Yes.

Live Guardrail Demo

So can I show a practical example of guardrails? And that's exactly what I was going to do. So we can, right? Okay.

So here, I put a guardrail here of prompt injection. And so when I run this, you're going to see that it got blocked.

So here, the PIO is fine, but we've got prompt injection happening here. I'm just going to run this as one as well.

And then we'll look at our traces. So if we look at our traces for this, and we refresh, we've got a guardrail check happening here. And you can see it's captured it.

And it said the request was blocked by import guardrails, right? So we've actually captured this is happening. So you definitely need to take care of this stuff.

Alerting and Output Constraints

And when something like this happens, you can get alerted to your Slack, your emails, or whatever you want to do, right? But the fact of the matter is, it stopped from doing that.

And if I refresh it again, you'll see the output guardrail happening here. And so here, the output guardrail, I actually want to say something like, it shouldn't output

anything with forward, and you can see here that output contains the the forbidden term, forward. So here is an example where you don't want to talk about competing companies, and you just want to strictly keep it to your own. So you can definitely put guardrails in there.

Framework-Agnostic Patterns

No, this is just AI patterns in general. You can absolutely do this with any framework you want. It's just a pattern.

So input guardrail, I've built a library for it called AISDK guardrails, built on top of Vercel's AISDK.

So just like I said, if you don't want to pay the cost of AI, before you send it, don't send it. It's going to save you money.

Then once AI has generated something, something, you can definitely do an output guardrail.

The output guardrails can come in two flavors.

Deterministic vs. Model-Judge Checks

Deterministic ones where you can search for particular words in there like the words hack or kill or knife. You definitely don't want that in there.

You can also use LM as a judge. Give it to an LM to judge it if that's relevant or something else to it.

Just note on that though, you've got to decide. The more guardrails you have on the output side of it, you need to balance between speed, cost, and quality here.

It's totally up to to you, there's not one solution. It depends on your use case and you might have a mix and match of these going on, if that makes sense.

Evals: Do Them Right

Evals.

I had a super demo for you, but I'm going to tell you where this is one of the biggest mistakes I've made.

Don’t Automate Too Early

I tried to automate everything from the start. You do want to do automation, but let that be the goal. Don't start off with it.

I went off and I found loads of off -the -shelf evals. For example, if we scroll in here, conciseness context relevance correctness hallucination helpfulness oh

Avoid Off-the-Shelf Trap

my god I was like a kid in a candy shop and the biggest mistake I've picked I picked them all I was like gimme gimme gimme gimme don't do this as someone's as Elna quoted here not someone Elna all you get from these prefab evals if you

don't know what they actually do and in the best case they waste your time in the worst case they create an illusion of confidence that is unjustified the only way to do this properly is listen to the advice of Sheri Shankar and Hamil

saying right and what they say is that always start with error analysis put some time in domain expert for example if we were in langfuse right now and we looked at this trace over here what would come down to the bottom of it

we'd have a look at it and we'd say okay that was their trace oops refresh that so if we went down to like this jet here we'd go down here we look at this and we'll say something like ah I'm going to put a comment here like this wasn't very very good.

This was great. And I could send that up. So here's where you want to do it. And this is what they talk about as well.

A Practical Evaluation Workflow

So their framework, just very, very quickly in this time, is you start off with traces. Minimum 100.

You do open coding like we did there. Look at every trace. Mark it. Put your things in it. With the domain expert.

So in terms of I'm an engineer. I work with people who know their domain. I can guess it, but but they really do know the domain.

Once I've done that, we give all that to an AI to go and cluster these failures. So we've got all these comments now, we need to aggregate them and group them. We can use AI to do that.

After we've got that, we've got three or four categories, we use scores. Shreya and Ham will say, use 0 or 1, don't use 0 to 10, that's just crazy. 0 means like and fail and pass, right? That's all you want.

From that, you create your dataset, you run your experiment. If your evals are agreeing with you 90 percent, then you've got automatic evals.

So like I said, don't start with off the shelf evals. They are not relevant to your product. They are just general purpose, right?

You need to spend your time to create your own evals for your own domain, for your own products. That's how it works.

RAG Observability and Experimentation

Just quickly, as we, before we run out of time on RAG, and we can't do that demo, because we're running out of time,

things to look for.

Chunk and Context Quality

It will tell you, they'll tell you whether the chunk's right, whether they picked out the right information for the context for that generation or not, right?

So this will really give you insight insight into if you need to tweak things, change things, is it working the way you expected?

Running Experiments Across Variants

LangFuse can do experiments for you as well. So if you ever want to change something like a model, a prompt, or do something else like that, you can totally run experiments and everything there as well.

And it'll do comparison scores for you and everything else you want.

I keep saying LangFuse, all the tools have similar really features. Just go and use them.

Observability Beyond Your Stack

The other part you might be thinking is, Jack, this is fine.

Listen, I own the whole stack. I own everything that I'm building. That's absolutely fine.

Watching MCP and External Tools

But what you can do when it comes to an MCP.

So here's the demo of using the chat apps within chat GTP, which they launched a couple of months ago.

And here I've got to use asking for their payment details to my system. That's not my UI. So how do I do it?

Well, again, Observably can help us here. I recognize that this MCP tool is coming to my system. I can see what was passed into the tool and what we sent back, right?

Again, I need to make it good, right? People are relying on our systems to make it good. We need to know what's happening and ensure that we can do that.

Conclusion

Observability is key in an ever -changing world. So we always get new frameworks, new language, new models and everything else, right?

Observability is the only thing that can really help you ensure that everything is down there to quality.

From Guesswork to Clarity

So just to finish off, we went from Whitney Houston asking us how will you know to John Ashes I can see clearly now.

Thanks.

Finished reading?