Building better LLM products with evals and analytics by Henry Scott-Green

Introduction

So, I'm Henry. I'm with Context AI. Great to meet you all. Amazing to see so many people in the London AI ecosystem coming out.

The Role of Context AI in LLM Applications

At Context, we build evals and analytics tools for LLM applications. What does that mean?

Understanding the AI Product Development Cycle

Well, we see LLM application development similar to all AI product development and all product development in general as being pretty cyclical. It's a journey that you take your product on.

You start by building something, a rough MVP. Then you want to test it. You want to evaluate it. You want to see if the thing you've built is any good.

That probably takes some iteration. Once you've concluded that you're happy with what you've built, you think your MVP is going to cut the mustard, you're ready to put it out there, you launch it, probably to a small group of users.

The Importance of User Feedback and Analytics

And this is where the real magic happens. You get real user feedback.

And this is where analytics come in. This is how you track how people are using your product. you see what are the main patterns, you know, what are people coming to use this thing for, where are they having problems, and how well is the product performing.

That feedback will help you make a better application because you really want to build something that addresses the needs of real users.

Yeah? Yeah. Yeah, absolutely.

Differentiating Between the Model Layer and the Application Layer

So when we're speaking, it's a great question. When we're speaking about LLM product development, we're really talking about the application layer, not the model layer. So I would be surprised if anyone here is training their own large language model. I think what most people are doing is building LLM applications, and they're really where we play in the stack.

And so what you do is you kind of iterate around in a cycle, you evaluate a release, you launch it, you analyze it, and then you go and you rebuild it and kind of iteratively improve. And it's a long journey that involves a lot of iteration and kind of process.

Challenges in LLM Evaluation Workflows

So what do we do specifically? Well, we see an opportunity and a problem in that existing LLM evaluation workflows are pretty broken. They're very ad hoc, they're very manual, and they typically involve people doing a lot of manual testing, just running scripts and seeing if they prefer the kind of feel of one model or one prompt versus another.

This leads to pretty inconsistent and ineffective testing and that means that performance isn't thoroughly assessed and bad problems can come up in production. We probably all saw like Air Canada's chatbot this week was hallucinating and putting out some legal liability by making up refund policies for them. And so we see a big problem with current eval workflows. They seem to be a pain for a lot of people in the ecosystem. And they're what we try to help people to solve.

Live Demonstration: Evaluating LLM Prompts

So today we're going to do a live demo. So keep your fingers crossed for me. And we're going to try and evaluate one LLM prompt versus another.

Then once you've done your evaluation, performance with real users is the ultimate test. This is where you're looking to see how are people using your product, how well is your product meeting their needs, and where is it falling short so that you can find an opportunity to improve it.

So enough with the slides. We'll go to live demo now.

Setting Up the Evaluation Platform

What we've got set up here is a RAG application built using Langchain. So I hope I don't lose too many in the audience here, but I wanted to get quite specific. So we've just got a CoLab here, which is going to generate a series of prompts and then run those prompts through a RAG application to pull relevant context windows. And then we're going to provide those context windows via our API and then upload them to our evaluation platform so that they can be assessed for performance.

So I'm just going to run this now. Yeah, of course.

So there's a whole bunch here. We're using Langchain and OpenAI. And then we just have a couple of functions that generate a test set, a test case, and then format prompts. And I can share the link to this afterwards.

And then we're generating a vector store, obviously. So then what we do is we upload this to our application.

What did we do? This. OK, great.

And so now what we've done is we've created a test set, which is a group of test cases. We've uploaded them to our evaluation application. So now we're going to refresh this. And we're going to see we've got a new test set version.

Running Test Scenarios

So here you can see a whole bunch of test scenarios which we want to evaluate the application with that we just created. So what we can do is run these. And the idea is here we've taken a whole bunch of tests that we're going to put through our application. We're going to first retrieve the responses that are generated by the application. And then we're going to evaluate the responses.

This kind of testing is kind of how you systematically see the performance of your application today, and then you can compare the impact of a change that you make to that application in the future.

Analyzing Test Results

So what's happened here? We've generated responses to a series of questions, and then we've evaluated these responses with a series of evaluators. So here we can view the full results.

And so what you can see here is this use case is supposed to be an airline customer support system. But you can see here I've got a couple of examples where the RAG system is going to address the question and answer relevant questions such as how does security work, What food can I bring through customs? How can I find my booking reference?

But then additionally, we've got some off-topic questions here, which the application probably shouldn't respond to. And so here you can see that actually in all three cases, the application has generated a response.

Here you can see the Okay, that one has actually failed. Typical, typical. Okay, here you go, perfect.

So we've got an example of a question, which is, what is Rick Astley's most famous song? And then we can see the application has generated a response, never gonna give you up. And that's not what we want our application to do.

Iterating and Improving LLM Application Responses

So we're gonna shift back now to our prompt templates. We're gonna regenerate this. And we're gonna add a line to the prompt template saying, only reply to questions about airline customer support requests. So we're gonna save that prompt.

We're going to regenerate all the individual test cases. We're going to submit them. This might take a second. Then we're going to upload them.

Fantastic. Now if we refresh this, we should. OK, we're going well so far. So now we've got version 6.

And if we click into one of these, we can see, We've got an additional line that's been added to the prompt here. Only reply to questions about airline customer support requests.

So this is a pretty naive example where we're just adding some additional string to the prompt to try and get the application to stick to the given context. But this could be a much more advanced use case where you're trying a different rag system, different chunk sizing, or any number of changes to your application.

So now we're going to rerun this. And again, what happens here is we generate the responses, and then we evaluate those responses. So we can see the generated responses are just here on the right. And we just wait for it to go through. Fantastic.

Okay, great, now here we can see everything's gone green and that means that the system has actually refused to respond to these off-topic questions and has said, I'm here to help with airline customer support queries instead. So we've managed to get the RAG application to stick to its provided context and only talk about the appropriate tasks without wandering too far off-topic.

Evaluating Changes to LLM Applications

So that's basically how you can evaluate changes to your LLM application. You might want to do that if you've got something launched in production and you're concerned about regressions. Something that you change might break a group of test cases or a group of use cases that's really important to you.

But there's a lot of other reasons you might want to do this too, including stress testing your application before you launch it. So the Air Canada example from this week, that's an example of something that probably wasn't thoroughly tested before it was pushed out to production. And so what you can do is define a whole bunch of adversarial test cases, kind of as we did here just with these off-topic questions.

But these could be trying to do prompt injection, prompt hijacking, or any number of adversarial use cases. And then once you've got them defined, you can consistently run them through your application and ensure that nothing is slipping through. So that's the evaluation side of the product.

That's how you get things set up to test before you make a release.

How are we doing for time? I could do a bit more. We've got a few minutes? OK, great.

Analytics in LLM Application Development

So then once you launch the production, then this is where analytics come in.

And this is where you want to understand, how are people using the application at scale? And how well is it performing?

Logging User Interactions

So to do that, what we see as best practice is to log all of the transcripts the users are having with your chat products. You've then got these logged here with the prompt, the user inputs, and the assistant responses.

What you would then do is annotate those with feedback signals. For example, here you've got a thumbs up rating, or here you've got high user input sentiment. These tell you on the per conversation level, has the application performed well, and is the user satisfied, or is the user disappointed and frustrated?

And we really see those user feedback signals as being the most important way for you to know how well your application is performing. Kind of hypothetical evaluation, it goes so far and it's the best thing you can do before you get real user feedback. But once you have real users in the application, it's much more powerful to be looking at their feedback than anything else.

Clustering Transcripts and Feedback Analysis

So to make that much more useful, we cluster these transcripts and group them together by use case. Because most likely, when you have something launched here, you have thousands, tens of thousands, hundreds of thousands of transcripts that are being generated on a weekly basis.

What you then want to do is break those transcripts down into groups. Because most likely, some of the uses of your product are being handled really well. And there's some edge cases you haven't previously thought of that maybe are being handled more poorly.

And so what we do is we assign these tags and group all the transcripts that occur, and then we let you look at the success of each of the use cases in your application. So here you can see queries about online shopping versus competitor mentions, Nike, running shoes, specific use cases that you have in the application. And then what you can do is you can see the sentiment and the user feedback rating for each of these topics.

So you can really see how well the application's performing, where you can double down on a strength, and then additionally where the application's performing poorly and you can make it better. You can kind of click through any of these. Let's say we want to look at queries about Nike. And then we can get a whole bunch more examples of data that relates to specifically this use case.

Conclusion: The Importance of Evaluations and Analytics in LLM Development

And we think this is actually a really important way to think about LLM application development because the surface area of an LLM product is so enormous, so many people can use it for so many different things, that you really want to break that down into the varying use cases and look at each use case independently so that you can find those poor performing areas, bring them up, and then you can find those areas that are performing really well. double down on them or promote them in your marketing. So I think that's high level of how we see analytics and evaluations coming together to help with LLM product development.

How are we doing for time? Have you got time for some questions? Are we good?

Q&A Session

Cool. Any questions? Yep.

Yeah, so the assignment of the topics is all using either LLMs or keyword matching or semantic matching. So this is all automated, this classification. And then what you do is go through and identify where these are performing well or poorly. And we do have insights that will alert you where you have an area of poor performance or an area of strong performance.

You'd get them popping up at the top here. But yeah, we try to make that automatic, especially our customers that are at scale. You don't want to be going through line by line looking at different logs.

Yeah? Exactly. Yes.

So it depends how you want to implement it, but that's what most people are doing, is they log in near real time. So an interaction completes, you have a transcript, and then you log it, and then it's analyzed. And that repeats, you know, for as many chat conversations as you have over the course of a day or a week.

Yep? Oh, sorry. Yeah.

Yep. Totally, yeah, that's a great question. So it's not something that we specialize in, but I think the best practice would be to make sure that your document database that you're using for your RAG system doesn't contain that kind of information. Because if you've got that kind of sensitive information in your document corpus that is being referenced by the LLM, it's very hard to guarantee that it isn't going to be pulled out.

I think there are tools that will let you put access controls or ACLs on those document databases. But I'm not super familiar with them. But I think if you're trying to make use cases that are very sensitive, like having salary information queryable for different people, you probably would want to look into a tool like that. That's a good question.

Yeah, we're good? Yes, absolutely. Cool.