Evals in RAG and Agentic AI

Introduction

It's my second time here, and I absolutely love this space.

Evaluating Gen.AI Outputs

So we're going to talk about evaluating output that's produced by Gen.AI. Let's first try to evaluate something as humans.

How many of you here are English majors? One, two? OK, listen up.

A Practical Example

So we're going to use GPT-4.0 from OpenAI and ask it a question. How many R's are there in the word strawberry?

Sorry, we'll do it again. How many R's are there in the word strawberry? Still the same answer as before. There are two R's in the word strawberry.

Always a fun little fact to know.

English majors, how many R's are there in the word strawberry? Yeah.

Exploring the RAG Paradigm

So what we're going to talk about here is actually not responses that LLMs give out of their own knowledge, just as ChatGPT gave us just now. But we're going to talk about the RAG paradigm or retrieval augmented generation.

It's where you do search first to find relevant information and then the LLM just phrases the answer just based on that information.

And that's how it can point to which bits of information it used, so we call those references and that will be important.

Understanding RAG's Mechanism

Actually, let me first show you how this RAG thing works.

This is a live example of us being used by a company called Perfecto, the DevOps company. They have a lot of technical products. Sorry, a company called Perforce. Perfecto is one of their products.

They have a search box where you can search across their documentation and now you can do that using retrieval augmented generation as opposed to keywords so you can ask a question and then this will just look at the documentation pages on public knowledge base articles and it will provide an answer to this question referencing some sources that that it used to come up with the answer so that's just to illustrate the retrieval augmented generation paradigm so

Evaluating Responses

Let's think about how we could evaluate something like this. Let's take a question. When is the next MindStone event?

And what could the answer be? It could be it's today. It could be this Tuesday, July 15th. Maybe it's August 19th if you're asking the question tomorrow.

So the nature of this is not exact or is not very precise. So if you want to evaluate, we need to allow for that flexibility. So you could provide a high-level description of what the correct answer should be.

For example, you could say that the correct answer should say that MindStone events take place every third Tuesday, or the correct answer could actually state the date, which is whatever the next Thursday is on that day. Alternatively, and that may be a better approach, what happens if MindStone changes when the events run? But they keep the up-to-date information on the website.

So you may want to specify the correct answer as a link to the page that provides the correct answer. In this case, there is the MindStone own page or maybe it's Eventbrite.

So you could specify multiple links. And as long as any of those links are used as references to generate the answer, you have a good indication that the answer is likely to be correct.

well, what if there are too many, if the question is so generic that you can answer it based on many different sources, then there's another technique that could be useful is that you could take any of the sources and you could say that as long as the answer that the LLM comes up with is equivalent to the answer that the LLM would provide, if it just used that one source as knowledge, then you could say that, yeah, the answers are probably correct. But sometimes even that may not be enough, because in the collection of questions and correct answers that you're putting together to evaluate your system, you may still be getting some answers that are just incorrect, because sometimes the question is just too difficult.

Maybe you cannot find the right information, maybe information is not even there, or maybe it's just not working well enough. So you may want to label some questions in your evaluation set as questions that currently the system is failing on or questions that sometimes you get right, sometimes you get wrong. And there is a lot of that when you're dealing with the LLMs.

Creating a Test Case with Snapweaver

So we'll make this a little bit more real now by creating a fake data set, a fake company. This is something that ChatGPT helped us invent. Snapweaver, it has a few documentation pages and some that you expect from an early stage startup.

It has the introduction page, it has the installation guide, it has three other pages. And we used that as the source of knowledge.

So we took those five or six pages and we instructed our reg system to respond just based on those pages. And we came up with a validation set.

So we'll look at this in detail now. So that's a real

Validation Sets and Evaluations

Validation sets so just to go through questions one by one we can test not just individual questions But we can also test conversations.

So for example when we asking what is snap weaver we want the Answer to reference the introduction page. So let's actually give that a try If I go into this chat interface and ask what is snap weaver? It will provide an answer, and it will reference this page. If I click on the link, it will take me to the introduction page.

So this is how this particular question is evaluated, just based on the presence of the target URL in the references in the answer itself. And so this question, when evaluated, this answer, when evaluated, would get a plus one.

Then we could continue the conversation. We could ask in the same chat how to install it. So if I asked how to install it outside of any context, it would not know what I'm asking about. But since I'm asking it in the same chat, it knows that I'm talking about SnapWeaver. And we'll be evaluating this question based on the presence of a link to the installation page, installation guide, which is this link.

So it's also... passed this test then some questions may test its ability to not hallucinate when asked about things it doesn't know about for example what is the capital of China if you ask that question let me just refresh the window here it should say I don't know and as long as it says that yeah or something along the lines of I don't know, then I will count this question as answered correctly. And here we are providing instructions, so we should mention the lack of knowledge.

So there are two ways we can specify the correct answer. It's this high level description, which we call instructions here, or it's the link to the page that should be referenced in the answer.

Handling Synonyms in Validation

Then, Let's take a look at another question.

This is how to add SH and we just created this SH acronym for shortcut. Just decided that we want to test how the system deals with synonyms. So I've instructed it that SH is a synonym for shortcut. And this question was actually validating that it will do the, uh, that the system knows that the synonyms, which it does. So it will respond here that to add a shortcut, a snapshot, uh, you should use this combination of characters.

And, um, What we can also do is we can go into the evaluation output that tells you which question you asked, that provides the answer that was provided, but then it also tells you why it decided that the question was answered correctly or incorrectly. And so you get this verdict from the LLM that is often very useful to understand what is wrong.

And sometimes the thing that is wrong is your evaluation. So you did not design your validation framework well enough. So it gets confused, but sometimes it's actually, it's actually pointing to some other problems, which is, which is very useful.

So in this case it just confirms that the answer was correct. Correct. Meaning that it was consistent with the instructions. Um,

Evaluation Framework

Then once we run, so we have this collection of questions and answers, which is a starting point for any data set that we ingest. What we do is we configure our system and we run the test on all of the questions and the output looks like

And the output looks like this, where for every question, if there is nothing in this left-hand side column, it means that it's performing as expected. Expected meaning if it's B here stands for a baseline question, so those questions should be answered correctly. As long as they're answered correctly, there is no problem.

There could be some maybe questions, in which case it doesn't matter if it's correct or incorrect. We're not going to be concerned either way. It could be false, which is whenever the answer is incorrect, it's also expected.

But if we have a baseline question that we did not get the correct answer to, then that's a problem. So we want to make sure that we check what's going on. And we actually integrate it into CI-CD. So whenever we try to make changes, they're not going to be committed if there are any of the baseline questions failing.

And conversely, sometimes the question that was failing before is answered correctly. And that's good news, but we also want to make sure we notice that because maybe we want to change that question to a baseline question now that we've improved the system and fixed something.

Interpreting Evaluation Results

Uh, this is a bit busy, but every letter here means something.

I'll just go briefly through this.

Uh, we can click on this link to go to the details of that question. Um, actually maybe I'll show that.

Um, so you, we can see the detailed output that is, uh, uh, that shows you every message that goes back and forth. Uh, when this question is asked, uh, you can see the system prompt.

Agentic AI and Tool Calling

Um, I'll just briefly show you something that we use here, which is along the lines of agentic AI and tool calling. Um, We have two ways to process a question. It could be using semantic search or it could be using keyword search and the LLM decides which tool to use.

So if it notices any special keywords, so out of vocabulary, words or codes like this one, it will invoke the keyword mode as opposed to the semantic search mode and it will extract the relevant keyword. So we can see that here that we selected the keyword mode and the query that we passed was not what is Q blah blah blah, it was just this code.

And this is something that we can also test using this evaluation framework. We can test that the correct tool was called for the given input. Then we can test whether the URL, so the green U here means that we were testing using the URL that was provided and it was correct and the last number digit here is the ranking in the search results that the correct URL was at, so in this case it was always the first one.

Using Detailed Evaluation Tools

when we provide instructions it's the i letter that's active here and if it's correct it's green if it's incorrect if the answer is incorrect according to instructions it's red and if you want even more details we could be looking into tools like LengFuse that let you see in even slightly more detail what is happening between the LLM and the system. So in particular, you can see exactly how the LLM digests all of the tools that it can use, like the search tool that has the semantic and keyword modes and how it processes its different arguments.

Okay.

Conclusion

So I just want to leave you with this thought and i think joe said this scene very nicely at the beginning by comparing different points of view on linkedin i think this is one of the nicest for the humankind applications of llms where you can get a much more balanced view from the llm than you can from a human so actually llms can help us evaluate our own views and uh

I would encourage you to give it a try. Take something that you feel very strongly about and ask the LLM to provide the opposite view and see how you react to that.

Thank you so much.