Techniques to Measure and Control Output Quality in LLM Models

Introduction

Thank you everyone for coming. My name is Brad. I'm an AI engineer and consultant. I've been doing it for about 9 years when I first got into deep learning.

We use Lua torch before there was even pie torch or tensorflow or anything like that so. I mean, there's people who are even more veteran than I am, but alright so.

The Challenge of Quality Control in LLM Models

The original idea I had for this talk was how to quality control LLM models like writ large and then I started making the slides and that is, there's a lot there. And eventually I decided to kind of narrow it down on a specific technique that I could actually demonstrate for you guys. So let's discuss that.

generative ai it's kind of hard to to qa you know if you look at traditional models like a classifier that's a pretty simple thing right to analyze uh you build put up confusion matrix or uh you know if it's binary you don't precision and recall these are very simple things right only five possible outputs if it's like a five-way classifier but generative ai you know it's outputting say one of 50 000 words uh... each token and then it's outputting like you know a hundred tokens uh... that would uh... the number of possible outputs exceeds you know all atoms in the known universe or some wild number like that so yeah I love LLMs they're great but they're hard to control and that's one of their biggest problems uh... why they're showing up in all the news uh... people uh... you know

the bot that sold a car for a dollar. I'm not sure that example is real. I should dig into that more.

But the whole Air Canada thing, it's a whole thing.

The Need for Human Review

So how do we QA an LLM model? At the base, an LLM model is so open-ended and it can do so many different things, really the only way to control it should be a human review.

Assertion Prompting Technique

So that's why I wanted to discuss assertion prompting. It's an extension of the same concept, which we've used double entry accounting.

I didn't even discuss that. But that's another place where actually double entry accounting, another interesting example, which I learned in business school, any time we make changes into the accounting statements, we make them in two places.

But we're not doing the same thing twice. It's not like we're just recording the same number twice and just repeating the same action. The change you make on the assets and liabilities, it's a qualitatively different change than what you make on the equity side, but yet the two still have to agree.

And that we can do the same thing here. I could have made a mistake on one side. I can make them make a mistake on the other. What's the likelihood I make a mistake on both that they still agree?

So assertion prompts are just like prompts that you run on the outputs of other prompts. So you can imagine.

Actually, let's just dive into an example. This is an example. This might actually be too hard to read for most of you.

Example of Assertion Prompting

But in this example, we have a prompt on the left that is designed to summarize a resume and come up with a one paragraph career summary. Right, so I have some expectations that I want from the prompt. I wanted to identify keywords. I wanted to give a list of skills and experiences. That's my prompt.

It's like my code right now. Over here I have another prompt here. Now, this prompt is going to take the output of this, which is just a paragraph of text, is a summary of my career, and is going to analyze it.

Hey, does it have the keywords? Does it have the skills? Does it have the stuff? It's like a unit test.

We have our original prompt, and we have another prompt that's checking its output, just like we do with code. I have like 19 slides and 15 minutes, so let's keep going.

Addressing Reliability Concerns

So we already discussed this. It's a big problem, obviously, right?

The second prompt is going to be unreliable. If the first prompt has a 10% failure rate and the second prompt has a 10% failure rate, how can you possibly construct a system that is more reliable than the individual parts?

In fact, fortunately, one of my favorite classes in my master's degree was reliability engineering. I love this topic generally. But anyhow, to do this, you can imagine this as being like a noisy signal.

We have a prompt that analyzes some output, but it's unpredictable. Maybe it works, maybe it doesn't. It's 10% of the time. So we have to get out something that's much more useful.

One way to do that is to combine together several prompts, right? I'll get into that in a second. I have an example.

Another way is... trying to get a more variable signal, where instead of getting discrete tokens, which are words, that's not a lot of information. Yes or no, you're not getting much information there. But if you look at the probabilities that the network is producing, you get a much more fine-grained signal.

Reducing Noise through Prompt Averaging

And the third is just then calibrating that. All right, so to try to reduce the noise, one way to do this is to combine the results of multiple different prompts and basically produce an average over them. And I'll try to look at this.

There's a bunch of different ways that you can write these different prompts. You can ask for different formats. You can just change up the words. Let's look at some examples.

So this is the original prompt. Right, and it's just checking. Hey, did this the summary have the outputs I desire right?

Then we modify the prompt in various ways. Maybe I'm modifying the first paragraph. Maybe I'm modifying the last paragraph.

You can see the original just asked for like a yes or no. In this case I'm asking for a Jason. In this case, I'm rewording the keywords.

The neural network overfits to a degree. But when you change the wording of your prompt, you're going to change the vectors that are propagating through that neural network to some extent. Not completely, because you still have the data. But to some extent, you are changing the random overfitting that's happening in the neural network. And therefore, you can regularize to a little bit. some of that overfitting. So you improve the reliability by averaging over several prompts.

Sampling and Signal Analysis

Then let's discuss sampling. So the normal way to use a language model is just to, like, generate text from it, right? I guess there's also generating in images, but you sample the text.

The problem with sampling the text, yeah? This technique, there's no training. There's just inference.

During inference, how is the neural network tuning its weights? It's just doing the inference that way. Later, when you use these forms, use that log data to retrain them all, then is the time when these weights are going to be retrained.

Let's hold that question a little bit, because I don't think I fully understood what you're asking. But I will answer that at the end when I can have a bit more time. Yes.

So if you sample the output, you're getting discrete tokens. Whereas if you look at the probabilities, you're getting a more variable signal. So it's more useful.

Let's look at what that would output. We can take the original prompt, and then we can look at the probability of the different tokens that it's output. These are log probabilities, so they're negative, and it's sometimes weird to interpret. But once you get used to it, a higher negative number means more probable, and lower negative number means less probable.

The problem is that you can't interpret these meaningfully as probabilities if you're only looking at two tokens. The model has produced a probability distribution over 30,000 tokens. If you're just pulling out two tokens and only looking at their probabilities, it's not as... it's not as meaningful to interpret those as probabilities.

But what we can do is interpret them as merely a signal that is correlated to what we want and normalize that signal. So how do we normalize the signal? Oh, this is another way to reduce noise.

Normalization and Noise Reduction

If you write your prompt in the right way, you can combine together several different tokens that represent positive outputs, several tokens that represent negatives. In the end, you're trying to turn your model, this large language model, into a binary classifier that can say, is the result good or is it bad? You would compute the mean over several prompts.

Let me look at the output. Yeah, this is what I was just saying. You cannot meaningfully interpret those token probabilities when you're looking at a specific token in isolation. It's not meaningful to interpret it as a probability, but it is still a useful signal that goes up and down, and I'll demonstrate that after with the code.

So... Yeah, and here's the final piece.

Using Human Ground Truth Efficiently

Unfortunately, we do need human ground truth. You can't really escape having human-made ground truth. But you don't need as much.

You don't need the thousands of samples that we needed five years ago. Now you can get away with 30 samples.

I have 30 good samples. I have 30 bad samples. And I just compute those average signal values over the good samples and the bad samples. And I can choose a threshold that kind of divides them.

So again, you're basically building a binary classifier out of this large language model, all with the intent of using it as a assertion.

Code Demonstration

So let's get to some code. All right, so I wrote up some code here that was, I had to do this quick because I didn't have a whole lot of time this past weekend, but basically just walks through an example.

So at the beginning here, we're loading up a language model. I just used the latest Mixtro model.

One nice thing, if you're in machine learning, buy very good laptops. Get the extra RAM. It's worth it. Because I had not used the full 64 gigs. Then I download this Mixtro mod. I'm like, wow, I am so glad I bought that 64 gigs. These things are big.

OK, so anyhow, we have this code here that loads the language model. And this example here, it's a bit of a different kind of example. Instead of this yes or no, I was trying to look at with this example, how do we analyze and quality control the style of output?

And instead of turning the model into a classifier of like, is this output good or bad, or did it meet my criteria? Instead, with this prompt that you can all probably just barely see. Can I zoom in? Let's see here. Let me zoom that in. 22 is not enough, let's go with 30. Yeah, so you can see this prompt here.

Please choose the emotion that best represents the following text, then I give it the lead in, then I give it the final word, good. But what we do is we can actually take the output and see what the model predicted as the alternative probabilities for that final word. What did it predict for good? What did it predict for bad? What did it predict for angry?

And in this code, I have a list. I'll share the link. I made this code public, so I'll share the link afterwards if you want to take a look. But when you look at the probabilities over a sample data set, we can see that, again, they form a pretty reasonable distribution.

In this case, the data set is like a common sentences data set. And this is a great example. This is what, using this probability analysis technique, this is what the model thinks most represents scared. And if I ask you, helpless infants have a moral fear of being abandoned, It sounds to me it's pretty right. A good emotion for that sentence is scared, right?

So at the same time, if we go to sentences that are two standard deviations and not scared in the opposite direction of that same signal, we find some pretty happy sounding sentences. So again, using this technique, go ahead. Oh, five minutes. Oh, OK. Well, I guess we should jump over to questions. So basically, the idea is to use the technique, look at the probabilities, and that gives you a useful signal. It's not a perfect signal. It's somewhat noisy. There's ways you can reduce the noise. But it's a useful signal that you can then use to quality control your models. And yeah, that's it.

Conclusion and Questions

So questions? Yeah, it's been like four years since I last spoke, so it's exhausting. Go ahead. What do you mean?

You said we have probability distributions for one word, and then we have, let's say, 5,000 tokens. For 5,000 tokens, we have 5,000 distributions, which sort of forms a signal that, if you compare that signal to something else, then what is that something else So the signal does not have a meaningful interpretation. It's a unitless signal.

So the only thing you can actually do with that signal, you can think of it as like, What's the word? It really is only an ordinal metric, not like an interval or ratio metric. So you can use it to compare two different versions of your main prompt, for example.

You could say, if I'm testing this version, it gets a value of quality at like 2.5 standard deviations. And if I tweak this, now I'm getting 2.6. So I could use two different versions of my prompt. I can look at how production results differ from my laboratory data sets.

There's a lot of things that you can do with that signal in order to then, like, further control the system. It's just a matter of further processing it.

So how would you say it would work If a model is covering another model, props, in terms of quality, how would somebody, an external observer, know or understand the quality of the model? If I understand you right, you're asking, how does an external observer meaningfully interpret the results of the insertion prompt?

So if you've designed the assertion prompt right, you're asking the model, please output a yes or no answer. No chain of thought prompting, nothing like that. You're asking for a gut feel interpretation. And so the signal that comes out of that is the only signal you have is the token probabilities of the very next word, the only word that you wanted to predict, right?

And then the only, you cannot make any real interpretation of that number. It's like negative 15. What does that mean? I don't know.

But I do know that maybe negative 13 is better than negative 15. And so I can compare two versions of my model if I'm doing fine tune models, two versions of my prompt. If I'm doing prompts, I can do... over time to see if I'm having data set poisoning.

There's like a whole bunch of things that you can use with that signal. I found so with open AI. Unfortunately the log probabilities. I've really struggled to get them out with the latest models, but so I've had to use.

Yeah, exactly right. So you had to use the offline models. I played around with a few different. Sorry.

Tons yeah, so I haven't meaningfully compared the models, but yes, you can run the same code. I've I've ran it on Lama as well. Also probably a valid technique that could be another way to regularize that noisy signal absolutely.

say ML models where there is a query and search engine, right? A query comes in, then there is some query reformulation model, right? Here, you have to place that query reformulation with the prompt reformulation through another models, right? Traditionally, the way these query reformulations have worked, like that you generate these reformulations and pick one of the best.

However, here what you are trying to do is If I'm asking you, understanding you're right, what you're asking is how much does using multiple prompts actually regularize the signal? I don't have a really good answer for that. In practice, not a whole bunch.

It's not, it's like 20, 30% of maybe of the error that is reduced, but it's not, it's not like game changing. You're not going down 90%. Definitely not. It's a way to kind of extract more signal, but it's not the be all end all.

Does this kind of a model make the performance worse? Do you have any? Well, yeah, I mean, if you have to run 10 prompts on the output of every one prompt, that certainly would detract performance, but you would do that all in the back end, like asynchronously. All right, should we wrap that?