Evaluations Driving Fine-Tuning

Introduction

Thank you so much. I'm Julie Norton, CEO and co-founder of Plum.

Plum is a developer API that evaluates and improves the quality of large language model applications. And I want to give credit to my co-founder, Prem.

10 years ago, we built products and models in the last generation of natural language processing models. So believe me when I say we get the problems that developers today are now facing.

Understanding the Challenges

Now, I am sure most of the people in the audience today get LLMs. I'm sure you've used Grok or ChatGPT does a great job at summarizing an article, writing basic code.

But for tech leaders, there's still a significant gap between what they're expecting out of generative AI and what they're getting, especially for B2B business use cases.

So an insurance company that we were speaking with over the summer, they were building an internal version of an insurance application where they would upload a 200-page insurance application so that their analysts could more quickly price risk. But only a fraction of those outputs contained the correct legal terms. So the legal team had to end up blocking a wider deployment because for them, generic outputs were a business risk blocking revenue growth.

And this is despite months of their team doing prompt engineering and RAG, the quality of the outputs just plateaued. And that's why 30% of Gen AI initiatives next year are going to be abandoned because they simply don't work well enough.

The Solution: Plum

And for the developers in the audience, I want you to be successful. I want you to get promoted on these new Gen AI initiatives. I want you to succeed.

But the reality is that you'd have to do at least half a dozen of these intensive data scientist steps to even have a chance at improving quality for these foundational models past prompt engineering or RAG alone.

That's why we built Plumb, a developer tool that evaluates and improves the quality of those large language model applications. And here's how it works.

High-Level Architecture

Before I get into the demo, I just want to give you the high level architecture.

So based off of the specific business use case for the Gen AI model, it evaluates the system prompt and how well it's working with the status quo. And then it generates the exact right data needed to fine tune the model. And the process repeats. So it's not just fine tuning once, but it's an iterative, continued process.

And I know Chachapati and Grok are working on the next generation of foundational models. And the reality is that they are designed to work across a variety of business use cases.

1.0 or Orion, these new ones, they will never be as good at specific business use cases as fine-tuned models. It's just inherent in the architecture.

Demo Time

All right, demo time.

Given a use case here where the sample use case I've defined here is I'm getting survey results from my customers, and I want a quick way to either prompt the person who gave me that response for more information or a way for me to evaluate how good is this response. Because if they're just giving me a response that's like, nothing, I don't want to put that in a good response bucket.

And so this is a really well-defined prompt. This is with prompt engineering. This is end shot learning. This is me giving examples, telling the LLM how to respond. So I want you to know this is not a hobbled prompt. This is a very well-defined prompt engineer.

So given this, I'm still having problems with the quality of the outputs. So I'm going to show how to very quickly fine tune a model where you don't need hundreds or thousands of examples to get an enterprise level customized model.

So starting with this prompt, I'm looking for are the answers meaningful? Are they relevant? Are they in depth? Are they clear?

So the first step with our API is that given the system prompt, I pasted it in here, I'm going to first just get it to generate evaluation criteria. And this evaluation criteria isn't high-level metrics like faithfulness or alignment. This is particularly defined and created based off of that system prompt.

So from that system prompt, here's the evaluations that it defined. Are the evaluations relevant? Are they in-depth? Are they clear?

And so the next step is once we have that generated evaluations, now we can actually go through and run how well it's working. I uploaded a seed data set beforehand of about eight responses. So given that seed data here, let me just make sure this matches, metric ID.

What this will do is, one second here, is this is going to go and define how well the existing data set is working with that seed data. And this will take about 20 seconds. And so this is from those definitions before. It's looking at, for that seed data set uploaded, is it doing the evaluations?

Are they relevant? Are they in-depth? Is it clear responses?

And so based off of these evaluated metrics, I'll actually show the next step here, is for the particular evaluations, how well that's working, it'll generate the exact right data needed for the synthetic data generation. So based off of the prior step, it will show the synthetic data generation here. And this isn't just created out of thin air for the developers in the audience. I know you're going to say garbage in, garbage out.

This is generated for the particular failing metrics based off of that real world data. So the synthetic training data has to be based off of the seed data for it to be relevant for the model.

So now that I have this synthetic data training, fine tuning itself will take about 15 minutes through the ChatGPT API. It doesn't matter what foundational model you have. But so we don't have to say for 15 minutes I've already uploaded this.

And so that we're comparing apples to apples here, I have that system prompt. I'm pasting it in here. And these system prompts are synced.

So on the left-hand side, you have the unmodified, un-fine-tuned version of Chachi Petit. And on the right side, it's the fine-tuned version that we created together here. And so here, I'm going to give it a sample user response.

Demo Analysis

If you remember, at the beginning, I was asking about that user feedback. I'm going to give it this sample question that I asked the user, what's most helpful on the site? And the user responded with, I like how the product determines not to place my ads next to political posts. So I will run this.

And so if we remember that system prompt that I first defined, it was really important to me that it was concise. And so what we see here is that the fine-tuned model is not only about twice as fast, but it's also about 25% of the amount of tokens. So not only is it cheaper, but if I were to run this not just once, but hundreds of times, the overall quality improvement would be about a 50% relative improvement. So it's faster, it's cheaper, it's higher quality.

And we were able to just fine tune that model together in about, what was that, five minutes I showed that?

Q&A Session

So I believe I have about 10 minutes left for questions. Happy to see who wants to be brave enough to ask the first one.

What questions can I help answer? Yeah?

In terms of fine-tuning the model and doing the evaluation, does this also work on just a standard GPT that you've built? Like, can you evaluate a standard GPT that you've built on chart GPT using this? So what I'm hearing, just to repeat the question, is can it evaluate a standard GPT that's on chart GPT? Yeah, short answer is yes.

So in the beginning, you talked about the insurance example where you didn't cover the specific terms, syntax, maybe taxonomy for insurance. How does it get better through the process you described? Because those terms would still be very unique to whichever use case it has to work with.

Yeah, so if I understand the question correctly, it's like, how does it know how to get better on the particular domain or industry-specific use cases or subject matter experts? Yeah, so the way it works is that before, developers were bottlenecked because subject matter experts would have to manually review every single time they made a change. So because the seed data set and evaluations are based off of those subject matter experts' evaluations, the subject matter experts only need to define and evaluate and label it once. But that's what the seed data set and the synthetic data generation evaluations all start from. So every time they make a change, that original core part is still based off the subject matter experts, but it's no longer blocked for them having to check it every single time.

Does that help? Does that answer your question?

It's just a very general question.

In general, this kind of fine-tuning that you're trying to do, does it give you better results if you're trying to aim for a domain with very specialized area or... Does it give you better results if you're covering a broad area? Yeah, so as I understand it, does it give you better results if you're covering a specialized area or better results if you're covering a generalized area?

I'll say that it gives you better results if you're going for a particular kind of answer or a particular domain. So foundational models, as I said in the beginning, they are designed to be general for lots of different use cases and answer in a variety of ways. So Plum AI will work really well if you're trying to get it to answer in a very particular way across a lot of different use cases or answer in a particular way on a very targeted domain.

Just a small follow up. I suppose going for a specialized area, your chat GPT model out of the box might do better on it compared to a more general area. Correct.

So using Plum AI, is it really worth it? If you're only going for a specialized area, does it add much value? Yeah, so I take the question like, what value does Plum AI have?

And I'll say that the customers that we work with, for them, high quality models are blocking revenue growth. So for them, they figured out raising the quality from 50% to 95% is millions of dollars in additional revenue that they could gain compared to not doing a fine tune model. So for them, it's just, what's the value of having a higher quality model? And that's going to be different for every business.

Is there a bigger bang for the buck in the generalized area to use Plumier or? Potentially, yeah. So just to repeat the question, it's like, where's the value? It's going to depend on the business side.

I saw a hand go up in the back. And we can talk afterwards. Yeah.

I think that what he's trying to think is that, does it help me as an individual? No. Yes, correct. Yes, as individuals.

Yeah, so I think as a business application, it's very different in the sense that it's two weeks in the industry that you're working. So it is specialized in... Yeah, just to repeat that as for individuals, would it be worth it? And the answer is no, totally.

I will say this is not for individual use cases. This works really well for the business use cases. Yes, absolutely, to what you said about the general side.

So can you explain, what do you mean, how do you define time turning? So do you modify the base of the model, or is it like modifying the prompt?

So for the developers in the audience, LoRa would be the actual technique where it's usually you're just changing the last layer of weights in the model. Llama makes that a lot easier. Chachapati, they allow for anyone to fine tune through their API.

So this can help developers see how much better their system prompt is working from change to change instead of being afraid to have to launch it in production and get qualitative feedback. So the evaluation step will help with prompt engineering, but the fine-tuning itself and the actual improvement, the product isn't writing your prompt for you. It'll just tell you how much better it is.

But the second half of the fine-tuning, that's actually fine-tuning through that last layer of weights. Does that answer your question? Yes.

Cool. So just one more question. Yeah, last one.

Yeah, so I believe you mentioned something about industry experts being involved in a sort of validation of, so I'm just curious about the extent of their involvement and how that works, like how do you find these people, how do you define these different industries? Yeah, so really it's just, to repeat the question, is like how do we find the subject matter experts particular to the industry?

Yeah, so I'll give an example of a company that we're working with. Their subject matter experts are internal at their company.

So when we engage with a company, their subject matter experts are the ones helping to give examples of bad outputs, examples of good outputs, helping to define what does good look like for an output. So these are going to be people internal to the company who are already doing this.

Conclusion

Yeah, thank you so much. Thank you so much for having me. I'm looking forward to speaking with you after.

I will mention, if you email me, I promise I will give you a free API key, not trying to sell you anything.