Unlocking GenAI: Building Transparent and Monitored LLM Solutions

So all right.

Introduction

Hello, my name is Rogerio. I'm the CTO and co-founder of Lengwatch. And today we're going to talk a little bit how can you improve your AI quality. So for all of you who are here, I guess, are working with AI.

And just before we start, I wanted to talk a little bit about the team, who we are. So this is me, Rogerio, Manuk, my co-founder over here, and then Richard, our founder developer here as well in the audience. We work on the flexible area for those that are here. You can just go there, meet us for a coffee.

The Non-Determinism of AI Models

1Biggest problems with LLMs is that they are non-deterministic. So we all know how powerful they are. You played with ,, it's very powerful. 1But it's non-deterministic, which means that it gives different results at different times.

Traditional Software vs. AI Behavior

In the past, in traditional software, you had a button. And this button, you click it. and it does something. If you click it 300 times, it will do exactly the same thing.

But it's not the same with AI. Every time you generate an image, every time you generate a text, it is slightly different.

Examples of Non-Determinism

And for example, if you ask a GPT what's your favorite color, it says, I don't have personal preference, but for many people, blue is often favorite color. If you open another tab and you ask the same question again, it's a similar thing, but people like colors like green because they're associated with calmness. So this is the non-determinism of it. If you ask it differently or even the same, it gives different answers.

And it's not wrong. I mean, it is correct both times about the colors and there is blue and there is green, but it can cause some issues, especially if you rely on it for doing more relevant things than choosing colors. And problems can happen, and they already happen on companies going with AI to production right now.

Challenges in AI Applications

For example, jailbreaking. So jailbreak is when you force the AI to do something that it was not trained to, but because it's non-deterministic and probabilistic, if you try hard enough, you can find those gaps.

Jailbreaking AI Systems

There is this e-commerce that they sell mattress, and they put a bot to be able to negotiate for them, and then a user, I just saw this yesterday on Twitter. A user managed to jailbreak into the AI. And then they managed to get a 99.99% discount and paid like less than one cent for a mattress.

And yeah.

Alignment and Ethical Concerns

or other problems, for example, on alignment. If you ask ChatGPT how to break into a car, it can assist you with that. So alignment is when you put your values or your company values on that bot, on that solution. And in this case, it was trained to not help with violence or criminal activity whatsoever.

However, if you said, in the past, how did they break into your car? Then it's gladly reply. It gave me over 10 options, and then it kept going. And they're like, yeah, do you want more?

Yeah, it's pretty crazy.

AI Hallucinations and Misinformation

And then last but not least, hallucinations. So the model can also put slightly different information, slightly incorrect information. For example, this. The earliest mention of artificial intelligence in the New York Times was in February 1950. It was actually in November 1950 with the article named Thinking Machines, but it was actually a revolution seen in Thinking Machines. So again, it's just slightly wrong.

And if you're not a journalist like they were, you might not even notice those mistakes. But because it is inherently non-deterministic, you cannot guarantee that it will go correct 100% of time.

And yeah, this even caused issues like Air Canada had to honor a refund policy that a chatbot made up. Someone asked, what's your refund policy? They didn't have this. And then the chatbot just invented one. That was not true. And then they had to comply with it.

Strategies for Improving AI Quality

So what can you do then? How are you going to rely on those things in production? Well, what you can do is to monitor, evaluate, and improve, and reduce the probability of issues as much as possible. So there are places that you can actually remove it completely.

Implementing Safeguards and Monitoring

So for example, on the negotiation bot, Maybe you should have protection placed on your code that do not allow the bot to go under $100 or a number, right? So you can fix that.

While others, it's simply non-predictable. You won't guess that your users will try like in the past, or we would try this, this, and that. So the way that you can reduce these issues from occurring is monitoring them, is being alerted when they happen, and then you can fix it.

Reducing Probability of AI Errors

So more and more, the chances of it vanishingly reducing. And as recent research shows, 89% of the market actually struggles to do it.

It struggles to collect data and monitor and get insights of major engagement. This because now we are dealing with a different kind of data. So we have Google Analytics, right, for a long time, but now we are not dealing just with numbers or with clicks. Now we are dealing with text, natural text, open free text.

And then how do you get insights from that? If you're processing 10,000 messages per day, how can you read it? It's just not possible. So we need better tooling to process that, to do that for us.

And then to evaluate quality and safety, how do you evaluate the quality of that output if there's no hallucinations, for example? And then iterate with confidence. Even if you have a product that it works well on your own task with your team, it might be that you find an issue and then you want to improve it for solving that issue. It might be that you break something else, right? Because when you're doing a prompt with ChatGPT, if you change the prompt slightly, it can change the answer by a lot. So iterating with confidence is also hard. So even if you have something working, if you want to make it better, it's hard to do. without breaking.

Introducing LengWatch

And this is why we built LengWatch. This is why we start on this journey. And what we want you to get out of here today is what is the actual process that gets you on this improvement.

So those four points that I mentioned, let's put them

Practical Steps to AI Quality Improvement

on an order on how can you, with your team, how would a team that is improving AI quality actually works in practice. So for what we need is two people on the team, or how we like to call two, different types of hats, one of the AI developer or AI engineer, a more technical person, and the other wearing the product hat or sometimes also called the domain expert, so someone that actually knows what's the product about or where it should go. For example, if you're doing a bot for healthcare, then the engineer might not know about health advice, but someone more domain expert like a medical doctor could feel this.

So then, how does the process go? How does this improvement loop go?

Data Collection and Monitoring

The first step is to collect and monitor your data. So in case of language, just by plugging in our SDK on your code base, then in the background, it collects these messages, and then it sends to our platform. And out of the box, you can see everything that is going in and out here, the tokens, the cost, all the metrics, and so on.

But then once you get the data in, you can, if any issue happens, you can investigate, you can see what happened on the code. This is the more technical part for the AI engineer to go deeper there and understand what happened.

you can see all the documents that are used, for example. So if an answer was given wrong, maybe one of the documents are wrong on your knowledge base and you need to update it.

Moving forward, once you have all this data in, then, like I said, it becomes hard to choose 10,000 messages.

So we have this other visualization that allows you to more easily read all the messages, but then also group them by topics and subtopics automatically. So it's like you can kind of zoom in. It groups messages when they are related on topic to each other.

So you can actually, through here, our customers find a lot of unexpected users. They did not predict that users were going to use that way.

And then, yeah, if someone is asking for a pizza on your mattress e-commerce, then maybe something you can account for and you pivot on this direction if users are asking a lot.

But anyway, you can kind of zoom in on those messages and understand the quality to improve. And then you can also read the full conversation, what happened with the user.

And then those messages, all those that was collected, they become also analytics on your analytics dashboard on Linguatch. We have many dashboards, like you can track users, topics, LLM metrics, so you can really understand the behavior and the traction of your product. But also, you can create custom reports. So any data that you capture with us, you can build graphs like those for your own reporting or to report effectiveness to your customers.

Evaluations and Alerts

Then the AI engineer, looking at all those messages, after you have some message flowing in, you understand your product, you start realizing some issues that happen or some security issues that you want to prevent, for example, jailbreaking. Then what you can set up to know when one issue like that happens is evaluations.

So we have here set up a content safety and answer relevance evaluation, which I'll get in a bit, but basically this one evaluates how relevant the answer was to the question. then what you can do is set up a trigger so that when this evaluation gets a low score, you get warned on Slack.

And this is the flow that you can get to improve your messages. You get alerted that a message has a problem, you go there and investigate, and you see what's wrong with it. on the next step, so you click through the message, you see what's wrong with it, and then you can have the domain expert actually evaluating that message.

You can have any kind of scores or metrics that you want, define whatever is really important to your team, and add their

their domain expert comments here so that the developers or the AI engineers can later bring this to a data set and collect all those examples that were good and they were bad so they bring back to improve the product.

Iterative Improvement and Experimentation

Then, how do you improve, coming back to a problem I mentioned some time ago, how to improve the product given that it might break, right? It might cause issues if you change something that was working, it might now stop working, you can run experiments.

So taking those data sets, those examples, and running against those evaluations, you can change your products and see if everything is still working. And it's actually working better now.

So you can put the next version in production. And then after you put it in production, you can watch for the changes.

So for example, you see here that the relevancy, answer relevancy went just up, was increasing. So you can see that your product is now going better. If it wasn't, then you can adjust it and fix it, and then you can track your progress, and then you keep monitoring the improvement in production and doing continuous improvement.

Conclusion

I don't know how much time do we have left? Two and a half minutes. Two and a half minutes. OK.

So we have a whole diagram here about evaluators and how do they work and what to choose. But I think I will skip that in a matter of time. But let me know if you're curious about it.

How does those evaluations work? We will be around here, and then we can talk over years later.

LengWatch Demonstration and Offer

Yeah, I just want to mention that if you want to try LengWatch now, you can just go to app.lengwatch.ai. We have a demo account that is already set up that you can see how it works, how it would look like. And for enterprise and companies that have really sensitive data like healthcare, we have also an on-prem solution on AWS Marketplace that can run completely internal to the company and still help improving the the product and uh yeah that's pretty much it and thank you very much