Building Reliable AI: How to Test and Improve AI Models

Thank you for your attention.

Introduction: Why 2026 is the year for reliable AI apps

So 2025 has definitely been the year of agentic AI. We have seen more powerful systems come in production.

Personally, I think that 2026 will be really the year where we are more adopting AI when we integrate it in our apps as the developers. This is why I think this is the right time to understand how you can test your AI and how you can build reliable application when you are using AI.

So in this presentation I will show you how a standard industry like CharGPT and Tropic they are testing their own models but I will also show you how for your

own apps whether it's a pet project or it's something you are doing in the industry how you can test your AI models.

About the speaker

But first let me introduce myself myself. My name is Hugo.

I'm a software engineer. I'm a DevOps. But before all of that, I'm a passionate crafter.

I always build it since my childhood. This is my second time presenting in Milestone.

The first time was how to build AI agent with NA10. This is available on community .milestone .com. You can watch it.

If you are non -technical, this is perfect for you if you want to build your own AI agent.

Like some people I saw here I did five years at CERN but more recently I moved in Geneva because now I work in a private bank.

A real-world pain point: Finding an apartment in Geneva

I found out a problem I was not expecting.

Finding an apartment in Geneva is an absolute nightmare guys. I don't know who lives in Geneva here but it's very difficult.

Geneva is the hardest city in Europe in terms of vacancy, followed closely by Zurich and Paris. This is really difficult.

Finding an apartment, you need to be consistent. There is high competitions. There are so many different platforms.

You need to contact people, send private chats, and this is very repetitive. And us tech people, we hate when it's repetitive.

Building FlatScoot, an AI agent for apartment hunting

This is why I wrote FlatScoot. It's an AI agent that will precisely understand what you're looking for and then he will browse the full internet, whether it's the Regie website, immobilier .ch, even Facebook marketplace or Facebook groups.

It will watch everything and at the minute it finds something that suits your need, it sends you a message. It's not like you will spend one hour every night looking for post, and the person will not answer you because you arrive six hours later.

It answers you right now. It was very nice, and I got some of my friends that were interested, and I let them test it.

The hard part: Unexpected user scenarios and regressions

But then I found out, when you have many people that use your app, you have unexpected scenarios. Like it can be irrelevant demands, or

Or sometimes the user message is not very clear, sometimes it comes from the bug in my app. And the more users I have, the more I have to adapt.

But with AI, when you adapt, it goes better in one direction, then it goes down in another. It's just very difficult to find a right equilibrium.

So I was thinking, maybe FlatScoot is just dumb, and I should give up on this project.

But AI, it can answer very complex, physical questions. Is it intelligent or is it dumb? So that made me question, how does the industry test the performance?

How the industry measures AI performance

So there are more than thousands of benchmarks that they test chat GPT against, they test entropics. So I wanted to present you three.

Mainstream benchmarks: Knowledge, math, and software engineering

The first one is GPQA diamonds, which is for general purpose questions. It's kind of like, do you know what chemical reaction it makes when you mix this and this and this? If the AI hallucinates, it fails.

So this is about general, like, culture, we can say.

ME 2025 is a competitive high school math benchmark.

And SWE, for the programmers, it's if LLMs can answer GitHub... Yeah, if they can resolve GitHub issues, and it tests the agent reasoning in general.

There's a last one I want to... Oh, just before I skip, you see these numbers. They are very high. They are almost completing.

These benchmarks are like one year old. At the beginning of the year, it was barely at 30 % to 40%. So it's going very high, and benchmark changes all the time. So this might not be relevant one week later.

Beyond training data: ARC-AGI and generalization

One last benchmark I want to show you that is a bit different, because AI is very good when it has been trained. But what if we try the AI against a task that is that has not been trained before and This brings us to a little game.

I want to try you tonight This is arc a GI v2. This is a benchmark that gives questions to LM It has never been trained on before so

Here it's about Visual recognition. I don't know if you if you can see it. Well guys

guys. But the goal is to look at these two examples and enter, let's say the test. Does someone here have the solution?

Yes, if you can solve the last one. I don't know if you can see. So

the puzzle here, you have two examples, you have an input and an output. Here you have an input and an output.

Here you have an input and you have to guess what will be the output. So I'm not asking you to color but if you can like find the general logic

No Yes, so okay. I'm not going to to change you longer.

But here you can see there is an amount of dots here delimited by the blue area is it for zero blocks for one and It removes the other so here there's two it removed

so here it's the same logic for the yellow one it will look for the one that there is six dot here and and we can see there is no empty block. So, it will remove this one. So, okay.

Let's say it's just like too small, but at this day, there is almost no AI that can resolve any challenge of this benchmark,

and I'm sure every of you guys, if you spend at home quietly two minutes, you can solve it, which makes you more intelligent than AI. That's the solutions.

So this reminded me one sentence from a philosopher in Geneva. Who defined intelligence as, intelligence is not what you know, but what you do when you don't know it.

What it means to “test intelligence” in your own AI product

Evaluation requirements for a production-grade agent

So let's get back to our, how can we test flat -scoot intelligence? So, the requirements, it has to be repetitive. I don't want to spend hours testing my bot.

It has to handle the full requirements of the user, find out what the guy is looking for. If he's looking for something small, for something related to the amount of room, etc., it has to fill my use case.

Why manual testing fails—and what to replace it with

1So what we don't want is manual testing. It's slow, inconsistent, random sampling. sampling, don't think about that, it's totally unreliable.

1What we're looking for is an automated, repeatable evaluation framework.

So there are multiple solutions. They kind of work with the same tools, some are more complex than another.

A practical solution: Automated evaluations with the OpenAI platform

I want to show you the OpenAI evaluation platform. So very quick, how to introduce it.

Core workflow: Prompts, datasets, and graders

You create and you configure prompt in a developer -friendly interface, where you can set the temperature You can version your prompt so you don't, you know, change something and then, oh, this previous version was working way better. No, you can version it and you can come back to it later.

You build an evaluation dataset, which means you will define the questions or the variable you want to put in your prompt and some answers or the solutions that can be used to determine if the model works or not.

You will build graders. You can build one, you can build multiple that will test your application. For example, if it's about something very vague, like sentiment analysis, you will build a grader with an LLM that will judge your LLM with the help of the solution containing the dataset, of course.

Or if it's something more logistic, you can even put a Python script that will look for keywords, etc. So you can iterate on your system without losing control.

I will show you, let's do a quick demo. more.

Demo: Creating an evaluation dataset

So here I am in the evaluation. It's the platform .openai .com. And I will create a new dataset.

So the name of the dataset, demo -mindstone. And let's create it.

So here I am in this beautiful little interface. And I will add some data. So I can upload some.

I can enter it manually if I want, but I have already prepared some resources. So I have a JSON file that contains some of the data I will be using, and I don't have enough range.

So it will contain the user description, which will be like the prompt entered by the user. I'm looking for two rooms in Plain Palais, et cetera. And here, all these other fields are the solution.

I'm creating the dataset and we see it spawned here for the data. Let's now look

Demo: Running a prompt against the dataset

at the prompts. So here I have already one that is prepared, this one, and you can see here I have my variable that I can insert like this using double double this thing, I can set some variables like the reasoning effort, the verbosity, if I want the answer to be text or to be a schema, and here I have my data from before.

So, let's generate the output, and it will take some time because now it's generating, it's running GPT. So, never mind, it was super fast. Perfect.

So it was super fast and if I click in any of the row I have like a very good view on the piece of data. So here was the user description, je cherche un studio à plein palais, 1500 par mois, c 'est urgent. And here are the data that I said it must be the truth. So now the output here is what the AI guessed.

So it guessed fine, this is the project, and it guessed fine the zones, it guessed fine I think the number of studio, so one room, and it also said the type studio. About the comfort criteria, the user didn't precise, so it's all false, and then here this, yeah.

So this is quite fine, but we said we don't want to compare it ourselves. We need to use a tool that will do this evaluation.

Grading outputs: LLM-based and rule-based approaches

So let's create a new grader. And this is the most interesting part of the talk, I think, because here you can really configure the grader to fit the need of your application.

Demo: An LLM grader focused on comfort criteria

So here, let's do the very simple grader possible. The grader will look at my data. It will be an AI evaluating another AI, but I will give it some help so I Prepared a little prompt that will be used for the grader to evaluate the result of the LLM

So you are an expert grader blah blah blah you will look specifically for the comfort criterias So it will look at the input It will look at the output that the LLM has generated, but I will also give him the solution So he's thinking by himself and he has the solution under the eye.

So he's really really less likely to make a mistake So I give him the comfort from the data set. I can choose what kind of model I want to To perform the evaluation, but let's leave it like this.

I can even add examples and LLMs are very good with examples So let's save and run So now here the scoring grader is running and when it finishes it will either tell me pass or fail I think it will take a bit longer than the previous one.

So in the meantime, if you want here, if you have different iterations of your prompt, you can click on add prompt, and you can modify, you can try some stuff in parallel, and the idea is not to wait in front of your screen. So let's see.

Yes, okay, it passed, it passed. When I click on pass, the scoring grader, I can see he's reasoning behind it. So if it fails, I can read while it failed.

Let's see. So everything passed. Great.

Demo: Adding faster checks (string/logic/Python graders)

Let's try another grader. So I can add another one. I can have multiple graders for different needs, testing different parts.

So here is to change really the type. So you can do simple with an AI evaluating an AI. You can do a string check. Let's verify this value corresponds to this value.

Take similarity if you are looking like for for more abstract problem, Labeler, Scorer, or Python Grader.

And this one I like it because it's very easy. You ask ChatGPT to generate you a JSON for evaluating, you put it there, you verify it works, of course, and then you can reuse this block a lot.

So, which I already did, and here. So this one will try, I'm not asking you to understand this code,

but basically this one will look at the zone and compare it with the zone I expect and look if the model correctly extracted the zone because when someone say I'm looking in Rive droite I was when I arrived in Geneva I thought Rive droite was Rive gauche because when I look on the map anyway so here it will be much faster because here it's not doing LLM stuff

it's it's just like doing some comparison and we saw that one failed failed, one succeed, one succeed, and the last one, it worked.

Turning failures into an improvement plan

So I see I have a really good area of improvement in my app when it's about understanding the zone criteria for the user. And now I will be able to generate maybe that set of 50 samples and iterate on that.

Conclusion

So okay, that's it for me.