From Prototype to Production: The Evaluation Gap in Gen-AI Development

Okay.

Introduction

Background and perspective

Just a little bit about myself. So I actually started my journey from systems neuroscience. And if you want to talk to me about basically being in a simulation, I'm going to talk to you from that point of view.

but then this was actually my kind of entry point to AI because that was the opportunity for me to actually start modeling visual systems or using different sort of modeling techniques including neural networks and from there

I actually started to work in industry in computer vision initially and then gradually I actually gravitated toward NLP was during the basically 2018 and 2019 with the you know all the large basically language models like you know like BERT and other models that basically showing some promises this was a time that I kind of started my journey in there and then I sort of started

basically taking on more leadership roles and managing teams that are mainly dealing not only with AI and machine learning but also data engineering basically operation research and basically optimization and anything

needed for what I call data insights so my philosophy is basically don't buy into the hype and let's talk about what really works and let's talk about the real performance.

And currently here I'm actually head of AI at Evalia, which is a company that actually we help businesses to establish and define their AI strategy and help them basically with the execution of that, specifically in the domains of machine learning operations, helping building

building data pipelines, plus evaluation, which is the core of basically the talk today.

Audience check and goals for the talk

Of course, this is my first time actually talking here. I really don't know what percentage of the audience is technical. So let me ask you a couple of questions.

How many of you have been lately used agentic AI for your day -to -day tasks? And that includes that when you're basically asking asking GPT, chat GPT, that globe is basically circling around and basically doing searches on the web. With a show of hands? Okay, so many, many of you.

And how many of you have developed AI agents in the past couple of years? Awesome. So great audience.

So I'll try not to be so technical. a call.

Why Evaluation Matters for Agentic AI

From the “year of agents” to the “hangover”

But before I get into this, I think 2025 was the year of agentic AI.

I'm sure like all of you have heard about agents, all the tools that you can use to create agents. And there are basically lots of development in that area.

But it seems like 2026, at least to me, is the year that we've we've been woken up to a little bit of, I would call it, agentic AI or AI agents hangover with lots of tools around us and agent systems and all the agents. It's really hard to understand exactly which one is the right fit for us.

And this is exactly why evaluation matters a lot. So whether you are actually using this sort of tools personally, or you're using this as a tool basically to create some agents for the development, development, this is really important.

So the decision for a specific task to go with ChatGPT or Cloud Anthropic or Gemini is something that sometimes you have to basically work case by case to understand.

Or if you're a developer, if you're deciding between codecs, app or cloud code, like anti -gravity and all these coding agents, the question is that which one really fits your need and and which one can be used for a specific use case that you have in mind, right?

And then if you're developing AI agents in a development team, sorry. So you need actually to have some sort of evaluations in place because without evaluation,

you are basically flying blind and you have no way to verify exactly what changes are basically done and that's having some effects for you.

So these are some of the problems that you have and you don't have any evaluations, right?

Evaluations as a competitive advantage (data flywheels)

And I want to actually kind of make a flashback to Sheikh's presentation that he truly mentioned that actually one of the competitive modes that you can create for an AI product is actually through the data flywheels and improvements that you make on iterations over AI products. And this is mainly done using evaluations.

If you have some systems around to exactly know what's not working and what's working, using the data that you're collecting, whether it's coming from LLM as judge sort of systems or experts who are basically creating and annotating, labeling those data for you,

there's no way you understand exactly the data if the AI agent is actually doing the right work for you.

A Practical Testing Framework for AI Agents

Three layers: component, agent, and end-to-end testing

So before getting into the use case, I just want to mention we have three layers of testing, And this is something that for the developers specifically is very important. And there are parallels to that. This is kind of like a very similar to the testing pyramids that most of you are familiar with that.

You start from bottom up, from very small components that are creating your AI agents. For example, if you are searching web, you want to make sure that the search is done correctly and properly and covering most of the cases that you're looking for.

right and then the next level is the age if the agent is able to take all these steps necessary properly to reach the goal that you're looking for but for example if you're asking an AI agent to look for something on the web and it's

not basically just doing a sort of search on a kind of like a lousy so we're at the engine or search engine and it's doing like in lots of lots of steps That's not really optimizing. It's not basically get you what you want, what you want to go, right?

And then of course at the end, sort of like an end -to -end test, is you really want to have the expected outcome to be delivered to you.

Whether you're looking for exactly, you know, pulling some data from an agent or doing some sort of searches and summarization that you're looking based on some like deep search done by the AI. All right, so

Use Case: Data-Insights Agents

Chatting with your data: lookup, analysis, and visualization

So what is the use case today? So the use case, what I call it, is a data insight in agents, which is just helping you with any sort of database that you have or any source of data to basically have some sort of chat with that agent.

And that agent is able basically to pull the data, do the analyzation, and sometimes if you need to actually make a chart or graph for you.

This could be your bank transactions or it could be your receipts that basically you have taken some photos of. you need it for the tax purposes when you want to file something or it could be you know like some system of records internally for a company or it could be

basically some small databases that you want to reach out so the first step for this agent is basically to look up the data that you really want from the sources of data the second step is that once you have that data pulled then you

want to basically do some sort of analysis based on that sometimes you you want to make an average of that. Sometimes you want to want the maximum, the minimum, the number of times that you did something or you bought something.

And then sometimes you may need some graph and sometimes you don't. But this is the agent who really decides exactly which of these actions should take to reach the final outcome that you're looking for.

Why traces matter: understanding failures step by step

So the first step for you to be able to basically actually see the traces of this basically agent building, building, because this is very important, there are many, many steps in here, that if

some of these actually goes wrong, it propagates down the pipeline, and it's going to create something that's going to be horribly wrong, right?

So here, only for the sake of presentation, I'm showing you one of the tools, Phoenix, which is a great tool, but my main point here is actually just to understand exactly what's the in and out of basically doing the evaluation.

So this is one run of that specific agent. And here you can see all the steps that will basically take you from the beginning, where the agent basically gets the input that it's which stores did the best in 2021.

And this is basically a sales data from different stores across different locations. locations, right?

And then there's going to be, as I said, like the three steps that you can see that is actually executed one by one. 1And this trace helps you a lot in terms of understanding exactly when it was executed, how much it cost each step for you.

And then you can see the input and output specifically for each of these steps to evaluate whether the model, the agency system is actually doing a good job or not, right?

But this is only very manual and takes lots and lots of time because you have to go through all the steps yourself, right?

Automating Evaluation

LLM-as-a-judge and other evaluation options

So the next thing is, remember the testing parameter? You really want to be able to evaluate each of these steps, and hopefully you want to do it automatically, right?

LLMs are actually a good candidate for this because you can ask the LLM to basically focus on the input and output generated by each of these steps and tell you whether it's correct or not.

So this is called LLM as a judge kind of schema for the evaluation.

There are other ways to do it. You can actually use code basically for the evaluation or you can do human evaluation that people actually go through that.

Those are experts who can basically look at the code and look at the output and the input and tell you what's working or what's not working right so the next step is for each of

these components you start actually defining your evaluations for example So basically there are these steps that you can actually create some sort of evaluation and this is something that I basically for this step one, you can use LLM as judge.

for the lookup sales data there are two steps involved first after you prepare your data set you want to generate some sort of sequel query it's just a you know formal query to actually work with the database and fetch the data and then then you wanna execute that SQL, right?

And then you can do also some sort of data analysis, the step two, you can again use LLM as judge that you can ask for the clarity of the analysis, right?

You actually say, give the input and output to a model and say like, given the data, the table that I fetch from my database, did the LLM give you a good answer, a good analysis or not?

And the third one is basically data visualization code, which is just the creating code that you can run it, and then you can get the charts that you expect. So this is also the third one.

So these are the things that you can set up easily with just writing prompts and also writing some code and then you can easily basically using Phoenix deploy it and then this will this will do the proper evaluation for you so let me show

Operationalizing evaluators in a dashboard

you on the dashboard how it works all right so if you go to this evaluation agent so you can hear see different traces and different sources because it leads this been like different runs right and then if you basically look at

the session so these are the evaluations that you define which is basically very specific for each of the components we have one evaluation for the tool calling you just want to make sure when you ask ai to do something it picks the right tool to work with those three steps if you remember you know pulling data from the database making analysis and also data visualization the llm should correctly pick which one goes first, and which one goes next, and then the third one, right? Sometimes they make mistakes, and this is where you actually find them.

Because you can actually use those LLM as judge, as evaluators, and this is going to run on all the data that you have, and then it gives you the score, which is basically here 87, which seems like one of the samples, actually the LLM did not do a good job. It didn't pick the right tool to work with.

And you can also do the thing. So all the evaluators that I show you in the slide, you can see at the top, and you have also the ability to actually go and filter and see what's not working. And based on that, you should be able actually to change your prompt or change the system how it's designed to have a better outcome.

And this is the power that you get because you can run this in basically seconds, and then you can see what's not working and then come up with a different version, which is perhaps better. and then you can keep that and this is going to be something that actually prevents you from regressing to some problems that you had before so this was

Trajectory and cost metrics: steps, convergence, and spend

the first step the second step as I said is about the trajectory right how many

steps the agent takes to achieve a goal is very important if I have an agent which is taking basically three steps to do the job compared to hundred steps to to do the job.

Of course, you know, one is better than the other in terms of cost, in terms of time, and perhaps in terms of the performance, right? So when it comes to that, you have also the ability to actually go and do the evaluation exactly on those metrics.

And when you're actually embedding that, this is basically how you get that.

So what you see here is actually a list of inputs that that you are giving to the agent, which is essentially the same request, but all rephrased in different ways, right?

For example, you can ask, what was the average of sales? Or you can say, can you tell me what was mean, for example, for the amount of money that the company actually got from this period to that period?

So what you expect here is actually, these are all the same requests semantically, but the output of the agent must always be the exact same number the average is always the same however it is asked differently and this is the power of these tools and almost all advanced AI evaluation tools actually provide that

that you are able to run these evaluation times and times and then you can actually measure all these metrics the output of the model whether they they converge or not, and how many steps they take, and exactly, on average, how much it costs you to do a specific job.

Choosing the Right Tool for the Job

Comparing agents and coding assistants by measured trade-offs

I think a good example of that specifically these days, I'm sure like you are kind of maybe like deciding between should I use cloud code or should I use codex app or, you know, like anti -gravity and everything like that.

So each of them are very, very different. It really depends what you want to do with that.

I'm sure like, you know, like cloud code is is actually very token hungry. It's gonna cost you a lot, but it's very, very rigorous.

So Codex, if you are not specifically asking for that, it's pretty cheap, it does the job, but kind of so -so.

So it really depends on actually what the tools you are picking, you should be really aware of the specs and the evaluations done on that, and you have to be familiar with that.

Even if you are non -technical and you are using it for your day -to -day job that's very important if you don't know the difference between electric gas or diesel cars or you don't know what's the difference between 200 horsepower or 400 and you don't know how it affects the acceleration of the car or maybe like even

the fuel consumption it's hard to believe that you can pick you can pick the right card for your need So this is the same also with AI tools, whether you are taking it like picking it personally for your purposes or whether you are basically creating a dev tool for a tool, an AI tool for an end user.

Conclusion

That's it. So thank you very much.

Finished reading?