Reliable Agents in Productions

Introduction

All right, so today I'm going to talk about reliable agents in production. So everybody's on LinkedIn and being bombarded right now with posts like, oh my God, agents is the new thing. The best thing since sliced bread, right?

Yeah, I'm here maybe to present a bit of a controversial opinion, right? To cut a bit through the hype. And yeah, it has exactly to do how to make them reliable in production.

All right, yeah, so I'm the guy over there responsible for AI.

What is an AI Agent?

So what is an agent, right? Like maybe let's start there.

So there is no consensus among the experts, so people can call an agent just about anything, but I like to just do a little bit of a... etymological analysis, if you will. So an agent is something that acts and something that has agency, right?

It's pretty simple. But is it an LLM? Is it an LLM that does multiple stuff at the same time? What is it?

So the lines are kind of blurry here. But basically an agent is like us. We take information from our environment and we act upon it according to such information.

So there are multiple types of agents. So pardon my reductionism in this slide, but yeah, it could be just simple LLM. We are chatting with ChatGPT. Is that an agent?

We don't know, possibly. Then we have other kinds of agents. We have, for example, the React agents where they have this called the React loop. Basically, they look at the environment, they think for a bit, and then they react. So they try to do something useful with these tools or actions that they have available to them.

We have search agents which are more in line with traditional AI approaches you know from robotics ever since you know we have AI you know so basically they will think in kind of like this tree structure you know evaluate okay I have three possible actions which one is the best one okay it's the middle one because I don't know, they take their own elations. Then we have planning agents. You could say that this is kind of a planning agent, right? But in today's world of LLMs, planning agent in reality just tries to produce a plan that it can act upon.

So you have all the papers there if all the technical people want to go. And then we have multi-agent which is just a collection of whatever types we decide to go through for our agent architecture. So in this case they collaborate with one another like we humans do right now trying to get an event up. So it's exactly the same thing.

Agent Conundrum

So what is the agent conundrum? What is the problem? Everybody is saying on LinkedIn that it's the greatest thing, but can we actually use them in production settings? It's a difficult question, right?

While an AI thinking might be so complicated, and we see these contrived operations in thinking, and these models have a gigantic, enormous amount of parameters and consume a lot of resources to run, but yet sometimes they can be pretty dumb. So let's see why. So yeah, basically this is a graph that explains the success rates of an agent as it goes through its plan or steps.

So if I ask it a question and I have an answer, that's one step. But you can clearly see, if you squint your eyes a little bit, that the more you go, the success rate decays very rapidly. If we use the stronger models that we have now that everybody is talking about, like DeepSeq R1, like 01 or 03, the reasoning models, these models that think. So for example, at four steps, they would be, let's say, at 80%. If you do six or eight steps, they're already down to 30%, or depending on the case, it's like 60%. Which is pretty good, right? Well, maybe not for production.

Imagine telling to your stakeholders, all right, like this guy, 30% of the time is wrong, but it's okay. I'm sure our clients will get it, right? So that's wrong, at least in my view, right?

So yeah, so what is happening? So these models, they are huge black boxes. So black box means that you get an input and an output. You have no idea what's going on. Like, for example, if I try to talk to you, I have no idea what's going on inside of your head, even though you can try to explain to me, right?

So this poses a bit of a challenge because we have no traceability and interpretability. So again, we can't really if anything goes wrong we can say to our stakeholders like hey it's a black box i don't really know what happened right so that's very bad especially in europe because we are starting to be heavily regulated it also has been shown that llms they they can't plan so They are getting better, yes.

But if you give them a very complex task in the real world setting, chances are they're going to blunder. So basically, this paper just talks basically that we need human in the loop to help the AI to reason and to plan and to build that plan and to verify it.

Practical Applications

So for example, going back to maybe a more practical case that is not so abstract. So for example, at Knox's, we are automating a customer complaint department at hospitals. So they get these complaints that are handwritten.

And so they would have humans inspecting these documents manually to see if it's a complaint or a compliment and what type of complaint it is. So it's a very tedious process.

So we have an agent workforce, let's say, in place that analyzes these complaints. and helps the human to take a decision, right? But it requires human supervision.

So let's say that 80% of these cases, they are easy. The other ones, which just don't let the AI run loose and decide, right? We need to have a human in the loop, right? To verify.

All right, so maybe you have felt this too when you're talking with ChatGPT. So there is this nice sentence that I like. relative to LLMs, which is LLMs seem to know everything that I don't know, but the things that I know of, it's right 60% of the time, something like that. So it seems like there is this perceived quality of a model knows that if we don't know anything about the subject, it just seems like there is, you know, Pythagoras or something like that.

All right, so what is the problem then? Do we think it's okay to have agents that perform at 80% success rate in production? So maybe not.

Solutions for Reliable AI Agents

So one of the solutions is workflow-based intelligence. So what we do at Knox's, and this is like an example of our workflow that we have in production for the customer complaint department. So here it's highly simplified, so basically,

Each one of these boxes is an intelligent box, right? So one of these is going to categorize the complaint. One of them is going to send this data to the client's database, for example. So each of these boxes has their own specific action. But if you look at this, this is nothing like, oh, the agent does everything, and it's super automated, and we don't need to do nothing.

We just let it run wild. No, this is exactly the opposite. So what we are effectively doing is mapping business logic of our clients into a workflow that is interpretable and traceable.

So in case anything goes wrong, they can go say, oh, the bot kind of failed in this step because it said this, this, and that. So we are reducing the scope of action of that chatbot. So I kind of like to call it this materialized intelligence. So we know that that is the process.

We don't need to let the AI run loose. We know that that is the process, and we just let it run it. So yeah, so if you give a very big task to an agent that tries to do everything, basically,

there can be a lot of error and the error will propagate across the steps as we saw in the first graph. If we apply this divide and conquer approach with very small tasks, basically the chance of having an error is way smaller, right?

So solution two, I'm going to skip this. This is a bit more technical, but basically you would want to align your AI with human preference, right? So if I say that this is blue and the model says it's red, it's good that it's aligned with my preference. My preference can be wrong though. And that's sometimes a problem in getting these tailored solutions to customers because sometimes human logic is faulty.

So for example, we have a case where we are doing for retail, where we have to categorize products. Yeah. So, and basically there was this case that this helicopter toy had to be categorized as a Lego toy because they had no other categories.

So if you try to fit the model to that, it would be very weird, right? So the first solution is confidence bound. So how confident is this LLM when saying stuff? So there are a plethora of methods that you can do to achieve this. Basically, you try to inspect the internals of the model to see if you catch something. But it can be very hard to do that.

So one easy way is basically this semantic similarity. It's the most cost effective, in my opinion, which is basically the same as saying if I ask the same question, subjective, let's say, to five people, and if all of them have different answers, of course, you know, there is not a consensus, right? So it's not consistent. All right, so yeah, use statistics basically.

Conclusion

So to conclude, we have a Portuguese saying, I don't know if they have it in other countries, but a prepared man is worth a thousand.

So before you go into production, of course, you must have a good evaluation framework. That's where basic statistics help you.

So you have to be able to build this data flywheel, so get the data from your customers, curate it, and harness it to be able to iterate on your solution. Yeah, only fine tune your models around your data mode. Your data mode is basically your oil. So your data oil, data that is unique to your business, right?

Basically, you can't compete with these giants, and there is no advantage in that. But if you try to optimize around the data that is unique to you, better chances of success.

1So yeah, then use this consensus algorithm with your models and insert human in the loop for the planning of the models and verification and create traceable, interpretable logic that can be regulated and that you can justify to your stakeholders.

So yeah, that's basically workflows and the main value proposition of what we do at Noxus.

So yeah, a couple of pictures about our platform. So here you have a collection of your workflows, for example, a recent news digest, a CV reviewer for your company. So you can basically join these Lego pieces to automate a lot of stuff.

So this is how a workflow looks like. You just pass information from one side to the other. We have integration with Basically, your documentation in your company, you can build this huge knowledge base that you can then integrate in the rest of the platform.

And of course, we also support the chat that integrates with these workflows, these knowledge bases, and what have you. So yeah, thank you.