Why Your AI Agent Will Fail at Row 347

Introduction

Thank you, thank you for having me. This is actually my first event with MindStone, so cool to be speaking and experiencing all at once. My name is Jay, I'm the founder of QX Labs. We're a software business that produces automation products for the financial services industry.

industry. Personally, I have a background originally as a software engineer at the start of my career, and then I spent eight years working as a software investor. And so my

experience with AI kind of comes from two different angles. On the one hand, I've been a software engineer, I have the engineering hat that I've always had, but then I've also experienced a lot of unusual inefficiencies in the way that investing

happens. I'm not going to talk too much about that, I want to be a bit more practical today, but I want to talk a little bit about agents, agent systems, we

hear the word a lot, and actually one of the challenging limitations of agents. And then I'll go into some of the things that we do to try and overcome those and and maybe there's some practical takeaways for yourselves and how you think about these things as well.

A Simple Agent Example in Excel

I'm going to give you one very simple example just to illustrate the point.

We have Microsoft Excel here open. We've got a list of companies, and I'm just going to ask it for each of these companies, add a new column with the company's revenue,

and let's just say some financial metrics. metrics. What you have on the right side here is Claude for Excel. It's something they released,

Anthropic released not too long ago, and it's an agent that sits inside your Excel workbook and is able to run a bunch of tasks, search the web, look at what's going on inside your Excel sheet, and then actually manipulate the Excel file accordingly. So what you can

see is I've asked it a question, it's read all of the information inside my Excel sheet and it's gone and it's looked up for each of these items the relevant financial metrics and as I asked it to, it's added these two columns, it's even gone out of its way and included some notes with some background information. So that's great.

Why Conversational Agents Struggle at Scale

Context explosion, cost, and performance degradation

The problem that we have is that we work with a lot of people who might not be dealing with three rows of data, often they'll be dealing with 3 ,000, 300 ,000, sometimes 3 million rows of data.

And the way that this kind of chat agent that you see on the right works is it's a conversational LLM, and for every additional word that comes out of it, it's essentially reprocessing the entire conversation history. history.

So these LLMs, they predict the next token. That's effectively how they work.

And as a conversation becomes longer and longer, it retains the history of all the web searches it's done, all the information it's retained, you know, all of the back and forth Q &A that you've done with it.

And the longer this conversation gets, the more expensive it becomes in terms of cost and spend with Claude but also the less performant the system becomes so if I go from a situation where you know I'm dealing with three rows to suddenly a situation where I'm dealing with potentially 200 rows or 2 ,000 rows we're

going to start running into problems because this is going to be a very very long sequential stream of thinking and tokens that compound and get more and more expensive so at a fundamental level agents are great they're able to do all of these

these wonderful things and connect with connect with different tools and so on but they they suffer from a context problem we we can sometimes call it context explosion sometimes people refer to this issue as the the lost in the middle problem and that is that as you start to feed

it more information and as your conversations with it get longer and as it starts to deal with more data, the performance starts to degrade and the costs tend to skyrocket.

Consistency drift and recovery failures

The other problem is that when I'm asking it to look for certain things for three companies, there's a likelihood it'll do each thing consistently. It'll kind of have the same process for finding the revenue for company one as it will for company three.

But as I start to scale that to more and more line items, there's going to be some drift and inconsistency. How do I ensure that it deals with row number one in the same way that it deals with row number 1 ,000?

And then there's the issue that if this thing fails halfway through and it hasn't written the results yet to my Excel sheet, I'm sort of lost having spent a whole bunch on API tokens and with no conceivable output. So there's a kind of recovery problem that I need

to deal with as well so when we think about some of the customers that we work with who are dealing with lots and lots of data but they want to be able to do some of these agentic actions on on large amounts of data there's sort of three things that we're trying to solve for one is

i want to be able to do things in a way that's repeatable i .e it's the same thing it's doing for row one as it is for row 500 i want to do it in a way that's also traceable you know investors A lot of people in other enterprises, they're real sticklers for process, and they want to make sure, you know, I'm following a very guard -railed process every time I do something, and I need to do it in a way that is scalable, so that I can do this on 10 companies, or I can do it on 1 ,000, or depending on whatever the entity might be.

Design Goals: Repeatable, Traceable, Scalable Automation

And so that's where we started to think about, okay, well, what are the different ways we can solve this problem?

From no-code drag-and-drop to “vibe mode”

And actually, in the early days, what we ended up building was something that's now somewhat out of fashion, which is a sort of no -code drag -and -drop.

We can go and we can build automations, and we can tell it, you know what, I have an Excel file. I want to drag that Excel file in, and then I want to run some web searches, or I want to pass that into an AI. and run some queries and run some web searches and all the rest.

And that kind of works, but what we found increasingly as time has gone on is we're living in a generation of AI where people don't have the patience for drag and drop. We're now living in the generation of vibe, as everyone likes to call it. And so what we then evolved into was effectively what we call vibe mode.

Constraining agents into modular steps (and using deterministic nodes)

And the key thing here is instead of having an AI agent with an unconstrained problem, i .e. you need to go and run this entire process for me end to end and you have a bunch of tools available, we constrain the role and the task that each agent has at different steps of a process.

And, in fact, some of the steps of the process might not require an agent at all. It's completely deterministic.

Reading a file is completely deterministic. The way we want to run a web search might be entirely deterministic. The way we want to interact with certain tools might be entirely deterministic. So we kind of take what we did earlier and we kind of port it over to this.

Demo: Building a Reusable Workflow for Broker Reports

And I'll give you an example of something that our customers deal with quite often, in, which is that they'll often get these big files, they can look something like this with broker reports and information on lots of companies. This is like a 120 -page file on the RegTech industry.

And often what they need to do is they want to take that and parse all of the companies that are mentioned there and then start to do a bunch of things with it. So let's say I receive lots of PDF broker reports from brokers with information on companies.

I want an automation that can extract the names and descriptions of all companies mentioned in a file, file, figure out the website of each company with web, and then score. Let's say create a score list on each company.

Let the agent assemble modules into a guarded workflow

So we give it kind of a broad task, and then what we have is instead of having an agent that goes and tries to sort out all of these things on its own, we instead have an agent that is figuring out what are the different modules that are required for each of these tasks.

Some of those modules might be pre -built. The way that we need to extract data from a file or the way that we need to read a PDF is a pretty bog -standard thing

that a lot of different people need to do, so we have a kind of tried -and -tested way to do that. But then a lot of the actions that it takes might be completely novel.

They might be interacting with external services that we have integrations with, but that involves certain actions that aren't prescribed. And so for those, it can just go and write code and execute code inside those nodes. So it's starting to come up with a workflow plan.

We want to read a PDF. We want to extract some data. We're going to select the best website and then create a score list.

And then it's asking me, is this a reusable thing? So let's say it's reusable. usable.

Um, and, and what, just so you're aware of what it's doing here are tool calls. It's a, an agent that, uh, uh, that knows, uh, that it has access to certain information about nodes that are available and on the fly it's figuring out which of those nodes that needs to, it needs to pick.

So here, let's say we've got a prebuilt scoring system, so let's just pick that for, for scoring companies. I think we've given it all the information it needs to go and run this process.

Fingers crossed, it'll try and build this workflow now that it has all this information. Bear with me. And there we go.

So it's now come up with a template for what is effectively a reusable automation. So this is something that I might want to do every time one of these files come in. It's got kind of a prescribed workflow.

We have our guardrails. I want it to read the PDF the same way I want it to extract data the same way. I want it to be able to run these web searches to figure out the companies Etc. Etc.

Audit trails, model choices, and editable prompts

It's run some validation and and so on and then beneath that we if we're sticklers for for process We can go and we can actually look at the kind of detailed information about every step that it's it's carrying out and And so, you know, if I'm particularly keen on using one AI model versus another, I can say, you know what, for the data extraction, I'd rather use GPT -5 Mini. Or, you know, if I want to adjust the prompt for exactly how it does the extraction, I can edit that here. here.

Um, and so it kind of provides a balance between control and guard rails, but also being able to kind of vibe your way into, uh, into things.

So we can, we can save this, we can call it a broker scores. Uh, we save that here and then what it's done for us is it asked us if we wanted a reusable automation.

So it's created a little interface for us where we can essentially go, uh, upload, I think it was called regtech, here we are, we can go, we can upload this file, I can go share this across the team and say, hey, I've run and tested this automation a bunch of times and it essentially fits the purpose of what we're trying to achieve every time we get these broker reports and need to run this process.

And then effectively what happens is it runs it and you have like a full audit trail of, you know, this is producing this PDF file, I imagine this read PDF might take a while because it's 120 pages and actually what we try to

Reusing modules across workflows and grid-based execution

build for is this sort of modularity so you know if I have a workflow that I've created and I actually want to reuse it inside another workflow I can do that here so you know I have I have a workflow for getting the website of a company I can actually just reuse that inside inside another inside another other flow like this and then similarly sometimes people work with flow based

systems other times people work with grid based systems we effectively have exactly the same thing but in the form of a grid so instead of working with an excel file at the top I might just have a big row of companies in fact I might have one already set up prior to this but we have a big row of companies here or rather this one's this one's pretty small and I can actually take that that

workflow that I had to retrieve the the companies for each of those I can say I want the company name from here I want the description to come from the description row I can add that to the grid and then I can go and and run this flow and it's gonna use that exact prescribed you know defined process because people in companies really need to make sure things go exactly the way that they want and it'll run those processes exactly according to that prescribed flow.

It gives you more granular control but actually going back to what I originally said it solves for that problem that I talked about which is each of these is a little agent with a small small task and I can be sure that that little agent is going to do that one task very well rather than than necessarily drifting away with a big problem that's gonna explode my costs.

If we go back to this, let's see. Yep, so it's read the PDF file. You can see that it's kind of pulled in all of this information.

I think I've only got 30 seconds left, so this is gonna be a bit of a longer automation to run. But effectively, you can allow these things to run for hours, days, whatever it might be, and it provides resilience. You can pause it, you can resume. If there's an error, you can check all the checkpoints and then pick back up from where you left off.

Conclusion: Guardrails Against Context Explosion

So a quick demonstration of some of the ways that you can create guardrails to manage this context explosion problem.

Thank you. I think that we're open for questions just in case anyone has anything. Yeah.

Q&A

Sorry, can you say that again? Sorry, I'm missing what you're saying. I'll come closer.

We have a million rows. Yeah.

Handling a million rows with elastic workers

Yeah, understood, sorry, yeah, yeah, so the question was what happens if you have a million rows and so effectively, so I'll give the example here, when I hit this kind of run button and it's running on all of the rows, each of those items are being sent to a, effectively effectively a worker, an individual worker that's working on that task.

That creates kind of two benefits. One is you have it's called elastic scaling. You can you can kind of flexibly scale. If if someone suddenly sends a request where they're dealing with a million rows, you can scale from having five workers to having 100 so that they can process each of those concurrently. Obviously, there are rules because some things might be dependent on others. And so one thing will still need to process before another thing can happen but we create that taxonomy that that kind of That set of rules for what needs to run first and what needs to run second beforehand

But the second problem that solves is if one of the workers for whatever reason fails It doesn't fail the entire process. It just means that that one row will come up with an error And you'll see that come up but the rest of the the process can still run

yeah yeah yeah yeah for sure so so uh what you can do is you you can actually set up agents to

Adapting extraction to a deduced data model

you can set up an agent that looks at the pdf and determines what the appropriate data model is and you know where where we had a kind of prescriptive data extraction where it was just saying take the name take the uh take take the description take the financials whatever it it might be, you could flexibly set up that extraction to be based on whatever data model it deduced was relevant from the input side.

Yeah. Yeah, it could. It could, yeah.

So you could say, this is my model document. I want you to base the extraction on that model document rather than something I've just kind of hand -typed.

Yeah. Yeah. Yeah. Sorry.

Go ahead. Yeah. Yes, you can. Yeah.

Yeah, so let me see if we can do it. Let me open it up here. Yeah, exactly, yeah.

Checkpointing: rerun from a failure point to control cost

So maybe if I go into an existing execution here, we might be able to see it.

Yeah, so you can do rerun from here, for example, as a checkpoint. So if it failed at a certain point, you could just say, look,

I want to rerun it from the failure point. Or if it still worked, but you want to rerun it from a specific point,

you can do that just to manage cost. cost. Yeah.

Passing only critical context between steps

I guess just to, just to quickly show again, the way this data extraction is working is effectively it's, it's kind of going into the file. It's pulling out the kind of a hundred odd names, 180 odd names that came out of it.

Uh, and then that then gets passed into the next step to the point around it's just passing the information at each step that's critically needed rather than a huge conversation history or a huge amount of context.

Great. Thank you.

Finished reading?