From prompt to production: what it takes to build a reliable AI agent

Introduction

Thanks very much for the invite. It's a really cool moment for myself and the team to actually be demoing this. We've been building this thing for two years.

So you mentioned companies taking care of the boring stuff so people can actually get the benefit of AI and agents. That is exactly why we exist.

Founding Story and Prior Lessons

It's like the world's most 40 -year -old person's business. We have, for 15 years, built startups in the Tune and learned so much.

so back in 2010 I was one of five co -founders of a company called Partnerise and it's an ad tech company we started out with five Geordies grew it to 250 people it was a mad journey we almost died a few times we end up scaling that into offices around the world and we were lucky enough to have

clients like Apple and Google on the books and that is not a soft flex the The reason I mention that is we've deeply got to understand what it takes for big business and really of all size to actually integrate third party tools into their systems and into a production environment.

From Ad Tech to Regulated Crypto

After that of eight years of great times, also a grind, I went off and did the most cliche thing a techie can do and went to work in crypto. crypto, and that ended up becoming Bottlepay, which was like a payment rail, but we used Bitcoin and a scaling network called Lightning to try and take on the behemoths of Visa and Mastercard.

A bit of a lofty goal, definitely too early, but again, that was a regulated industry, something totally new for me, and WoW is building in a regulated industry like creative quicksand. It was really difficult, but again, very nuanced and piled all of those learnings from those two journeys into what's become RightBrit.

Co-founder Experience and the Rise of LLMs

And my co -founder, Matt Wells, he was involved on and off in those two startups and went and did four years at GitHub. And he ended up scaling that thing, heading up the sort of deployment of their cloud compute.

And for the the last two years of that tenure worked on CoPilot, which was kind of like the first commercialization of LLMs at scale. So we realized that that experience would be really useful to, I guess, build a platform or the fancy word now is a harness around this technology.

What This Talk Will Cover

And what I want to do with this sort of live demo which is going to go swimmingly well is kind of set it going hopefully it's something relatable and then I'm going to take you through kind of all of the the pitfalls and I guess the gotchas

that people have experienced we've talked over a hundred companies around the world and most haven't got AI into production and I kind of want to really really just list off the ten things that we found are most repeatable across all of those journeys and how we've tried to solve it.

So it's a soft chill for right brain but hopefully the information is relatable to whatever you're doing and whatever platform you want to use or if you want to build it yourself.

Quick Audience Check: Prototype vs Production

Okay, so what I'm actually going to demo, well actually before that I'll do a little bit of audience participation, hopefully this isn't too cringe.

I just wanted a show of hands for people that have used AI in any way, like ChatGPT or Cloud Code or built something themselves. Min. Okay, now keep your hands up.

Of the hands that are up, how many have transitioned the things that they've built into a live production environment that clients use or people at your companies use? Ooh, okay, cool. So for those that put their hands down, this talk is very much for you.

for those that put your hands up fair play you've done really well and I definitely want to chat to you after so I thought for a demo I'm gonna show you

Live Demo: Turning Call Transcripts Into Action

an agent that takes where it's hopefully something relatable it takes a large amount of text that could be anything but for this demo it's a cold transcript so we use Google meets internally every call we have with people on the team or or with clients, we will automatically transcribe that.

Our security and regs guy is here. So obviously, we ask for permission before we do that.

And I want to show you taking that and effectively coming out with some actionable insight. So I'm going to pick one of these,

which is a chat with Ollie. He's here. Hello, Ollie. And hopefully, we spoke some sense on this call.

And we want to basically break down the thousands of words into something actionable and then push that into another app. And I'm just going to show you Google Sheets, but it could be anything.

Why Workflow Integration Matters

1So the whole point that we exist, and I think to really unleash the power of AI, is to enrich the workflows and the platforms that you already use in your business, rather than context switch everyone to yet another dashboard or yet another app.

That's a big, tough, long sell. Nobody wants that. that.

So I've grabbed the URL of that doc, and I've created a new spreadsheet here. And I'm going to go into an agent that I created before and literally just provide it that.

Take the transcript from that Google doc and update the sheet. All right.

Why Getting AI Into Production Is Hard

So it seems really simple what is going on here, but actually under the hood, like I said this is two years of effort and there's a reason for that we absolutely could have sort of skipped doing certain things but the reason that

AI hasn't made it into large business and to the enterprise is there is a lot of things missing that you would traditionally get with other software so I'm gonna just kind of list those out for you as this does its thing the first

Auditability, Compliance, and Trust

is the auditability the trail so for those who've used any of the large closed -source models so chat GPT or Claude or Google you don't actually get the checks and balances about what information went in you get the eventual result but there's no auditability so from my perspective at

At Bottlepay, we were regulated by the FCA. And once a year, the FCA would come in and they would pick 30 customers that we had onboarded that year and would say, for all third parties that you used, we want you to show us why did you trust the response or the reasoning, the past result that you got from that third party.

Now, imagine that you built a little app in ChatGPT and you hooked that in via API. How would you actually do that? You wouldn't be able to show them, well, this was the information that I sent in. This was the information about the customer and here was the result, the past result that came out and this is why I trusted it.

So one of the first things that we have done is on this right -hand side here you can see essentially it's the raw inputs and outputs of every tool every model that's being used so that you kind of have this retrospective analysis that you can do not just what was the information that went in and what came out but who called it when was it called what model was used what version of the model was used

and again these things are hard to predict at the start but it's when you create something with AI and you put it on the table of the business to say say let's transition this into production, the risk guy goes, hang on a minute, we need all of these checks and balances, we are in the EU and this is customer data and we have to adhere to GDPR, you're sending this information into a black box that exists in the US, obviously it's just getting a big fat red X.

This is now completed and I'm just going to show you what it's done. done. Essentially, it's gone through, it's taken that information from the Google Doc, and it's now created the actionable insights. Key insights are action items, split that out by the owner, and here you can see all the various things that have been mined from that large document.

Now, why is this useful? So much golden information sits in a business unused because it's a huge amount of data. And prior to AI, it would take an incredibly long time to mine all of that data, summarize it, and come out with that kind of searchable, actionable insight.

We were talking to a customer very recently, and they had 5 ,000 prospecting calls with a document that underpinned each one of those calls and all of the learnings but they didn't actually reuse any of that information.

The beauty of AI is that you can now talk to that information. You should be able to take all of the things you've learned in the past to fuel the decisions that you make in the future. And we all know this, but we can't actually do it in our day -to -day jobs because of the various things that are stopping us putting this into production.

Okay, so we saw this demo execute. And what I want to do is actually break down what happened here and the various gotchas to get this into production.

How am I doing for time? Good, I think, yeah. I'll actually do it myself. Okay. Classmate. All right. Am I? Okay.

So the first thing is this looked very simple. You know, I just put in a super simple input. But the configuration behind it, this is where a lot of the magic is.

Reliability and Resilience (When Models Go Down)

And this is from a platform like ours or a harness these are the things you need to be looking for The first is if you're going to put something into production it needs to be reliable and resilient

Okay, so I picked Claude Opus for six here as a model if we go to Claude status you look at it It's like platoon Absolutely brutal uptime and it's because these models are getting hammered all of the time

but if you are doing something in production and you become reliant on the functionality of whatever the AI tool is you need to sort of have a solution for these models going down because they go down all the time for those of you that have used clawed code I'm sure you've gone into like the dreaded 401 or some

red output where suddenly you just can't work anymore you can't vibe code because because it's down for the next two hours. So how did we kind of come up with a solution to this?

We came up with the idea of a fallback model. So essentially what we do is we package up the functionality of the agent into something that can be fed into any model.

And you could do that with a framework such as Langchain, or you could roll your own. But you know, that's a decent amount of work to have one queryable format format that any model can read and execute.

Repeatable Response Structures (Schemas)

The second is repeatable response structures. So what do I mean by that?

In our platform, you can create a tool that you can make available to an agent. Now let me go to the tool I made here.

The prompt is actually reasonably complex, but what it's doing is it's informing the model about exactly how I'd like it to interpret the document and we can pass in for the nerds in the crew a JSON schema about exactly how the model must respond so if you just fire in a generic query to any model and based on its

creativity settings it could respond in a different format ten times that you executed that same query. So what you need if you're going to interoperate an AI agent with your own internal platforms or hook it into an external API, you need guarantees around how the agent will respond.

So that was one of the big things that we realized early on, is that we need to be able to constrict the model and say, every single time you respond for this particular tool, you must always have a summary node and a call date and a call title so that you can rely on or your techies or your mobile developers can rely on this AI

app that's doing something quite abstract or intelligent on this information you're feeding it but it will respond in the exact same way every time so again things that you don't get from a model and by default all right Right.

Practical Integrations Without Blowing the Context Window

The next thing that I did for this agent is that I made the actual integration of Google Docs and Google Sheets available.

Now, if you, say, with Cloud Code, ask it to do things with Google Sheets and you author your account, the tools that it offers, the 17 by default with Google Sheets, if you enable all of those tools, it actually generates its 900 ,000 tokens in the request to the model, which

actually, for most models, would go over the context window. So it's just going to insta -fail, and you're not really going to understand why.

So you need a way to intelligently pass the features of the tools of an integration and only use those that are actually required to facilitate the request.

Versioning, Drift, and Iterating Safely

Another one that's really important is versioning. So for the audit guys and the risk guys, understanding drift is a real big problem, again, if you're using a generic model.

Because for some reason, especially when anthropic, the models seem to get more stupid over time. and it's only when they release a new version it seems to get intelligent again so within our system very much like git we have version control so

every single instance or change for an agent is captured so if the prompt changed okay well what changed who changed it how did that affect the output these are things that are really important to your internal teams when when they're trying to understand why did something change, why did it go wrong.

But it's also very important to actually iterate on an agent, especially the prompt. And the more times you run something, you learn from its responses how it's behaving. And you can use those real world responses to tune your prompt and hopefully get a better output.

Better output might mean more interesting information. It might also mean shaving off 50 % of the tokens, and suddenly you've made this thing half as like more cheap and those are the kind of big gotchas and I would say one more to touch on here is for your

Observability and Picking the Right Model

infrastructure team because they're going to be the ones who actually have to support this thing and again with a black box model it's very hard to understand what actually happened under the hood so you want observability you want to know for that query where did it spend all of its time and you can see here Opus took a minute and a half that's four point six I told you got more stupid and a minute and a half here so what could we do with that

information well let's try a different model and what we found is actually some Some of the open source models, they're incredibly fast. You don't need all of the understanding of Reddit, I mean the internet, that's loaded into a big generic model. You can use something that can maybe just manipulate text in a better way. Some of them are great at creating content that sounds more human. Some are great at more generic requests. requests.

So understanding actually how these things are working and the overall reliability and resilience of these things is crucial in putting it live.

Time check? I think about 15 minutes, right? Three minutes left.

All right. Okay. Strong finish required. Check me notes.

Is it Q &A after or is that? Yeah. Okay. Cool.

Common Reasons Companies Stall

So I guess to wrap up, anecdotally from all of those companies that we spoke to, why aren't more using it in production? So there's sort of three groups of companies.

The first is they had something built, but by an external dev house or software company, and it was hard -coded to a model, which then is deprecated, or the inputs change, the APIs change, and it breaks. So they don't have the domain knowledge in -house,

so they therefore can't do AI anymore. The second is they've prototyped something because with these AI tools it's incredibly easy to create a compelling demo with AI and something that looks super shiny but to transition that into a production environment is basically what we've been talking about for the last 13 minutes.

The third is and actually this is probably the biggest group most businesses that aren't massive enterprises that can just hire AI teams for themselves they are so focused on their own core competency their own product the roadmap's like two to three years on average so they just don't have the time but they know that they need to

actually do AI but they don't know of all of the problems that they have and of course every business has a you know a massive sheet of 50 problems of that 50 what's the best fit for AI Where am I going to get the most bang for my buck? So they need to work with a company that actually understands the domain. Those are the three buckets.

Conclusion

So I realize this has been very much a write -print demo, but I hope these lessons about why we haven't been able to get things into production as businesses, those lessons are generic in their nature.

So either you're going to be in a build or a buy scenario, and if you're going to build it yourself, hopefully some of that was useful, and if you're going to buy it, just come chat to us. We'll help you out.

Thank you.