From Production Alerts to Automated Fixes

Introduction: Curiosity and Criteria in Building with AI

One thing that I would love for you to take away from our chat here is it is so necessary to have really two things to start building, curiosity and criteria. To your point that you're talking about, like you cannot let models do everything.

You need to have human criteria to do things. And in our just humble experience, we would love to kind of talk about what we've done with our curiosity and criteria to build something, what we've built. If that serves as an example, great.

It doesn't have to be something as sophisticated as this. It could be, I have a problem that I I want to solve. Maybe it's scheduling with my family.

I have so many different invites. I'm manually putting them into the Gmail, right, in our calendar. Can I just quickly find a faster

way to manage our schedules? It doesn't have to be anything crazy, but just think of a problem that you want to solve throughout this conversation that we have today and apply and try to build something off of that.

Meet the Speakers

Before I get into too much detail, maybe I should introduce myself. My name is Yumi Jo.

I'm from the Bay Area. I've spent, before we came to Madrid, we spent the last decade or so working in go -to -market in the Bay Area and I'm a Stanford grad and Andy

Andy’s Background in AI, Search Ranking, and On-Call Pain

can introduce yourself yeah my name is Andy I spent 10 years in the Bay Area as well worked in AI and machine learning for a company called Zillow which is idealista for US real estate real estate market was leading the search ranking team there and as part of like the search ranking experience which is like a pretty important like all personalized search goes through that system i was essentially four years on call meaning like when something broke i was the person that opened the laptop in the middle of the night and started troubleshooting and that's kind of like what led us to this idea

of our base and today we're going to show you not a demo demo but more like kind of like how we're using ai in our workflow developer workflow because for me as a developer as a principal applied scientist it's been really a game changer quite a bit so yeah awesome and before we go into

AI and the Future of Developer Work

this in more detail one thing that I say often and I want everyone also to listen to my point of view on this which is there's a lot of this doomer type of AI apocalypse like jobs are going to disappear there's not going to be any more need for developers I don't know how many of you once again are developers or have some technical background I wouldn't be so worried about that

Yeah, honestly, like the way that you work, the way that all of us are going to work is going to change, right? You're not going to need to write code lines one by one like this, right? But the things that you'll be able to do now as a result of having AI on your side will make you be able to do a lot more and give you a lot more freedom and less dependencies. So I think that's something interesting to think about.

The Problem: Production Incidents Are Getting Harder

So very quickly, what is the problem that we wanted to solve? And then maybe this helps you think about the problem that you want to solve in your own experience and your own ideas.

So, there's a problem even that my dad had way back when, when he was a technical person dealing with problems in production, which is, it's very complicated, right? Especially even now with AI, because now everyone's vibe coding things, and everyone thinks they know, and they're automatically some superpower developer. And it's really hard to maintain code. It's not so easy to fix errors in production.

In fact, what usually happens is, you know, uh -oh, something broke. You get an alert. You get an angry customer on the phone. hopefully it's not that late, you have to open your laptop, you have to look at many different

things. Maybe they're logs, maybe they're metrics, maybe they're lab dashboards, you're pulling at your hair, you don't know.

Then you have to call the senior, the most senior person and bother that person again, right? And it's Slack and more meetings and it's going to get worse because there's more code being thrown out there and not everyone that's launching code into production

actually understands system thinking, actually understands like architecture or anything that that they should probably plan for before they just launch things, right? This is the problem that we've seen.

Andy, we worked together at the last company that we were at together before we exited as well, and he started building something because this is an ongoing problem.

So maybe you can give them a little show and tell on what it is.

The Approach: Human-in-the-Loop, Agent-to-Agent Workflows

Absolutely, so I think the idea of our base boils down to, in the future we'll have more and more workflows that are agent to agent or machine to machine, right?

I think no human being was designed to look through stack traces or browse through thousands of log files. And that's where we pick up with our base.

We believe in an idea where, number one, humans can do what humans do, which is being at the reasoning level and understanding, okay, I have four hypotheses of why this broke. Now I need to actually dig in with AI to understand which of those four proves to be valid. it, right?

Human in the loop. I think when we talk about human in the loop, we talk about accountability. And that for me is like the most human trait that we can have. And if we build accountable systems, we can get there to a degree with agent to human interaction.

Accountability and How Agents Differ from Humans

And number two is the way how agents behave is different from humans, right?

GitHub was down today. I don't know who tried to like have a CICD flows and GitHub actions today, but it was down for like a good like two three hours or so and i think the like if you look pre -agentic era the github

status page to like post -agentic era the github status page you see like pre -agentic it's like all green right so it's beautiful like slas are heavenly right and then like agents start to

like kind of like interact with the system cloud code codex me myself right i don't know how many pull requests i do a day but it's insane the volume that goes up and the same things happen in observability as well.

Why Observability Gets Harder in the Agentic Era

And observability is kind of like a fancy word of saying, how can I make sure that I understand whether my system is up or down or whether there's any error in here?

Product Walkthrough: Automated Incident Triage and Investigation

And what I'm gonna show to you now is like a little sneak peek of like, one of the workflows that we tackle, which is kind of like an error happens in your system, can be an API, can be an expense report generator, it can be like a chat bot, like whatever it is, right?

What happens in the moment that something breaks? Usually a human being would, like back in the day, kind of like analyze this. And we now have like a workflow that kind of like automates all of this.

So I'm just going to show you like teasers.

Triggering an Error to Simulate a Real Incident

My background is an idealistic kind of like scenario. So this is like a house price prediction example as endpoints that you'll see here.

Anyone not familiar, think of it as like a backend system or you want to try to like log in in somewhere. And I'm going to just trigger a little payload

here, let me say one is. And what's going to happen is like we created an error that essentially like is sitting behind this API. So sorry, no API key provided. Here we go. Very secure. All right.

And you should focus on let me zoom in a little bit. Down here, this should, in theory, switch to a nice little 500 error. I guess so, right? All right. Here we go. All right. Internal server error. Beautiful. Okay.

So now over here, this is like the dashboard that we essentially like. We have this inference server that we're listening to. I'm just going to reload this real quick. And you should, in just a second, And if everything goes right, an issue pop up here on the end where you see, OK, there's like an inference server issue.

Let me check. I don't know why it's not running right now. Could be. This is a nice little demo issue. So while this is hopefully happening in the background, one thing that I'm going to show is what this will look like if you actually have an issue that's running somewhere in production.

From Stack Trace to Triage Loop

So the incident comes in. it's a stack trace. We listen to that. A first agent is getting spun up.

The agent has access to your code base, has access to a stack trace. You can add custom context to it with regards to, okay, this is what the service is doing.

Here are coding conventions that we're using. Here's some information about stuff that you will have to run if you are in the CICD process, for example, like linting, all the technical details.

With that information and the runbooks that are available for the server, it essentially kicks off a triage loop, which essentially has read -only paradigm and runs on top of the code base and understands, okay, with the stack trace and the code base, is there something that I can actually fix?

Now we can click into this real quick. It runs an error analysis, steps through a multi -incident flow, and then afterwards comes up with an

an investigation summary and has a classification head on top of that, which essentially decides how much or how important of an issue is this. In this case, it said, okay, well, this is nice, so we can actually discard this.

Reducing Alert Fatigue with Classification

This is one of the coolest features, I think, that we have right now, which is a lot of our clients are dealing with issues when it comes to overflow of systems and understanding alert fatigue. fatigue, so this kind of like cuts alerts down by 70 % with like a 97 % accuracy.

One thing that we do is we focus heavily on recall, meaning like we want to be super, super precise when it comes to the actual errors that we should be alerting on. So we're like doing pretty well on that end.

Investigation Loop and Auto-Drafted Pull Requests

What happens is if the triage loop continues and says, okay, this is actually something something they should be worrying about, then you can, here we go, click into one of these guys, an investigation loop is getting spun up and this investigation loop then calls

a coding agent and the coding agent is then talking to essentially like the orchestrator and the orchestrator has all the information about the triage loop already and it goes in here and at the end of the day, it creates a merge request or a pull request in this case because it's integrated with GitHub.

up this pull request back in the days would have been essentially what a engineer would have to come up with after troubleshooting what we say is this is not a final solution like this is not going to be a self -healing loop that runs automatically but this brings you from i know nothing about the error to hey i have a pretty good understanding what could be broken and where

it could be broken way way quicker than before and i think this is like going back to the accountability piece.

Testing, Coverage, and Human Approval

As part of this, we write unit tests, integration tests. We try to increase coverage as much as possible.

And then after you have the fix merged or the fix in your pull request, you can decide whether this is actually the pull request that you want to go with.

Using the Tool Internally via MCP

Now, how I'm using it in my workflow is I use it a lot as an MCP server. We talked about cloud code. So I thought it might be interesting to talk a little bit about how I'm using our own tool.

Our own tool is actually monitoring our own tool as well, which is kind of like this nice little inception moment.

So I'm just going to authenticate the MCP server here. Continue. Let's do this for 90 days. Yep. All right.

Pulling Prioritized Incidents with Full Context

And now it has access to essentially all our issues that are happening in our code base across all services that we're monitoring, inventorying, which is pretty cool because when something like this happens, it helps me be way less stressed because now I can ask stuff like, hey, can you give me the incidents I should focus on?

And this now has a connection to our MCP server, the MCP server understands okay from the incidents, let me pull your current incidents, it understands what is focus group, what is stuff that's it's in the backlog and it kind of like, okay, it pulls these out.

And then what really helps is like it has additional information about the triage loop. It has additional information about the fixed loop.

It has the merge request attached to it. It has all this contextual information that usually like we would have to like piecemeal together from all different information sources right here at the fingertips.

And I'm working in my code base, which speeds up the debugging and troubleshooting steps so much. And these are kind of like the workflows flows where I'm saying like I'm the human in the loop.

I'm still part of this process but I have kind of like this magical connection at my fingertips now to talk to my system and understand when something breaks.

And yeah this is in a nutshell what we're doing at our base.

Conclusion: AI Expands What Humans Can Tackle

Once again we don't believe humans will go away. We believe humans will be able to grab after way more difficult problems in way shorter time and that's what I'm excited about whether it's you know like

like troubleshooting your services or cancer research. I feel like the true gift of AI is that it pushes us to the frontier wherever possible.

But to some of the remarks today, it has to be in balance with our creativity and our balance, like our minds. And if we put that to good use,

I think we can really achieve great things. So yeah, thanks a lot.