Shipping Without Superstition: Executable Specs and Observability for AI Code

Introduction: Coding with AI, Magic vs. Science

Hi everyone, my name is Jag Rehal. I'm going to be talking about coding with AI magic versus science

And if you can't see from the picture The magic is Oxford and we're in Cambridge.

So these are all your people from Oxford, right? The fictitious people and we're like scientists.

So we're like Yeah Okay

Why AI Changes Everything for Developers

Hey, look, I've been coding for a long long time and I think that coding with AI changes absolutely everything and I've seen this progression in the last six months and what it can do today is honestly unbelievable over Christmas last Christmas as I thought to myself well I'm not sure where the

future is going but what I do know is that 2026 is probably the last year that I write line of code and I don't think in 2027 I will be and I definitely don't think I'll be in 2028 and it's just not me.

I came to a software craftsmanship meetup group a couple of weeks back and we had a round table and people talked about AI. Remember, this is a software craftsmanship meetup group where people do like test driven development and all the kind of practice you

want from XP and everywhere else. A lot of those people on my table are doing coding without opening up an IDE. They're literally just using AI all the time, 100%.

That's where we are right now i work in a company um people there different not engineers working in sales client management etc are using lovable v0 and everything else and the things they're building like would have taken me about a month to build and they're rendering out in two days in front

of apps in front of clients and people are using it and people are loving it that game has absolutely

The “Magic” Approach: Tools, Tricks, and Hype Cycles

changed so what do we have now well we've got a load of tools and they come with models some good some bad, some really expensive, and we've got plenty of tricks.

So we've got all these things like Ralph loops, we've got claw bot, we've got spec files, we've got plans, we've got skills, we've got clawed MD files, we've got agent MD files, right?

And there's that one person on YouTube who's telling us how to do it all better with just one trick. And it looks like this, right? And then we cook, right?

So what we do, we have this cauldron called AI, and there's Oxford over here, and they're throwing their skills in, they're throwing their four leaf clovers, a rabbit's for whole shoes everything into this pot right because that's what they want to do and that's

what they've been told to do and the result well sometimes it just doesn't work right and you're

left disappointed it kind of reminds me of something so way back in 2004 uh i joined business information uh working in a business business search engine and i became a wizard in the dark arts of seo uh what i was trying to do or what the company's game was is trying to to manipulate some kind of black box to do what you want.

Kind of similar to AI, and unlike Harry Potter here, I'm not a wizard. So forget all of that kind of stuff because it keeps moving, right?

When someone brings out something, Claudio will immediately launch a brand new feature, it'll be moving to their product, and all that hype around something that was magic two minutes ago is built into their product, right?

The products are amazing. The models might not move on, but the products themselves are just incredible. They'll just keep getting better and better.

The “Science” Approach: Build on Deterministic Foundations

So here I'm talking about the science of AI coding, right? These things will last around longer than ever rather than some guy's hack on YouTube. And we're in the era of AI -first libraries that are built specifically for coding agents.

1And just to remind you, 1AI will only be as good as the foundation it builds upon it, right? And you're going to see this running throughout this entire talk, right?

The first step has to be scientific. AI is awesome, but if the foundation is wobbly, the next step is going to be wobbly and it just wobbly forever onwards.

Workflow Discipline: Using Agent Hooks to Force Checks

So the first thing I'm going to talk about, and if you don't use Cloud Code or Codex now in the last week, they've launched this experimental thing, is hooks. So you can see it there, and this is such an underused feature.

This is the science of AI coding. So when you're sat there with an AI coding agent and you're telling it to do something, there's a feature in there called hooks, and this is actually what happens.

So you tell it to implement a feature, it will go and edit some files, and then you put your hook in, which is It's like, actually, you have to do this after you've edited a file. There's no ifs or buts about it.

There's no vibes. You're not asking it as a skill or something else like a please. It has to do it because it's part of the feature set. And what it does, it will go and run something.

And here, for TypeScript, it's a compiler. It will go off and do that. And if there's any errors found in that hook that it has to run every single time,

it will come back to the coding agent, it will have to fix it, and it will fire off the hook again. This is deterministic. There's no choice about it.

It's finishing editing a file. It has to do this, right? And it will do this until it gets it right, right?

This is such an underused feature that we should be doing right Again, it's checking its own work. It's making sure it compiles Just to go over really quickly.

Types and Constraints: Making Errors Explicit

I know some people aren't engineers here, but this is fundamental. Why are types so important? Okay, so this is TypeScript

and over here as you can see here the agent edits some code hooks fires just like we talked about before so this is TypeScript here and if you look over here we're saying that this function whatever it might be is going to return back a number and we can call it with a number and that's absolutely fine if we don't call it with a number we get an error and this is what I mean so if a coding agent did this here the hook would fire it would run the TypeScript compiler and this is what the AI AI agent would go and see. Once it sees that, it would go and correct its own work.

So we come back onto that and we talk about errors. Errors are so important here because it's going to come up. So here is like, and you can do this in any language, it doesn't matter.

We're trying to do something and we have something called magic catch block. If we do this here, it's okay and loads of people write code like this, but you're going to make mistakes and this is that low -level AI coding that you just don't want to have.

so you introduce a new concept to it and these are all building blocks so so far like i said we said we wanted something to be a string and it's returned back a number and we're like no the next thing we're going to say is something like a basic result type so here what we want to

do is we're going to take in some string we don't know what it is and we're going to try and parse the age we want to convert over to an age here because we can do that and we can say hey we want to parse the age it's either going to turn back the number or an error when we hover over with this we know that the result is either going to be a number or it's going to be an error so basic steps but our functions aren't simple like this they get more complicated because as we go

along and we might be we've changed rules now because now we're having a check age and we want to check if they're too young so we create these errors here and we come along here and now when you validate the age we know what could go wrong with it so we're still going to get a number back if it's a number, that's fantastic.

But what are the other things that could go along with it? Well, it might not actually be a number, we would not be able to read it, or it might be too young, right? So now we're here, there's no guessing, we know exactly what's going to come out of it, right?

It's not just, I had an error, I don't know what the error might be, it has to be one of these errors. And so it continues and continues and continues.

And you might be asking, why am I telling you this, right? What is the magic of these constraints of this bit of friction that we want AI to write this code with.

Static Analysis: Truthful Diagrams and Reviewable Flows

Static analysis. Static analysis is amazing.

Just like anybody who's done any coding or has used an open API spec, when they're generated by code, we can 100 % guarantee it's going to be correct. There's no guesswork about it.

We're not going to let an AI say, here's a load of endpoints. Can you create a document I can give to my clients? No.

We generate this stuff by code because we know it's going to be correct. The same applies to what I've just told you now.

Just last week, Cloudflare launched a post about how they're using static analysis in their tools that are going to do workflows. I wrote a tool, Effect Analyzer. Let's talk about why and what it does.

Like I said, there you have a thing where you're saying, we're going to validate something, and it's going to be successful or error. Because it's statically analyzed, you can guarantee if you run it once, it's going to give you the same result. If you run it 10 times, it will give you the same result.

If you run it a billion times, it's deterministic we'll draw the same diagram. Ask AI to draw the same diagram for your code. Who wants to take about 10 times, 50 times, definitely not a thousand times. It will draw a different diagram. Because it's going to draw a different diagram, that means if you're trying to do any code reviews or to work out kind of any diffs visually, you can't. You just can't rely on that foundation. It's not strong enough.

Stack analysis gives you that foundation. As you go through this post and we make that thing bigger because it's trying to process payments, we We've added a fetch exchange rate. We can start seeing how this flow has gone over time.

So now if you're looking at this as a code review, because we might not be coding anymore, and we're seeing diagrams, instantly it makes it comfortable. It's like going to a building and seeing a map. We know where we're going. It makes everything lower. The cognitive load's easier. Now we're able to reason with what's happening here.

So this is fantastic, and it goes on and on and on, and it gets complicated, and it gets complicated because the actual logic's complicated, but it's starting to canonize every time. There's no lying about it if someone goes and takes out one of those steps in it You can see it in a code review, right? It's so easy to see you're not seeing lines of code You're seeing diagrams static analysis is an absolute game -changer Again strong foundations here deterministic foundations

Bonus: Enforce Team Conventions with Custom Lint Rules

Bonus tip, right?

So we talk about types. We talk about static analysis

You can make your own lint rules up lint rules have to run we talked about hooks before for.

So if there's a particular way you want to write a pattern or this particular thing that you like or whatever variable name, it might be long or short, whatever, you can write lint rules.

Lint rules have to run every time an AI coding agent has finished touching any code.

You could be writing a document. You could say something like, write this document and if you're finding m dashes in it, write it again. That's the kind of thing that hooks do.

There's nothing for you here. It's free. It's built into most of these frameworks. works.

Testing and Requirements: Getting to “Living Documentation”

So, we've got type safe code. How do you know it works?

Tests are definitely not documentation. And you can write tests, loads of them. 100 % coverage is a lie.

You would really want to be doing mutation testing at this point here.

But the major problem is how do we write linked tests back to the actual requirements, right?

So, not a new problem.

Why Docs Drift (and Why Engineers Don’t Maintain Them)

Market payments at Cambridge Assessment. I worked at Cambridge Assessment. I worked on a marker payment solution, how we paid markers for the example scripts they marked.

I went away for a year, came back to the project, and there was a bug. The bug was a nightmare.

The documents on Confluence were no longer kept up to date, and the code is different. And so we were confused, right?

We were like, oh, my God, this is ridiculous. The code's there.

As engineers, we're lazy. We don't go and update documentation. It's the last thing we want to do.

there was huge issues here so that's fine we have a solution that solution is

Cucumber’s Promise—and Why It Often Fails in Practice

called cucumber that came along into it try and give living documentation so here what you do is you'd have some sort of feature file when you'd say something like here's a spec of how I want it to work and basically somebody else would go and basically code up this feature file here right it's I try to do this about

two or three times because it's a absolute game changer this makes it accessible for sitting around in a meeting discussing it business stakeholders testers the whole lot this is just making the code accessible to all and here's here's the kind of thing you get when it runs here's a green tick here's how it works and you know that it's working

brilliant okay so you've got to ask yourself right if it's so good why isn't anybody using it or the big question here is every time i've tried to use it in big companies with so many clever over people, why have so many people given up on it? It's because of the Cucumber Suite.

You end up keeping two systems in sync, like I showed you up above. The things that you got to keep in sync are this feature file, which lives over here, and you got to keep this steps file that lives somewhere else.

AI coding agents are okay. Sometimes they're good, sometimes they're bad. Asking you to keep two things up to date, you're just going to ask for trouble.

One Source of Truth: “Stories” Embedded Directly in Tests

So I wrote a library, of stories. It works across Go, Rust, TypeScript, JavaScript, Python, Java, .NET.

To implement it, you're in your tests anyway. Remember what I talked about. Tests are deterministic. They compile.

You know they're correct. You can't go wrong with it. We just take this test here.

That hasn't got it. This has got it. It hasn't really changed that much. All

we've done is taken a normal test, which we're using our IDs anyway. We're just writing it. We're just swapping those comments range actor cert to to basically just a bit of writing not too much headache And you get this right?

This is the result of running that which is fine, right and it works, right? And this is the kind of thing I'm talking about the actual test code itself. You're not maintaining two things You're maintaining one thing now one file has generated this file

And it works with AI right so that how you'll be able to see this So this is the exact file and what I've told AI to do is go and write a thing where... Let me start it again.

I've told it to go and write this file, start off with the failing test, and then write it as the persona of Batman. Off it goes. It's gone

off in there, and you can see it's passing. The first thing it says, it's going to read the file first, and off it goes. It's going to write a failing test first, and you'll

be able to see that happen in a second. That's what you get when you get a failing test. it will tell you exactly where it's failed, and it's told you there's one test that's failing. It's run that, it's confirmed it over there, and now what it's going to go and do,

it's going to write the Batman calculator tests, and off it goes. Because it knows how to write it in the language, it's going to write it in. I've done this as personas of product owners

or stakeholders, do I want a document to be more technical, do I want it to be light, or whatever it is. There you go, it's finished. It's wrote this story, there you go, five tests

tests are about a calculator as if it was Batman. Easy enough, right? There you go. And I know it's working, because I can see when it's failing, they'll be red, and when

it's passing, it'll be green. It's just, and it's all done in a single file. And it's not just tests.

And I will tell you, it's not just tests.

Using AI to Generate Real, Verifiable Learning Artifacts

So prompt injection. I wanted to learn it myself and teach it to others.

So I took the post that someone else had done about lethal trifecta. I got it to generate a whole kind of living document.

Not just that, it drew the diagrams for me. There's not one bit of here that I actually touched. This is purely done by Claude Koch. It wrote the diagram for me.

That's fair enough. I did ask it, can you write an app to demo prompt injection? Because I'd love to see it. Absolutely. So here you go.

So it's gone okay. I'll show you in an app. And remember, these all got green ticks, So this must have happened because otherwise it'd be read and now it's taught me how prompt injection works because he's wrote me a demo I'll helped it say what the UI should look like but it's gone and done it

It went way in there and he's talked about the different models and it taught once you about things it found and everything else But he's taught me in a way that is absolutely amazing So he's taken all this content and it's actually made it real I can actually interact with this and read it and guarantee it works It's broke down the whole thing for me in a way that I wouldn't be able to do otherwise.

From Specs to Understanding: Visualizing and Testing an API Quickly

So one more thing. So I work on a payments product where you talk to different payment providers. One of those companies is a company called Airwallex. Airwallex had the open API spec.

Back in the day, I think it would have taken me about two, three weeks to go through their APIs and try and understand what it was. No, hang on a minute. Why don't I just give it to this tool?

Why doesn't the tool go and draw me exactly what's going on underneath from the calls It's been making go and draw it go and show me what the UI screens look like when I'm going to embed this into my Own system and it's absolutely fantastic off it goes not only that but we call their API is give me

200 tests on testing out their API and he's told me exactly how it works What's going on this thing would have taken me forever to write if at all possible, right? But look at this, it's done everything for me, and I know it's correct because it's given me the green tick to make sure the test has passed.

It's all done by test, it's all done by visible, I can just read it. It's a game changer.

Observability in Production: Proving What Actually Happened

So, type safe, code compiles, link passes, test passed. How do we know it works? You don't.

OpenTelemetry Defaults That Are Safe for AI (Redaction, Sampling)

So, last December, I talked about open telemetry. Open telemetry is super, super important. But I built a library on top of it. Why?

The official open telemetry libraries are good. but not great. My library wants to give default things out of the box which are safe for AI, things like redaction and sampling, which AI can't do.

AI will happily up all your private data, and that's not great. How do you use it?

Node .js, oh my god, it's got an issue. Other languages don't.

Node .js set up, open telemetry is loads. In my library, it's about three lines. That's it. Done.

And you get the redaction and everything else. Right once observe everywhere.

Product Analytics as Facts: Planning from Real Customer Data

So one of the things out of here is you'll see the linear blogs and stuff like this is the important Things that we're going to succeed about in a minute, right product analytics is super super important How do we draw that in?

So with here we add it in and we say we want to post subscriber and we want to slap subscriber And there you go. The lines haven't changed

We just told it we want subscribers this time and the way we use it We just tell it to track an event and this event has happened. Brilliant. Okay, so what does that result in?

Well, that means that this is the span This is open telemetry as you get for free if you do see any other product and we can see here that in this money Transfer application which I vibe coded to right remember.

I'm not opening my ID at this point I just want to know what's happened I Now have the science of what's happened is told me what calls have been made to go off and do these these are those things

That we saw earlier in that little train track validate fetch rate get balance converts you transfer confirm This is the science this has to have happened. Yeah, it's no two ways about it

And this is it calling slack is easy enough Brilliant.

Okay, so you might be saying okay. Hang on a minute Jack the test could have lied You could just give us a green tick. Anyway, how do we actually know this is actually working?

Okay, so here's another demo I'll just try and talk through this So his code code is connected up to some called Jager, which is like an observability tool So imagine this is running in production

production, I've asked it a question to say, go and use the MCP tool to go and find out errors in the last five minutes. Cloud Code is connected to this tool. It could be in production and off it goes.

It's not going to go anywhere until I hit play because I just pre -paused it. Okay, so off it goes.

It's connected to this tool and it's going to go to me. I've had a look. I've seen some money transfer API. There's no errors found.

I have a script which is going to cause a whole lot of bugs random bugs so off it goes It carries on it says look found some more services in there. Do you want me to check those?

Nope So I ran my thing this is going to go and create a whole lot of errors And I can't spell about but that's fine because I go back and correct in a second And I said what about now right so I let these errors happen.

I Can see the errors are there because this is Jager over here, right? I can see the errors did actually happen right in science science.

So I ask Claude Code now, what about now? It's gone off, right? And it's looked at this, and this time it's gone, whoa, whoa, I found some errors, right? What should I do?

I'm like, well, can you find the issue? You've got the source code. You've linked up to my production open image tools. Can you go and fix it for me? Because we're relying on science, right?

It knows exactly what to do. It's looked at exactly what to do. It's looked at the spans and all the errors we've got. Look at the details coming out about It ain't guessing. It is not guessing. It's actually got the facts from there to be able to make a better decision

And this is what I'm talking about. These are the kind of things that we want to be relying on, right? These are the kind of things that are scientific now.

So off it goes And it did actually go and find the thing I don't know when I stopped it because it actually comes back to me and says oh I asked it can you find the issue we had all that kind of thing and it's found out exactly where it's gone

I'll put a little bit thing in there I'll put the word that if you put the eye band as boom it throws an error and that's that's the thing you want to win the found Okay

One more thing product analytics, so When we develop all these specs and these plans and everything else that are all magical and everything else What do we base it on what we know or should we base it on customer data?

See here. I'm using a post or game CP. Remember what I said

Oh to tell clicks press talk no problem and I asked it how many money transfers happen today and And because it can connect up to my product analytics, and this is all facts that have happened, I'm able to make better decisions on it.

So here I have to give it permission to go and do that. It's gone off on them there, and it's looked at post -hoc, and it's talked to that. And it said, I've seen some travel events, do you want me to go and tell you about them?

I'm like, yeah, go ahead, I'll give you permission to do that, and it's told me exactly what I want. These are product analytics that are happening in production right now. I can make better plans based on that than me just guessing.

So one more thing.

“Show Your Homework”: Traces as Evidence, Not Green Ticks

You might be saying to me, Jag, it's still guesswork. It's still guesswork at this point. And I'll say to you, fair enough.

What if we took the stories that we had green ticks and we asked it to show its homework? To make sure it just didn't

get a green tick, but actually to prove that what actually happened when you're telling me that these things have actually happened. Can you go and show me the trace for it?

It can't lie. It had to make those calls. It can't have just got those spans out of thin in there, right? This is actually what happened behind the scenes of that test running.

Now, imagine that a test had failed. We have all the data here to go and diagnose those tests. It's super easy. It just makes the guesswork go away.

So one more thing.

Legacy Code: A Safe, Reversed Order of Operations

What about existing code you may not be asking? It's completely backwards from what I've just told you about.

There's books there. Michael Feathers has wrote about writing effectively code.

I wrote a step -by -step guide because I work with other teams in the business who are dealing with legacy code. And it's called or bringing it all to chaos without breaking anything,

but basically you have to start with open telemetry first before you do a single thing, right? You have to do it all a bit backwards, but you still will do it all with AI.

But you must do it in that order. It's all reversed.

Conclusion

That's my talk. Thank you very much.