Harnessing Long-Running Agents: Stop Babysitting: deploy while you're sleeping

Introduction

So, my name is Rosario Moscato, as Omar said, and in my association, we deal with AI from a double point of view, double perspective. From one side, we try to disclose, to make education, etc., to people, because we really would like that people use AI in the best way. From the other side, we are involved in calling, developing, et cetera, et cetera.

Why We’re Still “Babysitting” AI Coding Agents

So tonight I would like to talk to you about something quite interesting. So the idea is to stop babysitting AI agents. It's quite a long time that I've been dealing with AI agents that are supposed to write code in our health for ourselves.

and the dream is stop babysitting AI agents and let them deploy software applications while we are doing something else for example while we are sleeping this is a dream if you want to deep dive this kind of topic there is a

very fresh paper that you can download that this is this link here so we are We are talking about harnessing long -running agents, how to orchestrate long -running agents. I believe that the topic is very well known.

From “Vibe Coding” to Real-World Complexity

A year ago, a little bit more than a year ago, in February 2025, Andrej Karpaty was the first one to use this expression, white coding.

So now everybody knows, he said, there is a new kind of coding coding, we can even forget that code exists. We talk with an LLM, for example Composer, so maybe he's using Cursor, and the LLM writes the code for us.

Even if we have bugs, the LLM just randomly tries to fix the bugs, and sooner or later everything will be okay.

And

is that this kind of approach is really good for let's say quite simple software quite small software if the application the software we want to build up is

complex we have a lot of troubles because we have to deal with full tolerance we have to deal with the operational state we have to deal with the cognitive coherence of the model so in few words when we want to write a a complex software, all the problems we have, more or less,

Context Window Limits: Amnesia, False Execution, and Summarization Drift

are related to the contest window. Because when the LLMs, the agents' contest windows fill up, we really have a lot of problems and disasters.

For example, we have AI amnesia. The agents start forgetting system prompts, foundational design constraints, and the software is a disaster. cluster.

The agent claims to execute commands in the chat, but actually we discovered that the interface remains unchanged.

And finally, we have another big problem, the so -called summarization drift. When the context window fills up, the agent starts compacting the memory. and this leads to losing the vital architecture context okay so the web

coding approach is good but for simple software the point is this one we start from requirements we start with a prompt we want to make an application when the

prompt is something bigger because the application is complex the problem is a is split in isolated phases, it's microtasks. And the orchestration of these microtasks can be a nightmare.

It's like we have several workers working for us that are completely disconnected in shifts. So one finishes task, leave the floor, another one enters, and the new one has zero context.

So it really doesn't know what happened previously. easily. And this, talking about software,

can introduce duplications in codes, bugs in parts of the code that previously were working well.

Finally, the result is that we have to babysit the agent. We started with the idea that the agent can write software for us. Finally, it's the opposite.

We start babysitting the agent. We double check. We oversight. site we tell him how to fix problems now especially in the open source world

there are several approaches to build up our nesting to build up open orchestrators for this kind of agents the real problem is that many of these tools these frameworks works approach the problem from different point of view.

The Goal: Enduring, Long-Running Agents

So what we did is to, let me say, to make a frank sign. We put several tools together in order to manage the problem. And what we would like to do is not having faster AI. It's having enduring AI.

So the idea is to have long -running agents, agents that can rely on a self -orchestrating harness, on a separate setting manager that can help them to operate autonomously for hours or even for days.

The basic idea is to transform scattered generation into a very structured, continuous mechanical loop that can work, in principle, infinitely beyond standard token limits, beyond the context window. window.

A Structured Loop Built to Outrun Token Limits

So to be very vertical, our approach is based on four pillars, quite standard pillars. The first one, the first phase, is no -master -sensile phase.

We define prerequisites, the technical stack, the core logic, every time we start with a prompt.

After that, in the second phase, we have an initializer agent. This agent gets all the app spec, sets the projects, the project's baseline, and splits the specs into really micro granular features.

Once we have these project and the macro features, we move to the memory.

The main problem with the memory is that the memory approaches the traditional way with the context window, the bricks, OK?

So we save everything on a persistent memory, for example, the SQLite database. In this way, we have a central database

that can track all the features, all the states, and all the tasks.

After that last phase, the coding agents. Finally, we can fire up some independent instances that pull single tasks from the memory, implement the code, test what they're doing and give the result save the result again in the

memory so even if this approach is quite simple works because overcomes the promise I told you at the very beginning the full tolerance the operation I'll say it everything is saved and the cognitive coherence on top of these we

Two “Secret Weapons”: Visual QA and Regression Alignment

We added another couple of, let's say, secret weapons. The first one is visual quality control.

So as I told you, each micro task comes together with the test cases. So we use a test -driven development approach.

And the agent can really open a local browser window, interact with the application in real time, time, look for any kind of issue, for example, selling issues, broken UI, et cetera. And when requested, it can look back, fix the code, test again, and more.

The second smart feature is the regression alignment protocol. protocol, any time we pick a new feature among all the microtasks, we also pick another three feature from the, let's say, the backlog.

In this way, the agent can test the new feature alongside the old. And that's a good way to avoid any kind of side effect,

because we try to avoid any kind of effect that on what we did previously.

So I believe that I really talk a lot about the theory, OK? So let me show you some pictures about what we are doing. As I told you, in the first phase, sorry, let me go a little bit back.

In the first phase, we start from the master sense, definition of prerequisites, et cetera.

Case Study: Building a Personal “Second Brain” App

Now, this example is about an application that I made recently. I take notes with a small software named AppNode. It's something like Notion, but very, very small.

So I wanted to create my personal second brain, my personal WKLM, basing it on these notes coming from the software AppNodes. So that was the prompt.

The system must import the notes, must create embeddings in order to have semantic search, must provide an interface chat GPT style in this way I can really talk to my notes give questions receive answers all the answers should contain the sources

linkable sources and moreover I want to have in this software a dashboard with the knowledge graph because the automatic classification for the topics weeks, et cetera. So that was the prompt.

From Prompt to Plan: Defining the Stack and Splitting Work

This prompt was given to the agent. The agent asked several questions in order to define the technical stack, the rules, et cetera, et cetera, and made a plan.

The plan was divided. This is one of the interfaces of the tool.

The plan was divided into 128 microtasks. These 128 microtasks were given to the orchestrator.

We call the orchestrator maestro. So the orchestrator gets the microtasks, expounds task -specific subagents, and here we have the subagents.

Subagents are in charge of concurrent execution of everything, UI, backend, logic features. features.

In this way, we can move from a traditional software development that usually is linear to something a little bit different.

Because we have an LLM that orchestrates simultaneously to write, test, and ship code. And the orchestrator can give the orders in a sequential way, but also at the same time.

Because he's able to understand the dependencies inside the software, inside the microtasks.

Managing Microtasks: Kanban Workflow and Human Steering

All the microtasks are managed in a Kanban logic view. So this is another screenshot of the tool. We have the pending tasks, the microtasks, progress, and done.

For sure, we can double -click on each task, open the task, change it, skip it, cancel it. On the task, we can do everything on the fly. And this way, we have human steering feature. feature.

And in this interface, there is also the classical chat bubble in the corner in the bottom. This is the chat bubble opened in this picture. You can see better. So we can interact with agents.

We can interact with agents during the software development if we want. For example, in this case, I just asked to the chat, chat if the design system was saved in a folder, in a directory. And he replied, yes, I already saved with typography, hierarchy, and so on.

What the System Produced: The Final Application

Finally, when the activity is completed, so at the end, in this case, of the 128 microtasks, we can see the software. We can see every time, but we can see the final software.

And what the tool made was this kind of software. So we have quite modern software, Next .js, React with the authentication, the team management, light and dark.

This is the landing page, the first part of the landing page. This is the second part of the landing page with all the features. so we are advertising on the features that are the features we specified in the prompt intelligent shot knowledge graph up not importer sorry if I go back

to the prompt is exactly what we asked it okay or nice here all the feature we required in the in the prompt as I told you the software is quite good quite modern also in terms of look and feel the light theme the dark theme the dark

team is quite good because there is a very good contrast between the colors this is the landing page in dark this is the the part of the landing page with the functions of the software in dark and there is a very good eye contrast with the grow effort etc etc and after we log in we enter

the software we go to this kind of software so this is the uh the software the the menu we have the dashboard in the dashboard we have the total notes we imported just seven notes with some information on the chunks the top is classification as we require it the knowledge graph so the nodes

that are the nodes the connections uh we have in the menu the search because we asked for the search and this search is quite interesting because for sure we can type whatever we want we just started the intelligence here and we get the result in this way so we get all the results but the results are sorted by the proximity score so the

the answer is not just a keyword matching but it's a real semantic answer and finally uh the last uh a voice in the menu is the the chat but because in the prompt we ask it to have a chat button chat gpt style so we can chat with our notes and every time we get a reply there is the source citation that we can we can click up so i believe that more or less that's all the story

Conclusion

and the methodology we uh try to develop the uh to sum up the points here are quite simple We start from a prompt, the prompt is converted in a plan, the plan is converted in microtasks. The agents autonomously manage the microtasks until completing them.

This kind of work can last days, but even weeks if you want. Thank you so much.

Finished reading?