On The Different Engineering Requirements of Different AI Agents

Alrighty, this is exciting. I was actually presenting at MindStone exactly one year ago and it kicked off this very business that now a year later I'm coming back to discuss some of the challenges.

So my talk today is about how the different communication channels that an agent communicates on impacts the engineering and the requirements from the engineering. We at Prospera Labs, we make specifically conversational agents, like we're not making the kind of AI agent that just goes off on its own and autonomously does a bunch of stuff.

We're laser focused on agents that talk to a consumer and interact with some sort of back-end system. So just distinguishing us from that maybe more speculative kind of future facing stuff.

So when I first started building AI Agents I really believed that everything was a conversation and that ChatGPT, it can respond to messages. All messages are the same. An email is a message. Me speaking to you right now, it's like a message. SMS is a message.

begrudgingly, I kind of learned that this is really just not the case. Actually, each of these communication mediums that a bot might chat through or exchange messages with really function quite differently, and they have quite different expectations in terms of how our AI agents need to interact with those mediums.

You have greetings, you have signatures, and you're also trying to save emails. You don't really want to send a whole barrage of emails back and forth the way you might with a voice conversation.

So although you may have the same AI agent, And you may have a system that can respond. The requirements from these two communication mediums are really quite different.

So what are some of the differences that come up? Obviously, length. Emails are really long. SMS is shorter. Voice is even shorter than that.

The format of the emails. Response speed. This comes up a lot in trying to make an AI agent really work well. Conciseness. How do I explain this?

The number of back and forths, right? With email, I don't know if you've ever tried to schedule a meeting with someone over email, but you're often like, if you've done it a lot, you're falling over backwards to use as few emails as possible. It's kind of just like, here's a time and here's a time, please choose the first time and accept the invite and we're not exchanging any more emails. That's very different. than, say, how a voice conversation might go, where you're back and you're forth. OK, 5 PM. OK, no, 5 PM doesn't work. 6 PM, great. That's fine in voice. It's fine to have that back and forth. Whereas with email, you're really trying to get it all done, all in one place.

Another interesting piece is the available information. When a bot is outreaching to someone via, say, a phone number, there's a lot you can infer based on their phone number. You can infer where they live, what language they likely speak, what time zone they're likely in, all very important things if you're trying to, say, schedule a meeting or, I mean, have any functional conversation at all. Whereas if you're sending an email, an email doesn't actually tell you much about who they are, what language they speak. So if a conversational agent has nothing but an email to go on, it needs to ask the user, what time zone are you in, before it goes ahead and schedules a meeting at 4 PM. So there's a lot of differences, nitty gritty differences between these channels.

Actually, here's a long list, actually, of differences that we've come up with over time.

Email is one of the most difficult because you get the least information about the user. If you're just conversing with them over email and you know nothing else, you have to ask the user a lot. And at the same time, you're trying to use as few emails as possible, so it's challenging.

With phone and SMS, there's a lot of information we can grab from the phone number, like the phone number itself conveys location, it conveys time zone, most of the time. Same with the web browser. From the web browser, if you're doing web chat, you can pull in all of that information.

But unlike with phone, you might not know the user's name because you didn't have it in a database beforehand. So web chat is almost all incoming, SMS and phone are more outgoing, so there's a lot of different differences that come up.

Particularly the transcription errors, this one down here is a big source of constant, constant pain and issues because Whereas it's really easy to design an agent that works reliably if you can assume that they typed in their phone number correctly or their email address didn't have any typos. It's a lot more difficult to have an agent read back people and correct typos and all of those things. So transcription errors often really are a big source of issue of differences.

So here's another example. So when you're trying to come up with a really good scheduling flow, like how does an agent go back and forth to get a schedule done, it really varies completely based on the communication channel. So for example, considering those mistakes, if you're on a phone call, you want the agent to confirm back, like, hey, is your email address this, G-E-N-I-X-P-R-O, or, I don't know, kitty1231945, whatever it may be, you know, it's confirming that.

Whereas that would actually feel quite stupid in a web chat. You know, if you're a web chat, you paste in your email address, you're like, and it just said, hey, is that actually right? You're kind of like, well, obviously it's right, I pasted it in, you know, it's... There's not the same concern for errors in a web chat that there might be in a phone conversation.

And again, email, although there's no errors, You're really trying hard to be concise. And so you're often packing more into that initial email.

Whereas with voice, you might be using API calls. There's more back and forth. With SMS, this came up recently.

So we get a lot of international clients. And often, the telecom charges in whatever country they're in are just off the charts. In America, it's cheap.

But if we're trying to send SMSes to Dubai, So in those cases for SMS, we just send a booking link. It's just easier than having the back and forth. It saves money too.

And on WhatsApp, WhatsApp actually turns out to be the easiest to have the most natural conversation where you just kind of do back and forth and just suggest some times and book it.

So that's the AI agent behavior. The behavior really has to change between these different channels. Same with the underlying engineering. Deep in the guts of an AI agent system, these different communication channels have very different requirements.

Most notably, voice is the most different, in my experience, You really can't have any third-party API calls. I know people try to use these vector stores and you go out and query this information and they're doing it on every single message. That's how you end up with an agent that takes eight seconds to respond and it's not conversational.

When it comes to voice, you almost can't have any network-based requests at all in the flow. Everything has to be preloaded in memory. The response speed has to be extremely fast because the LLM itself just already takes up so much computation time that trying to add in doing a vector match against 1,000 entries or something is just too much. It's not feasible.

Email. What a nightmare email is. I don't know if any of you have had to write code for email in general. It's a nightmare.

There's so many problems and issues. Getting emails actually into the inbox is a whole can of worms. You would think it's easy. You just send the email. It shows up in someone's inbox. Not so.

There's layers and layers of algorithms between you and that user's inbox. And those algorithms, for the most part, do not want your email to reach the inbox. They're filtering 90%, 95%, 97% of emails. And it's actually only a tiny, tiny fraction of all emails that are sent that are actually landing in the inbox.

Lastly, with web chat, we've experienced that reliable delivery is really quite challenging in web chat. It's probably the biggest engineering hurdle that we overcame. You have to combine different techniques to ensure that your message both gets there quickly and also... you know, reliably if there's a network disruption or if, you know, people are switching between Wi-Fi and, you know, LTE, that's a common place where connections break and, you know, your web chat really needs to be able to handle that in a totally reliable way.

So each of these channels has their own kind of deep engineering requirements and that's... Why a lot of startups, frankly, probably the smarter, more wise startups, they choose one channel because it's way easier. It's way easier to get an agent working just on web chat, it's really where most people go with, than it is to get an agent that can communicate naturally on all these different mediums and do it all together.

And this is actually just an example I whipped up before the show to get a screenshot that would make sense. It's quite a bit more complex than this. There's a lot of different tweaks that go between these different channels to ensure something that's natural.

So lessons from voice, saying again that for the purposes of voice, you really can't have too many network requests or database requests or anything. Your system is pretty much taking data in, going straight to an LLM, and feeding the result right to audio. That's the only way to get the response speed that people expect. And so your engine needs to be built around that response speed.

You're not making requests up to Pinecone after every single message. It's just not reasonable. It's not fast.

Another neat trick that we learned, background audio. If you introduce it, it just kind of hides the fact that there's slow responses. So if you are having a problem with slow responses, just throw in some background audio. Honestly, it slows down people's perception by about a second and a half. which can be meaningfully different when you're measuring everything in milliseconds, a second and a half in perceptions quite a bit.

The other big thing is that how you handle interruptions matters a lot in voice. And you should only include words into the conversation history that the bot actually said. So if you have any interruption handling in your system, don't use like, let's say for example, OpenAI, they have this new responses API where you just provide the previous message. That's nice and a very convenient API to actually use, and other providers kind of have similar stuff.

Oops. But that actually makes it a lot difficult to do things like rewriting the conversation history. If OpenAI in their previous kind of now defunct agent API, there would be no way to do this. It's like if the agent said x, that is now permanently recorded into the conversation history. Whereas if you kind of use the older completions API, you can kind of rewrite things to remove words. So if the AI did not actually say the words out loud, if it's interruptible, then you don't include it in the conversation history. It just causes a lot of misunderstandings in those flows.

Email.

Lessons we learned from email is just making sure you use a robust parsing library. Get ones that can handle encrypted messages, like if people are using public key encryption on their emails. You want something that's thorough, that can handle all different types of emails out there and all the different varieties.

We also found that a lot of people like to CC a person into an existing conversation. And so once we started putting our email bot in the wild, that's almost like the main way that people use it. They're kind of just like talking to someone, and then they're just like, oh, OK, I want to bring the agent into this conversation. I'll just add them and CC them.

And it actually works surprisingly well that if you just take that first message and just decode the whole conversation history and feed it into the agent, that the agent will kind of just get it. It'll be like, all right, I'm caught up and I'm being brought into this conversation. And you honestly don't need to do too much system prompting for it. It works pretty well.

But where was I going with that? Yes, and the deliverability is a big issue as well.

This has been an issue with email for, I don't know, two decades, maybe three decades. I don't know if you've ever done email marketing or tried to write code for email marketing. Email sucks, and deliverability is really important.

And actually, if you use the exact same domain name that you use with your regular email that your company is sending business emails on, that improves deliverability a lot. So rather than having your AI agent sending emails from its own domain name, like the agent's domain name. Actually, if you can try to reuse your existing company domain name and have it sent through the same mail server, your deliverability goes up like 10x. And you can almost always get your email in the inbox.

Last for web chat. big lesson that we learned was making your web chat interface sexy really important.

As an engineer, I tend to think very functionally and I'm partially colorblind, so I'm not a good person to reference for style. But as it turns out, the way all of the little colors in your user interface for web chat People pay a lot of attention. They look at it closely.

And I find clients in particular want to customize it. They'll change the chat bubbles to their company brand color and whatnot. So that's really important.

Deliverability is super, super important. People notice immediately if your bot responds in one second normally, and then all of the sudden it's taken 10 seconds or 30 seconds to respond, or worse. It's a really bad user experience.

I almost immediately get complaints about it as soon as our real-time server goes down. So we actually combine multiple different deliverability mechanisms to ensure the message always gets onto the web browser within about seven seconds. And you can't rely on one piece of infrastructure to do that. I wish you could, but in practice, having both the real-time messaging and the database polling just improves your reliability a lot.

And lastly, we have experimented with a lot of these old 2018 chatbots. When they realized that the AI piece of it didn't work, they started putting buttons in the chatbot interface. You remember all those chatbots? They're not really chatbots, they're just kind of like... like a button-based chatbot, those chatbots suck.

But there are cases where bringing a traditional user interface into a chat widget can improve the experience. But it's just something to be used very sparingly. We found that, for the most part, it only improves the experience if you're presenting tabular data, like if you're choosing from a big list of options, then having that Having a unique UI that's not chat, it can help in those cases, but for the most part, the web chat widget should stick to web chat.

That's what users expect. They don't want traditional user interfaces popping up into their chat window. We thought that they would, and we tried to amaze people, but actually when we presented those user interfaces, people just didn't like them.

Conclusion

We believe that multi-channel AI agents are really where it's at. A lot of people are designing systems where you have to enter the agent's world. You have to log into an interface and you open an app and you're entering its world to converse with it and give it some instructions.

But I think really where the future is going is that AI agents are going to be entering our world. they're going to communicate the way that we do, and they're going to use the same communication mediums. And that means they also have to adapt the same way that we do to those different communication mediums.