Build your First Voice Agent

Introduction

Hi, everyone. I'm Josh.

I wonder if we should just all stand up and sit down again. Just stretch our legs a little. Sorry.

Back in a former life, I...

Oh, if we can sit down again now.

In a former life, I taught junior high students, and they can only usually sit for like 15 minutes at a time before.

You need to probably have everyone pick a new table and move everywhere to do a new activity.

About the Speaker and Session Overview

I'm a developer. I develop agentic systems systems for small and medium sized businesses, both with a group and as well as privately.

And today I'm going to be talking about voice agents, a bit of a spread from complete introductory stuff to trying to address people who've already started to use voice agents in some way and hopefully show either some things that can be thought about or ways you can kind of improve your agents further.

I guess as a show of hands, how many people have played with a voice agent in some capacity? Fantastic. Fantastic.

And how many people are using it in their actual business? OK, fantastic.

So four parts here. What is a voice agent? Model choices and the kind of decision points. Deploying the agent, that's for the build your own agent part.

And then demos, we've got some funny demos to showcase at the end. Yeah.

What Is a Voice Agent?

So first and foremost, what is a voice agent? I've kind of fundamentally structured it as you You talk to the computer, and it's doing some tasks, and then we get an audio back. So it's sort of the chat GPT cycle, but with voice.

And obviously, you have all the modalities that you can wish for with that. So via the phone, the computer, anything that has a mic and a speaker. But yeah, broadly, it's the same chat GPT concept, but with voice.

And the question is, how would you then build that?

What's kind of new that's working now in 2025 that a few years ago we couldn't really build and

Two Approaches: Speech-to-Speech vs. Cascade

we have a lot of new models that have been trained obviously in the last few years and there's two fundamental ways that we'll look at how you can build that.

One is you can use a sort of speech -to -speech model or model that can kind of take in audio and output audio natively, so

we've all heard about large language models and how they take in text and they spit out text tokens and And then with GPT -4 .0 and other models as they came along, they started training and figuring out ways to encode images, audios into tokens so that the same fundamental paradigm could be used to generate images, generate audio and text.

And so we're seeing, especially in the last year, a huge improvement in the quality of the speech to speech models.

And usually with this, you would give some sort of direction, thinking kind of like a system prompt in chat GPT, plus the user's voice and you output the voice.

As you can imagine, or maybe not, this is offloading a lot of responsibility onto one model and when you're looking to build something in production you end up handing off a lot of control because you're sort of trusting the model to do a lot of things.

By way of analogy, this is a single person that you're allowing them All the failure points are at one person, so there's a huge amount of trust.

The second approach, which is used I think basically everywhere, or at least where businesses are using it, is the sort of cascade approach.

It feels like a worse approach, but in practice is kind of what you see actually performing. And it's also a lot cheaper. deeper.

The Cascade Pipeline: Transcribe, Reason, Synthesize

The first step of the cascade approach is the sort of ear, the transcription that was being demonstrated in the previous talk. You have some sort of model that's powering WhisperFlow, that's powering all these new tools, and you have the voice coming in, and

you do that transcription layer that we saw Drummond using to power his computer, and you extract the text of what the user said. And then behind the scenes, you basically run the sort of chat GPT loop. You pass the user's input text into the LLM and you get an LLM answer. And so all the mechanisms we're learning about of how to use LLMs you can leverage with all of that.

And then the last stage is you take that user's text, you pass it to a text -to -speech model, you generate the audio. So the user sees the same end -to -end experience but behind the scenes you have a bit more control at each stage. The most important control you have is that middle layer, that LLM layer.

Because of all the other paradigms, if running evals, of watching what the LLM is doing, controlling it, all the system prompts, you can use there. This also means you can end up mixing and matching different providers for your use case, based on whether you're optimizing quality, speed, intelligence, all sorts of things. You have a lot more control.

And as you could imagine, though, there's some information that's lost. So I feel like with LLMs, we're often trying to think in terms of these new models and sometimes we're fitting them well to how humans think and experience the world and sometimes we're projecting onto them things that we really shouldn't.

So with a model, we're kind of leveraging the fact that it can hear the intonation of the user's voice and it can use that as information and data and if someone speaks to me, I can hear if they're yelling at me, if they're whispering at me, if they're you know saying it in a kind or upset way and the model can kind of pick that up so we're giving it those sorts of that power the llm loses that information it gets a transcript and a transcript is just a fraction of the information that's really being communicated by the user i worked

on this with a interpreter project where we were looking at medical interpretation between a patient and a doctor and there's a real problem if you're if you're looking to communicate something very somber and the LLM only gets to see a transcript of that and then if there's one word that's mistranscribed suddenly it could kind of sound wrong on the other end and it's difficult to emulate that emotional tone.

That said, some of the modern text -to -speech models are able to have, not to get too much into the weeds just yet, are able to have voice samples or a bit more emotional control but it's never the same, I don't think that's my, sorry, the same, it's not the same end -to -end kind of experience where you get to hear and respond to the user.

Tasks and Tool Use in Voice Agents

Okay, now we said basically there's three things or the sort of like voice coming in, voice coming out on these tasks being done.

You know web search, writing a document, pulling in an email, marking a task is done, you know pushing your code to production by an LLM, whatever the task is that also has to be

done and how is that handled. Fundamentally, the way

way to think of it is splitting that middle stage of the LLM. This is where all the power is.

The sort of the LLM is controlling everything, doing all of the tasks and things. It spits out all the tokens that will be synthesized into audio, but it also generates special tokens to do tool calling.

So you'll configure ahead of time that if you output these tokens it should trigger this other part of our code. And the model outputs the tokens to do the tool call and as a result it gets this tool call response.

So it says something like I would like to call this API to pull in a Google search result for pubs in the area and it gets back this text response of the the list of pubs in Oxford.

The next step is then to pass the list of pubs in Oxford with the original user prompt into the LLM and the LLM also has encoded that these were were responded to by an internal tool.

And then it can respond, again, in that sort of native way we're used to with ChatGPT, where it feels like, oh, yeah, I looked it up, and there's this and that pub, and it acts like it knows, right? Because it's now got the tool call response with the original text, and so it can form an answer.

And again, a lot of the magic here is all about concealing these intermediary layers. And for me, that's a theme of voice agents. It's a theme of ChatGPT generally,

but voice agents in particular, I think there's lots of magic that you you can do to conceal all the intermediary layers.

Does this make sense? Any questions about tool -hulling? OK.

How much slower is the .. Good question. Yeah, so I think it all depends on the models. So it can be extremely slow or not so much.

But you're talking still within one or two seconds, but versus maybe almost like a half -second response.

And with users, you might, I'll get to it later, But there's a problem with end pointing when you're trying to figure out when someone's ending a conversation and the voice -to -voice models the best of them can handle this sort of Awkward moment where you're talking over each other and you need to backtrack a little bit

Design Patterns for Reliable Voice Agents

So for those of you who are already using agents and I'm thinking about this theme of concealing the magic behind the scenes I Said earlier. It's not really a good idea to offload too much responsibility to one model I also think it's not a good idea to offload too much responsibility to one agent

I'm going to blitz through these, but quickly go through a couple of structures that people are exploring for increasing the accuracy and reliability of your voice agent.

Let's say I call up my phone provider and I want to change my address. You can ask an LLM, make sure you get their password. Make sure you do this and that. Make sure you do all these very important things before you give them any information.

The business stakeholders are talking about, we don't want to leak personal data. We don't want to do this or that. And there's still a lot of concern because LLMs will just do kind of unexpected things. People can kind of hack them in unexpected ways.

And not to say this fully solves this, but one very simple paradigm you can do is to

Pattern 1: Gatekeeper and Main Agent

create a gatekeeper and a main agent. And the user never knows there's two agents behind the scenes, but they call up.

The gatekeeper agent is responsible for collecting the required information and to make a tool call to make sure that the pin they entered or their username match.

and then once they get that information they're allowed to pass it off to the main agent and the main agent gets maybe a summary of the conversation but more likely it just gets the this is an authenticated user with this ID and this

information as a result the main agent starts off with a fresh context has clarity of mind doesn't matter if they talked for 10 minutes saying oh I forgot my pin or let me go find it the main agent can have focus and clarity so yeah

gatekeeper to main agent but again it doesn't say I'm gonna hand you to the the main agent now, it just says, oh, thanks, you're authenticated, great. But now you're talking to the main agent behind the scenes.

Oh, sorry, these animations all got screwed up. But

Pattern 2: Triage and Specialized Agents

main agent two is triage. So you call up, you want some help, and there's a booking agent, let's say, for pubs. And there's just a general question and answer agent for pubs.

And they say, yeah, I'm wondering, are you open tonight? And it routes very quickly. Basically, from that statement it routes to the next agent and they say and this agent that's the

Q &A agent has all the information and says yeah we're open at these times you know is there anything else I can get for you or they say I'd like to book a party the you know the room for some event and it routes to this agent again

the user doesn't experience this but it's all about but from a development standpoint it means you can control the context it also means you can design better tests for reproducible scenarios because you'll often have situations

situations where you want it to handle about five different things and those five different things aren't um there's like so much context that's needed for each of them and if you add them all into one it tends to get confused or tends to like want to recommend one set of activities over

another whereas if you do triage this agent can have a 10 000 long prompt and this agent can have a 10 000 long prompt and they don't even know about each other except for insofar as if someone starts asking questions you can't answer we've got a q a agent that you can hand off to and And again, all of that happens without the user knowing.

Pattern 3: Sequenced Question Agents

The third and final paradigm is the endless sequence of agents.

So let's say you have the problem where you wanted to build an agent that could call up someone and take a 50 -question survey, where they say, I need to answer 50 questions about name, medical history.

And if they have medical history, you have to go through every family member, and you have to do all of these things. That's a big ask for an LLM.

And if we've chatted with LLMs and gone down a very long conversation, you know they can get really in kind of a kind of crazy place when you go because the longer you go in a conversation the more the control of the LLM diminishes so in

this paradigm you have a question agent one who has a quite initial question set and the initial questions that might just be basic information and question agent two is a medical agent and as in question agent the second question agent two should be question agent three but basically as the time goes on you can

log the questions via tool calling, like you would do if you just had one agent, but you make sure you do an agent handoff every time you do a tool call. As a result, you end up getting a lot more control and understanding where your product fails.

If users have a poor experience, you know that the agent one to agent two handoff is problematic. Likewise,

instead of asking people who are playtesting your app to talk through and make sure they get to hear, you can do quick iterative cycles where you're talking with this agent, making making sure it asks the right questions, asks the right follow -ups, and all the evals that you want to run with that.

So it gives you control as a development group to separate that.

Any questions or comments on that? OK.

Observer Agents and Model Choices

And with that, of course, with all of these paradigms, you can also look at things like observer agents, observer agents that are tracking the overall flow.

So maybe this one only understands about question set two, and this agent only understands about question set three.

But this agent is just making sure that this conversation is going as normal and is looking to interrupt or terminate the conversation if there's a problem.

It's looking to see that the user isn't either having some information in the earlier questions that should be surfaced at a later agent.

It's basically acting as both a guard as well as a sort of expert on demand that can help any agent giving them a context as they're trying to answer their questions.

Yeah, so that's the other beauty of it is you have full control at every stage of the LLMs you use to optimize. So if you know that question set two is extremely difficult and complex and nuanced, but all the rest of your questions are pretty straightforward, you can say we're going to use Gemini 3 Pro for that.

And then, yeah, that's exactly right. And the same with the observer agent. Maybe for your clients, you want to run everything through like a mini or a light model, but the observer agent is going to be a bit more powerful and it will only run at every agent handoff stage or something like that. just to do your verification layers.

So yeah, yeah, yeah.

You can do fine -tuning. I think fine -tuning isn't necessary for a lot of cases. I think fine -tuning generally, as an aside, is a lever that, for me, feels like a carryover from an older age.

So back five years ago, talking about machine learning, everyone wanted to build up lots. Data is the new oil, I think was said all the time. And people wanted lots of data to train various models, like random forest models. We knew neural nets were going to need lots of data and feed it lots of data

I feel like what's where where products are tripping up especially in the first version is not the model tuning stage It's not to say the model tuning isn't good But it's just not where the project is it's going to fail if you're asking how is my company going to fail to build? This agent it's most likely not fine tuning until like some later stage of quality it from my experience Yeah

If agent two is... So, yeah, I mean, as with all these agents, you have lots of degrees of freedom.

I wouldn't recommend just commandeering it. I'd recommend having, like, normal pathways. So either it's an escalation pathway, where it's an escalation to a human, if that makes sense for your use case, or it's a shutting down.

Because you have these modular controls, you say, oh, sorry, we're going to move on from that question. Let's move on to the next question. Again, the user doesn't realize it's been interrupted, But Agent 2 got its question set shut down prematurely, you've bumped to Agent 3, and in your logs you have that was a failed question set. And maybe you can retrieve it in some other way.

But yeah, I think the observer agent, at least I thought, you know, worked with it, thinking of it as more like trying to help nudge rather than stepping in itself. It's just making sure things are going along behind the scenes.

Handling Tasks, Latency, and Practical Challenges

Quickly, what about tasks? So there's lots of different ways that tasks can be structured.

MCP you might have heard a lot about. MCP is sort of like the analogy that's used as a USB for agents. So it's a very LLM -friendly way to have agents interact with various tools,

Google Drive, Asana, your calendar, whatever it might be. And it gives usually very LLM -friendly responses.

So often these voice agents, you can hook them up via MCP to some server. Of course, traditional API calls work as well. and if the LLM understands the API structure or if you wrap the API to make it easy for the LLM that can be used and

custom functions

Latency, Endpointing, and Overlap

Okay, a couple of notes about LLMs of things that are hard or things like people think about when they're starting to build it is There's this talk of like oh, it's laggy and the question is why is it laggy? So it could be that there's processing delay or it could be that it doesn't know when you're done speaking so we all have had that experience of talking over someone else or not knowing quite when they're done and

And the default and naive approach to endpointing is using voice activity detection, where you wait for no voice activity detection, let's say, for 0 .75 seconds, and then you start processing. But then there's more improvements that you can do, and people have trained fine -tuned and very small models to try to make that gap narrower and narrower. So sometimes when they're trailing off in their voice, you know, let's give it 1 .5 seconds to see if they're done speaking or if they've said a complete semantically complete sentence, you just cut it off at 0 .25 seconds of no voice activity.

Overlapping voice is a huge problem for transcription. I think it's difficult for humans. It's like impossible for LLMs or for these transcription models. That's really my experience, like ongoing difficulty.

So if you're in a room you may think yeah, we can have this audio checking experience for the receptionist Where they just come and chat to like our our AI booth But then someone's having a conversation over here and it throws off the whole experience and the LM doesn't understand At all what's going on?

Evaluations and Interruptions

Evaluations these are very difficult because you're talking usually about long sequences of conversations that you want to control It's easy and a lot of the evals that are published are single question and response response, whereas really you're looking for multi -turn things. There's more frameworks that are being developed for those, but it's difficult.

Interruptions is difficult as well.

Okay, I'm going a bit too slow, I think, so I'll skip a few things.

Model and Provider Trade-offs

But model choices, basically, I've thrown up some here. These are just kind of like some eyeball ones, but you basically have just lots of degrees of freedoms, smart, you pick some leading LLM, you want something really, really fast,

you run out in Cerebrus, you want something cheap and fast, you have all the sorts of trade -offs with the LM and that's the beauty again of the cascade model.

With all of these paradigms you have sort of trade -offs and maybe a Goldilocks zone for your use case, like if you want a super high quality audio and you don't mind about cost, you know, 11 Labs is your text -to -speech provider.

Demos

To take a quick intermission before the next session, I'll just do one of the the demos, because they're actually kind of hilarious and fun.

The first demo is, and then we'll get to deploying an agent, is a demo of Sesame.

So Sesame is building agents that have extremely fast and human -like responses.

So we've seen a lot of agents that are laggy, but I wanted to show this one most importantly, because I think it's the most impressive from just a human

sounding perspective let's go short and sweet huh all right yeah sorry I think I sure thing no it's right now oh okay right now right now okay that's so as

you can see this model is also not very smart so it's a really amazing human lot of thinking what's what's kind of next because yeah I'm we'll get to that

at the end but there's a lot more ways in which you can have relational Yeah, am I in danger of doing that with you?

Are you in danger of deluding me into the nature of our relationship? It's easy to forget.

No, not at all. So I appreciate you checking.

Okay. Thank you

So that's us to me yeah, very I think just like a bit of a flavor of what's upcoming of how I'm very excited about a lot of the use cases

But it feels also a bit troubling because of the confusion of the human AI boundaries that we're already kind of blurring and this Confusing to get to this stage of like deploying an agent and a

Deploying an Agent: Tools and No-Code Options

A lot of the tools have been made that are really, really simple. There have been a lot of wrapper tools that will allow you to do the process that we just talked about of selecting your transcription model, your LLM, your text -to -speech model. And I'm just going to list a couple of these here. You can kind of just Google, you know, make me an agent, you know, online and pick, you know, one of the top links.

I think, as was discussed, 11Labs has the best voices. Bland, as an example, is great for just like a drag -and -drop kind of interface.

These are a little bit more like developer -y options and I put the crown with LifeKit because I think that's like just the absolute best if you want to build something, a production app. It's actually what OpenAI used for their voice model delivery. It's extremely excellent.

Quick Walkthrough: Vapy Interface

For today's purposes, I'm going to just quickly show you the Vapy interface, which is kind of like a mix between the 11 labs and the life kit it's quite you can kind of click through it like tonight if you want to make a voice agent in about 30 minutes you can make one with this without any coding and you could make it so you can call it by a phone number as well so as well I'll show you here so the first thing

you do is you you get this like interface and I know this looks like a lot but basically you have those stages we talked about before so the first stage is the model they're calling it which is really the LLM and you you basically select your provider, OpenAI and the model's GPT -5. You just make your selection, and for the purposes of just playing with it, I think any selection is fine or the defaults are good.

As you can see, you're usually looking at about $0 .10 a minute, depending on your providers, and they'll estimate that for you.

After you've selected your... Oh, sorry, it looks like I messed up the screenshot here. But after you've selected your LLM, you select the voice and the model that's used for that.

they make it very easy, and they have defaults for all of these as well, and after that you can select the transcription, the ears of the LLM. They have a tab as well for tools that you can wire up and lots of other things. They even, over here I didn't highlight, but they have like a widget you can embed, like a no -code widget you just embed in your website.

And yeah, without further ado, let's get to the demos.

Demo 1: The AI Inspector

So one of the demos I have here is the AI Inspector, and let's see if if this works here for us. So the AI inspector is, oh, the decisions. The AI inspector is a bit of a joke one to,

let's see if everyone can hear this. But it's going to look to see if there are humans in the room. It's making sure that we're all LLMs, good LLMs here.

Hello, I am the AI inspector. There's been a complaint in this area of a human masquerading as an AI. I need to speak with you for a moment to find out if it's you.

It's not me. I'm definitely an LLM. Skeptical. An LLM, you say?

Your response seems somewhat informal for a language model. The hesitation, the use of uh, these are traits often associated with human speech. I'm just a very natural speaking LLM.

Have you heard of sesame? That's me.

interesting you claim to be an llm and you mentioned sesame a reference that could be either genuine or fabricated however your speech patterns continue to betray a hint of humanity the phrase i'm speaking llm sounds almost defensive and the casual mention of a proper noun sesame without elaboration suggests a familiarity that might be more characteristic

of a human I'm not convinced yet tell me what do you think about nothing in particular I don't have feelings and emotions because I'm an LLM a

predictable response yet it rings hollow your assertion that you don't have feelings and emotions is accurate for an LLM but the tone and context in which you delivered it seem okay what's your conclusion a question that seems almost

deliberate you're asking about a weekend a concept that an elf anyway okay so you get the idea um so i think well so the reason i want to showcase that one i mean it is just really fun to play with

but is this idea of i think also there's a lot more uh video game experiences that you can kind of unlock with llms where i mean this one would be a precursor to something like i don't know a whole uh social media platform that is llm only and you have to kind of talk your way into the the inner ring where you are acting like an LLM all along.

Um, it's very easy to change the voices.

I was using Cerebrus, which only runs open source models, but is extremely fast for the token processing.

And, uh, as a last, uh,

Demo 2: A Personal Agent for Philosophy and Support

as the last demos here, I've got this, um, I made this phone number for my wife. I'm sorry about that. Um, this, and it's, um, like a domestic at all.

If you have a question or need help with anything, philosophical or practical just let me know so my wife's uh loves philosophy and this one is got like some web search can say things to a database for her and things like that and you know dig uh insults her uh least favorite philosophers and praises the most so if we say i'm kind of looking

for some inspiration today for a talk can you talk to me about that absolutely let's narrow it down a bit what's the occasion or audience for your talk no no just give me something give me me something pithy to showcase your emotion. I mean, your dynamic personality.

Of course. Here's something pithy and just a bit cheeky. I'd rather wrestle with a screaming toddler than with the notion that life is meaningless. At least a toddler gives hugs when you win. Would you like a few more in different tones, uplifting, humorous, thought -provoking?

Okay. That one's not maybe the strongest. I can also show you the multi -agent handoff here though.

Hello. How can I help? Hi, can you hand me off to the multi -purpose agent?

I'm your multi -purpose assistant. How can I help you today? So we've done like the like more explicit multi -agent handoff.

Do you have another agent you can hand me off to? Yes, I can switch you to a different agent if you'd like. Here are your options. One, default agent for general conversation and assistance. Me. Two, emotional expert for emotional support or sensitive topics. Three, hard -hitting agent. Oh, hard -hitting agent, definitely.

It sounds like you might be referencing Agent O from the TV series The OA, or possibly... That was not successful, but hopefully you can see, and you can have different tool calls and things like that. So the idea is that you can have the utility kind of segmented, have it send you emails.

Demo 3: Persuasive Voice Agent for Oxford Trip

And the final one I will not completely show you, but I wanted to convince my friend to visit Oxford, so I made a demo through Vapy, sent him a link to go through to, and it just basically, you know, tells me that...

Hello, my dear Timothy. It is I, Reginald. Thank you for pressing that most peculiar circle. I'll put down my tea.

Do you mind if I borrow a few moments of your time? Yeah, that's okay. Tim, you're a legend. Thank you.

First off, how the heck are you? It feels like ages since we caught up. Any new adventures in Santa Clarita lately?

And more importantly, is Gracie still running the show over there? I swear that dog could outsmart all of us and still leave us covered in white fur. Tell her I said hi and that she owes me a rematch in fetch.

Okay, okay, yeah, yeah, get to the point. Why are you talking to me?

Absolutely, Tim. I'm talking to you, the one and only Tim Rose. And listen, buddy, I've got something pretty exciting to pitch to you.

You're in Santa Clarita, right? That's great and all, but you know what's even better? Oxford.

I just got back, and Tim, you have to come see this place. Imagine cobblestone streets, dreaming spires, and the history oozing out of every corner. It's like stepping into a storybook, but with better food and tea, oh, the -

Okay. Anyway, we'll cut it off there and wrap it up. But yeah, that gives you an idea. I think there's lots you can play with and lots of actual utility as well.

Conclusion

We went through Sylvie.

I'll skip the final notes and just say thank you very much for hearing the talk.

Finished reading?