Learn to build voice AI agents

Introduction

Well, hello everyone.

Why I am Here

My name is Juan and I'm here to talk to you about how to build a voice AI agent. So I'm currently working on a startup, I'm a founder. It's not yet official what we do, but we do a lot of voice That's why I'm here to talk about this and we can talk more off record.

Background

Before this, I was an engineer at Google and before that I was doing research at MIT. So I've been working with AI for quite a while. It's something I love and I also love teaching people on how to approach this the best way.

So our first question I have is who here knows how to code? Okay, because I'm going to show a bit of code and I don't want to say 60% of the room. So let's see, this is supposed to be a technical talk, but I'll make it, I hope everyone can understand it.

So I think we've all started to experience these voice AI agents for this last couple of years, where we take our phone, we open ChatGPT, and then we have a button where we can directly talk to ChatGPT. So it's like voice interactions, and we are starting to see this.

can build a voice AI agent, for example, for customer support. So instead of having humans behind, you have an AI.

And I was wondering how do they do this? And I found a library that ChatGPT uses basically to build these voice AI agents. So I got very interested.

And I've been building with that library and contributing to it for the past year.

And today, I'm going to show you how to get started with this, what are the main blocks. that build these voice agents and then how you can expand your knowledge if you want to move further and build multi-agent systems and connect it to your data and so on and so forth.

So the main building blocks of a voice agent are actually pretty simple.

If you see the user on the left side, the user basically talks to the phone. So you get like a speech input and what you have inside is basically what's called STT, the speech to text. So it's basically transcribing what the user is saying. Then you have an LLM, which can be an LLM anyone you want, like can be Gemini, it can be a GBT model, it can be Lama. And then once you get the answer, you transcribe basically the LLM answer with a TTS, which is like the opposite. Instead of speech to text, it's text to speech. And then you get the speech output back to the user.

And this runs on a loop with every user interaction. So these are like the most basic building blocks, but we can expand from here and this is what we are going to see.

And I'm not supposed to be showing slides all the time, so I'm going to show a bit of how this is done under the hood. I just have one more slide, which is this one.

So basically, this is the library I'm using. It's called LifeKit, and I like it because it's very, very flexible. It allows you to build voice agents like as Lego blocks, and you can expand it with connecting it to your data.

You can change the LLM or the speech-to-text, the text-to-speech. You can basically customize the building blocks as you want.

what we have to understand here is, as I was saying, this is a loop. The speech enters, then you get the STD, the LLM, and then the TTS, and then it goes back to the user.

So if you know a bit of Python, it's basically you have what's called the entry point. So this is what runs, is the code. So whenever a user connects, you go through this entry point where you define what's called the session.

So the session is basically, it's like a call. When a user calls the agent, they enter a session, and in the session you have all the data from the user. For example, what's their ID, what's their name. Maybe you go to your database and you fetch more data from there.

It's there where you define what you want, for example, for the STT, which I have it here. It's a model basically from a company called Deepgram, which is pretty good at transcribing. You define also the LLM you want. So as you see, it's like really building blocks that you can customize.

So in this case, what I have for the LLM, it's Gemini with the last model of Gemini that came out last week with reasoning. And then finally, I have DTS, so the part when the model speaks. what's going to generate the output.

So here, again, I can define whatever I want. In this case, I've chosen Google. I have a bit of bias for Google, as you can see. And I've chosen one of the last models also that sounds very human.

And then basically what you do is, okay, I have the session, but inside the session, what I have to define is the agent. The agent is basically where you define the instructions of like the prompt. You're defining like an entity and you're telling, okay, this is your role.

Why do we define the session and the agent separately? Because then we can define multiple agents. So we can create like a multi-agent system. where we will see later, for example, we have specialized agents that handle different stuff depending on the user or depending on what the user wants.

So the next step is we define the agent. The agent, basically, it's a very basic one. But basically, the agent I have is a very simple one.

By the way, I'm going to share all this code with you if you're interested. You can git clone it and run it. And I've also defined a readme so you have every step.

In instructions, I'm just saying you're a helpful voice AI assistant.

And I have defined, as you can see, you can define the tools you want, like the functions. And I have a very basic function, which is a function to get weather, where depending on the location, it will tell you what's the weather on that location.

So you can imagine this as... You can have, for example, a customer support agent that then can have a function which is, for example, retrieve this data about this user or a log that this user has had an issue or book a call for this or an appointment for this user each day. These are the kind of things that you can do with an agent.

So here I have just one. I'll show you later that you can do as many agents as you want.

And then I have the session. So the global scope, I have the agent.

And by the way, you can wonder like, okay, what if I want to have multiple agents with different voices? You can do that as well. Like you can customize all these STTTTS and LLM for every agent that you do.

Here I have defined one like global ones for the session that are then used. And then you just start the session. So starting the session, basically what it does is it starts the loop.

So then every time the user will say something and we will see this, it will get transcribed, it will go to the assistant, the assistant will decide the answer and if it needs to call the GetWeather function or not, and then the answer will come back to the user as a speech. And finally, what I have is the first thing that's going to happen is instead of waiting for the user input, I'm going to greet the user and offer the assistance.

So this is all the code I have. It's pretty short. It's very like the quick start.

So now I'm going to run it and I'm going to talk to it and I hope it will work. Hello, I'm a helpful voice, AI assistant. How can I help you today?

Hey, what's the weather like in Paris? The weather in Paris is cloudy.

So far so good. And how is it in Berlin? I'm sorry. No, sorry, sorry. How is it in Tokyo? I'm sorry, I don't know the weather in Tokyo.

What else can you do for me? I can also tell you the weather in other cities. Just let me know where you'd like to know.

Okay, so basically this is what you've seen. I asked the weather in Paris, it has an answer, but when I asked Tokyo, it says the return of the function is like, hey, I don't have the weather in Tokyo, so that's why it said that it didn't have it.

So this is the most basic example. agent you can build and from here you can just like buy code and run your imagination and see where where you can land i can tell you later what what like the app i've been building so you can see how you can expand it and what you can do with it and back here

So this is a diagram to show you a bit more what you can do. So what you've seen is basically, what I've built is what I had in the first slide, so the most basic blocks.

But as I've shown you, like the big block is LiveKit, so it would be like the session.

But inside you could define, for example, a triage agent. This is basically it's very common in customer support, for example. So you have a general agent that decides what is the specialized agent that should treat the call.

For example, imagine you run this in a hospital and someone can call you, for example, to book an appointment or someone can call you to ask for a bill or a refund. So you can have specialized agents for any of these processes.

so you would have as you see like the triage agent which could be like where the user would be routed at the very beginning where you would have like the speech to text the llm and the tts and the llm you can connect it to functions as we've seen with the with the get weather but you can also for example use it to search like either in your database or search on google and you can also connect it to ncps i don't know if you've heard about this concept but it's basically a way of

defining like functions is the same but it's defined by companies so you can leverage for example if I don't know imagine like a CRM that has defined functions that you can call straight away so it's an easier way to do it

And then you can also with LiveKit, for example, depending on the user need, you can hand it to a specialized agent, or you could also transfer the user to a human. So if the user, for example, has a need that no specialist agent can treat, you could transfer the user to a human.

the user to a human and then maybe come back to the agent and you are like in this world of like depending on what you need you build your lego basically so that would be it