Building an LLM+TTS Assistant on Telegram with Open Source Technology

Introduction

My presentation is going to be related to generating... We are going to make a full system in which we are going to connect an LLM to produce text with a text-to-speech model to produce the speech with the model to clone our own voice. Okay? And we are going to embed all this into Telegram so we can use it in our phones.

Overview of Audio in AI

So before diving into the code section, I would like to talk a bit about the specifics, the fundamentals, from a very high point of view, of how to work with audio in artificial intelligence. Because this is my background. I am a PhD candidate at the Universidad Politécnica de Madrid. This is what we do every day, but mostly people tend to know about how to work with text, how to work with image, but what about audio?

1So the first thing I would like you to notice about audio is that... Okay, it won't work, so I will point with the hand. It's a massive piece of information. So just one second, it's CD quality, 44.1 kilo samples of information. That is huge.

Understanding Audio Data

It's like a super big image. And one second of audio is like nothing. Imagine we need to work with three minutes for a full song. That's a lot. That's an immense image.

Encoding and Feature Extraction

massive textbook so usually we don't work with this with the raw audio like that we usually work with a fundamental small piece of information that is some information that represents perfectly or at least the best approach, the best effort of representing the most important characteristics of audio. And that's what we try to use when computing, when using artificial intelligence with audio.

Let me explain a bit better, because we usually in engineering talk about this scheme, which is first we try to convert this into the smaller piece of information, which is encoding in engineering style, fashion, but it's a funnel. Something big is converted into something smaller. Then, to make sure that this piece of information is meaningful, we would like this to reconstruct exactly the same audio as we had in the beginning. It can't be exactly the same, because we are reducing information, but it has to be the best, the most similar possible.

However, I usually like to give this example in class. This is like trying to pass an exam. That's the exam. This is the textbook, which is huge. And this is the cheat sheet that I will keep in my pocket. This is all the information I need to reconstruct to try to pass the exam. It doesn't need to be a perfect exam, but at least to pass it. And this piece of information is the one that we usually work with.

Concepts of Latency and Embedding

I also like to comment on the idea of calling it latent, also embedding. This is because, imagine, when working with music, we have some human words that we can use to describe the music, like instrument, rhythm, note, pitch. But AI doesn't understand this. It works with with fundamental features that make it work, and we don't know what it is, we just call it latent, and we are fine with that. That's why.

Introduction to Text-to-Speech Models

But today we are working with a text-to-speech model, that's why we need an extra feature here. This is something that accepts as an input text and will speak it out loud.

which is a base speaker model. This is, in my opinion, a state of the art. The problem is that this is a base speaker model. So it will speak like someone that we don't know.

Customizing Text-to-Speech with Personal Voice Characteristics

But this is quite achieved, in my opinion. This works.

But I would like to have some of the characteristics of, for example, a random speaker like me, So I would need something extra, which is another box, which is a color extractor. We call it like that when we want to mean that this is the characteristics of my vocal tract.

This is something that is going to understand the characteristics of my voice and try to input them in the Latin information, in the fundamental information, but only change the characteristics of the voice, not the information of the voice, only the characteristics. That's why we have these two things.

Combining Components for Text-to-Speech Synthesis

So in here we have the text. In here we have my voice. And in here we have the text with my voice. Is that clear?

Integration with Language Models

The only extra thing that we have here left is that everywhere we need to have an LLM. There.

I'm not going to write down any text to produce a speech, right? An L line is going to do it for me. I'm just going to input something.

Application in Telegram and Use of Open Source Models

1And now that we know this is the fundamentals, this is the high-level schematics, to Telegram. Okay? And yeah, and using open source models.

So because today we are here in MindStone and Google is one of the sponsors, we are going to use Google Gemma. And for the text-to-speech, that's for the LLM, for the text-to-speech, we are going to use, I mean, we have two stages. So we have the VITS model, which is plain text-to-speech, the base one, and then the one that is going to convert my voice, which is the open voice.

In my opinion, quite good. The latest release was two weeks ago, so let's see if this works or not.

Setting Up the Coding Environment

Let's head to the coding part. I'm going to open PyCharm, that's my favorite IDE. Can you see it? Sorry, I'm going to close this.

Coding the Language Model

Okay, so the first thing, the imports, what we are going to use for the LLM is the LAN chain. Probably you know about this. It's pretty common these days.

For the text-to-speech, oh, don't worry about picturing this because I'm going to show the GitHub afterwards. Don't worry about that.

For the text-to-speech, we are going to use the Open Voice API and for the Telegram API. I've set some environment variables. The first one is the device. In the case that you have a GPU, this will run faster. In the case you don't, it will work, just slower.

And the Telegram token. You need this. I have attached to my presentation a tutorial on how to get this token, because it's quite straightforward. But it's nothing that you can code. It's something that you have to do in Telegram website.

So first thing, we're going to code the LLM. 1We need to set it up, and then create the function to generate text.

It's super simple to have an LLM this day. With LanChain, I'm using Ollama, which is like a wrapper in which we can HTTP call this server, and then we will receive the resulting text from the, in this case, Gemma model. And the generateText, well, we just call the invoke from the langchain tools, and then we will receive the resulting test.

Super simple. And you can have a lot of more attributes in here, but these are the more simple ones. When playing with this code around, you can change this.

Implementing Text-to-Speech Stages

So second part, that's the text-to-speech stage. This is a bit more tricky because it's two stages.

So the base one, the one that just speaks out loud in any voice, it's just this line. This is very simple. It's in English and in the device with GPU or CPU, whatever.

But for the colorizer, for the one that is going to change the way that this speech is going to be made by my own voice, this is a bit more tricky because we need two things. First, we need the embeddings of the source speaker that was the model trained with. So for taking that, we need to choose one of the ones that they have in the library.

I chose this one, which is English, and as you can see, I have a perfect British accent, so I decided to choose the British. And as well, we need to have the source embeddings of my own voice. And for that, I recorded 30 seconds of my voice, which I think I cannot play, but it's just me saying weird things in English, okay? Hello, my name is Mateo, I'm a student, I would really like to speak in front of you, whatever, okay?

I need the model to capture my nuances of my voice. This is how we do it.

And then for the text-to-speech, this is just the setting up. This is very simple this time. We just need to call the text to speech to file model. This is the base one. And we are going to create a temporary file with the speech with the plain speaker.

And then we are going to use this temporary file here to push on top of it my voice with this sentence. So it's more difficult to set it up. But this part is pretty simple.

Integrating with Telegram

Now we can head to the Telegram part, which is we need to create an object updater. This is the one that is able to go to the conversation in Telegram and is able to read the message and is able to send messages.

So we just need to set a handler. which is these things that, you know, the ones that you write slash something and then some input, and it works like it files some event inside of the code. This is how we do it, with a command handler, and then you decide which is the slash word, in my case, Mateo.

So I can say something like, slash Mateo, how are you? And I expect Mateo to answer, okay?

Bringing It All Together

And the last part is wrap everything up. So we have some text that I've sent to Telegram. I have some text that I sent to Telegram.

Then I input this text into the generate text function. So I get the LLM response. This response, I'm going to send it to the conversation in Telegram, but I'm going to send it as well to the text-to-speech. Then I will get the audio, and I will send it to Telegram.

Demonstration

Do you want to see if it works or not? Okay, it's already running because it takes a bit to set everything up.

But I have here, this is my telegram, I have hidden everything personal. So I can say something like, Mateo, how are you today?

Usually it takes, since this is not streaming the text, because in ChatGPT we are very used to have everything like appearing. This will appear just in one message, so it takes a bit longer. But it should be like, Five seconds more.

Nothing else, I think. Hopefully. Yeah. There you go.

Well, at least he's working optimally. But he doesn't have feelings. That's great.

But what about just saying it out loud? Let me prove that I'm doing it. I'm going to receive it in my phone. There you go.

As a language model, I do not have personal experiences or emotions. However, I am functioning optimally and ready to assist you with any information or tasks you may have. How can I help you today?

Conclusion

Is it my voice? Not really because I'm not British. But in Spanish, it usually works better. But I had to do it in English. So, yeah, that was everything on my side.

Additional Resources and Contact Information

I'm going to leave you some more, oh, sorry, comments here because, I mean, everything is in the GitHub. So you have here the link to the repo.

Here is the tutorial to create the bot in Telegram. Here is how to set up Ollama, the server. More information about these guys.

And of course, if you have any questions, if you want to connect me, if you just want to chat, this is my LinkedIn. The thing that I've just shown is like a very simple one. Now the doors are open. It's up to you, whatever you want to create.

Closing Remarks

Thank you.