Training a new AI voice for Piper TTS with only 4 words

Introduction

This is just a bit of fun, it's nothing to do with my day job.

Why Piper TTS?

So this is training a new AI voice for Piper TTS which is text to speech with only four words.

Lightweight, AI-powered text-to-speech

So has anyone heard of Piper at all? So Piper is quite a lightweight text -to -speech engine, it's AI powered and it's pretty good.

How it compares to classic engines

It kind of exists alongside quite a few other engines that have been around for several years, such as eSpeak, Festival, and Pico TTS.

The trouble is they all sound quite robotic. They're not fantastic. They're quite easy to understand, but they don't sound very natural.

Legacy TTS demos: eSpeak and Festival

Here's eSpeak.

Unfortunately, that's not going over the speakers. You have to rely on my laptop speakers. speakers.

Very, very robotic.

Here's Festival. Festival sounds a little better.

This was a project from the University of Edinburgh.

They also released some commercial voices which you can't have, which wasn't great.

Why voices matter for home automation

The reason I like these voices and I use them is because i've always had some kind of home automation home automation thing over the years alert systems that sort of thing i've always wanted to be able to use a decent voice with

Enter Piper for a more natural sound

this system something better than alexa because alexa is somewhat evil so here's piper piper came about a few years ago and it's closely associated with the home assistant project i'm not sure if a lot of you have used it or heard of it i bet most of you have considering this audience

audience so yeah Piper sounds a lot more natural system failure release Rensler what about heavy

Cloning voices with minimal data

weight AI so there's something called chat box and you may be familiar with AI voice cloning

Chatterbox TTS: four-word voice cloning

generally voice cloning is quite expensive and for training and inference technically chat box is quite amazing really you literally give it a few words and it will clone a voice of course

you need a fairly beefy gpu to run that and the problem with that is it's expensive you don't really want that running in your in your closet at home to power your home automation system

Quick demo and feasibility

so here it is there's actually a demo hugging face and it's fairly amazing you can go and check that out literally say something or stick in a sample like i did from um that festival demo of one of

Training Piper on a budget

the commercial voices so training a piper voice traditionally so piper as i mentioned is um fairly low resource so you can run it on the raspberry pi but you need a fair amount of beef to train it and you also need a fairly large training data set of you know thousands

of phrases and so on so you've probably figured out where i'm going with this

Choosing a target voice and generating data

so here's here's some um text um so a test generated by this mystery commercial tts engine i've just said what it is system failure release rinse sounds a bit better i wanted to clone it because i think it sounds sinister and i want my ai voice to sound sinister

so here's that same phrase regenerated with chatterbox tts i literally fed that into to chatterbox system failure release rinsler sounds about the same which is pretty cool

Scaling up phrase generation

so the idea of course is to generate loads and loads of phrases with chatterbox tts and you can

run it um just on your desktop pc i left it churning away for several days making lots of phrases my gpu and my computer is a bit rubbish it's about 10 years old it's a 980 ti so the

Hardware hurdles and heatwaves

The architecture was too old, so a mate of mine lent me his Tesla GPU, which is here, and it was during a heat wave in the UK, so this was running outside, so it didn't overheat. It was ridiculous.

It was a very janky test setup, as you can see, with a really old CPU, but it doesn't matter. The CPU doesn't really matter in this case.

It all runs on the GPU to do the training. training.

Docs, tooling, and dependency management

So Piper TTS, they provide documentation on how to do this training. And it's not as well supported like the rest of the project, you've got to figure some stuff out and, and

such along along the way. So you can see, it's a bit small. But

yeah, this is a terminal screenshot. Of course, I've got to have H top in the terminal screenshot. It's it's the law.

Anyway, that shows the GPU being a bit hot.

Software -wise, I like NixOS. I use that with Docker just to make things easier and reproducible.

I've met a fair bit of dependency hell in this process, but Docker certainly helped with that just to figure it all out.

Scripts and environment setup

I'm going to show you a few scripts. These are all available at the end if you want to run them yourself, so I don't expect you to memorize the script or anything or indeed read it.

It's a big chunk of text, but we're just installing dependencies and Piper in this case.

So this is a wrapper script, which just allows you to run something in Docker, quite useful for this. You don't have to build containers all the time.

Building a clean training corpus

What about text gathering?

Gathering phrases

So there are several bits bits of software around that allow you to train for Piper, and they expect you to sit there for four hours with lots of different phrases to train a voice of your own.

So I grabbed a big list from one of those projects. The name escapes me at the moment.

The problem with stochastic TTS errors

The trouble is, being AI -powered, AI is stochastic by its nature, and sometimes it can just do do crazy things and get stuff wrong.

So going through this big list of phrases, one of them was zero, it was trying to say zero in this case, and like, let's listen to it.

Beach and pious. As you can imagine, that's kind of really going to mess up the training, potentially, there's quite a lot of that.

And I think there was about 3000 phrases, I wasn't going to sit there and verify all of them. So that's why it's a problem.

Automatic validation with Whisper

so yeah verifying it without whispering without listening is the answer and there's a project called whisper from open ai and that's one of the few projects that allows them to technically live up to their namesake whisper is great i mean it's state of the art it can understand anyone

in any conditions in my experience it's brilliant it's an open source project and that is speech to text so connect the two together you have these problems so some things can sound the same

but be spelled differently or perhaps you've got a zero instead of spelling out the word zero so could have punctuation um lacks of lack of spaces maybe other things too so what's the answer here

Using phonemes to compare outputs

phonemes.

My daughters were learning about phonemes at school and figured it actually makes sense.

Leveraging eSpeak for IPA conversion

1So as part of how Piper works it depends on something called eSpeak and the TTS engine I mentioned earlier and that has a phoneme conversion system which can convert to the international phonetic alphabet.

So here's a demonstration of that. If you tell it to convert your words into how they sound, it then converts them into this IPA, the International Phonetic Alphabet, and it will tell

you that A is B. So color zero, color zero. That allows us to actually properly verify the output.

output.

I mean, again, I'm not going to ask you to read that, but it's all available at the end.

This is just the script that goes through and does why so there's a few other hacks in there to try a few other things.

Boosting validation accuracy

But doing that, I think I got the success rate from about 80 % to 90 odd percent. And I was, you know, I was happy with that.

From text to training data

So there are a few more steps to

convert that for training.

Training strategy and checkpoints

The nice thing about training in in general, because I think this is pie torch based, the tooling is really good.

1So what I did was trained from a checkpoint. So actually, you can grab an existing Piper voice and sort of steer in a different direction. And that's what I did.

Time, VRAM, and batch sizes

It took much less time, I think it took maybe 24 hours to do this training on this little Tesla GPU.

So yeah, again, here are

the things and you can look at it at the end if you like, that that I did to get it to work on the Tesla P4 because it doesn't have much VRAM.

So you had to limit the batch size, which is the amount of stuff that it trains on at once. But other than that, it mostly worked.

I could have trained from scratch and it's advisable to have a bigger corpus, bigger training set for that. Maybe I'll do that one day.

Monitoring with TensorBoard

TensorBoard is a bit of software that just allows you to see the training progress. So generally when you're training,

you're trying to trying to minimize a loss function and you know hopefully you can see that going down over time sometimes things go crazy and it just shoots off or or whatever but this is right at the start so there's not really any interesting stuff to see there

um i think i might have a yeah i've got a picture of the loss function going down in a in a bit so um result this is just uh an idea of like the mistakes the model can make um

Loss curves and convergence

some unfortunate mistakes in some cases um yeah so this is the lost function you can see actually the gray bit was when i decided to move the machine out of the closet because it was overheating i had to take it back outside and it does indeed converge so that's what you're looking for and it's quite noisy because of the batch size apparently the noise would be you know quite a

Results and audio samples

bit less if um if the batch size was bigger and i was doing it on a big gpu so this is the final result with uh piper tgs system failure release rinser perfect i might turn a little down i can't turn to volume okay so that sounds all right doesn't it kind of what i wanted but it can run

General clarity test

on a pie or something um here's it a rainbow is a meteorological phenomenon that is caused by reflection refraction and dispersion of light in water droplets resulting in a spectrum of light

appearing in the sky just realized i can't pause it so uh sorry about that it sounds quite clear

What I’d improve next

anyway so um yeah the conclusion it's worth doing and it's quite simple if you want a unique sounding voice what would I do after perhaps the training data wasn't perfect I could have

clicked some silence from that I could have trained from scratch what about making voices

Designing character voices

I kind of like games and films with evil AIs such as yeah Red Queen from Resident Evil or from films or games because I don't know if you guys have played System Shock 2 with an evil AI or Xerxes or Showdown from System Shock.

So I kind of did that.

Audio styling: stutter, echo, and reverb

The way these AIs speak is very kind of stuttery and echoey, there's reverb.

So just for a bit of fun, I'm not gonna explain the context of this because it's very strange,

Letting an LLM write the lines

but I connected an LLM to make some very strange strange sounding quotes and some some sinister quotes you're about to see this is because it generates loads every day I have no idea what it's going to sound

like or what it's going to say the worst case it could be lame and that would be embarrassing but in the best case it's going to sound sinister and I'll look okay

Sample sinister quote

Do not underestimate the dark side of institutions.

Conclusion

And that's it, so thank you very much.