This is just a bit of fun, it's nothing to do with my day job.
So this is training a new AI voice for Piper TTS which is text to speech with only four words.
So has anyone heard of Piper at all? So Piper is quite a lightweight text -to -speech engine, it's AI powered and it's pretty good.
It
It kind of exists alongside quite a few other engines that have been around for several years, such as eSpeak, Festival, and Pico TTS.
The trouble is they all sound quite robotic. They're not fantastic. They're quite easy to understand, but they don't sound very natural.
Here's eSpeak.
Unfortunately, that's not going over the speakers. You have to rely on my laptop speakers. speakers.
Very, very robotic.
Here's Festival. Festival sounds a little better.
This was a project from the University of Edinburgh.
They also released some commercial voices which you can't have, which wasn't great.
The reason I like these voices and I use them is because i've always had some kind of home automation home automation thing over the years alert systems that sort of thing i've always wanted to be able to use a decent voice with
this system something better than alexa because alexa is somewhat evil so here's piper piper came about a few years ago and it's closely associated with the home assistant project i'm not sure if a lot of you have used it or heard of it i bet most of you have considering this audience
audience so yeah Piper sounds a lot more natural system failure release Rensler what about heavy
weight AI so there's something called chat box and you may be familiar with AI voice cloning
generally voice cloning is quite expensive and for training and inference technically chat box is quite amazing really you literally give it a few words and it will clone a voice of course
you need a fairly beefy gpu to run that and the problem with that is it's expensive you don't really want that running in your in your closet at home to power your home automation system
so here it is there's actually a demo hugging face and it's fairly amazing you can go and check that out literally say something or stick in a sample like i did from um that festival demo of one of
the commercial voices so training a piper voice traditionally so piper as i mentioned is um fairly low resource so you can run it on the raspberry pi but you need a fair amount of beef to train it and you also need a fairly large training data set of you know thousands
of phrases and so on so you've probably figured out where i'm going with this
so here's here's some um text um so a test generated by this mystery commercial tts engine i've just said what it is system failure release rinse sounds a bit better i wanted to clone it because i think it sounds sinister and i want my ai voice to sound sinister
so here's that same phrase regenerated with chatterbox tts i literally fed that into to chatterbox system failure release rinsler sounds about the same which is pretty cool
so the idea of course is to generate loads and loads of phrases with chatterbox tts and you can
run it um just on your desktop pc i left it churning away for several days making lots of phrases my gpu and my computer is a bit rubbish it's about 10 years old it's a 980 ti so the
The architecture was too old, so a mate of mine lent me his Tesla GPU, which is here, and it was during a heat wave in the UK, so this was running outside, so it didn't overheat. It was ridiculous.
It was a very janky test setup, as you can see, with a really old CPU, but it doesn't matter. The CPU doesn't really matter in this case.
It all runs on the GPU to do the training. training.
So Piper TTS, they provide documentation on how to do this training. And it's not as well supported like the rest of the project, you've got to figure some stuff out and, and
such along along the way. So you can see, it's a bit small. But
yeah, this is a terminal screenshot. Of course, I've got to have H top in the terminal screenshot. It's it's the law.
Anyway, that shows the GPU being a bit hot.
Software -wise, I like NixOS. I use that with Docker just to make things easier and reproducible.
I've met a fair bit of dependency hell in this process, but Docker certainly helped with that just to figure it all out.
I'm going to show you a few scripts. These are all available at the end if you want to run them yourself, so I don't expect you to memorize the script or anything or indeed read it.
It's a big chunk of text, but we're just installing dependencies and Piper in this case.
So this is a wrapper script, which just allows you to run something in Docker, quite useful for this. You don't have to build containers all the time.
What about text gathering?
So there are several bits bits of software around that allow you to train for Piper, and they expect you to sit there for four hours with lots of different phrases to train a voice of your own.
So I grabbed a big list from one of those projects. The name escapes me at the moment.
The trouble is, being AI -powered, AI is stochastic by its nature, and sometimes it can just do do crazy things and get stuff wrong.
So going through this big list of phrases, one of them was zero, it was trying to say zero in this case, and like, let's listen to it.
Beach and pious. As you can imagine, that's kind of really going to mess up the training, potentially, there's quite a lot of that.
And I think there was about 3000 phrases, I wasn't going to sit there and verify all of them. So that's why it's a problem.
so yeah verifying it without whispering without listening is the answer and there's a project called whisper from open ai and that's one of the few projects that allows them to technically live up to their namesake whisper is great i mean it's state of the art it can understand anyone
in any conditions in my experience it's brilliant it's an open source project and that is speech to text so connect the two together you have these problems so some things can sound the same
but be spelled differently or perhaps you've got a zero instead of spelling out the word zero so could have punctuation um lacks of lack of spaces maybe other things too so what's the answer here
phonemes.
My daughters were learning about phonemes at school and figured it actually makes sense.
1So as part of how Piper works it depends on something called eSpeak and the TTS engine I mentioned earlier and that has a phoneme conversion system which can convert to the international phonetic alphabet.
So here's a demonstration of that. If you tell it to convert your words into how they sound, it then converts them into this IPA, the International Phonetic Alphabet, and it will tell
you that A is B. So color zero, color zero. That allows us to actually properly verify the output.
output.
I mean, again, I'm not going to ask you to read that, but it's all available at the end.
This is just the script that goes through and does why so there's a few other hacks in there to try a few other things.
But doing that, I think I got the success rate from about 80 % to 90 odd percent. And I was, you know, I was happy with that.
So there are a few more steps to
convert that for training.
The nice thing about training in in general, because I think this is pie torch based, the tooling is really good.
1So what I did was trained from a checkpoint. So actually, you can grab an existing Piper voice and sort of steer in a different direction. And that's what I did.
It took much less time, I think it took maybe 24 hours to do this training on this little Tesla GPU.
So yeah, again, here are
the things and you can look at it at the end if you like, that that I did to get it to work on the Tesla P4 because it doesn't have much VRAM.
So you had to limit the batch size, which is the amount of stuff that it trains on at once. But other than that, it mostly worked.
I could have trained from scratch and it's advisable to have a bigger corpus, bigger training set for that. Maybe I'll do that one day.
TensorBoard is a bit of software that just allows you to see the training progress. So generally when you're training,
you're trying to trying to minimize a loss function and you know hopefully you can see that going down over time sometimes things go crazy and it just shoots off or or whatever but this is right at the start so there's not really any interesting stuff to see there
um i think i might have a yeah i've got a picture of the loss function going down in a in a bit so um result this is just uh an idea of like the mistakes the model can make um
some unfortunate mistakes in some cases um yeah so this is the lost function you can see actually the gray bit was when i decided to move the machine out of the closet because it was overheating i had to take it back outside and it does indeed converge so that's what you're looking for and it's quite noisy because of the batch size apparently the noise would be you know quite a
bit less if um if the batch size was bigger and i was doing it on a big gpu so this is the final result with uh piper tgs system failure release rinser perfect i might turn a little down i can't turn to volume okay so that sounds all right doesn't it kind of what i wanted but it can run
on a pie or something um here's it a rainbow is a meteorological phenomenon that is caused by reflection refraction and dispersion of light in water droplets resulting in a spectrum of light
appearing in the sky just realized i can't pause it so uh sorry about that it sounds quite clear
anyway so um yeah the conclusion it's worth doing and it's quite simple if you want a unique sounding voice what would I do after perhaps the training data wasn't perfect I could have
clicked some silence from that I could have trained from scratch what about making voices
I kind of like games and films with evil AIs such as yeah Red Queen from Resident Evil or from films or games because I don't know if you guys have played System Shock 2 with an evil AI or Xerxes or Showdown from System Shock.
So I kind of did that.
The way these AIs speak is very kind of stuttery and echoey, there's reverb.
So just for a bit of fun, I'm not gonna explain the context of this because it's very strange,
but I connected an LLM to make some very strange strange sounding quotes and some some sinister quotes you're about to see this is because it generates loads every day I have no idea what it's going to sound
like or what it's going to say the worst case it could be lame and that would be embarrassing but in the best case it's going to sound sinister and I'll look okay
Do not underestimate the dark side of institutions.
And that's it, so thank you very much.