Speech to Text: a basic intro

Introduction

Yeah, so I'm Paolo Di Prodi.

From Neuroscience to Engineering

Well, originally I was a neuroscientist, so I did a lot of work on the kind of human brain, animal brain, and just modeling part of the functions, functionality. and then but also I was kind of always being a nerd myself so I love to break things apart and rebuild them not always in the same order when I was a child so I've been you know I was kind of hiking since a very young age and

Current Role vs. Today’s Topic

My main job is cybersecurity, so I do a lot of defense for basically defending from attackers in the cybersecurity space.

But today I'm not talking about anything about cybersecurity, so I'm going to talk about speech-to-text.

Setting the Stage: Why Speech-to-Text Feels Mysterious

And I think if someone's... How many people know about what is that thing on the top right? There's only two, three people. I'm amazed.

A Pop-Culture Metaphor for Transformers

So it's the name of that device is the Planbus, which was, I think it was introduced in Rick and Morty's season two or three. And it's kind of a metaphor of, you know, all these kinds of transformer technology. When you look at these objects, it's like, what the heck is that?

How does that work, right? And in this kind of cartoon, they show how this object is made, which is very confusing. They have all these parts that have like obscure names and whatnot, right?

Goal of the Talk

So I kind of like, you know, I think most of these kind of systems, like when you look at them, it's like, what does that even mean? So this presentation is gonna be much more fun and try to kind of explain how this transformers works in the audio space.

Background in DSP and Early Speech Systems

So as I mentioned before, I did a lot of kind of electronic and software engineering. I was actually a lecturer in University of Glasgow doing DSPs, which is digital signal processing.

And I don't know if how many of you have used Motorola's or like these kind of old microchips here, you know, like eight bits.

Hands-on with Classic Microchips

16 bits and then you have like I remember one of my teacher was like we're going to do 8.5 bits and I was like how does that work like you can have fractional bits it's a story for later but yeah done a lot of that so essentially you know we'll program this chipset to do like some sort of distortion of voice modifications or actually processing something serious like detecting birds

Serious DSP Applications

But like something more serious like heart rate, you know, like you can put a microphone in your body, I mean outside your body, also inside your body, you can try to figure out, you know, like heartbeat and anomalies in the blood flow. It's like really cool stuff you can do with DSPs nowadays.

Pre-Transformer Speech Tech: Phonemes and Patterns

And we actually, many, many, many years ago, we were doing text-to-speech or speech-to-text. And it was kind of challenging. I mean, we didn't have transformers back in the days.

It was mostly around phonemes. So phoneme is when you basically modulate your voice and your lips. You have vocal cords in your throat and essentially you modulate sound, right?

And essentially what you can do is you can take microphone input digitize it. There was actually a talk in the previous webinar we did where somebody was talking about ADC and DAX. But it's the same concept that you basically digitalize the voice.

You basically have to look at these patterns. Is that an A? Is that a B?

Challenges in Speech and Language

Obviously, changed by speaker. Obviously, language is universal. But different languages have different tones.

Tonal Languages and Accents

I tried to learn Chinese, Mandarin. And if you're over five years old, you can get the tones. There are four tones in Chinese Mandarin. You just can't learn them, right?

I mean, unless you're a prodigy, like maybe when you are 20, you can still learn that. But I just couldn't get the tones. As you can imagine, when I was in China, I couldn't really you know, there are certain words that mean different things. I was, you know, trying to ask for something and it was like, what, what, you just called butt? I was like, that's not, so anyway, so it's kind of an interesting problem, right?

Speech-to-Text Pipeline Overview

So when we talk about this speech to text, I mean, the principle is quite simple, right? So you take a human voice from a microphone, you sample it, so you have an analog to digital converter, which is, you know, a chipset usually.

From Audio Capture to Formats

You get this MP3, MP4 formats, you know, WAV, or, you know, there are many other formats you can use. Some of them are compressionless, some of them are, Basically, they keep the entire audio spectrum, but you can do some sort of sampling to keep a decent size.

Processing Hardware and Modes

Then you do processing, which can be language specific, format specific. You can have it on a CPU, GPU now, DSPs, ASICs and FPGAs. I've done all of that.

And also, you know, it doesn't have always to be real time. Or you can do batch, which obviously you have enough time, you know, like to record something. And then at the end of that, you can basically kind of translate that.

I think I've seen some demos where Meta was doing translation that failed. I mean, there's all these kinds of funny things, which is, yes, it's kind of people are trying to do that. It's very hard, right? Especially on a low power device.

Beyond Words: Punctuation and Speaker Diarization

And at the end of that, you get text, but actually it's not just text. Punctuation is really important, right? Because if you're listening to a lecture and you don't put your commas and semicolons right, you might get the kind of wrong meaning, right?

So language is ambiguous sometimes. Also, you might want to distinguish multiple speakers. So if you are talking in a meeting room, like Teams, which is terrible and never gets it right, but you have five people, obviously you don't want to mix what someone said with the other one, right?

So you might say, hey, you're fired. And like, who said it was the CEO or the CTO, right? So it's kind of, it's a really big problem, right?

And we had some very fun stories where we had siblings or homozygote twins and they basically had the same tone spectrum which was, I think, I don't know if people are able to solve that problem now, but that's like, really simple things you think are easy are actually very hard distinguishing two people, right?

Voice as Identity: Limits of Voice Biometrics

Voice. A lot of people have tried to use voice recognition for passwords. That never worked, right? I've worked on that, and I think I've seen some movies where people do that, but it just doesn't work.

It's not that unique, and it's very easy to fool.

Measuring Progress: Accuracy, Noise, and Speed

So over time from 2013, I sort of tried to, I found this report, I think it was from Google, so maybe it's biased, but the main parameters, I mean, the kind of success rate for this is essentially the word error rate, right? So if I give you like one minute of speech and you basically just confuse 100% of the words that it's zero accuracy, it's kind of totally useless. And we started off in 2013 with these kind of modern techniques, it was like 78%.

And now I think we are over more than 95%, right? But there are other things very important, like, obviously, maybe you get the word right, but the spelling wrong. You know, you have homonyms, where you have, you know, like, if I say race, it can mean two different things, right? So it depends on the context. So that's kind of a, there's some subtleties there.

Noise, like, oh, yeah, you're in a tram or a bus, and, you know, like, you can get that right. accents. I lived in Scotland for like 10 years. I never got the Scottish accent. But obviously, it's really hard, right? Like if you train your model on a kind of a British Cambridge speaker, it's not going to work on a Scottish speaker. But believe me, I have seen people coming from Cambridge just couldn't get probably 1% was the accuracy when they were trying to communicate.

And then the other things are like time to process, how long does it take you to actually produce the text, right?

Whisper: A Modern Baseline

So what they did, so the kind of, I think the kind of, with this kind of deep learning, the first kind of interesting result was this Whisper, which was produced by OpenAI. It's probably the decent thing they have done, probably one of the decent things they have done.

Training Data and Tasks

And what they did was to take 680,000 hours of multilingual and multitask supervised data, collect from the web, not sure if it was legal or illegal. I think they're still figuring it out in court, but they did that.

And there's some example there, like, you know, you have the kind of English description.

So, you know, the blue one is the actual audio, right?

That was from JFK, I think he said that. Ask now what your country can do for you. And then, you know, somebody, a human translator had the actual speech.

But they also add like, you know, any to English, so like rapido, zorro, marron, salta, sobra, I think is Spanish. And, you know, they basically translated that to English or non-English transcription, I think that's Korean then. But also like things like non-speech, like, you know, there's music playing and obviously, you know, you don't want to translate anything.

How Whisper Works, at a Glance

So the way it works, and let's see if somebody can spot the object in the middle. Anyone? No?

Knight Rider, good, yeah. Good TV series.

And so the way it works, you have a transformer architecture, probably you already know this by now, but it starts from the bottom left, so you take the audio, you produce a log spectrogram, which is kind of like what Kit was doing in the movie. I mean, with that three bars. It's not really a log spectrogram, but it's an equalizer, just to give an idea.

From Waveform to Spectrogram

So you have to translate the time signal into a frequency signal, which is windowed with an FFT. Actually, just that part, there is a lot of DSP maths doing this Fisher calculation, right? And I think it would be interesting to see if the Hive can find maybe a better implementation for the butterfly algorithm. It was from 1950s.

I remember I teach this every course and everybody was like, who the heck, how did they come up with that idea, right? And I think still now we are using the same algorithms in DCSP. So that would be kind of interesting. You can make a lot of money if you want, if you can find that for sure.

But so you do all this magic, which is in the source code, you can actually see and like, it's really like, if you want to study DSP, that code is quite interesting. There's a couple of bugs, But in general, it's really cool. I mean, if you want to study how they actually do it efficiently.

Convolutions and Positional Encoding

And then the second step, they do this convolutional one-dimensional. So basically, you get these kind of neurons. They just convolve over time frequency.

You obviously have to encode it via sinusoidal encoding, which is another math, but essentially because you want to know which part of that speech was in the dialogues, so the position is very important. And so you basically have a circle, sinusoidal and cosine, which encodes the position. Obviously, you're limited by the floating point precision, so you can do a full circle, but after a few minutes, you basically run out of the actual angles, right? So there's a lot of DSP, you know, like FP maths over there.

Encoder–Decoder and Tokens

And then the kind of more boring part, you have these encoder blocks, right, which you deep learn, so like, you know, encoder over encoder over encoder. You put the usual attention and then you put the decoder blocks to basically reconstruct the prediction. And it's the same as kind of like GPT approach, like you are getting discrete tokens, which are like frequencies, amplitudes, and then you want to basically figure out the sequence corresponding to the actual text.

Special Tokens and Auxiliary Tasks

Plus you have some extra tokens, like start of translation, the actual language, because sometimes you might even know which language you are listening to, so you might give a hint. And then you have the start time, and then the actual tokens of the English text. So that's the kind of overall, it gets more complicated because what they did was also adding things like language identification.

So like, you know, like, can I also guess the language or I can get a hint or voice activities or Like, I don't want to waste tokens just by listening to, you know, nobody speaking. So you also, you know, they also beat something that like, yeah, that's not like, you know, if there's a dog barking, in theory, it shouldn't trigger. But there are some really cool examples. Maybe next time I can show it gets too complicated.

But in theory, there's several blocks that essentially helps you to divide the actual text, get the timestamps, and basically give some sort of bias to what you're going to listen, right? And then end of translation, so basically you just finish and you get the actual text.

Model Sizes, Trade-offs, and Deployment

1So if you know transformers, obviously you can have a very large transformer where you need a lot of data, or you can have a smaller one which is more efficient in terms of compression, but maybe lose some accuracy. So they did seven, so from tiny to turbo, Turbo like like in the cartoon but like they're basically the number of parameters like in the transformers is like 79 millions to 80 800 millions they take different GPU RAM so obviously like you know if you want to run the tiny one which I'm going to do a demo today you can even run in your browser but you know if you go to like a 10 gigabyte one you probably need like a video game sort of computer, right?

And then you have like relative speed. The tiny one is actually quite fast, but maybe it's not so accurate. So, you know, there's trade-off, right?

Language-wise Performance Variability

And this is kind of, I mean, when you look at these charts, I was kind of impressed because, like, why Dutch is the one with the kind of lower error rate? I just couldn't explain this. I didn't have time.

But, like, if you look at the top, there is Dutch, Spanish, Korean, and Italian. So I'm very proud to be on the fourth place. But English is, like, very down. And, like, I couldn't find an explanation for that.

There's even Turkish ALP, like, at 12%. So yeah, the kind of word error rate depends on the actual language. And this goes quite big at the bottom. There's very large errors.

And I think it's a kind of... combination of the samples they got from the internet. Like, maybe there were more examples of Dutch, which I can't believe that's even possible.

But I don't know, they're not even movies in Dutch. But like, I don't know where they got that samples. But I just couldn't explain some of these things.

But I mean, obviously, some languages have come from the same roots. So you can find similarities in the phonemes, like, for example, Italian and Arabic that things in common, but So there's probably something like that, but I couldn't find anyone explaining that, just some hypotheses.

Throughput and Implementations

So then, just to give you an idea, so if you want to translate or transcode certain minutes, depending on the implementation, you might get faster or slower performance, and also the precision. So for example, if you take the standard one from OpenAI, which I think that one is still closed source, it will take two minutes. But if you take the one I'm using today, which is the Whisper C++, which is the one they open source with flash and tension, it takes like one minute, right? So the performance really changed almost 50%.

And FastWhisper, which was some guy I found online, which like optimize with batching, which is another technique, where instead of doing sequences, you basically, you chunk it and you put in parallel into your GPU, you manage to get even faster, right?

So like 13 minutes in 16 seconds, it's kind of impressive, right? With eight integer precision.

Choosing the Right Model

So yeah, there's like, so essentially, you know, there are a bunch of them are open source. You can compile them to different architecture, different size, and depending on what you need, you know, like the speed you need and the accuracy, you know, you have to choose one or the other models, right?

Live Demo: Whisper in the Browser

So what I'm going to do now, hopefully if the demo works, because, you know, I need to use my microphone, I sort of adapted this, you can see that, right? 1This is like a, somebody was crazy enough to compile Whisper C++ in WebAssembly, so it actually runs on your browser, which is Chrome, right?

So I'm going to refresh it now. Okay. Ta-ta-ta.

Model Settings and Parameters

So I already downloaded the tiny models that you can see there. So like, I don't know if you want to, just to give you some of the ideas, like the main parameters, like how many spectrograms does it calculate? There's 80 of them.

How many languages support the 99? This is the attention mechanism, you know, the size of that, the buffer, how many threads is using from my CPU? There's eight. Okay, okay, okay.

Live Test: Chess Commands

So I'm a chess geek, so I'm not going to play this for long, but... Okay, just to give you this, so I'm going to, let's see. Oh, sorry, I made a mistake.

Pawn, b2 to b4.

okay uh pawn uh he got uh b6 wait wait a minute i think it's the markers but it's not the problem is probably getting it from from the from the feedback let me see okay oh yeah it was there one second okay let me try again pawn b2 to b4

okay so you can see yeah it got like 65% right so if I was B2 to B3 with this probability because yeah this is kind of a you know very tiny model so you you know you kind of lose reliability but you know there's something you can play and I'm just going to show you this one if it's still working

Batch Mode and Timestamped Transcripts

what you can do you can basically do batch mode so you can i'm using like now uh you know a base model which is which is bigger it runs that and it basically gets you the uh i think it was the jfk speech and uh you know essentially translate uh the actual text with the timestamps you can see like zero to 11 11 seconds and it gives you all the all the stats right so

Conclusion and How to Get in Touch

um so yeah like you can you know you can go online and basically play with these things most of them open source like obviously you need to have enough hardware to run them uh and uh just just yeah have fun with that you know um and if you need uh any help you just send me an email uh i'm on linkedin but uh like i think there is uh

My email is there. So like that's kind of my personal email. So like if you want to send me something, you just send me that.

That's yeah, that's enough for the demo.