Bridging the Gap Between Computer Vision and Emotional Intelligence - Luis Gomez-Acebo

Hello, everyone. I am Luis. I'm studying computer science and AI, and I'm also working as a research assistant at IU University.

Introduction

Today, we're going to talk about bridging the gap between computer vision and emotional intelligence.

So we already know AI can talk to us. We can have a conversation with AI with ChatGPT.

However, to understand humans, we have to go beyond verbal communication. But can computers get there? Can computers see? Can computers do all of that? Well, I argue that with computer vision, they can.

Understanding Computer Vision

So what is computer vision? 1Computer vision is an interdisciplinary field combining elements from AI, neuroscience, and signal processing. It allows us to derive meaningful information from images, videos, and any visual input. Its goal is to mimic human visual systems.

But why is it useful? Well, it's applied in Tesla, our friend Elon Musk, and its cars. It also can be used in healthcare. It can be used in education, in mass surveillance.

So I'm gonna explain a little bit of core technologies to get to CNNs and neural networks. So I'm gonna get a bit technical, so please follow me.

Core Technologies in Computer Vision

So image filtering. Image filtering is a fundamental technique in image processing. It allows us to enhance and extract information from images.

We apply a filter or a kernel to an image, and then we can modify the pixel values based on a function of the values of its neighbor pixels. So for example, we can use image filtering for blurring image, in this case, just a simple use case, blurring an image, or edge detection.

The second core technology we're going to talk about is feature selection and matching. Feature selection is used in object recognition and scene reconstruction. It allows us to identify key features and points in an image to later capture their essential properties, and we can match those features between different images.

The Evolution of Machine Learning in Vision

So machine learning was crucial for computer vision. However, it's gotten a bit left behind. We used notable algorithms such as SVMs, support vector machines, and decision trees enhanced by random forests. However, since the invent of deep learning, we really don't use them anymore.

Moving on to deep learning. Well, deep learning actually revolutionized computer vision with their neural networks, and especially convolutional neural networks, or also referred to as CNNs. CNNs allow us to process pixel data and leverage spatial hierarchy.

CNNs, when we first built CNNs and we developed them, we actually first test them out with this data set. I just wanted to give you guys an example, because it's actually one of the most famous ones. It's called the AMNIST data set.

It actually became so famous that even Zalando, the clothing company, made one. Right here, you can see. Found it pretty funny. OK, moving back.

Convolutional Neural Networks (CNNs) Explained

So CNNs. Well, we can actually separate CNNs in two. First, we have the feature extraction, and then we have the classification.

So let's start off with an input image. An image in a digital device is stored as a matrix, a 2D matrix of values. But actually, in digital cameras, there's three channels, red, green, and blue, also known as RGB. However, for simple purposes, we'll just keep it as one.

Each pixel will have values from 0 to 255, 0 being dark and 255 being bright. So moving on to convolutional layers. Convolutional layers actually get their name from the operation that is applied.

And it's a dot product of two functions to get a third one. In this case, we use the feature detector times the filter. And it gives us this matrix, often referred to as a kernel.

How does the kernel work? Oh. So kernels allow us to do quick computations and producing a new output. And as you can see, the kernel will scan through the image, detecting its feature, and then will give us a feature map with all of the features of the initial image.

However, images will have different shapes, will have different edges, will have different textures. We'll use different kernels for looking for different patterns. When starting off, we'll use simple shapes, simple lines, such as horizontal lines, some vertical lines, some corners. That's what we'll begin looking for.

Demonstration of CNNs

Well, to start off, I wanted to show you guys a little demo on how all of that would work. So for example, let's draw a number. We can draw a five.

So as you can see, let me hide everything so you guys don't get mixed up. So here we have our input image.

We will have six different kernels. Here you can imagine if we had RGB, we would have 18 instead of six, which would be a lot more complicated. This kernel, as you can see, will start identifying them horizontal lines. from the number 5.

Same image, for example, here. You could also start to see the horizontal lines. You could start to see maybe some horizontal lines beginning to get identified, or even just the difference between the number and the background.

OK. So we've just seen convolutional layers. Convolutional layers are for extraction, feature extraction.

Now we have the pooling layers. Pooling layers actually allow us to downsample our feature map. As you can see, we'll have a 28 by 28 feature map matrix, which will be reduced to a 14 by 14. How does it do it?

Well, it actually takes a little kernel, as you can see, and it will scan through the whole image, taking the max value every time. That way we can keep the feature map, but just way more reduced for allowing us to create more complex networks using less computing power.

Let me go back into the... So, as you can see, Now this is our pooling layer. So now we've done the convolutional operation, and we just reduced it to a pooling layer. We'll still keep the horizontal lines, as you can see here and here. And we maintain most of the features.

So now that you've understood we have a convolutional layer, which allows us to extract, and a pooling layer that allows us to dense sample but keeping the features, we can actually repeat that process a lot more and actually build an abstraction in our network. However, we're not going to use the same kernels. that we used at the beginning. At the beginning, we used simple kernels with simple patterns to detect. However, once we start creating a more complex, we can actually start playing around with the shapes and patterns we want to look for, such as circles, diamonds, et cetera.

So now that we've understood the first part, which is the feature extraction, let's move on to the feature classification. Once we take all of this input, all of these features, we'll put them in some fully connected layers, which are neurons that will take high level features instead of just the raw input image. We'll take high level features, and we'll give them a probability to classify them. That way, we can classify depending on the features.

So now that we've understood how CNNs work, we've understood that if there's an idea that I want you guys to keep with is convolutional layer extraction, pooling layer dance sampling.

Emotional Intelligence and Computer Vision

So the first step when tying it back to motion is linking key facial landmarks. But what I mean by that is regions in the face that actually move the most or express the most in our emotions, which that could be eyebrows, that could be lips, that could be our jawline, whatever in the face that moves the most in our emotions. So we need to classify them. Once we've identified them, we classify them.

And that's where the CNN comes in. Convolutional neural networks will take all of that labeled data, labeled facial expressions, and will be able to classify them depending on the emotion and the features we've selected.

Challenges in Emotion Recognition

However, there is some challenges. On one side, we have the emotional challenges on, for example, personal differences. what for one person is a happy state might not be the same for somebody else, or depending on your country. Somebody from China might not express or have the same gestures as somebody from Italy.

Second of all, we have subtle facial cues that could describe emotions. What I mean by that is one emotion maybe have just a small lip movement, might just have a subtle eyebrow movement. And that just acts to the complexity of identifying them.

1Cultural norms. In one culture, it might misinterpreted in another culture, depending on the emotion and depending on the culture.

And then moving on to the technical challenges. It's very complicated to train something that is applied, especially on the visual systems, on the real world. Because settings are constantly changing. The light, as we can see, we've had to put the shades down for our cameraman. Everything is constantly changing. So it's very hard to train something visual on a real world

Conclusion

So in sum, emotion recognition has played a crucial and important role in computer vision. And it still has a long way to go.

However, the idea is for them to interpret and understand human emotions. It has to identify them, first of all, and then interpret them in the emotion context. It can do that by analyzing facial expression, posture, gestures, et cetera.

This could be used, for example, for enhancing interfaces for the user experience. You know, if he's happy or if he's sad. It could also be used for mental health assessment.

So we're still developing AI to grow alongside humans. And I may ponder, isn't it essential for AI to really and deeply understand our human's intentions and emotions to align with our interests and needs?

Thank you very much.