Beyond Detection: Unleashing AI Image Analysis with CLIP

Introduction

My name is Shafiq and I am a ML slash games engineer. I currently work at the New York Times and I have a background in research and specifically in computer vision and ML.

I've been in that space for quite a bit in different capacities over the last several years and uh it's been a passion of mine and now with the recent um evolution of ai to where it is right now um i think it's just been a great space to you know just keep building my passion and my work and expertise so just uh to give you a uh how many people have like a background in computer vision here or any understanding okay great a few people okay cool so um

Thank you very much. I don't know if that's any better. But I appreciate it.

Basics of Computer Vision

So what I'm going to go over is, in this talk, basic computer vision. I do several different kinds of computer vision talks that fall under the same category of what I'm describing here. But just today, I'll be talking about kind of the basics.

Understanding the YOLO Model

I'll be talking about the You Only Look Once or YOLO model, which is one of the first major, very ubiquitous image segmentation models out there. And it helps do a lot of things and bring object detection to the masses, so to speak.

Exploring Detectron Framework

Then I'll be talking about Detectron. I don't know how many people have heard about this framework before, but it's a powerful framework by Meta.

Introduction to Clip by OpenAI

And I'll then be going to Clip. Yeah, OK. Then I'll be going to Clip, which is in the promo. And it's actually OpenAI's framework for object detection and segmentation.

Advanced Topics in Computer Vision

And then I'll possibly, if I have time, talk about ImageBind a bit, which goes into the multimodal space. So I know some of this, it might be a little

all over the place high level low level uh for giving me some of my background again is research but i i wanted to kind of bring that kind of ubiquity to the technology as i speak to it especially for those people who don't have a background in computer vision i hope that helps so just for anybody who's not familiar computer vision is related to is a field subfield of artificial intelligence that allows ai to process vision from the real world

It's in everything that you have, anything around you, your chat GPT vision, your camera has all kinds of vision algorithms, neural nets that process signal and image information are abound. And even in that camera right there, there's optimization algorithms going on.

So there's several techniques in computer vision, but really what I'm going to talk about is the most fundamental ones that are used in deep learning today and that produce some of the amazing results and things that you've seen. And then there's a lot of different challenges with computer vision still that are really being addressed with multimodal AI, which means the incorporation of vision with sound and other types of modalities.

Just a little quick precursor to how, you know, when computer vision started, not going too deep into it. Pre-2006, computer vision was a very, you know, handcrafted field.

People were using things like candy edge detectors to do this kind of thing, take an image and then turn it into like an edge model and do machine intelligence on that. It required a lot of work and there was many templates and other things that you had to build using your like equations and things like that.

And so there's different methods like histogram of gradients, Lucas Kennedy associations. The other things that were like very mathematical in nature and not very accessible to most people, unless you were a vision engineer like myself, I was doing this kind of stuff back in the day.

And the great thing about modern computer vision is that it's now boiled all down to using the convolutional neural net and the vision transformer. The CNN is basically, for those not familiar, a magical machine learning model, a deep learning model, which automatically extracts intelligence and information from an image. You don't need to do any sort of handcrafted filter design.

You just need to be able to train it on a set of images that was already um segment it out and it can do things like pick out features of an image automatically and break it down into little pieces and that can automatically do classification detection and other things on it and so there's lots of different optimization techniques but as you see on the gif on the bottom the uh this kind of image segmentation for like cameras and tracking and things like that is all automatically almost out of the box now with Combinets.

So that's as far as I'm going to go into that, because I want to get to actually some of my demos and whatnot.

But I will say is that why is it so good? It's because prior to 2012, there was a challenge called ImageNet. And the ImageNet challenge was a competition to attempt to classify images and improve computer vision models. And up till 2012, there wasn't too much progress in it.

Like I think the best models are doing 30, 40% image classification. So it could tell what a horse was 30 or 40% of the time, something like that. And then people collected in large sets like 14 million images and into several different categories and wanted to improve this over the years.

So in 2012, AlexNet was produced. Well, it was produced before that, but it was demoed at ImageNet, and it did almost perfect classification relative to other models. And it opened the floodgates to wide-scale mass image adaptation.

I go into this in much more detail in some of my other talks where I dissect the layers of this model. But for here, I'll just let you know that this model has kind of revolutionized. Everything is kind of based on the AlexNet model.

And yeah, the ImageNet challenge is no more because there's other challenges now. Image classification is now better than human at this point by machines. And that's actually statistically been proven. It can classify things better than people can.

YOLO Model: Single Pass Image Segmentation

1So the first model that I'm going to talk a little bit about is the YOLO model, You Only Look Once. And so what the YOLO model is doing is that it's doing segmentation in what's called a single pass. And it's used for very fast image detection.

So I'll just demonstrate a kind of things in this video that I put together. You probably can't hear it very well, so I'll just explain what's going on.

So I do all these demos in Python and whatnot, and occasionally I build tools out for them. But specifically what's going on here is that I take an image of a city street. And what I'm doing with this image is I'm applying what's called a false fog modifier onto this image.

and it's occluding the image and it's showing the degradation of the ability of the model to detect objects in this image. So, for example, in the first least foggy image, the model does pretty well and it can detect like everything. It's harder to see, but it can detect anything from like traffic lights, cars, trucks, etc.

And then as I increase the artificial fog intensity, you can see the model fails. Now, what's great about this model is that you can apply different filters to this model, and you can improve the classification so that you can undo the effect of fog. So that's one of the nice things about YOLO.

It's a very fast, very easy to implement model, and it's used in a lot of different AI vision applications.

Detectron: Background Removal and Object Identification

1The next model that I'm going to quickly talk about is called Detectron. So Detectron and Detectron 2 are by Meta. And so the Detectron model is based on what's called the RCNN model, region CNN. And what it does, it lets you do all kinds of fantastic things with images.

So this is one example of background removal. And this is also stuff that I, these are the things that I did, the demos that I put together using this model. So I take an image and then I can remove the background completely, even a complex background from this image. And what it does, it uses what's called a mask region.

And it subtracts things that are not forward-facing objects as it detects and just is like, OK, this is now in the background. This is in the foreground. And it's great for things like visual investigations or object identification. As you can see, some things with faces, it still thinks this face in the background is part of the image foreground.

So it has a preference for faces. And then what you could do is key point skeleton detection. So if you're familiar with how key point skeletons work, the human body computationally can be segmented into what's called the 18 point landmark model. And that basically means that you take a body frame and you apply machine learning and you can break it into about 18 joints that a person can be segmented into.

And so you can see this model essentially does this. It does this segmentation on the side and you can do it for video. um there's a complex set of um details that go into this but essentially it's using what's called the feature pyramid network and a data set called coco common object context to detect that this is these are human bodies that have and you can imagine um that there's several uses for this especially in complex environments that have um you know fog and stuff and where you might need to identify a person

Then it gets more advanced. You have something called instant segmentation and dense pose estimation, where you can now get the entirety of the person in a box. These people now can completely be segmented out, and you can now get more characteristics about the person and identify more details about them, even from very complex backgrounds. So dense pose is very interesting because you can use dense pose estimation to do things like auto-generated computer graphics of auto-generated people and whatnot.

And there's actually some really fantastic videos and demos out there about how this was done and how this can scale. And then I think the most powerful technique that Detractron lets you do is what's called panoptic segmentation, which means you take an entire scene and then you can break that scene down into every region of that scene, including blobs that indicate, okay, this is... uh a concrete on the ground this is um a person this is forest this is and it does pixel perfect pixel detection so this model literally is doing is is cutting out the exact lines of everything in it and this is a very powerful technique for many reasons because you can do very complex geometrical um evaluations of images and scenes with a panoptic segmentation so this is me i took some video driving this is just driving through the city with a dash cam

And so what I did here was I applied all these filters to this video. And this is essentially the video just picking out objects from the background and completely segmenting out the background, blacking it out in real time as the person drives. This is key point estimation in place. So I can pick out all the people as they walk in this scene.

Very easy to do. And I can do this actually with relatively little data. And finally, you can actually do panoptic segmentation. And you can see it can perfectly segment out the cars, the trucks, the light poles, everything like that.

So again, all this is done in Colab, by the way, and Python. So it's kind of amazing what you can do once you get a handle of the model.

Clip: Multimodal Model for Object Identification

So now I'll talk a little bit about the clip. So Clip is basically essentially OpenAI's multimodal model. And this lets you do more specific identification of things, objects.

So basically what I can do is I have this video of me doing this here. What I'm doing essentially is I'm taking a bunch of categories, and I believe the categories initially are something like, I'm not going to go over all the details in the interest of time, but my initial categories are cat, dog, mouse, horse, and boat. Then as I run the clip model through a test cat that I have in the background, you can see that... it should be able to say that this cat is okay, 96% cat, 1% mouse, 1% dog, zero horse, zero boat virtually.

And what you can do, what's more powerful about this than that high-level data set is I can use what's called the ImageNet captions, and I can do very specific segmentation on this. So I've updated this model to pull from the 1,000 ImageNet captions, the 1,000 label caption set, and what I can do is now this model can now detect 25% Madagascar cat, 9% Egyptian cat, 6% tabby, 5% Persian, 3% tiger cat. So you can get very, very specific with this data set.

CLIP is extremely powerful because it's a multimodal text and image combined model. And CLIP is used, and variants of CLIP are used in almost all segmentation tasks for now. And so other things you can do with CLIP are interesting are visual image search.

I'm just checking time. So for example, you can match textual descriptions to relevant images. And say you wanted to do something like reverse Google Lens, but you can do this more in a machine learning way. If I want to search for something like a dog, like I have a bunch of images, I search for a dog, then it'll just pick out dog from my group of images here.

Now let's say I search for cats. And it will pick out all the cats without anything about the image being labeled cat. Nothing in this image or the metadata or anywhere is indicating that it's a cat image. It's just image two and that's it, a bunch of pixels.

And then I can pick out cow and one of the interesting things is clip is not perfect because with the cow image, As you can see, it says 77% cow, but it found a 70% person in here just because the shadow was segmented a particular way. So these models have to be augmented in different ways.

Neural Style Transfer and Artist Attribution

So another neat thing, and I talked about this last week, was called neural style transfer. And neural style transfer is basically when you take an image and you use stable diffusion or certain image generation model to style an image style of a particular artist so i took a car i style in the style of picasso or van this is in style of van gogh a car in the style of picasso and um so on a car in the style of da vinci now um with neural style transfer which is really interesting is that i can do something like i can i can replicate this and style one image in the style of an other so i can basically this is me doing this here

And I'm taking a content image, which is this woman here. I'm taking the style image. And when I run it, let's see, it should then style the image like this. So I extracted the style from the content, which is this rainbow waterfall, applied it to this woman.

And you can do this for any image in any group of images and do the same thing. Now, I can do this for video as well. So I have a video of me talking. I apply the style. And as you can see, the style applies to the video of me talking using neural style transfer algorithms.

So I'll just quickly mention why, in the context of Clip, this is important. It's because I'm working on a project to make sure artists get compensated if their styles get ripped off via these techniques. And so you can use Clip to do image attribution, which means that I can technically, in theory, build an attribution system to like find like, and this is just me doing this live again, but I can upload images and then I can say, okay, well, the images are a certain percentage of these particular artists, Monet, Van Gogh. I've just chosen Impressionist because their styles are very, well, except Matisse, but at least the images are very clear in what kinds of styles that they have. So I can just essentially build a system using Clip because Clip can identify things like styles very easily. So it's very powerful in that regard.

ImageBind: Multimodal Synthesis and Analysis

And then the last thing that I'll demo, Will, is that anyone who's familiar with multimodal analysis, there's a thing called ImageBind out there. And ImageBind is essentially a system that takes a bunch of different modalities of data and allows you to combine them all together. And what you do is a very intensive emergent classification of things.

So this is taking computer vision to the next level, where you not only take computer vision, but you can take sound, you can take You can take accelerometer data, temperature data. You can take depth of data. You can imagine how many things that you can take.

And so basically, I don't know if this is too audible, but what this is doing is I'm basically taking a sound of a dog barking with a beach. And this is from the ImageBind demo site, but you can actually build a model using ImageBind to do this. You might not be able to hear this, but let's see if you can hear it. We click on this image of this beach here, and so it retrieves a dog on a beach.

Yeah, so very interesting stuff. If we have pouring, we found the pouring. And this is real. This is not like faked or anything.

This is actually doing generative AI diffusion-based multimodal synthesis. A faucet with apples and water pouring on the apples. Now, this could be generated, or this could be actually... So you can actually use... After I finish talking.

Then let's say we have a car engine. so so you can do really fascinating things with this and i'm talking about this at my multimodal talk specifically next week where i'm going more into how like multimodal synthesis works but um yeah so that's basically some of the things you can do and just one of my favorite videos that i generated to end off this presentation was i took a space odyssey video or i took an image and then i just started morphing it using a warp fusion and clip and a few other things

to create like a um you've probably seen these videos online but like i'm de-aging this uh person from space odyssey back into the magical baby uh who was at the end of the movie for those who are familiar and fans of that it actually gets honestly kind of weird um because you can kind of control the diffusion process which you can't so that's kind of the funny thing about generative ai But the whole point is the image models essentially control what happens to this image. And I have tons more of these, but it's just kind of cool what you can do with that.

Conclusion

And so that's pretty much the end of my talk. Any questions?