Computer vision x LLM's - AI for the visually impaired

Introduction to My Journey as a Data Scientist and Engineer

I'm a data scientist and I have been a data scientist for almost five years. And to be honest, I'm an engineer as well.

So I studied mechanical engineering in my bachelor's in India. So the thing is, I was always into tech. I used to play around with computers since I know. I just used to use random PC in my home and then

do something do something on my free time so it it was always very close to me and also like after joining engineering so i was working on some of the projects that were close to software engineering and also like it was a integration of mechanical engineering and a little bit of software so that's where i learned Python, I guess.

So to be honest, I didn't learn much of coding as an engineer. Then I kind of started picking it up. And then I just started it.

I was just taking some projects, like random projects from my friends. And then I used to just try and build for them. So I think that is where I learned a little bit about Python. So that is a small story.

Current Pursuits in AI Development

And then currently, I'm doing a master's at Queen Mary University. And also, I'm an AI developer at a startup company. So we build LLMs for generative AI and explainable AI. It's quite new.

Exploring Explainable AI

I think most people don't know about explainable AI. So basically, what it does is whenever you talk about context or about a particular paragraph or document, NLP didn't have the capacity to actually know the context, like what was going on. Think of it as a parrot that could just mimic the words, right?

So I think LLMs started kicking off I think from 2017. So that is where transform was kind of skyrocketed, I guess. So it was a main learning curve for any of the language models. And then the growth was like kind of exponential for the LLMs.

So you can see like it's transform was based for any of the GPT models that we use currently. So all the GPT models, ChartGPT, Gemini, and then Copilot, so everything runs on LLM, right? So we might think it's kind of complex thing and possibly people cannot understand what is going on.

That is where I think it kind of got popularity recently. So you can think about a lot of use cases that are being done, not done by normal people. They don't know coding, right? Most of the people that have ideas of startup and then founders, it just started kicking in. So I think that is a small introduction for the LLM.

So if you guys have used ChatGPT, you might possibly know there's a LLM that is running behind.

And apart from that, I think computer vision was here for a very long time. 1So basically, computer vision is nothing but just, to be technical enough, it just extracts the information that is provided in the form of images. So it might be edges, and then it's like a small portion of AI that has the capabilities of understanding what is in the image.

So both of these are pretty good on their own. And they do their job.

But I think there's a lot of gap that is going to be built in coming days. So basically, computer vision works alone pretty well. And LLMs work pretty well by themselves.

Integrating Computer Vision with LLMs

But the thing is, I think there's a fine intersection between both of these technologies where it can be pretty useful. as a perspective of engineers or like in terms of solving problems. So that is why I thought it would be pretty cool to use both of them together.

It's like LLM is basically a mind and then you're giving it eyes and then you're giving it vocabulary and speech. So basically you're kind of trying to integrate all of it together and make it more useful as a tool rather than just bits and parts of a small tool, right? So it can act as a whole integration.

So this is a small idea that is pretty new, pretty young. 1So it started a week back.

The Hackathon Challenge

So basically, me and my friends were invited to one of the events at Microsoft. It was a 48 hours of hackathon. So with all of this, we thought of building something that is, I cannot say cool, but kind of helpful.

So you might know about Jarvis and pretty cool voice assistants that are kind of popular in terms of movies and pop culture. So it can make you superhuman, but I don't think that is necessary or required as of now. I think the fine use of both of them would be pretty much in the wrong place, I guess, if you try to use it as a tool for people, making them more like a superhuman. I think we can leverage that in terms of people who really need it.

You can think about people who don't have vision. people who cannot see or people cannot hear. So why don't we use them in terms of helping them actually? If it's a computer vision, why cannot it give vision to people who actually need it?

So that is where I thought, okay, maybe it might be a cool idea to do it. So we just drafted this in around two hours, I guess. So we did all the architecture and stuff.

And then me and my friend, so I'm from AI and data science. My friend is from IoT and then electronics and engineering. And then one more friend helped me with the software development. So basically, we were sitting together.

And then we sketched the rough copy. And then we tried to make this. So I'll just go through it.

So this is our team. So basically, Pawan and then one of my teammates is Munir. And then the other two people, they couldn't come.

Assisting Visually Impaired Individuals

Yeah, so basically these are the struggles of people who cannot see properly, right? So they cannot navigate to a lot of places. They always need company to go from one point, like point A to point B, and they're always scared of what might happen to them because they're always worried that something bad might happen or they just get lost. The worst case scenario is they won't even come out, right?

So that is like most of the problems that visually impaired people are facing right now. And it kind of gave me a small idea on building this because I was kind of helping my grandmother back then. So she really needed me at that point. So I was just like telling her, you need to go here.

And then I used to walk her. And then I moved here. So I kind of thought, OK, there might be a place where I can help people who need all of this. LLMs and computer vision together.

And yeah, so basically it offers independence and then it can also help with real-time descriptions of places and then navigation. 1And also we are planning to integrate custom voice to it so where it can sound more like the people they know and they're close with. So constantly it's not like you're talking to a robot and then it is telling you what to do. It's not like mind control or something. So you're just having this rapport with your people that you like and then you can talk to them.

So it kind of helps them read and then it kind of gives them this access to different places that they cannot navigate on their own and always need people to help them.

So there is some future scope that we are planning to do. So basically, currently, it can identify all the details in the surrounding. And also, there is a kind of small latency behind it.

We are planning to make it real time. So that is where we are trying to do now.

And we are also planning to integrate different cultural aspects of language models. And then we are planning to use different languages in our models.

And then it can also act as a personal assistant. So basically, they cannot use any apps. And they cannot actually do anything on their phone. So mobile app is basically useless.

So it kind of acts on a voice command. So to do all this, what I thought is there's a pretty easy way to do it. So I didn't take a lot of time, like not even a week or so. So we just built it in like 48 hours.

Building the Assisting Tool

So what we did is I'll just go through the breakdown of our model. So this is the process flow of the model.

So user will actually take a picture. So the picture is basically, they don't have to click or anything. So we are planning to use a small wearable device. So it basically acts as a tool for them to wear and go with it. It will be a small Bluetooth device, basically. So we are planning to use that.

So they can just tap it like this, and then it gathers the data that is surrounding them, and then it kind of takes the details into it and what we are doing is currently we are using GPT-4 as an integration for the image processing and then extraction of details from the images. So we tried different models.

There is a YOLO model and then it is pretty good at object detection. So basically how it works is there will be a object detection behind the models and then it kind of gathers what is in the image and then it is integrated with the language model.

So basically, everything that is described in the image that will be translated into into a text and then it will be fed into the large language models basically how it works is even though if there are like multiple objects and there are different things uh why why i prefer llm serve because they are highly contextual and uh even though uh nlp are pretty good case for language processing but the thing is they'll they'll they lack the context right so LLMs are pretty good at doing it, and they are built in a way that they can be customized to any of the ideas or knowledge base. So basically, if you use a single LLM, you can train it to be doing a different sort of range of work, like healthcare, or say if you want to train it on financial data.

So there's something called fine-tuning it. So basically, what you do is you kind of feed the data that is in a prompt and a response. architecture so basically you will have all the combinations and it can be also used as a chat agent or sentence completion or generative AI. So basically there's a huge scope of fine tuning LLMs and then kind of using transfer learning to use it in any specific domain.

So there's a huge aspect of using LLMs in terms of engineering and Yeah, so what it does is whenever it gathers the image, it kind of describes the image. So what is going on in the image? So what is happening in the real time?

So whenever I think about a blind person who's trying to cross the street, so he wants to understand what is the situation, what is the condition? Is it safe to cross the street, or is the traffic signals green? So all of these things kind of seems like a small, deal for us, but it's a very difficult task for people who are visually impaired.

So it kind of gives them that understanding of the environment. And then they can move around using our tool.

And also, if it gives an image, your user can actually interact with the data that is provided. So basically, it speaks to the user. And the user can ask, OK, how far is this car? How far is this light?

So it can give that kind of contextual information to the user. Apart from that, yeah, it can also be activated on voice command. So basically, user can just say, OK, I'm waiting for a bus. So say the bus is coming in another five or 10 minutes. So user just tells them, I'm kind of waiting for the bus. Keep an eye on it. If it comes, then just tell me. So basically, it kind of navigates that perspective and then tells user, OK, your bus is here. Probably you need to move to the left around three meters. And then you can board the bus, and then safely you can go around. So that is kind of interface that we are trying to build with this model.

The Technical Architecture of the Model

And we are also using, yeah, so that is pretty much model architecture. So to build this, we have used Django as a framework for the software part. And then what we have done is we used OpenAI GPT-4 model.

And also it is a bit efficient when it comes to image processing, I think, because when you think about generative AI, like the text generation, basically there is the token cap that is on the text, right? So you will be getting build based on the prompts as well. the image processing will be pretty cheaper than compared to generative AI that uses text.

How many tokens do you use? So if you think about it,

And we need the integration of speech models in order to communicate with GPT-4. So that is where we used Google Text-to-Speech. It's an open source tool which converts text to speech. So basically, whatever that is created by the GPT-4, it kind of gets converted into speech.

And then the user can interact. And then basically, that is how it integrates with one another.

So there is computer vision, there is a large language processing, and there is speech-to-text and text-to-speech, and also we are using some of the pre-built libraries for Pillow. So it kind of use for reading the images and storing them.

So this is a small part of code like I would like to show. So what I've done is it is a pretty straightforward process. So I just used GPT-4 as a base model.

So this is called GPT-4 Vision Preview. So all of their models are very good with textual data and then generative AI. But they're kind of still doing R&D and then kind of improving their vision model. They're kind of good with generating the images, but understanding the images and then identifying the parts of the images, it is still a difficult task for them. So I think there will be a good scope for improvement and then engineering for this part. Yeah, that is what we are doing.

Software Interface and User Interaction

And apart from that, we also have a small demo that I would like to show. So this is a software interface. So basically, it's a small software interface where I try to put in a small image that goes into the website.

So basically, imagine this as a scenario where they're trying to cross the street, right? So that is the image that I have put as an input. The image depicts a street scene with two sets of traffic lights. The closer set is displaying a red light.

There are three cars visible. The first one, a white car, is in the forefront waiting at the red light, followed by two other cars queued behind it. The road markings indicate a junction ahead, and trees and a grassy area can be seen on the left side of the street.

So basically what we have done is we use the prompt engineering to come up with specific set of instructions that it can provide visually impaired people. Like you cannot provide same instruction to people who can actually see the same instruction for the visually impaired. So it has to be very specific. It has to be very descriptive.

And it is a very crucial part where we try to use different prompts. And then we kind of experimented with around 50 to 60 prompts. And then the main bottleneck was for us to make it faster and quicker.

And then there is like three or four integration that is going on in the behind. But still, it needed to be very quick. And then it has to be almost real time. So that is where our bottleneck was.

So we try to optimize everything that we can. So the average time is around five seconds for it as of now to just dissect the live understanding of the environment. So we are planning to make it real time.

So basically think about someone who's just wearing it and then they can exactly see what is going on with the voice that is integrated with it. So that is where we are trying to move ahead.

And you can see there's some sort of misjudgment on the image. So it said actually there are three cars, but there are just two cars. So we're trying to fine tune that as well. So we want to be very specific about the things that is under the radar.

Yeah, this is how it looks on the software end. And my friend is working on the hardware demo. So once it's- Okay. Yeah.

So basically, he's trying to put in the Raspberry Pi. So we wanted to just show you how it actually works in real time. And also, I can show it a small image that I can just take the image of the location and then put it into the model. So you can see how it actually works in real time.

This one, I guess. Let's see. Fingers crossed.

The image shows a group of approximately 28 individuals seated and standing in a room with stylish Decor. Featuring a mix of blue plush chairs. Wooden tables. And patterned carpet.

Complemented by greenery inspired wall motifs. The individuals appear to be predominantly male. Dressed in casual to smart casual attire. With one person standing in the center background holding what seems like a camera on a tripod.

Yeah. Yeah. So it's very specific about the details.

So I was telling you, we try to use other vision models like YOLO. There is YOLO and there is GPT, the earlier version. Segment anything? What? Segment anything? Segment? Anything? Meta.

Meta, yeah. So basically, I think they have something similar with the meta glasses that they're using. Yeah, yeah, yeah. So computer vision. And the thing is, they're mostly using it for AR, VR. So mostly it is kind of, and also they are using it for recording the...

whole scenery that they see. So there is not that much of a description, descriptive analysis of the environment that is going on. I think there was a demo I saw that Mark Zuckerberg was posting. So he was just walking through a forest and then he was just like looking at what is watching and then it was pretty much

Future Directions and Potential Impact

So basically, yeah, the thing is we are planning to use it for warning the users who are visually impaired in terms of, say, if they're in an environment where it is a little bit potentially dangerous and then suddenly something might happen, it can kind of warn them in real time. So basically, if something is happening, so it can prevent it. And also, another end game is basically we are trying to make it as a personal assistant just for visually impaired people.

Yeah, sound is basically that is how it works. So we are trying to make it useful for people who have problems with the vision. So ultimately, even it can be leveraged for people who cannot hear as well in the future. But for now, we are trying to solve one problem at a time. So basically, that is what I think computer vision would be very useful for people who don't have vision. So basically, that is my idea of creating this tool.

Conclusion and Audience Q&A

We only have time for one more question, and this gentleman has his hand up first. Yes, I'm pointing to you. Yes, you. Okay.

You mentioned you sort of tried with 60 different prompts. Yeah. Do you know what sort of like way to judge which one's better or not if you tried different images? Did you try with like let's say, so what's our metric to judge which one's better?

Yeah, so basically it needs to be, ideally it should be very descriptive. To be honest, that is what we thought. So we tried with the different prompts where it was giving us, there is something that visually impaired people, it is not useful for them. there is just a big green chair.

So if you tell them, even if the context is very small and it has very least detail, so we are trying to make it very more descriptive and more detail-oriented rather than just giving a broad imagery of whatever they're trying to do. So basically, there was problems that we are trying to solve. So initially, it was just giving us a summary of whatever is in the image.

That is the reason we try to use different prompts. I can show you examples of the prompts that I used. It might just be a minute.

I see. That's fine. But I think it's, is it like done manually, or did you have like an hour and a half? Yeah, so basically, we use, again, LLMs.

We engineered five to six prompts and then asked GPT to come up with different prompts for the same thing. And then we tried to run each of them. So we were trying to manually deal with how it exactly describes. Because that is very crucial for people who need to understand, we cannot automate that prompt, basically.

So we were trying to understand what is the context that it is giving. So based on that, we tested. tried out the different prompts and then once it was like kind of appealing to us, like we kind of shortlisted those prompts. All right.