I'm Tanya. It's nice to be here in Geneva. I live in Zagreb in Croatia, so it's a little bit farther away, but I love to travel, so it's nice.
Yeah, so let's just jump into multimodal search. This is what we do at the startup that I work for.
Yeah, I'm from Silicon Valley in California. I grew up there. I went to school in California and university and everything.
And I just really came back from like a multimedia designer. We had an incredible multimedia design program at my high school that was after high school. So I really come from this kind of UX kind of media space all through my life, kind of on Adobe since I was a child.
So that's kind of a fun. place for me and really active in arts and cultures and organizing events in our community.
I've had really cool experiences in machine learning before working at a startup in Berkeley that worked literally cutthroat with SageMaker and Google with their machine learning models. But unfortunately, because they were a little bit too early in the space, got one round of funding didn't get another round of funding so a lot of things to learn from failed startups I think that's like a great experience and as well as working with big data so I worked in a company that does GIS data if you don't know what that is those are all the maps and all the data points and maps so millions of data points per millisecond of data that you have to process and analyze in order to make predictions so this kind of company I was working for and then heading out some product design teams so very big data product design bringing products to users in a friendly way that people can use and that's kind of my background with AI.
And so just a little bit of what I do with OmniSearch, or what we do in general with OmniSearch, we do multimodal search. That's search in different categories.
So we do images, audios, videos, text, presentations, pictures, text on pictures, speech to text, logo, face recognition, anything, vector search, anything that is a media thing.
content we can search in a different way and we do that through different models so it's not just one model that's answering all your questions but different kind of parallel models and that's definitely possible now because of stronger GPUs and this is something that's definitely revolutionized all of AI especially in computer vision so
Usually GPUs were used for video games, as we know them, stronger laptops, kind of these arts and design kind of laptops were using these stronger GPUs, but now we can use them also for mathematical computing to answer the problems that we had in OmniSearch before in order to search vector-based images.
So that's how we search, we classify, we index things, and then we're able to both generate LLMs to generate better queries in order to search our answers.
Ask away.
When you say multimodal here, is it when a search hits OmniSearch, does it translate everything back into text and into forms of text search, or are you... Is it... Do you have a universal language to go across audio, visual, and text, and then you do a vector search?
Definitely. So it's a little bit of both.
The multimodal aspect is, let's say I'm searching Tanya is eating lunch with a Nike t-shirt or Tanya Nike eating lunch or something or something or maybe something I'm saying at lunch ah Tanya told me about chia seeds at lunch right and now I can obviously I'm not going to if I'm having a conversation I filmed that conversation I filmed the podcast and now I threw that video in there first step is generating that speech to text so there was no subtitles speech to text so now they can say ah Tanya was talking about I don't know chia seeds and bracelets during lunch.
It can identify my face because I'm not going to say my own name while I'm talking about something. So, this is Tanya and she's talking about this. So, now when I type in that search query, that's the multimodal aspect.
the face of somebody is talking about something, maybe wearing something, maybe it's a rainy day, different layers of queries that can add together to find that second that you might need. I'll show that in the demo too.
Does that mean that when when the video basically you index the video and then you do the search on the indexed video yeah exactly so it's in the next slide or the next two slides so this is a little bit about like kind of history what happened but for us like this alex net breakthrough of computer vision was definitely like
instrumental to what we do today we wouldn't be able to do what we do with multimodal search today if those kinds of shifts in AI in computing didn't happen and that kind of computer vision because we really start from a computer vision point of view from Omni search so that they enabled us to do images and text and audios, which kind of weren't possible before in the way that we do them. It was more of a metadata tagging.
So this is kind of something we had. We were able to present at the NAB show in Las Vegas last year, 2024. No, this year, 2024 in February.
And it's just a huge talk about enabling metadata tags to do the exact search. This is the exact, sorry, exact example, like Patrick Mahomes in the rain with a Pepsi sign behind him and a fan in face paint in the background, right? How are you going to, do a metadata tag on that kind of image that somebody will be able to find. Maybe you'll be able to index that person. Somebody in archive TV will be able to write like a nice neat note, but you're really not going to be able to find that Pepsi logo with Patrick Mahomes with a fan behind him going crazy and all those different contexts or somebody eating popcorn.
So this is something that even today maybe is not possible, but it's something that we're working for. That literally anything that you could possibly think of, you could search and if that exists, you will find it.
And this is the part where you're talking about, Josh. So our indexer goes through either a UI or API. I mean, we made our own UI.
And that's the thing. It's going to index in layers. So it's going to index all the faces in a layer.
Each one's its own model. There's a model for speech to text. There's a model for objects, model for text that's on a picture in a video.
So it's not text, like a text file, but an actual... I don't know if I took an image of the playground bird. It's going to take... read it.
Landmarks, logo, scene description, obviously like four faces, you do need one picture, but it's also going to identify unknown faces.
Can I ask how many people in this room know what multimodal means? Ish. Ish. So it might be worth... It's a strong ish.
And that's slightly different from what would be called, or there's a different approach to multimodal, which is, where you actually no longer talk about text. You talk about, it's called vector space.
we want to combine both the kind of old school, I mean, it's not that old, but that textual search aspect of AI textual search, so what we're getting from ChatGPT, plus that vector search and blend it in together. So that's what we're really trying to specialize in, because this is something that is going to in the future be better.
ChatGPT does the textual search. Gemini does the vectors. So far, on one of the models, that's the only one that does proper.
So it's interesting that you're combining both. We're combining both.
And then the other thing is our tech team really decided to do the C++ in order to increase speeds, because a lot of the stuff's built in Python. But we switched over to C++, and it's been 10 times faster.
Our indexing is about 10 times faster than real time on a normal GPU computer, so like on my laptop. That's over one hour. Video is going to be indexed in six minutes, so that's nice.
I was going to ask, how can you be faster than real time, but does that make sense? Yes, you can. It's possible.
And then searching, as you're going to see today, is pretty much, once it's indexed, it's pretty much based on your Wi-Fi connection, and that's it. Or if you have it local, it's very fast.
All right, yeah, this is kind of visual representation of what we're going to dive into the demo really fast here, just searching faces, searching the transcript, catching occurrences, you know, getting kind of some logos. We'll look at some logo things and dive in. So bear with me. And feel free to ask.
I'm super sorry, so I don't have the beta version of our Google Drive live online, so I wasn't able to show you the Google Drive version that's going to come out, I would say, in a month or so. But let's call it Q1 2025, which will be much more fun for real people to see.
So I'm just showing you guys the kind of an enterprise version that has the same software that will be then we're implementing it into Google Drive. So business Google drives will be able to access the same APIs that we're going to be able to connect. So that'll be cool for like normal people.
That'll be a fun one. I'll record a video and we like send it over.
So we have here, so this is just sort of like a normal demo. We have some professors here. It's all open data that's on this normal demo. Professors who are presenting a few of their lectures that they have. There we go.
Yeah, so we can see here when we look in here, so this is our backend version, but I'll show you a version of one of our clients that you can design it however you want with APIs. We can see here's the professor, this is the text that the end client kind of put in, description, title, so we can search all of these things but the really cool thing to search is this video and this transcript that's automatically generated so we can kind of just skip and follow if we want to learn something.
Strange looking thing. Sigma, sigma, sigma, parenthesis, star. It's really just a regular expression.
Normal back to school lecture, right? And then obviously a presentation. So I haven't searched anything. I'm just showing you like what are the files that are available in each lecture. And if we go back, and let's just do a super easy one. So this is kind of just a one-way search. Python exception. We can do that.
Yep, obviously autocorrect for people who don't like to talk. We can see 42 occurrences, we can see 24 of them are text-based, 12 of them are the audio, and six are in the document. So it automatically tells you where you're finding it. In this case, there's only one person talking about Python exception in this thing, but if you have
millions, like one of our clients is the Croatian National TV. So they have 30 years of content across five channels 24-7. So we're talking about petabytes and petabytes of content. So we index their archive and for them, you know, when they do a search it's really important to get down to maybe 10 or 15 good videos to use.
In this case I just brought up one. You can kind of see a preview of what you're going to find. And then here we go.
This is the best first. You can filter best first. You can filter audio text. And yeah, just full screen this. And here we can see it's on the text when she's talking about the exceptions. Pretty simple.
Let's go back and sort of talk about recognizing the face. What happens if you search for exceptions in glasses?
Not in this one, but on... I can't show that one because it's a custer. But we have one. It's a large football league that literally has popcorn.
Because it's not on this demo. Because it's not. Because that's like a beta version that we're doing specifically for this football client where they're literally indexing what their fans are doing.
So it's everything from glasses, from eating popcorn, from yellow flags, from red cardons. This one's like a very basic demo just to show a few things.
Depends what APIs are implemented into it. And a lot of things are still in progress because our team is very research-based.
I know, so SipServ was one of the professors, so let's find his lecture about pumping lima. There's a few of them. Where he's in here is probably the first one is always the most relevant.
You can also sort them however you want. This lecture is literally about that. And then the cool thing is just him.
Here we go, the second one. So the pumping lemma says that you can divide. So it's both on the screen.
He's talking about it and it's a picture of him. There's no sips are listed here, but they found, you know, because we have him in the known people. It's easy to find.
So if you know, if you remember, I don't know. your school professor talking about something in that video is online, you're going to be able to find it, especially if something really abstract. I don't remember when I was in school, like, you know, it's eight hours of lectures and then something just like ringing in your head and you don't have enough time to go back and rewatch all those eight lectures, but it would be cool to be able to search that.
So that's kind of like the point of this little demo.
Next, I'm going to go into some extra face options. Let's see. Sorry, jumping across a few demos.
So the cool thing about this one is known people. So we're doing a little bit, we pulled open data from the political elections recently, White House race, what's been happening. So these are all public videos. That's why I wanted to show public things because I know this is being recorded.
I chased down some clients who gave us, but some of them were like, no. When we, yeah, media companies are very strict with their data. So getting anything is very hard, but okay.
For windmills and whales. So let's look at Michael Brzezinski. White House space. I know she talked about that a lot. So here we go.
I mean the photo identification is extremely accurate. You just need to upload one high-quality photo and literally somebody could be running in the background.
We always joke around in the Croatian national television like some friends are like shopping on the the market that have been in interviews before and we catch them maybe that's not exactly fun but it's for private use um there was this uh mit last week week before last actually built these glasses yeah real-time face identification you just walk around and like all of your faces just pop up with exactly social profiles like it's actually real
Yeah, exactly. So pretty simple stuff. It's happening.
You can add new people. It's very easy. You can add your mom, your dad, whoever you want, and hope they're not rioting behind Trump.
The next one, this one's pretty cool too. Joe Scarborough Trump. Oops. Spell Joe right. That would be fun.
And then I'll show you this unknown faces. And you can scroll through everything here. I don't know, we can like scroll through here also and find Trump on the thing. And you can always sync with video time so we can always jump ahead. No doubt.
Those issues that Democrats felt like they had to pull back on. So this is Joe Scarborough talking about Trump. So it catches him and he speaks about Donald Trump. So that's where it's catching it. Because I searched yeah exactly so I searched since I searched Joe Scarborough and that's Joe Scarborough and since he's speaking about Trump we catch him. So I mean we can talk we can you know identify maybe if Trump was talking about Joe we would we would also be able to catch it but he doesn't talk about Joe obviously.
It doesn't go the other way. Same thing I mean I can show a couple more we can save some time.
I would like to show you the logo identification once and just one more Kamala Harris campaign interviews. That one's fun. And just to show you a little bit of this.
So this is kind of always this first suggestion once or first. That value has not changed. Okay. This, as far as faces criticism for having Governor Walz by her side during the interview, Politico is reporting, quote, a senior campaign official also made the important stipulation.
So in the transcript, you're always getting the highlights. You can always download the transcript, and it's going to be highlighted with the search. The cool thing that we just implemented, so it's kind of fresh and new, is this people and logos. So we can always find these unknown people, and then in the timeline, we're going to
see them so she should have to sit and answer tough questions about why exactly exactly this one is not a great exactly so we're still working on this whole aspect but that's the point it's we find the unknown person in i don't know 100 videos and throughout the whole system and then we can put like oh that's my mom oh that's my dog or whatever you know we're like oh that's kamala harris so you don't have to upload the video it's like backwards. So that's kind of the goal of it. So if we then later on identify that person.
Also it's really good for us in the security industry because sometimes we don't know that person and what we're working on in CCTV cameras is also that you can identify if a threat is passing by multiple times in front of your house and you can identify this person's, you know, my neighbor. That's fine. I don't know his name, but he's my neighbor.
And this person's not okay. He's causing problems. Or this is the mailman. That's fine.
And then when I get an alert on my phone that somebody who broke my trash can last week is by my house again, I'm going to get an alert on my phone instead of, you know, every time the mailman comes that I'm getting alert that it's an unknown face. So kind of this, you don't necessarily need to tag the name of the person, but you can tag it you can alarm them differently, which is really cool for security industries.
Tricky! Here's my ex! Exactly.
exactly yeah this one's uh so this one's from the croatian national television they gave us a few of their videos from uh you know european cup so that's fun it is in croatian sorry but we can go through here yay football uh we can see And we can see a couple, you know, okay UEFA logo is gonna be everywhere but I know let's say the Nike logo here and then obviously when it's gonna pop up on their shirt.
It's catching his little logo up there. Maybe I can... For the logos?
So the objectives of logos, so we actually got this as a request from one of our clients for two reasons, which we thought of one reason in advance. So the cool thing about logos is sometimes they need to hide them, and sometimes they need to know how often they show up. And in what size?
Yeah, exactly. So same thing with face recognition. Sometimes they need to know somebody's face, not to tag them, but to blur them.
So it's like, I identified a child 10 times. Their mom didn't give me the signature because of GDPR, and I need to fuzz out that person's face every single time I see them. And then here they know exactly when to do that.
It's much faster for the team to do it. They could theoretically have two sponsoring contracts. Exactly.
We have this problem with one client that we're currently working on. The marketing team wants to know when the logos are there and what vector size to charge their ads, right? So they can say, oh, we're-
Exactly. Number of impressions and size. So this small logo would get, you know, whatever index they want.
We leave it to them what they want to decide with it. But we tell them percent of TV screen, you know, commercial block, blah, blah, blah. We can index all of that for them, make a dashboard of their choice, whatever they want.
You know, we give those large clients, we work with them to make a solution that works for them. the other tech team that has all the students working speech to text and hiding all the faces and hiding all the logos, all the interviews that somebody came with, the Mindstone logo and they're not supposed to promote. You know, you're not supposed to promote some brands or who knows what.
That team wants it so they can identify, ah, there's this logo and I need to fuzz out this logo. And so we were talking to two different departments about our software for months and months and months. And we're like, can we have a meeting with everyone together?
Because the software can do both at the same time. You know, they don't understand the concept of one solution for two problems. They're like, oh, no, this is two different softwares.
So definitely. One of the things that happens a lot with this stuff now, I don't know, slightly outside, but it is still AI, is that if you look at the football feeds, the logos around the field are actually replaced based on the geography that you're watching them in. So you are actually real-time watching a football match, but the logos that you see around the field
They're like, they're swapped out in real time on the feed itself, so they do the notification and do that. Yeah, yeah. Yeah, when I worked, I used to work in Amagi, which is actually doing that solution, so localizing ads based on your geolocation through knowledge about your TV, through knowledge about actually your, like, how much, who you're, you know, when you buy your TV provider, they kind of know your demographics, so they're going to force ads towards you based on your demographics, so they're kind of building out this.
They're starting to do this on TV too. They specifically do that for broadcasting TV. So this is like really cutting edge tech.
They must love you in that agency. So that's kind of why we developed this logo thing going on.
You can see Coca-Cola. It's kind of fun to play with. You can see them all in the background there.
I should full screen it. So yeah, we're working on this now. You can multi-select them. Fun stuff.
That's pretty much it, what I wanted to show you guys. And that's it.
I will definitely share a video of the Google Drive when it's live. I wanted to show that today, but here it is.
Yeah, connect with me if you want, but we're so small, so we can't do that in person.
Do you guys have any questions?