So we are NeuroPixel. We are building Canva for fashion. So we work in the intersection of deep generative models, computer vision, for fashion e-commerce vertical.
So what we do is,
So fashion cataloging is a very expensive process. It's slow. It takes time.
It also lacks context because you cannot have for each and every image, you can't have some kind of context. So MOQs are high and relatability is very low because the person, what you see and what you are, sometimes it does not kind of relate, right?
1So, coming to our solution, we have M-SWAP. So, we change the model. So, you shoot only one person and you change it to any person that you want with any ethnicity, any kind of attributes that you want.
We can change the background. So, you just, you can shoot a person on any places and you can change it to, you know, any preferred background, right. This is actually a shot in studio and the background is a very Barcelona background.
We can do both, right? We can change the person, we can change the background. So thereby, from a normal vanilla image, we change it to a completely kind of a creative shot.
We can create a little bit of video. We can change the person, change the background, and kind of create a video out of it. And if you do not also have a person, it's absolutely okay.
You can just kind of shoot it on a mannequin and you can change it to a person. Impact for the clients is of course it's 15 to 20 percent of the high CTS and at least 5 to 7 percent jump in conversion. It's very very kind of cheap compared to actual photoshoot.
It's very fast and I'll show you how fast it is. It's also of course definitely increase the conversion. You can have any quantity from a one single image theoretically you can generate infinite number of images.
And it's like very diverse. We can have one person, let's say, shoot on someone like maybe from Asian, but you can change it to African-American. You can change it to Southeast Asian.
You can change it to Caucasian. You can change it to Latin American, whatever you want, right?
We'll come to tech mode, but before we go there, I'll show you the demo, and then when the demo. So we have...
We have broadly as I told like there are three these things. So, let us say this is a model swap. So, let us say I upload this model.
I select any name, female. So, just get off loaded and then you select any of these models. Now, these models and that is what Stefan was referring about that. So, these models are actually of course we have crafted them, but there is a chance when you know it kind of with the generative model that this model can resembles actual person.
So, that is the risk right. So, what we do is we use another third party software and what it does it is basically crawls the entire web. and it is kind of tries to find the images with the high kind of similarity based on some similarity index like cosine similarity and all. So, for these images there is no such of result. So, that means the kind of this does not resembles with actual any person which exist.
Let us say I select any model may be Ashley, I select just a background and then I generate, then go to the queue. and yeah and it will take around five minutes right so it will let it generate so like normally it will take more than an hour and i can generate some in the meantime i can give some more so let's say background swap um maybe i can upload this um maybe or maybe i just do the lift is better actually because you can show both
So, this is lift, this is lift, this is lift, this is lift, this is lift, this is lift,
You can check any images like maybe, I don't know, maybe just walls and panes, simple. Maybe any other image, maybe view key and then you generate. So, now you have a background, you also have selected a model. So, it will get generated.
Now, let us come back to this. So yeah, so tech mode like there are broadly kind of two tech modes which we have.
1One is like our tech stack is very modularized. So, we have three or four broad modules like one is segmentation where in the first stage we will take the segmentation of the image. And then there is a generation module where we are generating the person, everything, right, skin tones, etc., everything, and background.
And there is something we also have, like we call it multimodal understanding, right. So, basically from the image you understand certain attributes which we again fed to the conditional generation right. So, that is the modular framework.
So, basically like the what it helps is earlier we were building something kind of very monolithic structure, but you realize that these models are very independent. Let us say for segmentations like right now it could be SAM 2, it could be Florence from Microsoft. For generation right now, we use stable diffusion from stability AI, but some better model might come so that we could replace it, right.
So, that is one advantage. As for let us say multimodal understanding, we have found that deep seek R3 is pretty good for our case, right. So, but anyway cloths on it, others also might be better.
And also we have a proprietary stack on top of it. So, which is comes from a, you can think about it as a combination of traditional computer vision image processing algorithms with deep learning. Like when we, I will show you some of the kind of once it is generated you will see that.
Like when you put the apparel back and there are certain things which is very kind of nuisances on fashion because fashion the bar is very high and we are replacing actual photo shoots. So, for that we use some traditional computer vision algorithms as well, like matting, like certain filters, etc. Let us see if it generates.
One is generated already. It is basically just less than four minutes. And you can see the skin's texture and all.
So we will go to Q, I think one is still getting generated. Let it generate, we will come back here.
This is a broad architecture.
So, what we have as we are discussing, so what we have is we have initially we have the input image of the model right what we have seen. So, we estimate the depth map and the light map because you understand the depth of the image I mean because background etcetera and also right you understand the lighting condition and we of course segment the apparel that is the first part because we segment the apparel and all. From the input model we also do the pose estimation.
So, these estimated depth map, light map, estimated pose and this from the segmented apparel we also kind of extract the understanding of the apparel. And everything along with that identity which you have seen on that image and all right those are actually kind of certain form of load and adoption model LoRa. So, that we have you know those things goes along with those prompt with the desired features and all if you have something right. So, everything goes here right and then if you for a lift image then the reference image of the background right for like for here we earlier image we saw that we gave a reference of a simple kind of textured wall etc.
So, those things goes to the conditional generation. Then it generates, after it generates that apparel that has been segmented has to come and then it has to be blended right on that and need to be warped on that model. So, it gets warped
Then couple of other things also happen, which are very interesting. Which is like, when you see the generation of the image happen, the hands and feet always get deformed. Most of the cases, we are not there yet. So what we do is, from this image, from here, we kind of understand the structure of the hand, because it comes from the input image.
But the texture would be coming from the here right, the output image. Because let us say my input is a white person and the output is a dark skinned person. So you want the structure of the hand should be as the initial one but the texture should be the output of the dark skinned person.
So, that it goes here and it get basically hands and get feet get fixed and after that the final what happens is the final adjustment like sometimes because of the light and all the shadow needs to be generated right. So, the shadow because otherwise you do not feel that immersion right. So, you have a shadow generation module. And then finally there is an enhancement module if required, because we normally do generation on 1024 cross 1536, but if you want 2x or 4x of that, we do a kind of an AI upscaler on the top of it.
So, that is the broad architecture. This is of course patented.
Coming back to this, this is the moment of truth.
So, as you can see, these shadows are all needs to be generated.
You can see this is perfectly immersed. So, that normally does not happen.
It is difficult to generate with a vanilla AI these things.
So, that is what we are working on. We are still working on these.
Now, coming to our code. you know where we have contributed algorithmically.
So, as I told that light map based control, so because of that it basically generates very immersed way and then the human kind of very consistency is there, like so what happens is one image can come beautifully, but let us say in a catalogue and all you might have seen there will be four poses of the same person. So, when you replace it with the AI, it has to be exactly the same identity from all those angles, right. So, that is one thing.
Consistent hair styles and length, same hand and feet fixing as I discussed. The segmentation of course of the fine tune of the apparel which I talked. Then the blending, I will show you some of these things.
So you can see this one is that there is no light map base, there is no condition of the light map but here there is a condition of the light map and you can see that here the immersion is far better compared to this. Similarly, if you see this, this is pretty, sorry, if you see this, right, this is the generation of the hand is deformed. These are very minute, but the thing is the, yeah, here you can see the hands are perfect.
They are minute, but for fashion it is very important. I mean, yeah. Here also, if you can see,
You see the line here in the segmentation right, here there is no line. So these are broadly we have applied some traditional image processing algorithm.
Here also if you can see here, so when you put the apparel back it sometimes happen because the size might not be exactly same. So you can see this gap. So what we do here is we blend it.
If you see here, see there is no gap. So what happens is, so you try to understand that gap here, this gap here and basically do a context kind of error generation for this and then also on top of that we do a matting. This is what I talked about.
That is all. Thank you.