How LLMs improve in-site search

Introduction

So before we'll start with what I'm going to talk about, which is persona-based simulation, I'm gonna start with a bit of motivation.

Setting the Stage

So there's a very known fact or a quote that 2024 is gonna be the year of rug. And just to keep you up to date, now it's 2025, and RUG is still not operational.

So we see a lot of attempts, but anybody here got a chance to implement RUG? Anybody here has RUG in production working?

So there's a reason for that. and is that RAG stands for retrieval augmented generation, but apparently the R part, the retrieval is really tricky one.

So we are going to talk about how can we use generative AI, the G to help the R and not vice versa. So with that in mind, let's see how we can do that.

So, Like my pitch is that we're using generative AI upside down, the wrong way. So we actually need to focus on how we can use NLMs to make search better, and then we can use RAG and all of the other beautiful theoretical frameworks.

Okay, so this is our agenda for today. We are gonna rush through that.

I hope we'll make 50% of that. That talk can last for 30 minutes, so I try to speed through the important things. the more important things.

About the Speaker

Let's start.

This is me. While this might not be obvious, I'm human.

And if you didn't recognize already by the accent, I'm also Israeli. I moved to the States last year and still trying to figure out how to speak English, just kidding.

Professional Background

And I started working on AI or back then computer vision at Microsoft and working on the Kinect. It was like a state of the art computer vision of 2010.

Then I transitioned into natural language processing, working at AT&T and Yahoo Mail until eventually I co-founded BestPractics, a legal tech startup that was founded in 2017 and acquired in 2020. And then I founded Argmax, a natural language processing-focused consulting firm.

This is a partial list of our clients and a partial list of our employees.

Understanding Search

And with that in mind, let's talk about search. So do any of you got a chance to implement kind of classical search architectures, elastic searches, something like that? So I see a few hands being raised.

This is how you typically see search being architectured. We have a list of articles. This could be a knowledge base, medical articles, and products.

We pre-process them, split them into chunks, paragraphs, and we clean the data, we enrich it. And at the end of the day, we have documents that we want to index in a vector database. And for the purposes of this talk, Elasticsearch, Lucene are also vector databases.

Once we have these documents, in real time, we're going to query by keyword or any other technique and find the most relevant items.

Classical Search Architectures

Now, the most trivial way to kind of incorporate LLMs into that pipeline

is to use ChatGPT or any other LLM to enrich the data. This is typically done by a list of questions. So using LLMs to answer questions, that's like the most straightforward thing we can do to make search better.

For example, what is this article about? Please summarize the data. Who is the target audience for this article?

And it's a very useful technique.

Introducing the Hyde Paper

And one paper that I think is kind of being underutilized is the Hyde paper. Do any of you heard about Hyde, Hypothetical Document Embeddings?

Cool, so I see like two hands being raised. I think 1Hyde is being extremely underrated and RUG is being overused.

So I think you should really consider that for your next architecture. The main idea behind Hyde is that what if we would use an LLM not to answer questions, but to ask them? So that's the main idea.

Now, if we have the same knowledge base, let's assume, for example, a list of medical providers or conditions or articles, and instead of enriching that with additional information within LLM, we are going to ask questions.

And then we are going to use the questions and index them.

Now that seems kind of weird, but it's actually more useful.

And let's see a brief example.

So for example, if a patient encounters yellow eyes, yellow skin, that patient is more likely to query something like, hey, I have yellow eyes. What is that? What does that condition implies?

And on the other hand, we have the medical jargon over here. This is a real article taken from WebMD. It doesn't have the keyword yellow eyes listed anywhere here. It has a lot of Latin words like, I don't know, or whatever you pronounce it. But this article wouldn't show up in the search results if the user is likely to search for yellow eyes.

There are a lot of examples like this, and they are very common in the legal domain, in the medical domain, This is a language gap, essentially a different dialect.

We have the medical or legal dialect, and an LLM can help us bridge that gap. So that's the main motivation behind using Hide.

Applying LLMs in Search

So how we implement it in practice?

We have our medical article.

We use ChatGPT to ask a question, for example. And keep in mind, we want to use various prompts. We want to use relatively short queries because users are more likely to ask. ChatGPT can be pretty chatty, so we want it to be concise. And another important parameter is to use a high temperature because actually we want to see a variety of questions for which this article is the answer.

So that's how we use an LLM in practice.

It's worth mentioning that we don't always use ChachiPT. There are a lot of LLMs, but for the purpose of this talk, ChachiPT is the generic LLM.

Practical Implementation of Hyde

And let's see how we can use this to evaluate and improve a search system. So evaluation is always challenging. And in theory, there are two ways.

Either you use an online experiment, we let people play with the system, and then we back at them and measure the results. But it's hard to automate. And online evaluation doesn't really scale.

So we'll say a few words about online experimentation.

Typically, when we think about And on an experimentation, we think about an A-B testing where we split the population into two buckets and each bucket is gonna get a different treatment.

And this part is pretty challenging with search because unlike predicting click-through rate or other binary measures, measuring search is trickier. There are a lot of different measures and there are a lot of them because it's hard to optimize for different things.

Some of them measure the ranking, some of them the precision, but in most cases we care about revenue if we can directly link such results to revenue. Typically it's more common in e-commerce. And measuring search is hard.

What we see with clients is that we end up seeing these dashboards with a lot of different KPIs that no one can understand. So that's how typically search systems are being monitored. At the end of the day, we see some kind of mix and match to have some sort of a unified KPI, which is sometimes helpful, sometimes it's not.

And online experimentation is really challenging. So that's one thing that you can get out of this talk is that online is how to scale with search.

An additional reason why search is different than testing a landing page is that typically users think about something. They try to formulate it in their own words.

For example, that John over here can have a migraine. and I don't know, Kathy over here can have a neck pain. Both of them are gonna search for the same query. So we have a latent state here that we cannot observe.

And that makes things even more complicated. And I'm not going to go into much details. I'm just going to say that typically we tend to look at search sessions.

If you look at Google, it's really hard to understand anything about user intentions from a single query. A lot of PhDs wrote a lot of words about session-based search recommendation. But we're going to highlight a different approach.

Persona-Based Simulation

What if we were to use LLMs as simulated users.

So the main idea is as follows. I have an item that is in my catalog, for example, this ice cream. I have a user that is thinking, a simulated user that is thinking about that item.

And then I can use an LLM to generate the query that that user is gonna search, is gonna search for. It can be a word or two, it can be a description of the image. we can be very creative with how we formulate that query.

So for example, the query might be something sweet or something that's completely off. Now, this is where LLMs and Gen-AI becomes very, very, very important. We can use an LLM to describe the image, for example.

We can use it to transcribe and describe the video of an item. We can use it to ask question, and we can even inject some context. For example, an example prompt would be, you are a 24-year-old woman interested in something sweet.

Please describe the following item. So it's a good opportunity for us to build audiences.

Okay, how much time do I have? Take your time. Okay, cool. Because I can go on and on.

So thank you so much for sticking until 10 p.m. It's going to be a long night.

Defining Personas

So how do we define a persona? We have several parameters. First is what are we going to ask about?

We can ask about the image. We can ask about the text. We can use the context.

And what's going to be the prompt? Are we going to filter it according to some item parameters? Maybe certain items are available in only certain geographies. So we have a lot of hyperparameters to control for.

But I think that the most important thing is that 1Persona-based simulation is measurable.

If I'm trying to describe item X, then it's really easy to measure. Did item X appear on the search results or not? You don't need to go about NDCG and all of those different and how to understand metrics, especially if you need to explain them to your manager.

So this is how the simulation scheme works. We have the item, we define a query with our desired persona. We trigger the search system and we measure the result.

This is actually a screenshot, some sort of screenshot from one of our active clients where we have several personas. And what we see here is what is the precision at K, meaning did the top result, was it the item that we imagined, yes or no, in the top three, top five, top seven.

So that's a pretty easy, simple to understand measure. And we can understand what personas work and don't work, and we can translate that into the valuable personas for our business.

Evaluating Persona-Based Systems

Uncovering intentions is hard.

And another interesting use case we use personas for is we try to suggest users during the autocomplete stage. Now, if we use persona-based simulation in the autosuggest stage, we can try and poke around what is the user actually looking for.

Now, this is not a new technique. Google has been doing this for years.

But the persona-based simulation helps us to understand, to classify the user as she clicks on the suggestions. So it's a good opportunity to get more information and shorten the typing time.

Okay.

So I've kind of tried to make it as quick and concise as possible. We do have a live demo, but those typically go wrong. So we can give it a shot. But I'll take the time in the meantime for questions.

Most of this is done in tech space or token space, not in the attention space? All of this is done in tech space, yeah.

Have you tried to test how the personas generated with the LLM differ from the actual personas, from humans, from people? So I mean, seeing how the LLM, old woman behaves or the impersonation of the LLM of the old woman behaves compared to the old woman? So this is a great question. The question was like, do LLM's personas have anything to do with real users?

So I've just came back two weeks ago from Haystack. Haystack is a search relevance conference in Virginia. And while I thought, I presented this talk there, and while I thought I was being pretty creative, 50% of the conference talked about the same thing, about using LLMs as a judge, about using synthetic raters, and actually the videos are out, so you can watch them.

But roughly 50% of the conference, some of them in the legal domain, some of them in the medical domain, some of them in e-commerce, found out that also human annotators have roughly 70%, depending on the domain, inter-annotator agreement, which means that if you're going to take two users from the same audience, they're not going to agree on everything. And the LLM human agreement is roughly the same. Now that's obviously, it changes from one domain to another, but overall, it seems to be like a valid approach.

Questions and Demonstrations

Any other questions? Can you evaluate if actually comparison improves user experience in augmenting your data with a land generated query? Sure, so the question was, does LLM generated queries help the experience?

And the short answer is we wouldn't have recurring business if it didn't. But yeah, I guess that query expansion and enrichment is really kind of an old form of art, right? And LLM makes it a lot easier. So it does improve, but at the end of the day, if your prompts aren't any good, the augmentation is not going to be any good. So we kind of throw that challenge onto the users, if that makes sense.

Audience Questions

Any other questions? Oh, yeah, I think you have time for a demo. So let's give it a shot. Five minutes, I guess, is more than enough.

That's demo, no? No, I mistyped the domain, it's fine. Okay, so what we see here is results from one of our live customers.

This is the online metrics. For example, we see that people, when they log into that hospital, search for MyChart for understandable reasons. We see several specialists. And actually, that specific client does not hire a dermatologist, so it's really interesting.

Now, let's talk about the online metrics. which looks something like that. Now it's really hard to understand what is going on with that information, right? What do we need to improve as the website owner? What do we need to improve in the webpage, in the search in order to drive more traffic and more conversions? And that's how you typically see search analytics going on.

Live Demo and Closing Remarks

We let the users define several personas and to apply them to categories or specific items. And what you see here essentially is a prompt template. This is the generic, the classical hide personas. Given the following post content, please ask a question about that. We also have more specific questions. Now, our users define that personas.

Not all of them are in use, but it's a really dynamic system. Let's see if the offline simulation works. Obviously, it doesn't. Everything is zero. But hey, there's no live demo without live crashes, right? I guess I'll end it here.