My name is Ivan, I'm a head of AI at Anthill. I've been at this position for about three years now and what I do essentially is I build, well except for mess up the presentation, I build use cases and business cases for artificial intelligence for clients in Germany, the US and in Bulgaria. So, essentially, my job is to find the most practical, quickest solution to a particular use case, what's possible, and, you know, put it within some budget constraints, some time constraints, and develop it to a working prototype or working product.
And I've chosen this topic. This is targeting audiences and personalizing content in RAG systems. But you can just think about it as a practical way of thinking about how we work with large language models to solve very specific complex tasks and how to control what's coming out of them.
Maybe everyone who's tried AI has faced the issue of hallucinations. LLMs have a bunch of issues they are very eager to please and this results in sometimes even jarring results and sometimes very irrelevant results whenever you try to embed that into a system.
To start with, I'm going to keep this very practical, but I want to give a bit of a background in where we come from when we think about . I would like to imagine what text really is. Text is a very, very weird thing. It's a linear progression of words, but those words have a very non-linear dependence in how exactly their meanings interact.
And if you have a book, you have your favorite book by Harry Potter or Dostoevsky or something, and you can think about just some word on page 34 can inform the entire meaning of a character whose actions come at page 205. The meaning of what's going on in the text is so irregular and so interdependent
that you can't really it's really hard to even imagine how meaning is constructed in text yet what we do with large language models is we put that meaning into a mathematical model you know we we embed first what we do is and uh thank you sergey for giving this overview on how large language models work is we embed the words into a vector and some of the best models out there work with a vector of 512 numbers or 4,000 numbers or something like that and that needs to embed all the meaning that's within that text so that's a difficult thing to do and you can imagine why large language models are unstable because essentially small differences in the input can necessarily need to lead to large differences in the output. That's what we call like a discontinuity, right? So the larger language model needs to do that and that leads to instability.
I'm going to lead into what a RAG is. I mean, I can assume that most of you know what a RAG is, but I'm going to explain it anyways.
So you have essentially a user that's interacting with a large language model and they need to access information right you cannot put you know if you're if let's go back to the book case you know if you want to ask a large language model what's happening you know who's ron weasley from harry potter you cannot just put the whole book in the prompt that's the problem essentially so you need to have a way to extract information from the book And then put the relevant information in the prompt so that the large language model has the context of the query, essentially.
If you ask, you know, who Ron Weasley is, they need to get the relevant passages from the book that can inform them to answer your question. So this is the way a RAG works. It's retrieval augmented generation.
And you have this user, the user queries, and you have this vector database. And the vector database is essentially a database of embeddings. So what an embedding is, is these vectors that encode the meaning of words, right?
So... I'm going to not get too technical here, but the idea is that if two passages are close in meaning, they're close as vectors. So that's what a vector database does. It gives you what's close in meaning from the query.
So if you ask about Ron Weasley, it's going to look about everything that's going on about Ron Weasley and going to pull out that information, right? And that's also, it's a difficult thing to do, and that's gonna lead to some of the issues I'm gonna talk about.
And you have the data sources that essentially go into that vector database, right? So if you have the whole Harry Potter series, Those are data sources. Every page of that is a data source and that all needs to be embedded into this vector database. So it can be queried.
It is a database, right? You just query it using a more text-specific method. It pulls out the relevant information and then the user query and the relevant information go into the prompt and you just ask the... Large language model, a very basic question at the end. You already have the answer in the prompt, so the large language model is likely to not hallucinate at that point.
So really where a RAG system may fail is pulling out the relevant information. So, I mean, obvious use cases that you might encounter is if you have user support, you have a whole database of maybe 5,000, 50,000 documents that outline different technical support cases and you have a user who's asking, you know, my thing doesn't work and you need to pull out the specific information from those 50,000 files. and the large language model can essentially give them an instruction on how to fix their own thing, so you basically can skip the manual part of interacting with the user. Chatbots, all of those things are kind of prevalent nowadays.
I'm going to show kind of an interesting take on how you can use a RAG But here are some of the limitations of that. And you can encounter those very easily.
Even in Amazon's website, AWS, Amazon Web Services, you have a chatbot where you can ask a question and it's going to try and answer it. And it sucks. It's very, very bad. Because it has too much information, because the system itself is very complex.
So React systems have these inherent limitations. So what are the limitations? First off, you cannot just put the whole Harry Potter series or all the 50,000 documents for tech support into a single vector. That doesn't make sense. You're not going to be able to put it in the prompt and it also just doesn't work because embeddings take a specific chunk of information.
So, larger articles and documents need to be stored in chunks, which already leads to a problem, which is, if you give me just a page of Harry Potter, I'm not gonna be able to understand it. It loses context. Some articles, some documents need that additional context. They come in large pieces of text.
Documentation usually comes in 200 pages, 300 pages. And you're going to break that up into maybe two or three paragraphs per chunk. It's going to pull up those chunks, but those chunks don't give you the whole information. And that's why it's going to basically give you a nonsense answer.
Noise. Noise is very, very bad. Because, as we said, text needs to be embedded into a single vector. Maybe you have a whole page that needs to be embedded into a single vector. That vector needs to represent everything that that page talks about. And if that page talks about two different things, it talks about Ron Weasley and Voldemort, it's going to add noise.
If you're asking about Ron Weasley, it might be very relevant information about Ron Weasley, but the vector is going to tilt towards Voldemort and you might not pull out that relevant information. So when you try to pull out the information, the most relevant things might not come up because of noise.
And failure to deal with multiple candidates. So that's another very big issue, which is, it's a needle in a haystack, right? If you're asking a question in AWS and it just has like 500 articles that talk about, you know, you were asking about S3 or EC2 or some service that they have, they might, you know, they might pick out like 50 or 100 articles that talk about that. But from then on, picking out the exact article that's relevant to you is the big issue.
And that's gonna be very, I would say, That is going to be very susceptible to noise. So it's going to pick out stuff that's generally relevant. but it might not pick the stuff that's most relevant. And that's basically the biggest issue. And this scales terribly. The more documents you have, the more likely you are to see these two specific issues, which are, there are gonna be documents with noise, and there's gonna be documents, there's gonna be more and more candidates for you to choose from the most, no, the best one. So,
With that very sharp segue, I'm going to talk about our case. Our case was in media, and our client was a very big German media conglomerate. We can't exactly say which one, but there's not that many of them. So we had the case of these guys are so big, they have like 200 different brands, right?
And each brand has their own articles They very often talk about the same thing But they cannot make their own recommendation system Because The brands are very siloed, there is not enough business communication between them, there is not a single content management system across the whole conglomerate. So they want you to click on one of their brands, looking for an article on, I don't know, Bill Gates or something, and then that article should recommend other brands. So you click on a tech article and it brings you the latest news article about Bill Gates, right?
And that was the issue. They had all this content, but they couldn't really track users over their different brands. So we need to start with no data about what's good recommendation or not, what's called a cold start. And that's very bad for data scientists because what do you do when you have no data? And it needs to be cross-brand.
And we essentially said, yeah, but I mean, all we have is the content of the article itself. So it needs to be content-based recommendation. And finally, because they're Germans and they're really ethical, they wanted it to be explainable. There were some issues with Netflix recently. Their recommendation engine was very, I would say, pay to win. So if you pay Netflix, it's going to recommend Netflix. your shows more and there was no visibility of that. So users started hating it. These users had a negative reaction to the whole brand because of it. And these guys wanted to avoid this. They wanted to give you an explanation of why exactly we're recommending this to you. So you have a better experience, you have more trust.
to the brand. And we worked here, I want to shout out Identrix, which is another Bulgarian company. They work in media intelligence, great people, very professional and have a great data science team. So they worked with us, they provided some of the models that we're going to talk about a bit later. And here we had this issue of okay.
There is the issues we talked about earlier There's too many articles the articles talk about too many things you have an article about Elon Musk buying Tesla and that article for you know Obvious reasons has a paragraph about climate change So when you try to recommend, you know from news about climate change gives you an article about Elon Musk and you know You don't want that You want to learn more about climate change? so The issue with RAG systems, as they usually are implemented, is that you put the whole text in, and you chunk it, and it splices, and it gives you all the noise, and it gives you all the relevant information.
So what we thought about was, as a data scientist, you usually work with features. You pre-process the data, and you give the model only the most important thing that they need to think about, right? If you have information about all the house prices in Sofia, you might want to give the average house price in Sofia and the maximum house price in Sofia and not give the model all the other noise around it. Just because it's going to pick up on noise, it's going to hallucinate, it's going to make up stuff.
So data features, which is no, feature engineering, it's a basic part of working with data. It's extracting pre-processed parts of the data. They come with direct contact with the model and they carry distinct meanings about the data. You have features when they're optimized properly, they don't overlap too much. You don't give the model two, twice the same, you know, average price, housing price or something.
And we thought about, okay, what's the equivalent of that in text? Can we leverage that somehow? And there's a lot of things, features of text that are extractable but people don't think of them as features, they think of them as outputs of models. So you might want to extract all the people that are named in an article or the companies, or the geographical locations, those are called named entities. And there is a whole science around that. It's called named entity recognition and named entity extraction. And usually people do that as an output.
You have the sentiment analysis, which is another thing, you know, do people like your product or not? which is another output, but it can also be an input. You have themes, topics, keywords, you have summaries and abstracts of text, and you have style. And when you think about orthogonality, or carrying distinct meanings, when you think about the content or the summary and the style, those two things are as orthogonal as they can go. The very fact that you can ask, for example, Chad GPT, please explain quantum mechanics in the style of Shakespeare, and it can do that, That means that the content of the quantum mechanics itself and the style in which it's written are two very separate things and you can control those separately and that's one of the things we leveraged a bit later in what we did.
So, this is our sort of augmented rack system for recommendations. So we extracted all these features. There's more of them, but I didn't want to clutter the whole space. We extracted the article title. We extracted the keywords. We extracted the summary of the title. We extracted named entities such as Elon Musk or Trump or whatever.
we did the same with the candidates right if you so this is a single article is the article that the user has clicked on already and you want to find similar articles in a whole database of maybe you know 500,000 articles and instead of just looking up a single metric of what similarity means just you know take the text and look up the whole text of the of all the other articles you create these four different databases of essentially different features of the text. You embed these features and you have four rag systems, except there are not four rag systems, there's actually 16 rag systems, because you can ask, okay, for the summary of the original article, which title is closest to the summary? So you can kind of mix and match everything and everything.
And it turns out that approach was so much more stable because you could ask, okay, you know, give me the top 100 articles by summary, you know, the articles that are closest in terms of summary. And that already takes out a lot of the garbage because summaries are 100 words. And if, you know, it's an article about Elon Musk and it talks about, you know, there's one paragraph about climate change, it's not going to put that in. So, summaries keep the general idea. And then, if, you know, you're reading an article about Trump and climate change, you can then ask, okay, what are the named entities in that whole article and do they match any of the hundred candidates? And now if they match, oh, that's completely different. Now you're talking about a recommendation on climate change and Trump.
So we kind of experimented a lot with... There was a lot of manual QA because you don't have real user data. But we ended up just tweaking these parameters of which and which is most important. And... We ended up with a little bit of a bonus, which is we get the top recommendations and we also get, because we went through this whole process, we also got a lot of additional information as to how we got there. It's because we're talking about the same people, or we're talking about the same topic, or we're talking about the same thing.
So this is yet to be done, but one of the things we'd really love to do is actually put some user data in there and tweak with ML the weights of these features. So it's going to get better and better and better. It's going to understand more and more and more about what it's doing. And so it worked pretty great.
And you have this recommendation. It's a recommendation about Ian Fleming's James Bond novels being reworded. And we recommend changes to Poirot and Miss Marple, but Agatha Christie's novels are being revised. And those are from two different sources. One of them is just entertainment, and the other one is more of a news source. and then we could take all that information and we could put this prompt in, which kind of does the same thing. It kind of extracts different parts of all the information we've gathered, and we ask them to, as Sergei mentioned, it asks them to think about the different aspects of all the information we're giving them. So, please give us a contrast between the two articles, please give us a comparison between the two articles, and please give us what's new, what's different, and all of that.
And then we get that result as a formatted text, we take out the best thing, and we get this very click-baity recommendation explanation, which users liked. And that was basically my presentation.