Reducing LLM Hallucinations with RAG with Citations with Michael from Cohere

Introduction

Hi, everybody. My name is Michael. Currently, I work on API design and developer experience at Cohere. Cohere is a company that's originally from Canada, and now we have offices all over the world.

And what we do is we train language models for enterprise use cases. And previously, I worked at Amazon and at Snapchat as a software engineer.

I also play bass in a band called Good Kid. This is us playing in Manchester. It was really fun.

And in fact, today I had a pretty crazy day because right across the street, our band got nominated for a Juno Award, which is the Canadian Grammys. Pretty sick. And maybe that's why our presentation is going to be sort of loose today, because I maybe didn't have all the time to prepare the slides that I was hoping.

Understanding LLM Hallucinations

But here's what I want to talk to you about today. I want to talk to you about LLM hallucinations. This is a term you might have heard before.

Generally, what it means is, You ask a question from an assistant such as ChatGPT or another one, and it just lies to you. And it lies to you so confidently.

Real-World Example of a Hallucination

And maybe let's do an example. Let's do an example of that.

So an example I came up with is I have a friend. Her name is Leandra. She plays trumpet in a band called Lovejoy.

We toured with them before. So let's ask Coral, which is our user interface for interacting with a coherent model. Let's ask it, who plays trumpet in the band Lovejoy?

Who plays trumpet in the band Lovejoy? And it just goes off. And the TV series Lovejoy, the trumpet is played by Ian.

None of this is true. And yeah, this is just a complete fabrication.

Challenging Misconceptions about LLMs

And I think a lot of people see this and they're like, well, LLMs are garbage. What is this useful for? If I ask it a question, it's going to lie to me.

And I think the way to think about this is that The model is not trained on all the information in the world. It's not an oracle. It doesn't know everything.

All it knows is how to sound human when it talks back to you. But this is still very useful because we can ask it to look things up first before it answers the question.

So for example, if I put someone on the spot here and I said, who plays trumpet in the band Lovejoy? You can't mess it up. To not sound stupid, they might just lie. But if I gave them a computer and they could look things up first, they would probably give me the correct answer.

Introducing Retrieval Augmented Generation (RAG)

And so that's kind of the whole idea behind what we call RAG, which is retrieval augmented generation. Look it up first.

So here's an example here. There might be a conversation. The user says, who is the current president of the USA? And the chat bot knows. And it will say, Joe Biden is the current president.

And then the user might ask, how tall is he? And so if we enabled RAG, what would happen is, step one, the model needs to figure out, what do we need to look up? And the model would say, yeah, we should first maybe Google Joe Biden height. Step two, actually do the Googling or the web search. So we'll go use some kind of web search engine, look it up, Joe Biden height, and it might find some articles. It might find a Wikipedia article. It might find a BuzzFeed article, something else. And now step three, go back to the model and say, OK, how tall is Joe Biden given the following information? And now the model is going to be way, way, way more accurate at telling you this.

So we can actually check this by saying, Who plays trumpet in the band Lovejoy? And what I'm going to do is I'm going to click this grounding button here and enable web search. And it's saying, searching, who plays trumpet in band Lovejoy? And there we see that, first of all, Zoe, who's an excellent trumpet player, plays in that band. And here's Leandra, right there, playing trumpet with Lovejoy as well. So suddenly, this technology becomes really useful, because you can get accurate answers, or at least more accurate answers to your questions.

And right here, it gives you citations. So you can actually go and verify this information and not just trust it blindly, like in the previous case. Okay, so this exists.

This is called RAG. Here's the example with Joe Biden and his height.

Implementing RAG with Code

Okay, well, how would you do this with code? Well, with code, we have an API. It's the chat API.

And you can say, first, get me the search queries. That's the first line. How tall is Joe Biden? And provide me the search queries.

Step two, go write some code that looks things up on some kind of web search. Get those documents back. And then step three, collect all the documents that you found. call the chat API again and say, OK, how tall is Joe Biden given the following documents? And that's what powers what you saw in the UI behind the scenes.

Connectors Framework for Simplified Integration

However, there's even a simpler way of doing this. And it's a framework that we came up with. It's called Connectors. Basically, all you do is one line of code. You say, how tall is Joe Biden? And please do it, conduct a web search. As a developer, this is pretty sick. You can just start building RAG-powered assistance really, really quickly.

Enterprise Use Cases for RAG

But at Cohere, we are focused on the enterprise. And that means that a lot of the questions that people are going to be asking the model require information that just doesn't exist on the open web search. You would have to look it up in some kind of private space.

And so what is Connectors? Basically, the idea is that you should be able to look things up in your own private repository. It can be your Confluence, your company's Wiki, Google Docs.

Private Data Retrieval with Connectors

How am I doing on time? I still have time. Great.

So I'll give you an example. So over here, I'm going to say, who is the manager of team endpoints at Cohere? I'm going to enable grounding, but I'm going to disable web search, because this information probably doesn't exist on the web. Although now that I'm Juno nominated, who knows?

And I'm going to enable our Google Drive connector. And let's see what it finds. So the search query got generated, who's the manager of Team Endpoints, and it found that Michael Kozakoff, which is in fact me, is the manager of the Endpoints engineering squad.

So you can see that this is pretty useful if you're trying to build a chat bot that enables people at your company to answer questions based on information that exists only privately. So I think at this point, I'm out of slides.

Privacy and Permissions in Data Access

And the only thing I wanted to mention in addition to this is that when you look up information, in this format, sometimes if you enable a connector such as Google Drive, well, I don't want it to be searching documents that I don't normally have access to. For example, the CEO of my company might have some private documents about people's salaries, and I shouldn't be able to ask, what is my manager's salary?

The way this works is there's an authentication flow to it when you register a connector. And so using OAuth2, it enables search only to documents that you have access to. And that way, I can search things in my private documents and all the public documents that go here, but I cannot search over documents that somebody else owns.

Conclusion and Q&A Session

Yeah, so this is a framework that enables people to build rag powered solutions for their own use cases. So if you're a developer and you think this is something cool to play with, I think it's really cool to play with and you should.

I think that's all I have to present and I kind of want to open the floor to see if there's any questions about anything.

Questions on Data Synthesis and Security

What if you take documents that you have authority to and you ask them to piece together possibly information processing and possibly of data of a document that you don't really have access to.

So you're saying, given a certain document, synthesize some additional information that is not true? Yeah, like predict. How likely is it this person is the manager of this?

You could try. But I suspect that if you ask the human to do that, they wouldn't do a very good job of that.

Since you give it the access to a private pool of info with the permissions, how do you manage the permission of what is done with it? Since you have access to privileged info into the enterprise, how do you deal that this info is not going anywhere? So the idea here is that this is a tool that's only available to you within your company.

And instead of you going and looking things up by hand by yourself, you ask this thing to go look it up for you and synthesize a summary for your own answer. And then what you do with this information is exactly the same as what you would do with the information you found to the documents you already have access to. So it's kind of like a helper tool.

Does that answer your question? Yes, totally. Thanks. Yeah.

Technical Capabilities and Limitations

You know, if in my company I want to do this, what data formats are supported? You know, because I mean our company has, you know, Microsoft files, it's got JPEGs, it's got like, you know, a million different file types. Is this limited in its ability to, you know, And I think it gets to the core of what is a connector.

A connector is basically a service that you spin up yourself that can search a data source of your company. So a connector takes in a search query as an input. And it's responsible for conducting a search on your personal data and then outputting text as the output. of your personal documents, that's your responsibility to spin up something that can search your documents.

It's just, yeah, you register something that provides us access to search through things and then brings back the text. And so the only requirement here is that you can accept a search query and you can spit out some text for us to work with. I'll add one more thing that I had mentioned before, which is kind of like an interesting case. When you work with models, they have a limit to how much text they can actually accept.

It's called a context length limit. And at this moment in time, it's actually not that huge. So what do you do if the retrieved documents you grabbed How do you decide which ones to give it, which ones to drop?

And so one of the things that powers things under the hood is another language model called re-rank. What it does is it takes the search query and it takes the documents and it sorts them in order of semantic relevance to the original query. And so when we decide which documents to omit from passing on to the model, we use the re-rank and we only take the top documents or as many as we can fit within the context window. And that's how we decide what documents go to it.

Adapting RAG for Code Generation

In that case, can I input my entire code base and then ask it to generate a class that potentially uses some of the other classes in my code base? And even if my code base is large, then re-rank will I think you could do that. I don't think RAG is the right approach for it.

RAG is more for asking questions about existing documents. So when you use our API that way, everything about the prompt that gets structured is about the question answer form. So what you could do is you could provide your code base as documents and start asking questions about it. Give it a go.

Yeah, shouldn't be too hard. Is there a different approach that maybe one should take to try to create a whole new class for their existing code base? I think you could do it literally just in your message to the model. Can you generate me a class or something like that?

My only concern is that if I have a large code base and I need to be using 10 different classes inside of the new class I want to create, how do I just copy and paste my entire code base into the chat message UI? I see, yeah. So you could construct something that like, if your concern is I have way too much code that's not gonna fit in the context length of the model, you could re-rank your code using the re-rank endpoint and figure out which code to omit, which one to pass, and then pass it in. Perhaps it might work well with the RAG framework with the connectors.

Interactive Clarifications and Follow-Up Questions

After you rank it by semantic relevance, would you have the provision for the bot to ask counter questions, to clarify the questions? That would be very powerful, in my opinion, because there are a lot of cases where My questions are not clear in the first place.

So the approach we take to this is that anytime you ask another question as a follow-up, we conduct the search all over again because it might be a different search query now that you asked a clarifying question. And so it's going to be a brand new set of documents and a brand new ranking for those documents. And so hopefully when you ask that clarifying question, the collection of documents used is going to be different.

So can the bot ask a clarifying question to clarify which document is signed? I think it depends on a case-by-case and how we trained the bot. I think sometimes if you ask it a question and it doesn't know, we train the bot to say, sorry, I don't know, or I can't answer that question given the following information. And then you know maybe to pass it more information.

How are we doing on time? We're still good. We're still good. All right.

Incorporating Vector Databases

link to something like a vector database? And are there any vendors you would recommend ?

That's a really good question. So a vector database is something that stores embeddings. The whole idea behind embeddings is that you can take some text, you can extract the meaning of that text, represent it as numbers.

And numbers are really useful because if you have text A and text B and you extracted their meanings, you can compare them to each other. And now you can tell whether they're talking about the same thing or not, even if they use completely different language. They use different words. So this is different from like lexical search.

So what you're asking about is what if you're searching over a database of those embeddings? Yes, we actually, I don't have enough time to demo this to you, but we have built a connector using a vector database that we built in-house.

But we are actually in partnership with Pinecone and a bunch of other databases. So yeah, you can totally build a connector around a vector database.

The only complex thing about searching over a vector database is in most cases, there's no permissions set up in the database. So it's very hard to tell which documents you have access to and which ones you don't. So if it's just all public data, yeah, sure.

Dump it into a vector database and then conduct a search over that database. But if now you need to be able to say which vectors you have permissions for and which not, that gets a little bit harder.

Thank you.

I have a follow-up question about the context. What are your thoughts on models that have very high context, like 2.1, which can give it like 100 So would that kind of solve some of the limitations of models that have a lower context length?

Certainly. I think the more context length you have, obviously, the better because you can provide more documents to the model and it's more likely to give you an informed answer.

However, The more context length you have, the more expensive your model is to run. And so some enterprises, for them, it just doesn't make sense to pay this much money for a model that's this big.

And so you have to come up with other solutions. And re-rank is an excellent approach to do the same thing, but a little bit cheaper. Basically, before asking the model the question and providing it access to all this information, you first filter down the information that you give it, and then that's how you get around it.