Human-centric AI - Natan Vidra

Introduction

Well, thanks so much initially to the MindStone team for setting this up. Really appreciate Josh and this whole community building AI in New York and helping it making it more accessible. And thanks so much to the CIBC team for hosting this event. Really grateful.

Essentially, my name's Anton Vidra. I'm the founder of an AI startup in New York called Anode.

And I'm going to be giving a technical talk today about a topic called the human-centered AI. I'll skip all the selling product stuff.

The Challenge of Unstructured Data in Financial Enterprises

But essentially to kind of picture this kind of use case, say you're like a financial enterprise, right? And you essentially have millions of rows of unstructured text data, you know? And maybe you have all these documents. They could be 10Ks, 10Qs, earnings calls.

And currently you might have a team of about 1,200 people. analysts that are kind of paid a lot of time money and spend tons of time going through these documents and you're looking into ways you can use generative ai or lms to kind of help answer questions or extract information from these documents

Initial Solutions with AI Chatbots

So initially, you might have a solution that might look a little bit like this, where you have this chat bot, and you upload your files. And you have these different chats, and maybe you ask questions on all your documents. Right.

And you ask a question, maybe you get an answer, and you can go and actually click on this item to see where would the answer come from, what the source would be. And you can see it gives a chunk of text as well as an answer.

And then you can scroll down, and you might see, hey, you ask it a question, and it gives some silly answer that you don't really know too much about. The model doesn't really understand what it's doing. The chunks it might give on the answer might not even be correct.

So even if you wanted to use AI or LLMs to go through these documents, it would be kind of difficult with just a raw large language model.

Seeking Accurate Large Language Models

1So from a technical perspective, there's this really open-ended question of how you can build these really accurate large language models that you can kind of input via a fine-tuned model ID. And you can actually get the right answers to questions as well as the appropriate chunk of text. That way, your model at least gets more answers right and actually has some relevant sources to back up where the answers came from.

Proposed Architecture for Human-Centered AI

So I want to talk today about, A, a proposed architecture of how you can do this technically. be happy to answer any kind of questions on it and see if we have time, we can kind of go into some of the technical details on how such a system could work.

So essentially, fast forward to here, the idea is you want to input a fine-tuned model ID into your chatbot. So you upload your documents, you want to chat them and get answers, and you want some way to evaluate to see if the answer's right. Maybe it's private or public. Maybe there's a software development kit that you can use to actually see how it works both with code if you're a developer.

The question is, how do you do this fine tuning? And how do you know the model is actually doing well?

Approaches to Fine-Tuning AI Models

So to do the fine tuning, there's different ways of doing fine tuning. One is what I call the most common.

It's like you just have all these raw documents. Maybe you're like 10Ks of the past from a few years. And you don't really want to do any labeling, but you just want to feed this as unsupervised learning into a model and use this kind of information via some sort of mask language model to fine tune large language models.

The second is what I call supervised fine tuning. So it's like, hey, you have a label data. Maybe it's like question answer chunk, or maybe you're trying to have a fine tune model for a classification. So it's like a text in category label.

And you take this label data, and you essentially use that label data to train or fine tune a model that you can kind of put into the chat bot. And the third is this idea of RLHF slash RLAIF, where it's like maybe there's some sort of interface. And as you're actually labeling or annotating, the model is automatically being fine-tuned. And you can put the model into the chat bot.

The general idea is you would upload your documents you wanted to fine-tune stuff on. You might add the questions or categories you'd care about. And as you either label all your items or label actively or don't label anything, your model is learning.

And when you finish, you can export that fine-tuned model as an API endpoint into your chat.

Product Perspective and Fine-Tuning Outcomes

So from a product perspective, it might look something like this, where maybe you're trying to fine-tune a model for emotions. And you have these categories. Maybe you had to find your questions or categories you care about.

You can kind of see some items on a GUI. And maybe you make some labels. And as you would kind of make these labels, your model is learning. And then you're using that to kind of predict or fine tune a large language model.

And I guess before I pause to answer any questions and go into the details, I guess why you might do this fine tuning is the idea is when you finish, you'd be able to evaluate that my initial model answers are getting somewhere between 5% or 10% accuracy. But with this fine tuning or retrieval, maybe you get 20% to 25%. And then if you enhance RAG, maybe you can get somewhere between 30% accuracy and improved accuracy on the answer to the questions. Yeah, I guess I'll stop talking for now.

Q&A: Clarifying the Approach and Use Cases

And before I get into any of the details, I just want to take some time to answer any questions on anything covered thus far. Yeah.

Who's actually using at this moment? So this would be for a financial enterprise, like Morningstar, if you have 1,200 analysts going through all these 10Ks or annual reports, but you want to use AI to answer those financial questions.

Are you using a vectorized database? Yeah. So that's one way you can go about it, yeah.

So I think it depends if the data is private or public. So sometimes with financial companies or enterprises, the key thing is that the data needs to be private on a device. And in that case, you can use Chroma or you can kind of just store things in a local MySQL DB. And if that's not a requirement, it could be public, then you might use Pinecone or Weeviate or one of those vector databases.

So basically, I think the question is not really, should you do it? It's basically, how well can you actually do this kind of research?

Enhancing Model Performance for Financial Queries

So I want to just talk a little bit about, while I have a few minutes, just want to talk a little bit about how to actually enhance these model performance to actually answer these questions right. So I think there's like a few things that are kind of worth like talking about from a technical perspective.

So imagine like, for instance, you have this like query answer context database on like financial services, right? And you've kind of are trying to not only optimize the answer the model provides, but also the chunk.

Techniques for Fine-Tuning

The first kind of thing you might want to do to make the model better is a process called like fine tuning. And within fine tuning, there's a technique called a parameter efficient fine tuning, which kind of uses like LoRa and QLoRa.

And the idea is like you're taking this kind of domain knowledge from your training data set, you're doing supervised fine tuning on some new model, and you're kind of using that information to make predictions on a really fast but easy to use way on like your test data set. And there's basically also Qlora, which kind of does this fine tuning process, but a lot faster and more quantized for efficiency.

And why this is useful is rather than use just kind of pure rag, with this fine tune model, you're actually able to get the answers to questions in the finance domain more right, not perfect.

Improving Retrieval Processes

And the second kind of approach, which I think is really interesting, is if you're trying to use pure rag, but you're trying to enhance the retrieval process, Essentially, there's a lot of approaches to actually improve RAG to enhance retrieval.

These could be things that are solving some of the core limitations, like, man, I have three or four documents where the text is in different sections, and I want to find the answer. Or my model is taking the chunk, but is not doing the right chunk of text. Or I'm getting an answer, but it's picking the most similar chunk, but not the most relevant chunk.

So there's these approaches like query expansion, re-ranking, and metadata filtering that could be interesting to look into. But yeah, thanks so much for your time.

I appreciate you. And thanks to CIVC and MindStone for setting this up.

Evaluating Model Accuracy and Benchmarks

Sorry, I have a question. So you mentioned that you evaluate the results. What's the benchmark? How do you actually know that the results are coming back and they're true?

I would say this is actually the most important question. So basically, it depends. If you have labeled structured data, So you have a human who's kind of went through and labeled, for this question, here's the answer I saw, and here's the context or chunk I found, something that might look like this data set, right? Because the humans kind of went through as an expert.

Yeah, and you can kind of compare the model answers to the human answers. And there's these metrics like cosine similarity to measure string similarity, or row-gal score to measure the longest common string, or you have a human measure it.

1But then I think most use cases in industry, you actually don't have this label data from the human. So evaluation of these models is really difficult. There's metrics like RAGAS, which essentially measures faithfulness of large language models. So it's not only how similar the answer is, but how relevant. And there's an algorithm for how to do that.

And there's also LLMEval, where you're basically asking GPT or your model to tell it if it's good or bad. And that's, I think, kind of subjective in an open research area. Yeah.

Differentiation and Future Directions

What is the most novel or differentiated about what you're doing? I think where we are at is we've realized that there's a problem. Basically, there's a limitation of what we can do, which is basically for our customers, it's like, yeah, we can get it from this 5% to 30% where most other folks are. But they want 90% to 100%, and they're paying us for the really accurate results.

So we are trying to think of a solution of how to actually do that. So it's like, how do you build this really robust fine-tuning library? And we have ideas, right? And we have approaches.

And I think our main kind of approach is supervised, unsupervised fine-tuning with labeling to learn to feed into the model that you can evaluate and try different models. And maybe you take different parts of the product and chunk it to filter your items before doing it. But I think we're still trying to figure out how to make it the best, right?

I think it's technically possible if you had a lot of resources and a smart team. But I think it's definitely possible to do it at least incrementally better, for sure.

Use Cases in Private Equity and Banking

So there's a use case within private equity or banking around looking at, like you said, 10 Ks and 10 Qs to figure out comparable companies. to like, let's say you're looking at a shipping company and you wanted to go through these 10 Ks and figure out what are other compatible companies so that you can look at the PE ratio, for example. So would that be a use case that could be used Exactly.

If the product was completely amazing, right? That would be a use case, whereas if the answers to the questions and the sources were completely accurate, it would just save a lot of time and money, and it would make a ton of sense, right?

It would give you prompts and ask you, what category do you think, if this falls in, and once you rate it, oh, And you can either label them all. You can not label any of them. Or you can actively improve the labels.

And each approach is good and works. And you can see how each one performs. And you can measure, based on your humans or based on your initial model, how each one got better or something. But yeah, that's exactly it.

Closing Remarks

Anyways, I don't want to take up too much time, but I'm really thankful for the opportunity, and I guess my name's Natan Vidra, so feel free to just reach out to me and connect, and we can always talk more. Yeah.