So today's topic is LLM reasoning and multi-agent conversation using text to SQL for complex problem solving.
First of all, just to, as a data scientist, just to collect a little bit of data, how many of you have used a long chain, for example? Okay. And how many of you used SQL for databases? OK, a little bit more, maybe half of you.
And how many of you have something to do with AI or machine learning? OK, so AI, machine learning and SQL, nearly the same.
And of course, the blockchain is upset, so... OK, it's my coin to start with a more technical presentation. I hope even those of you who are not in that field will find something useful. So let's proceed.
So this is the agenda.
First, we'll go through large language models. What are they?
Then I will introduce you with the case. And then we'll go a little bit into the rabbit hole.
We'll see what is React and the approach with human in the loop. to resolve that case and then another approach with multi-agent conversations.
Then we'll see what challenges and conclusions we have and we'll have some look into the future.
So here is a brief overview of LLMs. This is a famous transform architecture from 2017 of published by Google researchers. This is kind of where all these transformers and LLMs journey has started. It was a big breakthrough.
But in general, what are LLMs? Those are large deep learning models. Most of them use this transformer architecture, which you see here.
This is, we have encoder on the left side, decoder on the right side. Some of them now as most famous ones, chat GPT or mixed trial they use on the right side, but these are details.
Usually large language models called, there is no clear definition, but those are called with more than billion parameters. And it was a big revolution in NLP, as part of you know, in natural language processing, that sometimes some tasks are already not so difficult. But yeah, the main purpose, it's for question answering, summarization, classification, translation, code generation. So some of these tasks, we will see how it handles.
This is a brief a tree how this this field evolved since 2017 We have a pointer here. Yes So there are like three branches where this LLM How LLM develops Yeah, so we see Here, it's not seen, but I will read for you.
This is decoder-only tree. This is encoder-decoder and encoder-only tree. Let me continue.
So you can see that some branches continue to grow, even now in 2022 and 2023. And some of them kind of stopped. So the most popular tree, or the biggest branch, is the branch with decoder only. Those are models which are very good at generating.
That's why we see so many hallucinations in ChatGPT, because those models are here somewhere. So we had GPT-3 in 2020, and then we had somewhere here We have here Codecs. We have Cohere. And yeah, we have GPT-3 again. I can't find ChatGPT, but that's fine. Yeah, here we have end of 2022 ChatGPT that everyone heard of. So this is in Decoder 3.
Now the mixed trial model is also in this branch. And we see some encoder-decoder models, like Google T5. This branch is still growing, and we see some kind of branch which kind of died in 2020, maybe birth, although we see some sentence transformers now using this birth architecture. So this is kind of a general overview of how this field evolves. Right. And this is updated. We're 2024 now. But this is until what we have. We have GPT-4 also, Cloud and so on. So we see it's still going on. Yeah. Some developments.
So there were many applications in education and in different domains.
But what's one of the emerging properties of LLMs, those were reasoning and there is a lot of debate if LLMs can actually reason or they just can predict the next word. But those are most famous papers in this field.
So there was paper of chain of thoughts where there was a method Through this method, researchers found that using chain of thought, LLMs can come to better solutions. They can make better decisions.
And then other researchers have found that actually instead of doing different prompts, it's just enough to say things step by step. And this will be enough to LLM to generate better answers for difficult tasks.
Then there was a really popular paper, React, where actually the task was to solve different reasoning problems. but also while doing the tasks. So it's reasoning and acting based on the reasons. So there were multiple steps and then LLM could give its reason and then act based on it.
And then trio thoughts which actually led to better and better decisions and those are just partially of that field, partial studies in that field.
So the case, as part of PhD program, I'm a lecturer at university and we teach students SQL course. And this is their first time when they meet SQL code. So SQL is just a code for two query databases, right? You go to, you write the SQL, and then you want to extract some information for data analysts.
So there is this case which is open source you can scan the barcode So you can read about it, but in general it's so in the end of the semester when students they're able to Code they're able to query a database. They actually their task is to go to a database And then there is a case, there is some hint that the murder happened. And so they're acting as detectives and their task is to find the murderer.
And you will see now the tasks which are they supposed to do. So this is a database. It has different tables. It has different columns inside.
So their task is actually to find the murderer. And this can be split into different subtasks. Of course, they need to understand in which table to find the clues. They need to write the SQL code.
Based on the extracted information, they need to understand and decide, well, what's the next step? How do we investigate this? And then they need to connect all the dots and to find a solution, finally.
So the question was, OK, now having these powerful LLMs and those emerging capabilities as reasoning and capabilities of writing code, can an LLM actually solve these complex tasks, which requires multiple steps? Let's see where it ended.
The first approach is React with a human in the loop. This is a proposed solution. As I mentioned, first there was research about reasoning only and actions only. But there is this paradigm of react, so reasoning plus acting.
And those who use LanChain or working with LMS, you can check. This paper is implemented. It's already heavily used. The language is Python.
It was used in GPT 3.5 API. some heuristic prompt engineering. And there were two approaches react with zero-shot prompting. So zero-shot prompting means we just say this is the task and see what happens.
And another is just with user prompts, so helping LLM, kind of guiding it through the process. So this is workflow, what actually is happening. First, we have user question, which is some hints like the date and where the murder happened. So I will need to tell you a little bit about the case so you can be more involved what happens.
So those are hints. And then based on these hints, we guide this user question to LLM. Then LLM should query the database.
Then based on the answers, it should decide whether it's a final question, is it enough or not. And if it's not enough, it should decide by itself whether to repeat this loop until it finds the final answer. And here we use this magic of React prompt template. and action prompt template, which you will see in the next slide.
So this is actually the prompt, which is a product of this research of React. And it's well described in this paper. And how it actually works, it's prompted the LLM to generate thoughts to itself and actions afterwards.
And based on this, being able to produce better results. And also, it's us to decide, is it the final answer or not? And if not, the loop continues. Yeah.
And this is a prompt, which is related to SQL database. So when programmatically, through Python, when we connect to the code, and by the way, this is also a part which is implemented in . We have possibility to actually LLM to detect the dialect of the SQL, because it matters. And also, because databases, they connect all this table information and columns and all this schema, it's also submitted to the prompt.
So the LLM kind of knows what tables exist and where should it go next. And the question, of course. So here we will go into nitty gritty details.
I will try to go through them through the most important ones. So I'll be jumping through some of the slides. So you can kind of able to understand what are the features of and what are the capabilities of this.
So first of all, we have in our question, we have date written just normally. But then in SQL, because LLM already understands the type of syntax it should be using and understands how the data is inserted, it's generating a query which has this proper format immediately, which students usually need to think how to do it.
Some other cool things. Yes, it's generating query. So because it's LLM, it can actually read all the output and can understand what type of murders happened, irrelevant of how this extract will be. So you see it's not human readable very much.
And then it comes to conclusion, though it's not very correct here. Yeah, so here we see some loops which kind of jumping too much forward. LLM is trying to understand whom the murder is because it's the main goal. But it's too early, so it doesn't come to correct conclusions.
So in this case, it was very too complex task for it. And here we see in some of the examples, that a correct thought is generated. I need to find out more information about the murder. And it's actually generating a more complex query. And it's actually able to find, sorry, yeah. And it actually is able to find the query which refers to correct city, although the records which are extracted do not refer to the city which is asked. So because it's able to understand the language, it's able to make the reference and find the right answer.
So this is a possibility, right? So we can actually extract the rows in the way, even with some unneeded rows, the LLM is able to understand which which details are relevant.
I will jump a little bit for these examples. Yes, but here they're described. So the next attempt, and just to mention here, the progress was good, but the LLM couldn't find the right answer.
So another approach was, OK, so can we guide a little bit the LLM, right? Not just letting it go and making internal loops. Can we help it a little bit?
So it's fine. If we can help but it finds an answer quicker, it's OK for us.
So here we would just, a user would be able to, to have additional input to the question. And also, the context, all the conclusions so far would also be inserted into the prompt.
So we see that SQL query is modified correctly already in second iteration, which is quicker. We can see some examples that there are some more complex query. You see this is query inside the query.
We also see that kind of LLM is trying to follow the main goal, although it was not asked in this question exactly, this part of the investigation. But because the LLM knows the final goal, it tries to keep the final answer and proceeds forward.
So it was a bit more uncontrollable. where the model could actually reach is to find the suspects for the investigation, although it wasn't asked in that iteration. So you can see how it was loops, loops, loops and NLM is trying to make the investigation.
So actually in those attempts there was not The majority of the answers were incorrect. And some of the problems were either a loop was finishing too early, or there was infinite loops, so it was just stuck in the same place. Or there were SQL queries. So if there is a SQL query extracting 10,000 rows, then it would be an error, right? And of course, some wrong answers.
Until, I think, the second half of last year, there was this library appeared by Microsoft as a result of their research called Origen. It has 22,000.5 stars now. And actually what is the research on, they found that creating multi-agent conversations, LLMs could make better decisions and they could come to, they were able to make more complex problems even more. They were using this React also framework.
LLMs could write code it could write code then run it then based on the Results and the errors they could fix it. So this way Also the kind of making human larger tasks and also in these setups In these multi-agent conversations, there are, of course, different structures. But it was possible to create teams and through the prompt to assign different roles to the teams, imitating just our human teams.
So based on this, I've created a custom architecture similar to what Microsoft are doing, but a bit customized. So now we have LLMs which have their roles, right? So for the prompt, they're strong at their certain roles, at their tasks. So we have one decision maker which will analyze all the investigation and say this is the final answer or not.
We'll have database analyst who will be expert in coding and it will query the database and it will summarize it into this report. And we will have a planner who will give tasks, provide tasks to database analysts. So a planner will investigate all the clues so far and will say, OK, based on these findings, what is the best next step? What is the next task?
And eventually you see that in this architecture it's possible And this is a future horizons. It's possible to replace one LLMs. This can be an expert in code generation. And these can be some LLMs with better reasoning capabilities.
So let's see what happens with this setup. So this is a slide, again, for you just to understand the iterations. Everything is printed, so you can watch LLMs talking with themselves and see what's happening, actually. So here is an example of one iteration.
So there is a prompt. And then this planner generates a question to the analysts. And then analysts generates a query. Then it goes to a database. And then the answer, this is a database result. And this is actually LLM reading the answer. Then it goes to the report. And then second iteration starts.
So this way, by splitting the tasks, LLMs have focus only on one thing. And it's more controllable way to execute such operations. We can see some more examples now.
So you can see different how investigation goes. There were license numbers. There were different clues, of course. So first it was finding the witness. OK. And then here it's already, I think it was in fifth iteration, eighth iteration, the murderer was found. Okay, which was much quicker and it didn't cost a lot of efforts for this setup to solve this task.
Here is also AI generated diagram. So this is in this final report AI was asked just to generate all these relationships. So there were like witness testimony, there's one table with these testimonies, gym membership details, vehicle information, Jeep check-in, alibi, and all these clues, LLM, through the reasoning, was able to match all the dots, connect all the dots, and make a conclusion that Jeremy Bowers is the murderer. OK, so we had the case closed.
So let's go through the challenges, and let's make some conclusions. As I mentioned, some of them, in some cases, SQL query output was too large, which was exceeding the token limit. Sometimes, SQL query generated wrong column names. So there were areas related to this.
Yeah, some cost, of course, with open AI usage. The lessons learned that better results are achieved when problem is solved step by step or distributing it with the different prompts between different LLMs or LLM setups. Multi-agent conversations provide more control and visibility on what happens. And the context of the prompt plays an important role.
Some future work to explore reasoning techniques with trio thoughts or self-checking. These are emerging also capabilities. groups of trees of agents. Because sometimes, for example, in this case, if the investigation went into a wrong direction, then it was like a snowball going into total different directions. And if we use trees, for example, then based on different directions, the right decision can be made.
And of course, fine tuning open source or smaller LLM to compare the performance. So just future directions.
In general, this is just to sparkle your thoughts and ideas. What we have here is some limitations, and there is a trade-off between them. Those are costs, speed, and efficiency and quality.
So we know we have more powerful LLMs, like GPT-4, which are more expensive. We know for example, we can use maybe less powerful elements But then if we have one task or very important tasks, do we want to sacrifice the quality? Do we want to risk it, right? And the speed
In this case, it's relevant, but if you want to make more iterations, maybe you want to put 10 different LLMs and make them work one night and see what happens afterwards, of course, it's also not very quick. Some other challenges.
From this year, some of the papers actually were debating about, well, do the actually LLMs, do they have reasoning capabilities? And some of these papers, which I cited here, they found that In some cases, it was very dependent on the training data.
So if a similar task was included in the training data, then LLMs would perform well. But if it's not, or it's something outside of their training data, then it would fail and performance decreases a lot. Same thing, it was found some biases which were inserted during training data and LLMs would, if it didn't know the correct answer, for the reasoning they would just try to justify the incorrect answer. So this is something which should be also, we should be mindful of.
And some conclusions. Yes, LLMs are powerful technologies, of course. And in this case, the reasoning capabilities are evident, although we cannot still trust the results 100%. The type of tasks matters, of course.
So we should be mindful if that task was included in the training data or not. There are also some researches where the task was completely new, something we don't know, is not a casual task. So then LLMs would fail.
Yeah, LLMs can still make incorrect claims. We know now very well it's a hallucination problem and misleading. And especially for chain tasks, the tasks we saw just now, when the next answer is very much dependent on the previous answer, we have a snowball effect. So we should be mindful, especially in these tasks.
And potential future directions. So domain-oriented tasks, more powerful LLMs. Now we see smaller LLMs evolving or different agent structures. And just some ideas about the future directions.
So as I mentioned about hierarchical chat, more sophisticated multi-agent structures with agent teams and tree-based structures and actually very, very recent from January, February this year research where researchers actually find a way to weight different LLMs. Why not? We have in machine learning a lot of
applications where we have ensemble trees for classification. So maybe this is a way to solve more complex tasks where we do not risk to have false positives. So tasks which are very important. Maybe we have two or three tasks to do.
So productivity is not so important, but the quality is the most important. Maybe this is the direction to do. And simple as confidence intervals and voting. Why not?
run 100 different LLMs with different setups, and then see what's the distribution, what's the most frequent answer. So this still can help a task like this on SQL tasks, and it still can save a lot of human effort.
Thank you very much.