Hi everyone, hope you can hear me fine. My name is Prashant. And yeah, I work at Kuzu. It's a graph database startup from Waterloo, Ontario.
And I've been at Kuzu for a year. It's basically a young startup, but the goal is to build a graph database that's very fast, easy to use, and scales to really large graphs.
So first of all, I think I just want to make sure I ground everyone in terms of what the terminology is. Graphs are also what you may call networks in the literature.
So it's not those squiggly lines on a chart. It's more like this, where you have an abstract relationship between entities in the real world. So essentially, nodes represent a real-world entity, and you have relationships that connect these entities.
provide a more object-oriented viewpoint to your data modeling rather than a relational model that you may have in your Postgres and other SQL databases. But in general, graphs are very natural data structures to represent real-world data. And I'll illustrate why this is the case with the example that I'm doing.
So imagine you work in a healthcare setup, and you have some data in PDFs. Now, as you know, PDF data can be varied. You have a lot of multifaceted content. You have text, you have tables, you have images.
There could be a lot of stuff in the PDF that has some level of inherent structure, but it's not exactly structured, and it's not pure text either. So there's a lot of rich information in PDFs, as you know, in all enterprises.
So the scenario I'm covering here is, imagine you have medication, drugs. I know it's not clearly visible here, but I'll go into the actual example.
But essentially, the table shows medical conditions, the generic name of the drug, the brand name of the drugs, and the side effects that the drugs have. So again, I'll show the zoomed-in version of this data set soon.
But you also have, let's say, a scenario where you have a nurse practitioner who is interviewing patients, and the patient reports side effects, symptoms, patients are having a certain dosage of medication that they're taking, and all these notes are stored in unstructured form. The nurse practitioner is noting down things in their system. It's digitized but not structured.
So this is, again, rich information where it would otherwise lie unutilized. You would have this sitting somewhere in a warehouse or data lake just dumped somewhere. Nobody's really analyzing it.
So what we're trying to say here is a graph is a very useful tool in these situations. what we call knowledge graphs, essentially a more general term is graphs, where you have the idea of entities. In this case, the central red node here is the name of a drug, the generic name. The orange nodes are the symptoms or the side effects. The purple nodes are the patient, in this case the patient ID, and the green node is the brand name. So again, basically the goal is to take the data from unstructured form into some structured form, in this case a graph, and use it to answer questions about the data.
So the first thing to do is understand what we are modeling. This is called a graph schema. So a schema is essentially me sketching what entities I'm going to capture.
So as you can see here, I have condition, which is a medical condition that's being treated. It has an arrow here in this direction, leftward, towards the drug, which is a generic name for the drug. The drug has a brand, which is this green node here.
And the drug also can cause a symptom, which is a side effect. So all this information in our example comes from PDF. We don't have this in a database. It's currently stored in the form of a PDF.
And the second part we're talking about is patient data. And you can notice the mapping that I have is very natural. The patient has a condition that could be also available in my PDF document. The patient is prescribed a drug whose generic name is in my PDF, and the patient experiences a side effect which is also in that PDF.
So I'm bringing together two heterogeneous sources of data, one from a textual format and another from a PDF, to model it and then answer questions about it via RAG or other methods.
So where does this tie in with agents? The idea here is to construct a knowledge graph has been historically very challenging because you have to bring in the data, you have to figure out what entities to extract, and rule-based systems don't really do well in this aspect because data is really varied. You don't necessarily know what rules apply. People have used natural language processing before the age of LLMs, which worked pretty well, but LLMs are kind of changing the game, and that's kind of what I want to show here.
So the idea of modular agents is where the term agent here is important to understand. At the core, you have a prompt, which is sent to a language model, an LLM. Now, the agent is essentially a wrapper around a prompt, which is performing a task.
1The goal of the agent one, in this case, is to process images or PDFs into structured data. The goal of the second agent is to process the nurse's notes, which is a textual document, into a structured form. We're going to take both of those and push those into a graph database, in this case, Kuzu. And we're going to use the framework BAML, which I'll talk about briefly.
But the idea here is that Kuzu is an embedded graph database. It offers a persistent layer, and it's also very fast and scalable while being open source.
So after we build the graph, we have a rag agent, which is going to be used to specify a prompt. And again, we use BAML for the prompting.
But basically, the goal is we have modular agents that are independently constructed, optimized, and we can orchestrate these systems in much more intelligent ways. So this is a very basic orchestration that I'm showing. You could obviously add a lot more fail-safes and custom logic in place to orchestrate these in different ways.
So yeah, I think I won't go too much into slides. I'll go into a demo.
The first thing I want to show is the fact that PDFs are generally, because of the kind of data that they hold, it makes sense to think of them as multimedia content. So they're not pure text. In many ways, you may have images and other things in there.
So the first thing I'll do is I'm running this application, like a web application, where I will just dump a PDF. and I will transform those into images and store them page by page. So I'm just running this document here, and of course, in a live demo, there's always going to be some issue. But the point is, when I run this application, I'm going to be getting PNG images, like image files, and I'm going to use those images as input to a multimodal language model.
I could choose any language model that understands images. So that's basically the main code demo that I want to show.
The idea here is, I know there's a lot of lines here, so I'm trying to maximize my screen utilization. The idea behind prompting and getting data out of unstructured text, in this case, this image file that I just pointed to, this single page that has tables. And if I zoom in now closely, we can see what's in the tables. We have rows of data. Each row contains nested information. We have the reason for the medication. We have the names of the medications. And as you can see, this is not standard, cleanly formatted data. You have all sorts of generic names here, but then you have the brand names in parentheses. So along with the brand names, you have comma-separated values, you have registered trademark symbols, a lot of extraneous content that you don't really need for the purposes of analysis. And also in the side effects, you have a list of values. It's not one single side effect. It's obviously a collection of side effects.
So the power of modern LLMs is that by using modern frameworks and prompting the LLM in an intelligent way, you can actually obtain a lot of useful information in a very, very reliable fashion. And that's what BAML, the framework I'm using here, provides. So I'll very briefly walk through what it's doing.
Essentially, the concept of BAML is it's guaranteeing you a structured output from an LLM. Now, if you know traditionally ChatGPT, when you type in values or a question into ChatGPT, you're going to get a stream of words in English or natural language. These are what you term as unstructured data because essentially it's just a blob of text that's being spat out at you. What we are trying to say is to gain value from data, you want some structure. You want to be able to query it, you want to be able to get aggregation numbers, you want to be able to analyze it in a way that is not just asking an LLM to do it for you. So BAML imposes that structure.
What it does is we define the model that we are trying to capture. I have a drug. I have a generic name of type string. I have brand names, which is an array of strings. And I tell it here very clearly, don't give me special characters. Give me only alphanumeric characters. Then I have each row of information, which is calling this drug that I just defined above. So I have conditions, drugs, like an array of drugs, and a side effect array here. So all of these are the data model that I'm expressing.
And here is a prompting function. Now, this is basically very standard prompts, right? You have a system prompt here where I tell it the goal. I tell it, extract the condition, drug names, and side effects from the provided table. And note that the input to this is an image. So I'm not giving it clean text. I'm just giving it the raw image. And what BAML does is it does a lot of intelligent stuff under the hood. I can't go into the details of it here, but happy to talk more after. It does a lot of stuff under the hood to get that prompt in a shape that an LLM can really understand and use well. So what it offers is when you, I'll run a test to show what it's doing.
It's basically transforming this code that I've written as a function into this prompt here, which is in this little middle section of my editor. And this is the actual prompt being sent to the LLM. So it's telling the LLM, answer with a JSON array using the schema, and it tells the LLM, give me the schema. So on one hand, it's giving instructions to the LLM. And what I'll do is I'll run a test, which is basically calling this image.
So I call an LLM. In this case, I'm calling GPT-40 mini, which is not an expensive model. It's a relatively cheap model. As you can see, what's happening under the hood is it's generating structured data from the unstructured image that I've provided. The structured data is guaranteed to respect the data model that I've asked for. If I ask for strings, it's going to give me a string. If I ask for a float, it's going to give me a float. So as an engineer and a developer, I have guarantees that my data has structure and I know the types that I'm getting out of my model.
And the great thing about this is you can test this end-to-end. Every single time I change an LLM or change my prompt, I can immediately run my test, change out, swap out the model, get my results. And as you can see, it shows me whether the test has passed or failed. So there are many ways I can actually write an intelligent test suite to make sure that I get the data that I want.
So this is the first step. I get my structured JSON blob of conditions, drugs, brand names, et cetera, et cetera.
So this is the image portion of my pipeline. Similarly, I can do a very similar thing with the structured notes section of my pipeline. I have medications, dosages, frequencies, side effects, again, experienced by a patient.
And I again give it a prompt saying, extract medication information from the nurse's notes. In this case, my input is a string. It's not an image. So it's an unstructured blob of text.
So again, I can open a playground here, I can basically run my test, where I'm giving it information here saying the patient was given metformin, they report mild nausea. Note here that it says the patient denies having diarrhea. So the thing is, these are all logical things that a human might read and say, okay, one side effect experienced is nausea, diarrhea is not experienced. And the LLM also understands this. Like when you pass in a prompt this way and a prompt is structured well, the LLM is able to reason about the fact that denies is a negation term and you are not reporting that as a side effect even though the term was mentioned in the text.
So again, these are all very, very powerful techniques to think about how to get the best out of modern LLMs by using prompt engineering and testing the workflow end-to-end. So how is this useful? The way it's useful is I'm going to run. Essentially, I have some code that I'm not going to show here. It's a lot of code.
But in a relatively straightforward way, I'm able to transform the JSON information that I've extracted from the LLM into basically this graph. And what I'm showing here is a very simple subgraph.
Let me zoom in very quickly. So I have this purple node, which is a patient.
So that's basically what is used to power this web app here. And I can type in any sorts of questions, like I can type in, I don't know, what drug brands treat heartburn? So what it's doing is it's going to create a sequential chain query that asks the database, like, condition name is heartburn, it's treated by so-and-so.
Actually, in certain cases, the LLM does mess it up. But as you can see here, it does obtain the context and provides me the brand names of the drugs that are in the database in natural language.
What you could do is also take this query and go back to the... UI and return all the information there.
So basically, once you have a certain level of UI development on top of the database, we can actually interactively explore the graph and actually learn a lot more about the data in a way that would have been a lot harder if the data was in a table or in some other form than a graph.
So I think to close things up, I just want to highlight some takeaways. So The role of graphs in agentic workflows are, I think, something that needs to be appreciated a bit more.
And the key thing to understand here is that graphs are great when the data has some degree of underlying structure. Like, for example, in the PDF that I showed where you have nested data within tables, A traditional vector rack-based approach where you convert the data into embeddings and naively put it into a vector database does not really apply there because the kinds of questions you are asking involve the connectivity in the data.
You would rather have the data structured rather than just potentially lose connection within a vector store. And the other thing is factual accuracy.
So the good thing about knowledge graphs specifically is that you are able to use the context that you obtain from the LLM and visualize exactly what the context is through those connections. So when you get the result from your rag bot, you're able to actually show its citation or you can see the source of the data and visualize the actual paths that it has traversed to provide you that answer. And of course, that plays a lot into the role of explainability because a lot of modern systems are opaque.
As you can see, a lot of people are building systems whose internals they are not able to explain fully. So while LLMs themselves are hard to interpret, the output from LLMs is something that we as engineers need to be very mindful of. We need to be evaluating them and explaining why we get the results that we get.
So till recently, it was very hard to construct graphs from unstructured data. A lot of machine learning pipelines were involved, information extraction processes. But I think, yeah, with modern tools like BAML and Kuzu, I think there's a lot of engineering rigor as well as ease of use and open source tooling that is caught up to a point where you can actually pretty much turn any proprietary data in your organization into a structured form that can be queried by an LLM.
And yeah, this is kind of why I hold the thesis that graphs can form core components of the key terms here, modular and composable agentic workflow systems, where the term modular basically means you have self-contained blocks that do subtasks well. And you compose these modules together in intelligent ways, build an orchestration layer on top. And this is basically what we mean when we say agentic workflow.
An agentic workflow is not like a monolithic entity. There are subsystems operating within it. And before going into the multi-agent paradigm, I think it makes sense to optimize these sequential systems or more, you could say, regulated systems where we know the inputs and outputs at each stage and then use those for downstream tasks.
So yeah, I just want to highlight the two frameworks and tools I've used here. I've used BAML, which is made by a company called Boundary ML. They have a very active Discord channel, and I personally am on there a lot because I use their tool a lot.
It's amazing. So I highly recommend checking BAML out.
But yeah, Kuzu is where I work, and I'm planning to use BAML and Kuzu a lot more side by side because I think the two tools go very well hand in hand. And Kuzu has a Discord as well. It's kuzudb.com slash chat.
So yeah, I'd love to interact more with people who are using LLMs to transform their unstructured data into more structured forms. And of course, people who are working with graphs, I'm always happy to chat more about graphs.
If you want to actually look at the code and how this demo sort of works, you can check out this GitHub repo or you can chat with me after and I'll share the links with you.
And yeah, please follow Kuzu Inc. on LinkedIn and yeah, happy to chat more with people who want to learn more.