Kùzu: a fast, scalable and easy-to-use graph database for AI

Introduction

So hi, everyone. I'm Prashant. I'm an AI engineer at Kuzu.

And I know a lot of the audience may not be deeply technical. I consider myself a relatively technical person.

I'm a builder. I work on AI tooling. And Kuzu is a graph database company, so we sit at the upstream end of the infrastructure stack. on top of which innovative tools like LLMs and other downstream applications can be built for productive applications.

Understanding Graphs

So I get started with my talk and just a quick show of hands. How many people in the audience have actually come across graphs or know what a graph is? OK, so it's a reasonable number of people.

So I'm hoping that this image resonates with you because I spend a lot of time with graphs every day. And I think that a lot of people use the term graphs to describe 2D plots like the one above. But in reality, a graph is essentially an abstract representation of entities, which are represented as nodes, and the relationships between them, which are represented as edges.

And in the real world, a lot of data is connected. If you look at the data closely and then you zoom out, like this image here, you actually see that the data is organized in a specific way. And it reveals some very interesting structures.

Introduction to Kuzu

So with that, I'll jump into what Kuzu is. Kuzu is a graph database management system built by Kuzu Inc. And we're based in Waterloo, Canada. And I'm myself based in Toronto.

So Kuzu is a database that is built with usability at the forefront. And the data model that Kuzu uses is called the property graph model. It's a technical term, but I'll go into an example that demonstrates what that is.

Any database system requires a query language. If you ever worked with relational systems, you'll know there's SQL, and a graph database has its own query language. In Kuzu's case, it's called Cypher.

Kuzu is also an embeddable graph database that makes it much simpler to use. It's quite similar to DuckDB and SQLite. So there are no servers to set up. It supports ACID transactions. It's permissively licensed open source. And it's interoperable with many data formats. But most importantly, Kuzu is designed to be integrated with the Python AI and data ecosystem.

So a lot more information is available on the website, kuzudb.com. And I'll describe a bit more in our demo as well.

Interchangeable Data Models

Before I go into the demo, I want to describe that data models are interchangeable. Essentially, this slide shows three data models, all of which show the same data. You have a person living in a city, in this case, two persons, Wendy and Jeff, who live in Toronto.

A person follows another person. A person lives in the city. So in the left image, it's shown as a relational model, which is a collection of tables that are connected to one another. The middle image shows a document model, which is a semi-structured blob of information like this. But the third image shows the same data organized as a graph. And this is what I mean when I say property graph model.

A person follows another person, person lives in a city, and you have information or metadata about that person attached to that node, which is basically ID and name in this case. 1What Kuzu does is it brings structure to the property graph model. So we call it the structured property graph model, where you might be familiar with labeled property graphs, which are used in other systems. What this image shows on the left is essentially mapped to what's shown on the right, which is tables. A person and a city are modeled as node tables, and the follows and lives in are modeled as relationship tables. So every graph under the hood consists of these tables.

1The broader vision of Kuzu is to be the go-to system for graph data modeling and data science. And it's well integrated with a lot of packages in the Python data science ecosystem. So if you've ever used any of these packages, it's relatively simple to move your data back and forth within Kuzu and to the other systems. And the same holds true for the data formats that you may come across in the wild.

Live Demo: Wine Reviews Graph

So I'll jump into a demo. And because I was in Napa Valley in California recently, I thought this would be a very apt thing to demo. I'm going to pull this data set of wine reviews, which is from Kaggle.

And essentially, you have the IDs of the wines. You have the description titles of what wines are reviewed. And you also have tasters who have tasted these wines and written those reviews. But most importantly, there's numerical information like price and country and points and things like that.

So I'll jump into the code, and we can actually look at an example of how this is modeled as a graph. So at a very high level, I'm just going to jump into the schema here. What I'm calling a schema is essentially just like the diagram I showed before.

This is the data I'm trying to model. I have wine. It's tasted by a taster, which is this red node here. The wine is in the green node. The wine is from a country, which is in this orange node here.

And you also have a data set of customers who purchase these wines, who follow these tasters on Twitter. And the customers themselves live in a country.

So what I'm first going to do is I'm not going to walk through the code. But essentially, we have all this data organized in external files, in this case, Parquet files. And I'm just going to run this code. And once the code finishes running, It's essentially going to construct a graph, which I'll visualize very shortly.

But before we go into the visualization, I just want to show what Cypher looks like. It's a query language for graphs. So this query here, if it's visible, it's reminiscent of SQL, if you've ever written a SQL query for relational databases. You have a match statement. It shows that you have a taster tasted a wine. This little arrow here shows the direction of the relationship. So the taster tasted a wine and not the other way around. And you're returning the name of the taster. You're counting the number of reviews that they've written. And you're ordering that. So it's quite intuitive to read from a declarative standpoint.

And the results from this graph show that there's one particular reviewer called Roger Voss who has reviewed 25,000 wines. And the total data set itself, I won't go into the individual lines, but there's an earlier query that I ran which shows that there were 130,000 wines in this data set. So one person alone has reviewed close to 25% of the wines.

So now that we know that the data is in there, I'm going to visualize this data. So what I'll do is I'm just going to fire up a user interface where we can actually graphically view this data. So I'm just going to write the simplest Cypher query that I can possibly write, which is match everything in the graph and return everything. And I'm going to limit it to just 200 outputs. So once I do this, I'm going to see that I output structured information, right?

And you have a bunch of information here. You have person purchased wines, which are the green nodes here. I have the metadata of all that information, including the description, the variety, the points, and so on.

So just to give an additional idea of what kind of query we are writing, I'm going to copy this one here. This query is essentially asking the question, give me the number of customers who purchased wine reviewed by this reviewer called Karen O'Keefe. In this case, it's returning four as the answer. So essentially, the idea here is you're able to use a declarative query language to query your structured data in a way that models the connections in that data in a very intuitive manner.

Incorporating LLM into Kuzu

Now, where it gets interesting is I go to the second notebook here. we are going to use an LLM to do the same sort of query. So what I'm doing is I've attached an OpenAI key to this. And I can use an OpenAI model.

In theory, I can use any language model to do this. But now that I've done this, I'm going to close this notebook. And I'm going to run this. Sorry, I just need to remove the lock because that's

OK, so I'm just going to go ahead and run these initial lines. And as always in a live demo, something's going to be off. Yes, I need to run this.

OK, so what I'm essentially doing is I'm using the OpenAI endpoint to, it's called GPT 3.5 Turbo, and I'm asking a question in natural language. Just like before when I answered the question, who is the reviewer with the maximum number of wines tasted? This is a natural language query.

How many wines has Roger Voss tasted? So this essentially enters a Graph QA chain and the model writes the query for me, and I'm able to output an answer with this context.

This context is passed as a prompt to the language model, and that's essentially used to formulate this response that Roger Voss has indeed tasted 25,000 wines. So the second question I'm asking here is, give me the full names of customers who purchased wine that was tasted by this reviewer, Roger Voss. It does a similar thing, constructs a query, and it outputs the names of these people.

So the idea here is we have a lot of LLMs at our disposal. We can use any of them to fulfill the required goal. And nowadays, there's a lot of open source language models as well, not just OpenAI ones.

So in this case, what I'm doing is I'm using a local endpoint where I'm running a Mistral 7 billion parameter local model, which is much, much, much cheaper and much smaller than the OpenAI models. And I'm instead telling this chain here that my question answering portion of the pipeline will be answered by the local language model, whereas the cipher writing or the query writing portion of the pipeline will be written by the GPT model. When I run this query,

you'll see that it's going to invoke the chain. But once it writes the initial query, it's going to actually pass the output to the open source LLM. And actually, I need to invoke the chain for that to happen. So now that it's doing this, it's written the query. And you'll see that the tokens are now being output from the open source model. So this essentially means that as a user, you have the flexibility to choose which language model powers your application and interchangeably move these models around to power the application of your choice.

Concluding Thoughts

So with that, I'll actually quickly go to some concluding thoughts. The idea of retrieval augmented generation is not new. It's been in the space for more than a year now.

And nowadays, I think if you see a lot of these frameworks, they're pivoting towards what they call agentic workflows. And the idea here is your raw data sits in a variety of sources upstream. LLMs can access that data, no doubt. But you may have proprietary data. You may have data that is not structured in a way that the LLM has the right context to answer those questions.

So what normally happens is you build workflows upstream to organize that data that you have into structured sources. And then you use combinations of what are called routers or agents. Routers are slightly more rule-based systems, whereas agents are slightly more autonomous systems. And the level of autonomy can vary based on what you're trying to build.

But the idea is that these routers and agents orchestrate the workflow in a way that your data that is upstream can be passed to a downstream store. It could be an API, or it could be a vector store. It could be a graph store, like I just showed. And each of these serves a different purpose.

The APIs are more like tools. Essentially, your LLM is very poor at calculation. It's very poor at weather forecasting because it doesn't know that data, right? So all the LLM would do in that situation is ask a question or have the right information at hand and query the API to retrieve that information. So those are typically done through tools.

Vector search is a very well-known topic right now. And vector search, as you know, it matches on semantic similarity. So your LLM is very good at that.

It can actually retrieve semantically similar terms and parse that information to generate a useful response. But where graphs come in, and where I really think that a tool like Kuzu can help, and this is where we're actually working to bring a lot more functionality to users, is in the fact-checking portion. Because even in vector search, there's a possibility that when you retrieve a response, the context is lost and the model might hallucinate a response. A graph is actually factual information.

It stores structured information in a way that you can actually retrieve a response and then have that factually checked in the graph via a query language like I just showed. And as models are improving on a daily basis, they get smaller, they can be fine-tuned more easily, and they can potentially write better cipher. So this is where I see the space is moving. You see a lot more interest about graph in the world of RAG and agentic workflows. If you look at all the frameworks that are pivoting towards building agentic tooling, a lot of them are thinking about knowledge graphs being at the core of it. So I think that's where I'm going to end this talk, which is

Kuzu as a Tool for LLM-based Applications

Kuzu is an open source tool. It's very, very scalable. It's easy to use. And we are looking for people to build LLM-based applications on top of us.

And I think this audience is very, very apt for this particular topic because I think the way RAG is going right now, this is a very, very prominent area where things can be built in a very different way than before. So, yeah, you can get started by pip install in Kuzu, and please give us a star on GitHub.

Call to Action

All the code that I've shown, I didn't have time to go through in depth, but you can always go to our GitHub organization, github.com slash kuzurib, and check out the code there. And, yeah, you can scan the QR code to join our Discord community where we have a lot more active users who are working with these applications and using a bunch of tools to actually build on top of us. So looking forward to chat with any of the users who are interested in this. Yeah, thank you very much.

Finished reading?