Learnings from Using LLMs in Production by Marko Klopets

Introduction

Awesome. Hey, I'm Marco. I'm one of the co-founders and the CEO of a company called Supersimple.

But I'm going to talk about some of the learnings we got from building a couple of real LLM-based products in production over the past couple of years.

Background and Context

I'm going to start off with a little bit of context and background on what we do so that any of this would make sense. And then I'll zoom in on one specific LLM-based AI that we have and talk about how we thought about product design in the context of an AI-native world, whatever that means, and how it works technically.

Can we get a quick show of hands, how many of you are technical, somewhat technical? And how many are product people? Awesome.

So I used to be an engineer. I used to build a bunch of data and AI apps for the enterprise. And then over time, moved over to a product role, did some data science, trained a few models here and there. And now for the past roughly two years, I've been building this company called Super Simple, not Go Super Simple.

SuperSimple: A Data Exploration Platform

And what we did is SuperSimple is a data platform for B2B SaaS companies. And that means we do a few things actually.

Core Features of SuperSimple

At its core is a modern business intelligence platform. So a tool that allows people to interact with data to answer questions.

Now, our sort of fundamental bet here is that most of the value in data comes from people going really deep and specific into specific questions rather than just checking their KPI dashboards. So this sort of deep data exploration is what SuperSimple was built for.

Joined at the hip with us is what we call our AI Insight Engine, which acts like a thousand little automated data scientists who continuously test different hypotheses, watch your metrics, and then when you find something, we give you a little nudge, we send over a report that has all the underlying data so that you can draw your own conclusions.

And then thirdly, we have an API that these same B2B SaaS companies use to build customer-facing reporting into their products and to give their own customers AI-generated insights.

The Genesis and Evolution of the Inside Inbox

But where we started from a little over two years ago now was something like this. A very central part of our product was what we called the inside inbox. And again, the idea was that we can use AI to automatically find nuggets of gold for you to then look further into.

And so we gave you an objective fact about this is the observation, this is what we see happening in the world. And we gave you a couple of numbers to back this up. But very quickly, we realized that this has two problems.

Humans have two problems. First, people always want to see the context of what they were just told. If you have a very smart data scientist coming to you, giving you a little report about, hey, this is what's happening. It's pretty surprising.

You're going to want to see proof. You're going to want to understand what are the assumptions they had. How did they get to these answers? And that's even more so true if you're getting these insights from an AI that just claims to be smart.

And then humans very much always want to go deeper. If you get to something that's actually interesting, you want to ask follow up questions, right? It doesn't end up just having one shot.

So very unexpectedly, we ended up doing this massive detour of building an entire business intelligence tool to make these insights actually useful. And we built one that's centered around these sort of ad hoc questions, exploring data, not just building dashboards. And so that's the core of our product today.

Developing AI-Based Products

But the first AI that we properly shipped to actual customers ended up being what we call a TTQ or text to query. As some of you might guess, this means you can indeed use natural language to ask some types of questions. And when you ask a question, we put together some type of query for you.

And let me do a live demo because otherwise there's no chance of this going badly. So if I type in what am I looking for? For example, I want to understand in a B2B setting the average number of users per account changing over time. If I type this in, then two things happen at the same time.

First, we search through your entire data catalog, since it's a collection of pre-built reports and just database tables and metrics that you have available. And we use a semantic search using embeddings, using an LLM, to, in this case, surface a report that doesn't really share any of the same words, is that my question? It's called customer team sizes, and it actually almost answers my question, except it doesn't handle the changing over time bit.

But at the same time, our AI with a cute little penguin at the bottom put together a new mini report for us. If we click into this, we see we asked for the average number of users per account changing over time. And we see this as a time series job. We see a table. Cool.

AI UX and Text to Query (TTQ) Functionality

Let's zoom out for a second and think about what the obvious way for this to work is. What's the obvious AI UX? I think no matter the context you're in, the obvious AI UX these days seems to be this, right?

I'm assuming a few of you have seen this product. It's by the San Francisco startup. They're somewhat famous.

And indeed, you can ask it questions about data, or you can ask it to put together a SQL query. It'll do it happily. Sometimes, occasionally, it'll even be correct. And this is pretty interesting.

So a bunch of companies are now building something based off of this. So you can create these slightly more purpose-built apps where you have an even nicer text box where you type in your question and you'll get the SQL query out and you'll have a button to run this query. Cool, that works. And there are literally hundreds of companies doing this right now, including every major competitor that we have.

So the thing is, we think there is a couple of things wrong with this. First, I would argue that reading and debugging SQL isn't a lot of fun. So again, I'm pretty technical. I can write the SQL query. Most of the time, it'll actually be faster for me to write 10 lines of SQL than to read somebody else's 10 lines and make sure it does exactly the right thing. And not everybody is technical.

And then as we're betting on the sort of deeper analysis, not just getting to one report and ending it there, you want to explore around. You want to make changes. You want to go deeper. And you don't necessarily want to go into a weird prompt engineering mode where you're like, hey, Siri, can you please put back in those enterprise customers that I just asked you to filter out?

And then thirdly, I would argue that, say, the other LLMs, things like GPT-4, they just aren't smart enough for this. So when you type in any question onto a platform, the way we solve for these three problems is through having the AI not write a single line of SQL actually.

On our platform, the main way that humans interact with data is through this sidebar, where you have a few of these no-code steps that you can string together. And all that the AI did was actually use these very same no-code steps in the sidebar to answer my question. We can look at these top to bottom. It started from looking at all accounts. Then for each account, it figured out how many users that account has. And they got the average of this for each cohort that the account signed up in. And we can click into this, see details, and we can change something about it.

And what makes this interesting is that my grandma can look at these steps. My grandma is pretty meh at SQL, but she can look at these and understand what this query does. What does this data actually show her? Which means that you can actually trust these results and you can have the context that's required to draw some conclusions, which I think is critical.

To make that happen, you need to first have a deep platform that's able to solve the problem with a human. And then the AI needs to be extremely well intertwined with it.

And secondly, I talked about going deeper, making changes. If we want to, for example, look at this same thing, average number of users per account changing over time, but how it's different for each payment plan, we can do that in literally two clicks. Again, because there's a platform that was built for this specific problem of using data.

Technical Challenges and Solutions

And then thirdly, I mentioned that state-of-the-art LLMs can't really do SQL. So why can we do this? For one thing, because we don't do SQL. But the answer is sort of a bit more nuanced. There's two reasons.

One is we're actually taking a model, a base model, something like a GPT, and we're specializing it down to one very specific use case. So our model is pretty bad at holding small talk, but it's really good at giving you specific bits of data because we trained it only for that. I'm going to get into how we actually trained it in a sec.

And secondly, we're reducing the amount of work it needs to do. So because it doesn't need to worry about things that are going on in the database, things like CTs or joins or using the right keys, it only needs to worry about these very high level steps that my grandma could understand, such as, hey, I want to add a new column that says how many users there are. It just needs to do less work per output token, which means that the work that it does do is of higher quality.

The Training Process

The way this works is, at a high level, pretty damn simple, super simple even. So there's two inputs that we give the LLM. First, there's your entire data model, or a description of everything available in your databases, your warehouses, how things relate to each other, and then the question. And the output is just in JSON that describes one of these explorations or reports on our platform, with all of the steps in the sidebar, every chart that's been added, and so on. There's nothing magical about this JSON.

And how it technically works is we started out with GBD3 DaVinci, which is the original GBD3 model that they allowed to fine-tune as well. So we're fine-tuning these models with a data set of a couple thousand examples of these triplets of data model, question, perfect report. And over time, we tried literally every model out there that's at all significant, everything from StarCoders to the LLAMAs. We're currently using a fine-tuned version of GPD 3.5 Turbo in production. Going from DaVinci to 3.5 was something like a 2x boost in our performance based on how we internally measure it.

And then we're also going to get back to how do we measure it. I think outside of the GPD world or OpenAI's world, Very anecdotally, we got the best results with fine-tuning Palm 2, which is also a closed source model by Google.

Now, even though we're fine-tuning these models with a bunch of data, showing them a bunch of examples, we're still also using a system message or effectively a prompt to further guide the model. So some things are just very hard to learn from even a few thousand examples, especially if they're more nuanced. So there's a very detailed system message that we pass along in the beginning of every query that explains exactly how the AI is supposed to interact with our products.

And thirdly, I said we're outputting JSON. I lied. We used to output JSON. And one of our engineers figured out that if you create a domain-specific language that's especially designed for your use case. So for us, it's these steps and these charts. For one thing, it's more token efficient, so it's a bit cheaper and faster. But making something that's effectively easier for humans to read also makes it a lot easier for the machine to read, in our experience.

We also tried other types of models, so few-shot models being the most basic one. This is essentially a fancy word for, hey, just have a good prompt. And with this, obviously, you can get to really relatively good results on day one. So you can definitely get to a pretty good Twitter demo. And in our experience for this, GPT-4 is way better than everything else, including both commercial and open-source models. But for our use case, or for anything that isn't very trivial where there's a lot of depth, It's not very well suited for that because you need a very long instruction manual to properly describe everything you can do with data.

And the models aren't that great at following instructions. And then thirdly, we did try agents. These used to be all the rage. These were step-by-step, sort of maybe in a chain of thought manner, working through the problem. We built a V1 using Langchain. We realized that Langchain was super popular, but it literally just replaced like 20 lines of Python for us. So we replaced Langchain with 20 lines of Python. We tried to open AI's function calling. It was great. But the problem with agents is that while you can get to pretty good results, they're extremely slow to make good. So for us, in our context, we need the AI to come back in a matter of seconds. More than four seconds feels really bad. For this to work well, we needed something like 400 seconds, which also means that it's about 100 times more expensive than what we were using with fine-tune levels.

And so we're very excited about fine-tuning. It allows for, theoretically, infinite future improvement. You get more data. You get data about customers using your product. You get to train more. And with fine-tuned models, if you have access to the actual weight, the model weights that you don't have access to with something like OpenAI, but you do have if you're running stuff on your own GPUs, that's a lot of fun. You can do a lot of different things, there's a lot of parameters you can do, and then you can do things with fancy acronyms like RLHF, Reinforcement Learning with Human Feedback, which is part of the magic sauce between ChatGPT, and DPO, which stands for Direct Something Something. In fact, these are ways of aligning the model better with how you want the world to work.

Internal Tooling and Evaluation

For any of this work, we had to build a bit of internal tooling. One of the main things is how do you evaluate the model? How do we figure out whether it works well or not? And if we made a change, does it make things better or worse?

With LLMs, as they're being trained, the main way they're evaluating loss is essentially just how far off are the tokens that are being output? How similar are the characters to what's in the training set? But for us, we need the models to be evaluated on, does it do the exact right thing? Does it first just give us the right sort of syntax, but then does it answer the question properly and without mistakes?

Our solution to this is we have a separate evaluation set with a bunch of questions. Here's one random one that's relatively long. We have a set of gold standard solutions to each of these questions. We don't do direct string comparison on these. We use a slightly more abstract way of comparing things.

But effectively, if the model says something that we have deemed to be exactly correct, then you get a score of one. if you get something that doesn't even compile if you get something that our query engine which takes in these steps would say hey this isn't logical at all then you get a zero and if it's anything in between then we actually have a human in the loop looking at each of these completions marking them correct or wrong and if needed then we're updating our goal standard set

We also have a nice integration test running through the entire stack. So things broke one time too many right ahead of a demo like this. And so now we have proper tests that go from here's something being typed in by the user to us collecting all the context that's required to feed into the model, us calling the model, us getting results back, us compiling those results into JSON, us validating those results, and us showing them in the UI. So these just run in CI.

For managing training data, something like half a year ago, I very proudly wrote this tweet about how we went from having a bunch of CSV files to building this custom app in Retool to manage training data. And a couple of months later, as we got more machine learning engineers joining, we realized that we were right at the middle of the mid-width meme, where on the left you have dumb people like us, We're just holding stuff in CSV files. And then we had enlightenment.

Like, no, you need to build something very custom and very purpose-built for this type of training data, imagine it in an ICY. And what we do now is we literally have one single file that's a markdown file with code blocks in it. It's ridiculously long. Your MacBook needs to be sort of beefy to open it up in VS Code, but it works and we can do diffs, we can do pull requests. It works fine. You don't need anything fancy.

Miscellaneous Challenges and Learnings

And then I'm going to finish this off with just a bunch of very miscellaneous quick file learnings or things that we thought were interesting. One of the toughest things about doing this is that we essentially have the AI that needs to be very well aligned with some sort of API. In this case, the API is our product, which has some constraints. It has some features that it's allowed to use.

And so when you make changes to that API or when we make changes to our product, then we now have thousands of training examples that we somehow need to migrate over or rewrite. We have a few micro solutions to this, but overall, it's just a pretty tough problem.

We technically don't have one single model anymore. For any question that we take in, we run a little classification, and then we have several more specialized models that, for example, one model is really great at understanding user churn and answering questions around that.

A weird thing was we at some point realized that models were giving very wildly different results from call to call. And so in our own internal evaluation score, we saw up to two X differences using the exact same input. And somewhat recently, I think OpenAI introduced an option to set the seeds to actually get consistent results. Highly recommend that.

And then when we get something back out of the LLM that we automatically determine is incorrect, then we actually use another LLM, a few shot version of GBD4 with some instructions on how to fix certain types of errors. We feed that model the initial question or the initial context and the validation error we got, and we ask it to then fix it, which works a decent percentage of the time.

And then related to fixing stuff, there's some issues that we know the AI keeps running into. For example, it keeps mistaking in certain cases the name of a data model for the name of a relationship between data models. And in those cases where there's no possible ambiguity, we automatically detect that using code and we fix it for it.

And then finally, Effectively fine-tuning open source models. I said it's very fun. It's pretty damn hard.

There's a lot of stuff for you to worry about Even just handling the infrastructure in just spinning up stuff You're just having a GPU that won't be shut down on you because somebody that pays an enterprise level of money have preference but two of the things that we loved or love for doing this stuff one is axolotl, which is an open source library for essentially just fine-tuning models. They handle a lot of the complexity for you, so you just have to figure out all the different knobs that you need to turn, and there's a lot of knobs. So for something like OpenAI and their fine-tuning endpoints, you just pop in the data and it sort of works. You can choose the number of epochs, and that's sort of it. If you're doing something more custom, you're going to have wildly varying loss scores, so lower here is better, based on what your settings are.

To finish this off, three things. First, nobody knows what they're doing here. You shouldn't copy anyone for this reason, including me. And most chat interfaces are garbage.

Conclusion

Thank you.