“We’re live…now what?” - Effectively monitoring and iterating on your LLM applications by Lytix (YC W24)

Introduction

Hey, everyone. Thank you for coming.

I'll just go in the middle. My name's Sahil. That's Sid. We're with Lytics.

Lytics is all about how can we get your LLM applications up, live, in front of customers, working as you expect. So what we want to talk about is what happens once you've built something cool, now it's deployed and users are using it. How do you effectively maintain what you've got? How do you monitor it, keep tabs on what's going on, and then find those opportunities for iteration? So that's what this talk is about.

Real quick, just how we're gonna go about this.

The Life Cycle of a Deployed LLM Application

We'll introduce ourselves really quickly, mention who we are, what we've been working on, what lets us talk about this, and then just get stuck into sort of like the life cycle of a deployed LLM application, which we think of as in three steps.

Monitoring Performance and Alerts

So the first is you build something, now you've deployed it, users are using it. Is what you've built performing the way that you expected? And how can you get alerted anytime it's going wrong or you're having a interaction that's undesirable?

The second is, how can I improve what I've built? So users are using it. In theory, you'd want to make their user experience even better. How can you identify those opportunities for improvement and identify what needs to be improved?

And the last one is, OK, you have a change in mind. You're going to make a change to your code, to your rag, to your prompt, to your underlying data. How do you make sure that that change is only making the impact, only making the change that you want it to and not having any adverse effects?

So really, really quick, my name's Sahil, previously product at QuickBooks, smattering of data experience, a bit of time at Data at Square. SIDDHARANI SINGHALAKRISHNANI- I'm Sid, previously worked at Titan, another startup in the fintech industry, worked at startups my whole life. SIDDHARANI SINGHALAKRISHNANI- And we've built a smattering of small LLM projects amongst ourselves. And this is also based on our time working with companies from our YC batch, the last winter 24 batch.

Challenges with Traditional Alert Systems

So just getting stuck into it, one, is my system performing as expected? And the challenge here is that LLMs are these inherently non-deterministic functions. Each output of an LLM is going to be unique. And that's actually required.

Underlying problem here is that developers, traditional developers, are used to setting up alerts to find some category. And when that category is breached, give me an alert. Let me know what happened.

1That randomness is required for the inference and the creativity that we all love about LLMs. But with that comes the problem that if you want to alert on what your LLM has produced or your LLM stack has produced, you're always going to need to do a little bit of work to look at what's been created and decide if you need to make an alert on that. So as an example, if you want to alert based on, let's say, toxicity or similarity to prompt, you will always need to spin up all these lines of code that decide, OK, how toxic is this statement?

You'll need infrastructure and jobs running to consistently evaluate that. That's just tech debt that you'll need to maintain and carry with you, and fundamentally taking you away from building something that your users need.

The 2A, sort of 2B to this is that we've also seen that LLM teams will typically expand to about three to four different vendors and LLM providers as they deploy and start to make more changes. And so as you have more LLMs talking to each other, that gets even more challenging. You have all of these different dashboards with their own bug management tools.

We've seen users take user IDs and timestamps and try and compare them across all these different dashboards for root cause analysis. And it's just not a good use of anyone's time.

Native Setup for Alerts

So this is going to talk about how you might be able to set this up natively. Oh, sorry. Quick example.

Here is a LLM-based tool that we use for our own LinkedIn outreach that called one of our prospective connects a professional button presser. So thankfully, we were able to catch that before the connect went out. But this is exactly the kind of thing you would want to be alerted on. And you can see why all of that infrastructure would be needed to decide that you would need to alert on this.

Yeah, so the format will be I'll first show how you can do this at home and how you might do this by yourself. And then the second part will be how we solve it here at Lytics, our company.

DIY Solutions for Alerting

The first way that you can solve it by yourself is by using open source models that just extract toxicity and sentiment and all these fun things. On the left-hand side, you can see a makeshift pipeline. Again, all I'm using are open source Hugging Face models to extract toxicity.

This is one dimension that you might care about, for example. And you can see on the right-hand side, one of them is very toxic and the other one is not so toxic. This is a really basic example of just a single dimension that you might care about, which is toxicity.

But you can imagine there's a bunch of other things, like sentiment, reading comprehension, theme detection, stuff like that. So you can imagine that as you start to scale this out, if you decide to build this in-house,

This is essentially another project that you're going to have to manage. You're going to have to make sure that it's up to date. As new models come out and as the sentiment just generally changes, you're going to have to keep this up to date and kind of keep managing this. It works, but not ideal if you're kind of trying to scale and, as Sahil said, focusing on a product that you're trying to build yourself.

Lytics Solutions for Alerting

At Lytics, we kind of take all that and we do it all for you. So you have a dashboard where we'll extract all the aforementioned metrics. You can alert on them specifically.

We also have more interesting ones. We worked with a group that was really specifically interested in regex alerting. So as these events are coming in, immediately be alerted as soon as you see something interesting that they cared about.

But this can kind of take the pressure away from your internal team and let you just kind of focus on what you're building. And we'll kind of deal with those metrics that we've kind of decided are the most relevant.

Improving User Experience and Analytics

So the second part is now you have something live. You know when something's going wrong. But how can you make it even better? How can you identify those opportunities for improvement?

Limitations of Current Analytics Tools

And more importantly, why can you not use the tools that are already out there? There are some great product analytics tools out there already. The fundamental problem we've seen LLM teams have with using what's out there is that LLMs are fundamentally based on random strings. They're based on strings getting passed from a user to a model or between models.

And the question is, how are you going to include that string in your analytics infrastructure in a way that really makes sense and lets you answer questions about what your users are doing? Anybody who has worked with tools like PostHog or Amplitude or Looker can maybe project out what that challenge would be. Do I just include the raw string as it is? Do I start to find keywords?

If I want to group it in some way, what's the right way to do that? And we've also seen that analytics needs change as teams get bigger, needs change. And so that complexity compounds as you're changing what analytics events you care about on the fly.

Automated Tools for Better Insights

The second behavior we've noticed is that teams, when they start out with, are perfectly happy to just log everything manually and look through those logs at some cadence. That works perfectly well when you have a few users and maybe you have a hypothesis of what kind of changes you want to make and you're looking for data. But we found that that not only gets unscalable as you get more users, but we found teams are kind of frustrated with the product opportunities that they're leaving on the table by not having some kind of automated tool.

So as an example, let's say that I was on the OpenAI team. And here's a user on ChatGPT asking, what's the capital of France? As the PM or as the developer, what's the right way for me to include that in my postdoc dashboard? Do I just include the raw string?

If I do, how do I use that in my funnel analysis or my retention charts? If I want to group it, what's the right way to group it? Ideally, I would just not think about this at all. But I do need it so that I can kind of keep tabs on what my users are doing and start to include my LLM metrics in my product analysis.

Creating Baseline String Metrics

So again, Ted's going to go into how you might be able to set this up by yourself. TED HARDIE- Yeah, so there are a bunch of ways to kind of approach this problem. One that we're going to share today that we thought was kind of creative was the idea of creating a baseline string metric that you care about.

So in this example, the baseline string metric is, I want to buy this product. That's what I care about. That's what I want to see if people are actually doing. That's the string that I want inputted from my users.

And what you can do is you can embed both the user input string, like no, I'm good, I don't want it, or yes, I'm really excited to buy it, and then do a cosine similarity, so see how close those embeddings are, to this baseline metric that you have. So what you can do with this is you can see, oh, the one that says I'm really excited to buy this product, that's really similar to my baseline metric that I've defined. And the one that says no, I'm good, I don't really care about this product is not as similar to this baseline metric.

You have to know what this baseline metric is. So it puts a little bit onus on the person that's actually developing this pipeline. But you can see it kind of works.

And then you can kind of imagine that I can take this and now funnel it to my post-hoc dashboard. And this could be my conversion metric. I've defined, oh, anyone that says I really want to buy this, that's thumbs up. That means I've converted someone. And I'm using cosine similarity kind of just to figure this out.

Lytics Theme Extraction and Dashboard

At Lytics, we do it a little bit more in depth. What we actually can do, we take all of your input, output events, and we automatically start creating themes about it. So we'll extract it. We'll create some high-level themes like programming questions.

And then underneath, we'll automatically start deciding, oh, inside here, it's a question relating to Python concurrency, Python built-in functions, et cetera, et cetera. So we're going to take all of that data from your I.O. events and then actually extract the user intention from it. and then kind of give you this nice dashboard where you can kind of see exactly what your users are doing and the number of interactions on each of those themes.

DAN GALPIN- Cool. So how does that actually help you with your product analytics?

So here you can see on your left side there, in the Linux dashboard, you have a couple interactions. Some are about programming questions. Some are about scuba diving. And the other thing we do is we'll automatically tag it with a bunch of eval metrics.

And then what we do is we take all of those events and we import that over to your product analytics tool, so in this case, PostHog. And we include all of those themes that you added or that we added for you and those eval metrics as metadata in the PostHog event. And so what that lets me do as a non-technical PM or a dev that's interested in product insights is I can start to do my product analysis cut by different LLM metrics.

So in this case, I can see my funnel conversion rate for a purchase event cut by what my users were talking to my bot about and do analysis like, oh, people talking about programming questions have a much higher conversion rate than people asking about scuba diving. In this case, we just did it on theme, but you could imagine this could extend to which prompt you're using, which data set you're using, which model you're using, any eval like that.

Making Changes Without Adverse Effects

Sweet, so the last one is, OK, so I have a change in mind. I want to make a change to my whole LLM stack. But before my users see it, I want to make sure that it's doing what I expected, and there's no sort of unexpected changes. There's no adverse effects.

The Challenge of Testing LLMs

So how can I test for that? What's the unit test equivalent for LLMs? And unfortunately, again, due to the randomness of LLMs, clean unit testing the way that traditional devs are used to is near impossible.

Why is that the case? There's kind of two reasons. One is that there's just way too many unknown unknowns with how users could be interacting with your product and what the LLM could be spitting out.

As an example, we spoke with somebody who works on the Gen AI platform at Google. And they said that even at Google, if there's like 1,000 things that could go wrong with their LLM stacks, their team may only know about 100 of them.

And so what that means is that you could, in theory, manually test each of those cases that you are aware of. But there's a couple problems with that. One is you're not testing all the things that you don't know about, so that's just left to your users to figure out, which ideally you wouldn't want.

But the second thing is that manually testing cases is also a pretty inefficient and time-consuming way of testing. You're not going to be really confident that your test case is actually accurately reflecting the test case that you have in mind. And again, you're leaving all those other cases on the table. And again, terrible use of time.

And again, we've seen that this problem really, really starts to compound as your stack includes more and more models. And that sort of randomness percentage starts to compound pretty exponentially over time. So yeah, again, this is going to go into sort of how we have been thinking about this.

Approaches to Testing Language-Based Outputs

Yeah, so just starting with the disclosure, this area is very new. And this is just one approach that we've decided.

But thinking about how I would start testing a natural language-based output, first thing that comes to mind is maybe regex. Maybe you could do a really clean, easy, I just care about a specific keyword being in the output. That's really easy to test for. I could just do regex.

If you are interacting with your model with JSON that has predictable values like booleans, numbers, and stuff like that, another one is just testing with a JSON parser. But then you can imagine, as I alluded to, as soon as you get to natural language-based stuff, it becomes a little bit harder to test. How am I going to test the vibe of something or some stuff like that?

Lytics Framework for Testing LLMs

1At Lytx, the approach that we're going with is LLMs to test LLMs. So I'll go through an example here.

In this example, on our Lytx dashboard, you define this test called no profanity test. You basically say, hey, LLM, please return 0 or 1. And given this input and output, tell me if there's any profanity in it. This is a simple example, but you can imagine this can extend to pretty much anything that you want or an interaction that you see.

And then on the developer side, it's exactly what you'd expect from a unit test framework. You have a decorator that you add on top of your tests. You just say, hey, no profanity test. And then similar, you just set the input and output.

And then this, you can see on the other side, is going to, again, look just like Jest or any other unit test framework that you're used to, where it'll run the tests and tell you, hey, one of them failed. And it's because it had a bad word in it.

We're hopefully going to open source a lot of this. So it'll be just for anyone to see and anyone to use. But that's kind of the approach that we've seen a lot of our customers kind of be OK with. And it's a little weird. It feels a little uncomfortable. But it does get a lot of mileage. And to Sal's point, you don't have to waste too much time getting into the nitty gritty.

Conclusion and Invitation to Collaborate

So that's us. That's our LinkedIn. We're continuing to develop in the space, so you can watch that for more.

And if you're building yourself and you're close to deploying and want someone to help you think about some of these challenges, get ahead of them, we'd love to chat. Always very curious to see what people are building. And we just want to make sure that you get that over the line as smoothly as possible.

So that's us. That's Ledux. Thank you.