AI-powered product teams

Introduction

So hi everyone, thanks for coming and thanks to Qualtrics for hosting us this evening.

My name is Liam Collins I do product management and engineering at Mindstone and today I'm going to take you through a couple of the ways we're using AI in our product team

So I'm going to go through a few of the use cases.

So one around using personas to evaluate content another on regression testing

testing, performance testing in our app, a weekly planning playbook, and some of the issues we have with dealing with AI and how AI can solve those problems.

Building an Internal AI OS

So for context, the tool I'm going to be using to show some of these is something we've been developing internally called our internal MindStone OS.

The background to this is like a lot of companies, I think when AI exploded in the last two years, we had a lot of enthusiasm internally and everybody was going and experimenting

and trying things but it was quite quite disconnected and disparate so we had some people who were in chat GPT creating lots of custom GPTs we had some people in cloud doing stuff we had some people doing interesting stuff in

Zapier but it was a bit all over the place nobody was benefiting from each other's work so we wanted to try and build something a bit more centralized where we could all work together and we could all benefit from each other's work

Core Components and Capabilities

A very simplified way of thinking about this is it's an AI agent in the center and that has access to a bunch of tools to help it do things or help it get external data.

It has access to our internal knowledge base, so lots of knowledge about the company, about our products, about our people.

And it has access to a playbook library, so playbooks essentially are long prompts that that we use to help agents.

Why We Chose Cursor as Our Agent

For our AI agent, we use Cursor. So Cursor is an editor used primarily. It's an AI editor used primarily for software engineering.

It's become quite big in the last few years. We looked at other solutions for this.

So looked at Cloud and ChatGPT. And while they were able to do some of the things we wanted, they weren't quite able to do everything we needed. So we went for a more custom approach.

The benefit of Cursor is it works really well with your local file system and we try to keep as much things on your local machine as possible because that helps the agent run really quickly and be performant and manages context.

And it's just a really good agent. It's been trained to handle very long, complicated software problems and building features, so it's really good at creating to -do lists and checkpoints and iterating on what it's doing.

doing.

Live Demos and Use Cases

So I will try and do a couple of these live, so pray for me that the AI behaves itself.

So here is our OS. As you can see on the left, a bunch of playbooks, knowledge base, some other stuff around memory.

So the first one I'm going to run is this, and I'll get it running and then I'll explain a bit about what it's doing.

Using Personas to Evaluate Content

So this is a review program day as persona so like Wouter and Anne were speaking about earlier personas and like synthetic users are a really interesting space with AI because one of the things AI is really good at is impersonation and from our perspective

obviously speaking to users is really important and getting feedback on what we're doing, but there are limitations to that. So it's difficult to get users to speak to you. If they do speak to you, you'll get one shot.

So you can't call them up five times a day saying, hey, actually, I want your opinion on this now. And you can't just call them up whenever you want.

So we have put a bit of effort into creating some really detailed personas, which we feed feed all our real user interview and our real user feedback into, and then we try to leverage these to do various tasks in our product.

So our UX designer uses these a lot to go like, grab me a persona and then take me through a user flow. Does this make sense? What are your expectations here?

What happens when I... What do you think will happen when I click this button?

And you get really interesting results when you go with a couple of different personas, like maybe something made sense for one type of person but not another, and it gives you really nice on the spot insights and you can tweak something and then run it all again so it scales

Applying Personas to Program Day Content

really nicely in this one i've asked it to look at our program days so mindstone do um three and four week learning programs where we teach people about using ai and each of the days in the program is personalized but it's personalized based on a base lesson content and obviously it's really important for us that that content is really good so I've asked it here to go

get content so what it's done is it has used our internal MCP to go and grab the program data so that's an MCP for those who don't know is it's a standard that that allows agents and AIs to speak to other services.

So Slack, Linear, Notion, Qualtrics might have one. They release an MCP and it allows your AI, once authenticated, to go and speak to them and get data and post data.

So we have one for our internal data. It has used that to get the program, the program info. I've asked it to look for the reverse prompting day. So it's gotten that.

Then it's gone, grabbed the persona, and it's done the analysis on the day imitating that persona.

It comes back. It says, okay, some things were good, what worked, what confused, some assumptions. It's given some suggestions on how things could be improved.

And then I'll skip down to the end. It's given me options to say, hey, do you want to send a Slack DM to somebody to get their second opinion on this?

Do you want to use the internal MCP to update the content directly? But it's a really nice way of leveraging the persona to improve an aspect of the product.

Browser-Powered Performance Testing

So another workflow I'm going to go through is our performance analyzer playbook. look. So in this, it's going to use the browser MCP to go and visit our app, go to a specific URL.

I've asked it to go to the tools route here, and it's going to do performance testing on that.

Again, I wanted to show this because giving the AI access to the browser is another one of those tools that just has lots of different use cases so again we use this for like testing user journeys you can do it for end -to -end testing you can do

stuff like this performance testing and it's one of those things that opens up a lot of avenues for you so hopefully this will work and it should open Chrome soon

Inside a Playbook: Weekly Planning Example

while it's doing that I might give you a just show what our an example of one of of our playbooks looks like. So where is this one?

Here is one I use for weekly planning. So they're quite detailed. So this one runs to 271 lines of a prompt.

We have certain sections on agent use. When things the agent needs to know, you give the agent a persona to do something.

There's a goal. What are you trying to achieve here? and then just a bunch of other things,

a really detailed process so it knows to go through step by step and has sort of check boxes to check things off as it's gone through to give it as much help as possible.

Now, typically this has gotten stuck. Would you believe me if I said every time I tested this before, it worked perfectly?

While that is struggling, I might demo the next one, which is this weekly planning.

So I won't run this live because it's quite time -consuming.

Oh, here we go. so here you can see the agent and as it opens it's trying to open it and navigate

Weekly Planning Workflow

to a page yeah I am I doubt that anyway while that is struggling I'll show you the weekly our weekly planning workflow so this is a playbook I use to do

planning for the team so we run one week sprints in mindset when we plan for one week at a time.

So the aim of this playbook is to go get a bunch of data that I need to plan for the team the week ahead.

So in this it's going to go, it grabs, the first thing it does it goes to linear, our project management tool tool, grabs all the tickets that are in progress at the time,

figures out who the team members are responsible for those, and it creates DMs to send to them so I can get status updates on those. It sends the DMs via Slack.

Once we get responses, it's listening for those. It picks it up, and it moves forward.

Data Gathering and Synthesis

Then it goes and it goes to Slack. So we have a bugs channel in Slack where all our bugs that are reported by users or by us go into a channel so we have visibility there so it goes there it gets all the bugs from the last week

it goes to notion and it's going to get our roadmap and then once it has all that data it has responses from the work in progress from the team members it can

create a plan for me on okay here's the current state of things here are some ideas I think could be good work for next week and it gives you some just some suggestions around prioritization and then I have sort of a ready -made planning partner now that has all the context it needs to help me make

decisions on what we should do for the next week so I go to a discussion with it say yes let's do this let's not do that it will go into linear assign some tickets or create some tickets as it needs to and then in the end it will send DMS on slack so hey here's a plan for next week here are some tickets I

want you to work on and some notes let me know if anything's unclear so the reason I want to show this one I guess as a nice example of when you can pull data in from a bunch of different places synthesize it and then help you make

really good decisions one additional note on this is in the right here you

Managing Context with External Notes

might see it referring to notes that MD as it goes so one of the issues with these here we go one of the issues with these sort of longer running playbooks when you're getting a lot of data is as a chance that you overwhelm the agents agent's context.

So if the agent has to keep too much stuff in memory, performance starts to degrade and it can start getting confused.

So 1one of the things we do is you create a little notepad basically and you say, okay, as you get data, write it here and then forget about it.

And it's a way of keeping its context small, keeping the agent performant, but still all the data it has is here, it's on the file system and it can access it really easily.

And this is still unfortunately stuck, so we'll move on.

AI Strengths and Challenges

The final thing I want to show a bit about, and it was mostly a way of shoehorning a Simpsons meme into the presentation, is AI solves a lot of problems for us, but it also causes problems.

So like most software teams, I think in the last two years we've been leaning into AI, and a lot of our features are run to pin by AI, which is great when it works well, but not so great when it doesn't and in the initial months especially it was

earlier days models weren't great it was difficult to ensure that the AI was performing consistently so we have AI underpinning a couple of features where a user would have to get into a discussion with an AI and you would go

and test this the first time it would all work great the second the third the fourth and the fifth time the AI would do something strange so you have to go go back, tweak your instructions for the AI, test again, this time and the sixth time it did something funny.

Extremely boring, extremely frustrating.

So we figured out quite quickly

Building AI Evaluators

that we were going to have to lean into AI evaluators. So basically have another AI monitor the performance of AI that's doing the thing you want it to do.

So there are some solutions out there that are starting to do evaluators but we couldn't find anything that would do exactly what we wanted so we started sort of building some custom ones so here is roughly what one looks so

this is one for um a personalizer chat that users have at the beginning of their journey with mindstone where they get into a discussion with uh an ai and ai tries to find out as much as they can about their role so that we can personalize the program for them um so here in you uh here's

Here's the instructions, which you can edit.

You decide how many times you want it to run a conversation, and you select which model you want it to use and a reasoning effort.

Simulated Conversations and Scoring

And then what's happening in this conversation is another AI is playing the human role.

And to try and make this as realistic as possible, we have it rotating to a number of different types of personas and different engagement styles.

So sometimes it will be our marketing manager, and they will be enthusiastic about engaging in the conversation.

but sometimes it'll be a CEO who is resistant at giving us information so we wanted to kind of have some confidence that the AI would perform well in

sort of realistic scenarios and yeah an example of sort of a run just for this so G55 low and then you get kind of detail on the scores that the sort of the second AI.

So there's three AIs here, I guess. There's the AI we're testing, there's an AI pretending to be a human to have the conversation, and a third AI evaluating the conversation.

And it's that third AI is giving these scores on the conversation.

And the

great thing is you can run this 20 times quite quickly.

So here you've a bunch of different personas, and you're saying, OK, it scores, you look pretty good. Thanks for the performance.

Staying Current with Models

Really useful when you want to edit the instructions, and also really useful when you want to update models we've had times in the past where we had a feature working well we update to the latest model and stuff just us breaking because that model interprets instructions slightly differently or for some other reason so it's allowed us to move a lot faster and say sort of under there on the latest models all the time

Conclusion and Q&A

yeah so that is it any questions

Finished reading?