A Peek at the Internal Tools We Built to Manage 2M+ OpenAI Calls a Day

Introduction

Hello. Today I'm going to give you a talk about a look at the internal tools of an AI startup.

Overview of Find AI

So we make Find AI. Find AI is a search engine for people and companies.

So we built it because we were tired of opening LinkedIn tabs one by one and going through them in Chrome and just trying to see if someone matched our criteria. I feel like a lot of the world runs on opening LinkedIn's in a bunch of tabs and going through them one by one, whether you're in recruiting or in sales or investing, there's a lot of time spent just clicking around.

And we built Find AI originally to automate that. So it's a search engine, and you can ask questions on it.

People ask all kinds of different questions. We found people that are looking for people that used to do this that are now doing that, people that work at a particular company, heads of engineering at Series A startups. And you can ask even things like founders that have golden doodles. And this is a search you can run on Find.ai.

Use Cases

So hypothetically, let's give a use case for this. If you were trying to sell some kind of golden doodle toy or something and looking for early adopters, maybe in your head you're thinking, OK, founders are really good early adopter segment.

Let's take a look for people who are founders and have a golden doodle. And we can go through and run some research and find people that are founders and have doodles.

We run through criteria by criteria. So you can say we break it down into whether they're a founder and whether they own a golden doodle, and can see that the specific evidence for particular people having golden doodles, person's doodle breeder.

And so that is find AI.

Key Facts About Find AI

So some quick facts. We have customers ranging from Fortune 500 companies to investors. A lot of early stage investors use us for sourcing potential investments.

And we use OpenAI a lot under the hood. We call it a search engine, but it's really more of an analysis engine under the hood. We made 19 million requests to OpenAI last week. So we use it really extensively, all of their different endpoints from moderation to batches to chats.

And we're a globally distributed asynchronous crew of 10 people. So I've never even talked to half of the people I work with, actually. So we try to keep things really asynchronous and written first.

And we're seed funded from Felicis Ventures and Daniel Gross. And I'm Philip. I'm one of the co-founders and the CTO.

I previously made an engineering marketplace called Moonlight Work that was acquired in 2020. And I spent a lot of time clicking through LinkedIn profiles one by one and wanted something better.

I'm based here in New York City. And I'm not really on social media, but I write a blog at contraption.co.

Internal Tools for AI-Powered Search Engine

1So building an AI-powered search engine requires a lot of internal tools. When you use a product, There's a lot of things going on behind the scenes that are not visible to the end user that are tools built just for people at the companies. This is how employees will debug what's happening in production, how they'll issue refunds for credit cards, how they'll edit things that are not right.

Semantic Search Engine

So I'm going to show you a semantic search engine that we built. And then we're going to go through a model comparison tool that we built.

So starting with semantic search, one of the lesser known features of OpenAI is called embeddings. So what it does is you pass in a piece of text like sandwich or cat or dog, and it'll give you back a vector. So basically a coordinate.

You can kind of think of it as like an XY coordinate that you put on a graph. And what this allows you to do is more or less plot these points on a graph and see how close they are. And what this endpoint does really effectively is says that cat and dog are more similar to each other than sandwiches.

And so you can put all these different points in a database and start to see the relatedness of words, which is really cool. It can know that Chihuahua and Golden Retriever are related without having any overlap in terms of the actual text between them. And we use this in a variety of ways.

One of the really common ways that people use relatedness is for similar to tools. So hypothetically, if you're on Amazon, you see related products or something, you could build that with vector embedding and say, we're going to take the text of this page, the text of that page, and just measure what's the closest product to that based on the text distance.

And so that's one of the ways we use embeddings on our website. So every company profile, every user profile on Find.ai, we embed it, and then we show similar companies and similar people using what is the closest distance to that particular profile. But there's a lot more advanced use cases you can do with vector embedding.

And I'd say that this is one of those endpoints that's really hard to navigate and kind of get an intuitive feel around, because finding the closest of something is one way of using it. But at certain scale, it can get really difficult. use one of the larger OpenAI embedding endpoints.

So it returns a 3,072 dimension vector. And we have a database with hundreds of millions of vectors in it. It's absolutely massive.

It's our biggest database right now. And it's where a lot of our intelligence for our search product comes from. So one of the things we use it for is understanding how to build better search.

Building the Search Engine

So I'm going to walk through how we try to figure out what's the best way to use vector embedding to improve search. And to do that, we built a semantic search engine.

So this is one of our internal tools. So you can see we have on our internal tools a big red production bar at the top so we know don't break things in production. And we can try searching.

We embed different kinds of content. We embed company profiles, people profiles. And there's some other primitives I'll talk about here in a minute.

So with, for instance, a person, I had just run the search founders that have a golden doodle. So let's try doing this. So when I run the search here, search for people, founders that have a golden doodle, what it's doing is it's taking this text, turning it into the vector with OpenAI, and then going into this database and trying to retrieve what are similar people close in distance to this particular vector.

And these tools I'm showing you are going to be a little ugly because they're internal tools. They're not customer facing. But we can go through and see that we have distances here of some people that it found in the database.

And you'll see that it found, like Josh here, he's a co-founder of Good Dog, which is a dog platform. Found Patrick, who runs a dog site. And it's indexing really heavily on kind of the words founder and doodle, but it's not really understanding the intersection of those words, right?

So it's saying, like, these are people that have these words pop up in their description, but none of these people really have a golden doodle. Like, the top results are... that work in the dog industry.

So it's like the overlap of founder and dog without a deeper understanding here. And so the first takeaway here is we can't just build our search engine on top of Vector Search. A lot of companies try to just start Vector Search-based products.

Experiments and Results

But for this type of intuition, it's not working. And so what we did is when you went to our search engine you saw that we broke down a search into something called criteria.

So this is what we call it internally. So we go through and we say, OK, for a founder that has a golden doodle, one criteria is that they have the job title founder. The other criteria is that they own a golden doodle.

And so that was the next experiment we ran with our semantic search engine. And so we went through and said, let's say, has job title of founder. And we went through and decomposed all of our profiles into smaller snippets of text we call facts and embedded those.

And what you can see here is that, let's see how the results look. the results with shorter pieces of text are going to be way more relevant. So we're getting facts like, is the founder of the company previously held the title of founder and CEO?

And so you're seeing here also that we're getting closer to what we need, but this fact is still saying previously held the title of founder and CEO, which semantically is close, but in terms of meaning, it's not the same thing as currently the founder and CEO. And so we can also search has a dog and see how that wraps up. And what we're showing on the left-hand side here is the score.

So it's the distance. And so we see that smaller scores are closer. And the larger the score gets, the further it is in distance.

I'm showing just the first 100 on this because it's an internal tool. And you can see it has a dog, has a dog, owns a dog. So the semantic distance is really close on these.

But as we get to a higher score, has a strong interest in dogs, has two children and a dog. And so it becomes slightly less relevant the higher the score gets. We typically see that a score of a distance of less than one, depending on how you calculate the distance, there's a lot of intricacies here.

But one of the things we wanted to do was find a cutoff for scores of relevance that we want to look at. And for us, a distance of under one in terms of the cosine distance tends to work really well. But let's try something more specific.

So has a golden doodle. And so we just showed that searching for the job title founder works really well. Searching for has a dog works really well.

But we're going to get more specific here and say has a golden doodle. And this is kind of transitioning from being a generic term, a dog, to something more specific that would be almost like a proper noun, right?

A golden doodle. And so you can see that we're getting some matches here of owns a golden doodle, but then we're also seeing in the top matches here, some results like has a golden retriever and let's see, golden retriever. And so

The takeaway for everyone here is that semantic search and open AI embeddings is really powerful, but it's not a perfect solution. You can see that it understands the difference between founder and dog, but it can't really differentiate between dog breeds specifically. So the semantics are not getting that deep.

And so this search engine was a really powerful way for us to understand how we can do this matching in terms of vector distances. And we use different algorithms internally, right? So like, a Levenstein distance would be better than a semantic search, as you're trying to do things like proper nouns.

So that was our semantic search engine that we built internally to help understand how to use OpenAI vectors. One of the takeaways here was that similarity is not equal to accuracy. So things like proper nouns don't really work well in terms of semantic search.

We found that a cutoff distance of about 1.0 works really well for relevance. That's with the cosine distance. It really depends on algorithms.

Vectors are complicated, but for our internal purposes, we needed to find what's a cutoff that's relevant. And the other thing here was that if you're trying to really get accuracy, then decomposing complex queries into smaller things, like taking a search into individual criteria, will get higher accuracy. But that approach doesn't work as well for just related people or related companies.

Model Comparison Tool

So that was the semantic search.

Next, I'm going to show you a model comparison tool that we built internally. So one of the questions for all startups that are using LLMs is, what model do you use? There's a lot of different providers. There's OpenAI. There's Anthropic.

You can self-host from the Facebook models. I'm sure we're going to have even more coming out soon. And within each of those providers, there's different models. And so there's always a question of, what models are out there? Which one should we use? And when a new one comes out, should we switch? So there's always a question of, are the new ones better?

There's always a stream of new models coming out. There was GPT-3, then GPT-4, then 4.0, then 0.1, and all these different models. But then within that, you typically have different classes of models. So OpenAI is running three tiers of models. the kind of standard 4.0. There's ones that have more reasoning built in, like O1. It's really expensive. Then they have some less expensive models, like 4.0 Mini, that are on the order of 100x cheaper. And that's a really big difference in price. And so for us, when we're looking at our spend, we want to know, can we decrease our costs by 100 times? Because when you're making millions of requests per week, that really matters.

1So one of the things we built was a tool to compare the models that are available to us. So the way these AI Chat queries work is you have a prompt, and then you ask the model to give you a response. And so we have these internal prompt libraries. So there's different prompts that we use for different things. And so what we decided to do was have a way in our internal tools where we could take a particular prompt and then run it across a bunch of different models.

And the cool thing we did that this was not my idea. This was someone on our team built this because they were trying to solve this problem. The cool thing there is then they asked OpenAI to then evaluate every single model. And that was a really key unlock for us because just getting a lot of data isn't very useful. And sometimes it's hard to go through and analyze a bunch of different AI models and see which one is good. And so we actually use AI to pick our AI models for qualitative models internally.

Evaluating Models

So let's take a look at our model evaluations tool. So this is another internal tool. So we can go through and see some of the different prompts that we have internally. So one of the prompts we have is extracting facts from a company. So that is a model we use. I showed you in the semantic search the kind of facts search. So you can search for is founder, has a dog, things like that. So we have a model that goes and looks at a company and extracts the facts. And so let's try putting in a slug. So this is a database identifier. I'm going to use Find AI. So we're going to ask about the Find AI record and retrieve the prompt.

And so it's populating the prompt and the user prompt that we use internally. Temperature is more or less the randomness within your model. So smaller numbers are less random. A number of two is going to be extremely random. And so we can run and compare the models.

And so what this is doing is this is the model we use in production all the time. And this is a way for us to run the same model across GPT-4.0 mini, GPT-4.0. And there's always kind of an edge version of GPT-4.0. And so this is comparing those three. And so it's running them, and then it's going to return the results. So let's see how that goes.

Well, it's loading. Oh, here we go. So I'm going to run down to the lower part here. These are internal tools, so they're less polished. So we can see that GPT-4.0, so that's one of the models, gave this array of facts. Offers a pricing plan called outreach, provides a research plan. And so for me, looking at this, I'm like, this looks fine. I don't really know if it's good or not.

GPT-4.0 mini offers fewer facts. It gave us 11. But the cost, look, this cost here was about $0.30. GPT-40 mini was less than a penny, which is a massive difference in cost if you're running this across millions of records. And so looking at this, I'm not sure. Maybe this is better. Gives us two fewer facts, but how's the quality? And then we've also got GPT-4.0, which is the edge version. So this isn't the default model right now in GPT-4.0, but this is an edge version that OpenAI has made available. I think it's going to become the default in the next couple of weeks. But you can see that it gave us 19 facts on the same prompt.

And the temperature is pretty low here, so it shouldn't be that different. It's not necessarily just the temperature here. But you can see it gave a lot more facts here. And the cost was $0.16. So that is like halfway between those prices. And so this is the part that's really cool for us is we then had GPT-4.0 go through and give a score to each of the models and critique it for us.

And this is my favorite part because it's saying on GPT-4.0, provides detailed pricing plans, highlights use of technologies, mentions specific searches, cons, includes some repetitive information, and lacks mention of free tier details. score 85 out of 100. GPT-4.0 Mini, concise and accurate, mentions free pricing plan, lacks details on pricing plans and co-founders, score 80. And the GPT-4.0 Edge version said comprehensive, but it contains repetitive information and lacks clarity on unique features, and gave it a score of 78. So we can use this information then internally, run it across a couple of different records, and decide, is a $0.30 price difference per record worth it for GPT-4.0 over GPT-4.0 Mini?

But the other thing is we know that OpenAI is going to switch this to being the default model in the next couple of weeks. And so we can anticipate here that maybe we need to go in and adjust our prompts to be more narrow on some of these things. So maybe our prompt that worked well in GPT-4.0 needs to be more tuned for this edge model. You can also turn it up to high temperatures and see how it does, which is really useful for just also gaining an intuitive understanding of what the models are doing with real data. So this is our internal model comparison tool. Let's see if this gets results in a minute. Now we can move on.

Learnings

So learnings. We learned that AI is really good at analyzing its own responses. That has been a really big unlock for us because when you're busy and just trying to sit down and look at a bunch of different responses with a lot of different texts, it's hard to qualitatively measure the quality of it. And we found that OpenAI is really good at evaluating its own responses. We also found that many models hallucinate and miss info. But that's not always a deal breaker, especially on things that are qualitative. It can be OK. And we also found that it's worth keeping up with incremental model improvements. But sometimes those incremental model improvements can be more detailed and more smart in a way that requires us to update the prompts to avoid hallucinations or excess detail.

Conclusion

So in conclusion, we looked at semantic search and our model comparison tool. If you want to try out the product, it's at usefinds.ai. You can run searches for free there.

Finished reading?