I'm Alex. For those who don't know me, I actually only moved to New York about a couple months ago, so I think most of you don't know me.
I've given quite a few talks in the London AI Mindstone group over time. It's incredible because I think I gave one here in New York maybe nine months ago, and it was maybe like a third of this in terms of the size. So it's incredible to see how quickly this has scaled.
One of the things I love about MindStone is, and just in general where we are in the current state of the world, is I find that if you want to build a technical skill set, it's so much more accessible than ever before. And I mean, I think I'm, I can kind of talk from that perspective, because when I did my first Mindstone talk, it was literally showcasing how you can kind of build some systems by hacking together Google Sheets and Zapier and ChachiBT, and showing how you can manage your email or something like that from there.
And so I've usually done the practical talk, because there's a technical, practical, and I guess theoretical. This is my first time doing the technical. And I know people usually fall asleep in some of these ones. So I'll try and keep it interesting.
But I've gone from hacking with Google Sheets to now working at a company called bigdata.com. And what we do is we provide a high-precision search and retrieval endpoint that people can query for business intelligence or knowledge. And so what does that mean? It means we ingest an enormous amount of content, so news, corporate filings, transcripts. We do a lot of document processing on that. We make it available for people to query.
So if you're building an agent, you can then query it and get information about the markets, companies, different things like that. And today I'm going to talk through architecture for how this system works and what it looks like and how it differs from kind of your typical RAG process and solution.
So if you think about RAG or retrieval augmented generation, I mean ultimately I assume everybody at this stage is familiar with it, right? You have a series of documents. They get chunked up into smaller components. You then, you know, have queries that you want to query over that and ultimately It's a pretty straightforward process. But that's for, let's say, one document.
Maybe if you've got one document that's 10 pages, it's okay. If it's 500 pages, the systems have gotten better over the past year to be able to handle that. But if you've got five documents of 500 pages, or you've got 1,000 documents, or in our case, if you have 100 million documents, you have to think about the architecture very differently.
How we think about the architecture, and this is what I'm going to get into, is a lot more involved. And the background of this company that we've built big data from is the parent company is called Ravenpack. And what they've been doing is they've been ingesting information and providing signals to quantitative trading firms for the past 20 years.
And so out of this architecture, what we do is we ingest all this information. We do our extraction, chunking, all that stuff in real time, very low latency. We then do entity detection, which I'll talk a little bit about, and sentiment analysis. And then that gets stored in a big vector store, in addition to a knowledge graph that we have that detects, OK, if you wanted to search for Whole Foods, that should default to telling you that Amazon's a parent company.
from that you can then so let me just talk about that process and then i'll talk about what you can do with it because ultimately the system that we provide is effectively like a chat gpt for the financial markets but my day-to-day is to work with our large institutional clients which is like your goldman sachs jp morgan big banks of the world as well as large asset managers to showcase how you can do much more than just chat with us and there are things that i'm just like super energized by when i kind of come up when we come up with a workflow we can show that So the first part is just, okay, we've got 55,000 sources of news, filings, and transcripts, and that is coming in in real time and in very low latency. What we do with that is we first chunk it into paragraph-sized chunks.
For every chunk, we then have a model that will do entity detection. So you can imagine if, let's say, Apple is mentioned in a transcript, well, you want to pick out that that is Apple and it's the same Apple that's mentioned somewhere else, as opposed to if there's an article that said, like, I ate an apple yesterday, it's going to be a different apple. And that's actually kind of a tricky thing to solve, but we've got some very good engineers that have done that. And so entity detection is one piece. Then with each chunk, we also run it through sentiment analysis.
And so by doing both of these, I'm going to jump ahead a little bit. So for instance, if you search like app tech, it does a number of things it can guide us around. Equally, there are different entities that we have.
So when we look at an entity in here, I'll give an example of what this looks like. So let me go to just bigdata.com. I can go home. This is ultimately what it looks like. So here's our chat component. I can ask examples of chats to here.
But what I wanted to show was entity detection. So let me just grab a file. These are just newsletters that I've uploaded here. And you can see when we upload content, what we're doing here is we're detecting and categorizing the different entities that get picked up in that news instantly.
So whenever I throw new content in the system, it picks up, okay, here are the companies that are mentioned, here are the topics, here are the organizations. And that's really useful because in the future I can use that to search for these companies and fetch all the documents that are on that. And then I can combine that with sentiment and search, okay, I want all the negative stuff about, you know, maybe Elon Musk or Doge or whatever it may be. So that's kind of like the core base of what we do.
Do you use LLMs for the features or more traditional? This part is done with more traditional AI because it's done with very low latency. So you can imagine if you're building a system for quantitative trading funds that are trading in milliseconds, they want to be able to ingest this information and get it quickly.
And so they've built reliable systems to do this in within 200 milliseconds. So by the time a document hits our system, it's then available via the API in that time span. And so let's go back here. Yeah, this file just showed that.
So if you imagine, OK, I've got this enormous document store that I can query with precision, this is kind of like what we then work with our clients around. It's OK, if you want to query something with position, what do you want to do with it? And I mentioned a few examples of combining an entity and sentiment. But you can also imagine combining an entity and sentiment, and then a text string. So maybe I want to search for, let's say, a couple weeks ago.
And I've got a good example that's similar to this. But a couple weeks ago, China retaliated to tariffs by increasing the restrictions on molybdenum and tungsten. So see that these are rare earth metals that some companies rely on. Now, because we have this ability to search for companies, portfolios, and text strings, we can quickly search across this 100 million document data store to find, OK, who are the companies that talk about Tungsten?
And what do they talk about? And what is their exposure to it? And so then we can build a report out of that, which I'll showcase a little bit later, to kind of demonstrate that.
So this is the API documentation. I'll come back to the talk. So then if you're building chat from here, let's go back here.
Great. So here we've got the bit of taking in the documents and basically storing them. And then ultimately, I'm just starting to show the chat component. But if you have chat and you start talking to it and you ask a question, well, where does that question go? And what happens with it? And so we do multiple things with that because this has just evolved over time.
But instead of just taking your question right off the bat, we then take your question, we build out what questions, what kind of sub-questions need to be answered in order to answer your main question. And I'll talk through an example in a bit. We convert that into search, so we then go retrieve documents that we have that will be most relevant to that. We then concatenate those documents and we give you a result. And so, here. This is maybe a good example of a query. Quick question.
Yeah, sure. Is this kind of what a lot of the larger LLMs now are doing with like deep research where you ask one thing but then it tries to kind of ask a bunch of other questions and go to a bunch of sources to try to collate a much better answer?
Yeah, this is one component of that. So the ability to do query orchestration and break that, or query routing and query orchestration and break that out is one element. This is kind of the core of our chat components. We do very similar things, because we have a process for building reports. And if we want to build a report, what we do is essentially do the query sub-question.
But then there are other follow-ups that we might want to do. So we might want to ask, is this information I'm querying, is this genuinely new information? Because you might filter for articles that were in the past week. But these articles might mention something that happened two or three weeks ago. So you can imagine having an agent that is then looking, finding the information, and then checking the past three weeks how similar was the information that was retrieved to make sure that that thing didn't already happen.
And there's lots of tweaking and playing with it to get the best outputs from it. So here's an example of, OK, write a detailed report on Netflix planned price increase, discuss if there's likely potential negative impacts of the decision, and summarize the subscriber figures, how they've changed in the past, like how they basically impact revenue.
So if I chuck this into big data, you'll see like, let's go here, chat. Hopefully the internet is still connected. Great, so we get a report here talking about this is the increases of what they were proposing was dollar increase. These are the potential impacts. Here's what we're seeing in subscriber trends. And then each of these we can check the source materials of where it's coming from. So we've got articles from different sources and we can go track that down.
So this is what a lot of the sort of fundamental investment managers that are on our platform are using it for. But behind that, what is happening? And so this is kind of what I want to get into, is behind that there's a couple steps that are happening.
So the first step that's happening, and that happened all while you were watching, is that it takes that question and then it decides what are the sub-questions that need to be asked and answered in order to accurately answer the question that you provided. And so it will break it down into anywhere between one and six sub-queries. And then each of them will be a pretty simple string that we will then incorporate and build into a search.
And so the second step here is once we have that individual query or that individual string, what we do with it is we will have another model that will convert that into a series of search parameters. So a search parameter here is, okay, find me documents that mention the entity of Netflix, but also find me documents. So for instance, if you look at the queries, one is Netflix plan price increase, simple, right?
The other is potential negative impacts. The other is Netflix subscriber growth in the past year. And now the parameters that are being set here are one, you should see it, yeah, it's setting a query parameter of negative, so it's only going to get those bits of articles that our sentiment model has shown as negative. And then here it's already setting a parameter of this date timestamp, so it's setting multiple queries.
So here we're gonna run, let's say, five queries in parallel to answer that question that we just asked. And so once we do that, then the next step is basically deduplicating and putting all those results together that then get returned in chat. And when we do our search here, what we do is actually multiple searches. So we search to the big vector store that we have. We also search to the web to see if there's anything else out there that could be added to our results that might be useful. I mean, basically this is Perplexity's business model is using a web search API to do a very similar process. And we also search across structured data.
So we have examples where you might want to ask, okay, show me the income statement, or show me the stock price, or show me other things. And I've got a few examples to kind of show that. I didn't... Yeah, so this is a query I ran earlier today. Rather than run it live, I just wanted to show a screenshot I took, which was correlation between Apple's PE ratio and analyst sentiment.
And so again, it pulled together the similar report, but also we have structured data in here where it'll pull the actual price information and it'll showcase a graph. And so that gets added to the process. And so when we perform those searches, they come across three different areas.
And then from there, that's what I just did. That was a reminder to myself. So when we perform a search, then what happens is we get back these various different chunks.
And when I first started talking about it, I talked about chunking. Chunky is basically taking in articles, documents, breaking them down to smaller sub-components that you can then iterate over. And this is an example of what a chunk might look like. So on our end, when we perform a search, we're going to get the background, the sentiment, the details of this sub-paragraph. And when I would run this, there would probably be 50 or 100 of them.
And they all get then passed into a final language model layer that will organize that into a nice little report that you saw on bigdata.com. And so then yeah, here's the example of what you saw. And this is just showcasing what kind of chat looks like.
Now that's kind of just the background. Where I'm really excited about is the next steps. And next steps here are multiple things.
So where I sit, I work with asset managers to look at things like qualitative screening. So I gave the example of tungsten and molybdenum. That's a real example.
And I can show a very similar example that we recently did with a client around looking at the components of an ETF about smart grid technology and how much of them are actually exposed to the sort of theme of the ETF.
But also there's a whole nother use case here where it's like, okay, instead of just Chatting with it, which puts the onus on the user. Can I just get a report?
You know, that's a curated information report daily or weekly or whatever it may be because chat is not necessarily the most, you know intuitive Medium of exchange I would say and so that's something that we've been building out now I can showcase a internal POC that we've done so how much time do I have actually because I What's my timing like
Five minutes, two minutes, three minutes? What does this mean?
Great, so in five minutes, let's see what we can get done. Okay, so let's go through the example of, this might be a little bit more technical, but I guess that's what we're doing here is a technical talk, so it's okay.
So what we're doing here, and this is an example of when I was talking about qualitative screening. So this is how we could use the API to do something more interesting or cool. And this is just initializing how we're set up. And so ultimately, let me just step back and say what we're trying to do here.
So I kind of talked about it briefly. But we had a client that wanted to look at all of the ETFs.
So an ETF is a basket of typically equities, but it can be other things as well. But it's a basket of equities, and many ETFs are built around a theme, right?
So like you can look at BOTZ, which is bots, which is like about robotics, or one's about smart grid, or one's about, what I'm looking at here is smart grid technology. And this client wanted to say, OK, there are 15,000 ETFs out there. I want to find a process where we can look at the holdings of all of those ETFs, compare them to the description of the ETF, the theme, and tell me which of these companies actually are most relevant to that theme.
Because typically, when you build an ETF, they'll put 150 names in it. Some of them are highly relevant. But then they ended up putting a bunch of others that maybe are less relevant.
And the way we approached that was to say, OK, great. You want to look for a smart grid? Well, we can first pull the entities of that ETF.
We can then, so we basically built this way of querying a language model to generate a theme tree. So what we did is we set smart grid infrastructure here, and we built related terms to search for, because basically, if you can imagine, if I just search for one term, or if I search for, let's say, in this case, I believe it's about 18 related terms, I'm gonna get a lot more texture to my search, I'm gonna get a lot more information around the information I'm querying.
So here we have a process where I can change this to whatever it may be, and it's going to find sort of the sub-themes and implications. So in here, for smart grid, it's smart meters, load forecasting, solar power integration, these different things.
And each of these nodes in the tree, so this is what that looks like, and each of these nodes in the tree will then generate, like will use an LLM to generate a description or a sentence related to that. And then this sentence basically seeds our search. So we're searching across our document store for any document that's highly relevant to, let's say, smart meters providing consumption.
So coming down here, we set our search parameters. We can search across news, filings, transcripts. Let's say I want to find which of the management team is talking about this topic, so I would change it to transcripts.
If I want to search who's going to be mentioned in the news, it should search in the news. And then I run that.
This is just examples of how to format a query. And ultimately, I get back to this big data store.
It's kind of difficult to see here, so I've got it here in Excel. And what we fetch is basically, OK, here's the company mentioned. Here's the industry that they're in. here's the theme that was mentioned in the search, and then here's the exact text that was pulled from the document.
So, okay, here it's talking about ABB and application of electronic meters, da-da-da-da. And so I've got thousands of these now that has come back in my query. And with that, I can basically build a big pivot table. And so with my pivot table, I can show, okay, there's 100 names in this ETF.
And out of them, the ones that are talked about most frequently on each of these themes, let's say the distributed smart grid resources, are these ones, right? As opposed to there's some other companies in here like Little Fuse or Belvin, which are a bit more industrial, that might not be the most appropriate. So the reason I want to do this is if I really want to invest in smart grid companies, I might just want to invest in like the 10 that are actually talking about the topics rather than picking an ETF to invest in.
So that's that example.
If I have any time, I could do report building, but yeah, if I'm at time.
Are we good? You're cool. All right, you call it, yeah.
Well, thank you so much. That's cool, yeah. Thanks, everybody.