DocRouter.AI as a drag-and-drop AI document processing tool

Introduction

OK, so what is DocRouter? It's a night and weekends project, but it's becoming a real thing. So I'm building a company around it.

What is DocRouter?

It's basically the backbone for document workflows into any product, if you will, with a human in the loop.

And it's open source.

Understanding DocRouter as a Technology

So let's talk about it a bit and let's see how it ties into this discussion about agents and MCP and what we can do with it.

Key Functionality

So that's what it is. It's an AI tech stack enabler.

This is how the UI looks like.

Designed for Simplification

The data could be any types of documents, could be faxes, emails, PDFs, data you extract from ERP systems. And it's designed to simplify people's work.

So let's actually open it up and take a look at how it looks like.

So this is an example.

Exploring DocRouter's Interface

OK, so first of all, there's multiple workspaces.

This app is available openly on the web, app.cloud.ai. Everybody can create an account.

If you need help, you can reach to me. There's no paywall yet. I haven't implemented the Stripe integration yet.

It's available on GitHub. It can be installed on VPC for projects that need more control over the information.

And multiple users uploaded different types of documents.

Hands-On Use Cases

So this gentleman from London, he uploaded a bunch of resumes. And we defined prompts and extraction schemas that extract basically information from these resumes. That's one way to do this.

1Essentially, you can think of the Doc Router as a system of records for documents and for prompts that you use for extraction. It's like a minimal layer on top of language models that just allows you to do everything you can do with documents.

And then everything that's available in the UI, drag and drop, point and click, there's fast APIs, just like Michael showed. So you'll see how powerful this idea actually is.

And it's going to go very much along the lines that Michael described in his presentation. So yeah, these are the documents.

Annual Report Extraction

In this case, check this out. There's an annual company report from Monzo with a different type of extraction.

This is stacked annual report because you want to run only the prompts that apply to annual reports.

So we have multiple prompts and multiple documents. We don't want to run all the prompts against all the documents because that's a quadratic problem. So we use tags to assign which document runs against which prompt.

So you could just come, upload, I don't know, any type of document. Then you can select the tag that you want. And as soon as you do that, the system will run the prompts for that document.

Prompt Assignment and Logic

And in this case, let's say the CV prompt is actually, let's see, what does it have? It's actually very simple. It just says, hey, extract the CV.

But the trick is, OK, use one of the models that is supported, that most models are supported. Tag it saying you want to run it on CVs, not on other documents. And then the whole logic, if you want, is in the schema.

And this is mapped to support what these language models support for extraction schema. So actually, the language model vendors, they have their own pipelines when they build these models. allowing you to extract structured information out of documents.

They support this full power of the JSON schema. I have a drag-and-drop UI where you can configure exactly what type of extraction you want.

The parameters could be standalone strings or could be arrays or embedded objects. If the document has embedded tables with multiple rows, you can have the full power of JSON to extract it out.

DocRouter in Action

OK, so that's the example of of CVs, let's look at something more interesting.

So last Sunday, there's a hackathon at MIT with the education theme. And I met this woman who's an education student at Harvard. And she wanted to build a system that evaluates fifth grade quizzes according to an evaluation schema.

Educational Hackathon Example

So how would you use something like the Doc Router to do this?

Well, first of all, the doc router would not be the front-end UI. So when you're in an application like this, you need to develop your own UI and just exercise these APIs through the doc router.

But this is a simple problem now, because you can go to tools like Manus, for example, and can say, hey, this is the API. Just build me this particular application for it. So I'm not too worried about the front-end UI.

But I'm worried about the back end, the AI tech enabling of how to do this application, especially since if you're a teacher, you get the AI to grade the quizzes, but you want to be in the loop, right? You don't want to directly send whatever the AI sends to the student. On the other hand, you have this workload that you have to handle so many quizzes in that they don't have time. You need to know exactly where to go, how to quickly correct things. So if the ad does 80% of your work, that's a great win.

And this is just one example of application.

OK, so how would this work? Well, so first of all, we generated the quizzes and the rubrics for grading the quizzes from GPT, just synthetically. Say, hey, GPT, give me an example of a water cycle fifth grade quiz. Give me an example on how it should be graded.

And the trick was to basically come in the prompt and to say, Basically, the prompt is, you're a middle school teacher. Students are given the following quiz. Read the response. Evaluate against the rubric below. If the student answers the wrong question, note so and don't give points. Give constructive feedback and so on.

This can be arbitrarily complicated. And then basically just pasted in the rubric suggested by GPT to say, what's the scoring criteria? And this goes in the prompt. So you can think of it as being attached to the prompt. You want to pick a good model for this. Well, we picked GPT 4.0, but maybe you want one of the chain of thought type models that are better.

tagged to run only on the water cycle fifth grade quizzes. And essentially, you end up with one prompt per type of quiz that you run. The schema is actually going to be the same for all the quizzes. And you can kind of tell what the schema is when you look at the results.

Let's see one of these generated quizzes, how they look like. Well, let's see here. So this is what GPT came up with an example of a quiz.

Pretend student doing pretend responses on precipitation and so on. Obviously, when you do this in production, it's going to be with handwriting. But the language models can handle handwriting. And in terms of what got extracted, well, the questions, the list of questions got extracted, the answers from the student, and then the model generated an evaluation score and evaluation feedback.

Human-in-the-Loop Interaction

And yeah, that's it. And if you want to correct this with a human in the loop, you can come in here and change, let's say, let's give it a five. So that's pretty much an example.

But this example where you have to match the quizzes against the rubric comes again and again when you do document processing with the human in the loop. It could be, for example, you need to check invoices against the contract to see if the invoice is correct per line items in the contract. Or you might want to check invoices against purchase orders to see if they're matching. Or you might want to check insurance reimbursement requests against insurance contracts.

So this seems deceptively simple to have just a few prompts and schemas and some matching, but really it's a generic problem that solves many, many applications.

So for example, I'm talking to a gentleman who's building a transportation certification startup. And he's got exactly this problem, is that their security certifications And there is standard operating procedure for the companies that need to satisfy these federal security certifications.

Broader Applications

So we plan to use this exactly for this application.

In my consulting work, we deal with durable medical equipment providers, where you've got all these workflows with medical orders and face sheets and insurance reimbursement requests and labs. sleep studies that need to be matched against the requirements for the insurance. So you always have this one thing that you need to match against the other.

Sector-Specific Utility

So far, so good for the doc router. But remember, you still need to build this front-end application.

Market Position and Workflow

So the doc router, what does it see kind of in the pecking order of applications? Well, this is kind of an intelligent automation market map to understand what we're trying to build.

Are we trying to build a vertical specific application in one of these verticals? Actually, we sit in the data extraction layer. So we're a horizontal enabler.

And you can put UI front end if you want. So the reason I like this a lot is you see language models will be used in programming, but they have to be integrated with procedural applications that are like the front end UI.

Seamless Integration

And this integration is actually very tricky, because the language model works with prompts that you need to change, you need to try. You need to evaluate. Whereas procedural applications, they're just what I call a function call. So if you have a good separation between what is the manual workflow, the language model workflow with the human in the loop, and the procedural application, I think that could be a winner.

And in terms of how this could be structured, like in the example with the quizzes, you can develop a UI front end that is just as simple as managing the rubrics and managing the quizzes. The Docker is on the back end, so you run it through APIs, and you have yourself an application.

And actually, if you think a bit about how Cursor works, or WinServ, these AI editors, or how Manos works. There's a bunch of tools where you can use prompts to do vibe coding to start an application from scratch. And they get you so far, but wouldn't it be nice to interface that with

a mature application that deals with the workforce underneath. I think that could be very powerful.

So I'm evaluating, maybe that's a direction that I can take this given that the Docker Outer interfaces are pretty well set.

Further Ideas and Directions

But there's other ways to use the Docker Outer.

So for example, there's hospitals that use Epic as an ERP or for durable medical equipment, there's other systems of record.

These bits of software, they're very mature. They're also very difficult to persuade the vendor to open up features or to implement AI workflows in there.

Challenges and Opportunities

So what you can do is you can put something like a doc router in front of the document pipeline and use the tagging function to select which documents you can process with AI. And you can then pre-insert these results structured with a human in the loop and therefore save quite a bit in manual work.

Okay.

Future Connectivity and MCP Integration

So that's the Docker Outer, and let's talk a bit about this transition towards a Gentic AI and MCP and dovetails really nice on Michael's presentation.

It's the same thing that he presented.

So you start with a fast API, with a REST API implementation, basically create a token on the Docker Outer, and everything that you do in a UI, uploading documents, listing documents, uploading schemas, uploading prompts. For everything, there's a REST API.

API Integration

OK, that's good. So that's one way to integrate.

But you can also have this new MCP server architecture.

MCP Server Architecture

So Michael's presentation is great because it kind of takes it from the chat GPT from the open AI standard. MCP comes from Anthropic, so they're a bit competing with each other.

But MCP itself is not mature. So it's exactly what he said. It's not clear how it's going to play out, right?

But the good news is, in some sense, it doesn't matter. Because if you're ready for it and you have kind of a right mindset, you'll just kind of catch the wave. And maybe MCP is the wave, you know?

So that's what it is. You basically have an API to do this. You have a token, an authentication token. And now what you need is an MCP client to kind of derive value out of the MCP server interface.

I do want to talk a bit about the connectivity aspect of MCP versus the application level of MCP because it's quite subtle and it's not mature. There's more to watch for there.

Cloud Desktop and Interface Exploration

But before we do that, let's just think of this MCP client and start, for example, with Cloud Desktop, which I think Michael mentioned. So Cloud Desktop can be pointed to to the tools of the doc router.

It's going to discover all the interfaces. The interfaces, they have embedded comments in the code that say what each interface can do.

This is already showing a bit the peril of this. You can connect everything to everything. Look at this. There's all these interfaces.

But here's the trick. The doc router has the documents. It's got the OCR output of the documents. And it's got the extractions.

So if you want to run a function, you have three ways to do it. So are the language models smart enough to pick the most efficient way to call these APIs? I think that's the next question.

So that's not the application. That's the frontier of where MCP is and how we want to do this. So with Cloud Desktop, I'm able to ask it, hey, what documents are available in the doc router?

And it gives me all these resumes and the annual report that was there. And then I can say, OK, what are all the names of the candidates? And then it's trying a bunch of APIs.

Ultimately, it decides to take the OCR output and to rerun the language model prompt itself, which I don't care because I pay $20 a month and it's a fixed cost. But you can think there's already an efficiency problem there.

So I have to figure out how to run exactly the interfaces that I need because I already extracted this output. So why run the OCR once in the system and once in the MCP client? So this is kind of the frontier.

OK, it gets the candidates. Then, OK, so this is where it becomes interesting.

Solving Optimization Problems

Now, think yourself you're a consulting company. And you've got a large group of engineers somewhere, let's say, in Argentina or something. And you've got these projects that are available that you can find that big enterprise in Boston somewhere.

And now you've got an optimization problem. Now you want to say, OK, out of this pool of engineers, I'd like to have a team of two junior engineers and one senior engineer for a full stack project with an XJS front end and a fast API back end implemented in Kubernetes in AWS. This is actually coming from a test question given to a salesperson at one of these consulting companies to see if they're trained to understand the intricacies of how to assemble a team of consultants for a project.

And somebody was asking me, can you help with this? But now, actually, the AI can solve these optimization problems with something like the Docker author on the back end, with the names of the consultants, and say, hey, yeah, based on the CVs I've analyzed, here's an option. You can take this guy as a senior engineer, these guys as junior engineers. Multiple options, but I recommend you this option.

So now, obviously, this is about human work. You've got to be really careful here, because these resumes, who knows? They're kind of self-representing themselves.

But it shows you how to solve these optimization problems, again, some requirement and some solution. It doesn't have to be resumes, but it gives you an idea of where this goes.

Dashboard Creation

The other thing you can do is, I think, exactly what Michael was showing. It's like, build me a dashboard with the candidates mapped against programming languages and make it dynamically get the latest kind of information from the Docker.

OK. I think I've asked too much.

So what it was able to do was to build a static dashboard, like a one-page dashboard. And the nice thing with Cloud Desktop is that it actually is going to show you this one page because it's got the sandbox in there. It reserves some folders to kind of create the artifacts, and then it displays it.

And I think it's pretty neat. Nice.

And furthermore, if I want to actually go directly, call the APIs to do it, I'm pretty sure if I spend a bit more time, I can make it do it. But if not, it can be programmed.

OK. So this is what we can do right now.

Collaborative Development

And the project that we have next, and we're a group of loosely connected contractors and entrepreneurs. We're trying to build an open source MCP client, because we don't want to be tied to Claude. And we want to be able to replicate the same type of function.

And since each of us has slightly different interests, like one of us might want to do a trading platform for crypto, for example. But there's enough commonality that if we build an open source MCP client, get it to the point where it does some things like this, then it can serve as the basis for the next and the next products.

So this is great, and that's what we've been doing at the hackathon this Sunday. So that's kind of one direction we can go.

And really, the other direction is what I've shown you before, is this idea that you have

a system of record somewhere. And you've got well-defined, complete set of interfaces that are well thought out, something like a doc router. And then you can create something like Manus on top of it.

So basically, application creation as a service, where You, who are not very technical, have a bit of engineering, but you just prompt them and say, I want this. But now with the full power of the AI with the human in the loop on the back end, kind of solving this boundary problem that is so difficult to solve between integrating language model level stuff and integrating classical procedural website implementations.

This is my presentation.

Thank you.

Yes.

One specific question that I had with respect to ICPs, because as you mentioned, they are used in tandem with an AI agent. I want to actually ask this question from a more business perspective.

In business frameworks, what you expect from your agent or from your model to do is really provide some It responds to some of your prompts.

Now, the problem with these agents is that, I mean, it's a good thing, or it may be a bad thing, depending on from what framework you're looking at it. These agents are very creative in the way they operate.

So when it comes to putting guardrails on how these agents think about certain things, purely from a business perspective, when you deploy some of your applications to Did you see that, you know, those curtailers are breaking out, and the LF going, you know, full throttle, and, you know, it's just deploying all of its .

It goes to the core of how these language models are used. I work with financial data with insurance information and it's absolutely necessary to have a human in the loop for these workflows to the point where initially you almost have to design the system so the human goes point by point to verify what the language model does. Otherwise it's just not good enough.

However, The doc router doesn't have this function yet, but you can use LLM as a judge to evaluate the results, and you can basically ask a simpler LLM model, what is the accuracy of the previous response on a scale from one to five? If you get enough confidence that five out of five is usually correct, then you can take the human out of the loop for those applications.

Actually, the language models have become really good if you give them the right context. So this issue of context comes again and again. I can put anything in the prompt, right?

So one good example of that is when you use Windsurf or Cursor as text editor. It's got three modes. It's got agentic, ask, and manual.

But if I don't, and if I run the agent, and it needs to index itself and look for things, then we get into trouble, right? So now in terms of industry applications, because maybe there's kind of a business aspect to your question, not a technical aspect.

So business-wise, if you deal with financial data, you have to have a human in the loop. But if your application is customer support or marketing or something more loosely tied, like some workflows, for example.

So I go get the list of 200 names of potential leads. And then I pass this through a language model to say, OK, but how do I address these people in a way that they're likely to respond? That's an application of a language model that doesn't require perfection.

So you mentioned using LLM as a judge to evaluate the responses. One thing that I've seen with a lot of these evaluation matrices is that sometimes they really do not focus on the key performance indicators that you want to evaluate.

So for example, you want to evaluate whether the LLM is capturing a few important numbers from a specific report. The LLM generates a response, and let's just assume that it encapsulates all the important matrices. also has the same numbers, but it's just framed in a different way where just the difference isn't the semantics or the way it's worded.

Now the problem is that with LM as a judge, what I've seen is that if the wording becomes slightly different, even though the KPIs are captured really well, it gives it a business role. Have you observed something similar?

Actually, no. I think it depends on the quality of the model itself. If you use an expensive model, may not be feasible, but then you're going to have the correct extraction.

Typically, I see this with medical records, They do a great work, but you just have to pass exact context, not too much, not too little. And then it's actually pretty reliable.

So in my consulting projects, I have the people working as human in the loop. They rarely find problems with the language model actually not catching things. So that's one thing.

Now, for LLM as a judge, it's actually a simpler problem than actual extraction or actual chain of thought type application. It's supposed to work actually pretty well. So I think the experience shows that you could do it.

Now, in practice, do people actually use LLM as a judge efficiently for all the problems? No. But I think the reason is that the projects are too, there's too many steps to implement to actually get to that point. So you need an accelerator to do it.

something that gives you ready-made, well-implemented LLM as a judge with ability to change the prompts and see, like one typical problem is you need to update the prompt and you've got 90% of things are working and you want to fix 2%. You don't want to fix the 2% and break other 2%, right? So you need a system that really lets you

see quickly where the problems are. And these LLMs are actually expensive. You can end up running thousands of dollars in costs. So there's all these considerations.

It's still a difficult engineering lift. But that's why you want to have this middleware helping you accelerate what you're doing.

Thank you so much. Oh, OK. An observation and a question.

Something that blew my mind about your example is that the CV extraction, is that the model had a chance to add its own wisdom. I noticed that in the team selection challenge, it noticed that people who had a React background would be suited to learn next year. the resumes, there was something that Cloud 3.7 knows and added to its analysis, which I thought was super cool.

It was like a real amalgamation of its world knowledge plus... It's its own chain of thought, basically. Yeah, right? That was so cool.

So I've got to tell you, I spent like two minutes generating that example. It just worked. Awesome.

My top user request at Volcker, actually, is people want is we haven't done it because everyone has lab work from some different company.

They're all just PDFs. They're like, oh, we're going to have to parse this. It's going to be so annoying.

Is that the kind of thing DocWriter does? OK, so that's a really important point, because that's what I see in my consulting work, too.

It's like, yeah, the LLMs can do everything, but the world is so bespoke. that abstracting it out is really complicated. And the other problem is that when humans do it, they're already very efficient, especially some of the work is outsourced.

So it's the kind of a business that you have to basically do many things at the small cost premium and just kind of collect at the end out of 10,000 things. Then it adds up.

So it's this bespokenness that makes things very complicated, exactly what you're describing. Now, maybe for the doc router it's easier because I'm enabling you to do it, but this is becoming your problem maybe because you need to kind of stitch it up.

But when you design your business around this, you have to definitely think hard about this bespoke problem and how you're going to mark up to get a benefit out.

So really, really good question. Yeah, thank you so much.

Finished reading?