An LLM will fill that excel for you

Introduction

My talk is about using AI and LLMs to fill excel files.

The Problem: Turning PDFs Into Excel Data

That's a contract for changing a heat pump. It's a heating system.

My customers are real estate asset managers and they have this amount of pdfs to manage one building except my customers have dozens or hundreds of buildings.

We have a floor full of PDFs and then they have pretty operational questions. The answer lies in those PDFs.

What Manual Extraction Looks Like

So one of the job is to do like this. Okay, so contract number HP whatever. Okay, internal reference.

Now Now, first part is Braxton, blah, blah, blah. Okay, I cannot copy -paste, so I have to go here and continue.

Should I do the 15 fields or... You get it, right? Okay. And that's actually a real big issue.

Riccata: Automating Document Understanding for Real Estate

So, I'm co -founder of Ricarta. Not Mati, Ricarta. And our job is to help real estate asset managers manage their buildings.

I will show you only the UI that we have at Riccata, the one that our customers are using.

Product Walkthrough: Upload, Classify, Extract

You select the PDF, you select the type of documents, we can discuss about it later on. And then, boom, let's launch it.

Pipeline Overview: PDF → Text → Structured Data

We'll do two things. Well, actually it's written. First we transform the PDF in a text file, and then we transform this text file into structured data. When I say structured data, we'll see later on what it is.

Data Models by Document Type

We said it's a contract, because contracts are different from rental agreements. They are different from quotes, they are different from invoices. And each time we will map the PDF to one specific data model. model.

The goal at the end is to make sure that you don't have to input this Excel file manually first.

So we'll wait maybe 10 seconds more.

Why a Single LLM Call Isn’t Enough

It's about AI, it's about LLMs, but actually LLMs are quite bad at doing this. Or let's put it another way.

If you just give it to an llm you'll have very bad content very bad result and what we do we orchestrate many many llm calls to get good accuracy of the data maybe i don't know if it's network or something it's not very fast today because at the end of the day what our customers want is to extract data but also

Beyond Extraction: Comparing Quotes, Contracts, and Invoices

So compare this PDF with this PDF, with this quote, with this contract, with this invoice. OK, do we did the invoice, the amount that they said they would? If there is a difference, what it is?

So from the PDF we had before, we analyzed it. We found one building. We found two parties, Brixton, Grimtham, effective date,

signature date currency blah blah blah everything okay so it took about one minute it should be faster but yeah demo effect and then what's important is our

Compliance Checks Against Asset-Manager Guidelines

customers they have some guidelines so like oh if I work with heat pumps I want to do this specifically so I want to make sure there is no performance contract or there is no indexation clause so what we do is we have analysis

oh but maybe I will do it another way we have this new contract here that we want to come to compare with let's say so three ones with two ones and we'll check the is there indexation clause and performance let what we are here it's It's kind of the guidelines of the user, of the real estate asset manager.

And they want to make sure that each time they receive a contract, they are absolutely sure that there is a performance clause inside or there is anything. So what we do, like we did before, the document is already in text form so it's super fast and then we go and extract the data. data.

Again, the idea is that the real estate asset manager doesn't spend time checking its two pages document. The biggest we have, we add these 350 pages, and they don't want to scroll and see, oh, there is this first part regarding indexation for the price, and there is this thing about performance whatever and we automatize probably i took also very long pdfs

again demo effect yeah sorry there it is and then we have the same analysis for those three uh documents so you have the three documents here and we want to to know if it's a renovation or construction contract it should include energy performance clauses and we don't say only yes or no we say yes and why yes in this case no because there is no data and here same for indexation so when they have new

documents they upload the document they just say oh it's a contract it's rental agreement and then out of this is that the full analysis of is this document compliant and then they can link it to some processes internal processes and of course you can extract it to Excel it's only the instructions but you get the

Multilingual Documents and Broader Applicability

document name we work with French German English Italian so for example if you have a building in Fribourg you might have some contracts in French some contract in German maybe even some contract in English and you want to have

the same analysis even if you maybe you get or you don't get anything of of German but still you you get the data behind it it's a really b2b case as you can see but actually what's interesting

here is that extracting structured data from pdf or any document is something quite generic it's making sure that all documents that you receive let's say in your mailbox box can be used by the machine to understand what's going on, like extract the important pieces of the context, find inconsistencies between two documents.

Real-World Examples: Inconsistencies and Typing Errors

Recently, for the real estate case, we received a data room for an acquisition to acquire a building.

And actually, inside, we had three different values used for the heating system as we get it was like oh here it's it's gas here it's oil now you user work on it so we are not taking decisions we are extracting data so user can take decisions

and maybe to put it even more concrete one of our customer uploaded 93 rental agreements and he wanted to extract which are the flat number and parking spot number and he called me and he said oh joe i have i have an issue because some data that has

been extracted is not shouldn't be here and i looked at it and actually inside the rental agreement at the top it was flat number three parking spot number 23 at the bottom it was flat number three but parking spot number two and we were extracting um three two and 23. out of 93 documents they had eight issues like this and then we discussed and he said oh yeah

And I have another case for you. Actually, then we compare the data that you've extracted with the data we have in our system.

And out of those 93 documents, we have 17 typing errors. So it's like 20, 25 % of errors for this whole set of 93 documents.

Conclusion

That's what we do at Riccata. I'm happy to take questions.

Q&A: What the Solution Does (and Why It’s Different)

I want to make sure I understood what your solution does, so make sure there's some information that people are looking for on some contract. That's the main thing that Riccata does.

I want to make sure I understand it perfectly because I have a question about code in Excel after that.

Yeah. Yeah.

So probably two things. First one is we are really focused on real estate and construction. And we said a contract will look like this in terms of data model. It's very important because then you want to compare documents.

So you have to share the same data model. That's the first thing. And then we fill this data model with data inside the document.

And then that's docked to data, and then when we have the data, then we have all those processes, things that we call data to value, then we can do a lot of things on top of it.

Excel Extensions vs. Purpose-Built Extraction Pipelines

Basically, at the moment, because Cloud released an extension for Excel two days, less than two days ago i think and um is it not possible to do that with this kind of uh the same work that you same thing that you um with your solution but with uh this kind of extension that uh anthropic just released i haven't tested it uh so i didn't test it yet yeah i want to make sure because

Because there is already on the market some solutions, which are extensions for Excel. For example, there's Numerous, which is quite useful, especially for people who are not really, who have not, cannot, I don't have a lot of experience with Excel, so it's kind of a useful extension. I don't know if the close for Excel is that much better, but based on the current release, I think so.

So that's why I'm asking the question because there's already some competitors on that kind of stuff. So that's just a question if it can be done with that kind of extension already. Yeah.

Accuracy Challenges and Edge Cases at Scale

So we're not the only one doing this, that's for sure. and actually when we began like two years ago we haven't at that time we hadn't seen any solution with a good level of accuracy I have a lot of good cases like edge cases for some people but actually for us it's like cases that

we have to manage especially LLMs don't scale if the input goes bigger the accuracy goes down and with the same input if the output goes bigger it's worse yeah even you don't have to reach the end of the context window

actually yeah we said no we said non -technical people but i can have take you a little bit um but yeah if if the input goes bigger the lm is like yeah i don't care really it's like some parts of the document that like don't read it and for the output is even it's even worse so you put the same input but you ask even more data out of it and then it's like yeah i don't know and

1And again, so we have a lot of orchestration, like splitting the document in small chunks to make it work.

That's why we have a lot of, it was edge cases that now are really like what we want to support.

So until now, we haven't seen any good solution managing those edge cases. Yeah.

Q&A: Orchestration Principles

How does the orchestration work? I mean, you don't have to go into the tech details, but the principle.

Making Outputs More Deterministic

1Yeah, we like deterministic things. And LLMs are not. So what we do is we tame LLMs to have the smallest call possible. So the output is the smallest possible.

And then on top of it, it's like we split the document or we split the output. Sometimes it's quite easy to do. Sometimes it's not.

Chunking Long Documents Without Losing Context

when we have a document which is 300 pages long you cannot just split by let's say 10 pages long chunks you know because if you ask for um indexation close then you have something chunk number two and something chunk number 22 and how do you put them together so it's a lot of of understanding the context, breaking things together

without having sum -ups, because if you sum it up, then it doesn't, it lose all the finesse.

So we think of what is the expected output, and we kind of go back to how many calls we will do.

I can show you something. thing. That's my 30 seconds with technical things.

Observability and Repairing Ill-Formatted Outputs

Okay, that's Phoenix. It's a very, very good tool, guys, if you are into looking what's inside the LLM call.

And here we have a call and two answers. So this answer, that's my system prompt, user prompt, the output. output, but then actually the output was ill -formatted.

It found some errors. So we had another call to get the correct data format. So for the small document that I showed you before, we had at least four calls, four LLM calls that were sent to get data.

and our job for this part is to have very very good data also something you mentioned entropic and these all those tools if it's only an LLM always click twice click twice and you'll get two different answers that's not what we want to do

Q&A: Industry Scope and Data Modeling

What if I'm in the sports industry, should I expect the same level of accuracy, or will it arise? Yeah.

So pipeline is industry -agnostic, but the data model are not. And actually, it's at inference time, so in the prompt, that we add a lot of industry data or context.

Q&A: Security, Deployment, and Model Choices

How do you manage the data security for the projects?

Yes, good one. Thanks for asking. It's a big topic for us as well.

Data Security and Sovereign Infrastructure

At the moment, we are using an LLM through Azure. It's a temporary solution.

Since end of 2024, we have a project with EPFL, LFL, Swiss Data Science Center, Iverdon Engineering School, Exoscale, to have our own LLM on Swiss sovereign infrastructure.

It's a very, very important topic. It was a big topic for us like 15 months ago. It's even more big topic today.

But then when you move from one LLM to another, there are a lot of topics actually to manage. manage it's not like the beginning thing oh yeah let's let's switch to something else it's it's a big thing it's a very very big topic i can discuss technical things later on as well yeah so are you

Models Used Today and Migration Constraints

using a multimodal analysis or is it only one can you disclose which model you're using yeah oh it it was written before, we were in the transition, but we are using 4 .1 .mini and 4 .0 .mini.

4 .0 .mini is very, very good. Well, it's OpenAI and all of this, but 4 .0 .mini is very good.

The bad thing is six weeks ago, we received an email from Azure saying, saying, oh yeah, in one month and a half, you won't have access to 4 .0 mini anymore. And then they pushed the deadline, but we use 4 .1 mini and 4 .0 mini.

But as I said before, we are moving to an open source model with which we have better performance actually. It's sovereign and better performance.

Q&A: OCR, Traceability, and Handwriting

So you don't do email analysis? No. No.

Why Text-First Extraction Enables Source Traceability

We tried a few things, but for several reasons, it's better to have a text analysis, especially because maybe I should have shown you before.

When we extract some data, let's say building, we have the section of the document here. So we want to go back to where the data come from. So here, it's two pages long, so we don't care.

but if it's a 300 pages long document we want to know oh this close comes from somewhere and the user can click yeah I can try but it's a small one it won't change much yeah let me find another one and you cannot easily easily do that

that with, yeah, this one is first page, but let's go to signature date, let's say. Yeah, it goes to page two, whatever.

And so it's not that easy to have it with image extraction from an image document.

Yeah, kind of. Yeah. Yeah, thanks.

Q&A: Format Variability and Robustness

So if I understood well, so you need for those contracts to be analyzed, they have to respect a format that you... No, no, no, they don't. That's the value. OK, that's the value.

Whatever the format, we extract data. I can show you. I've prepared it, but that's our best.

We love this one. It's a contract. Oh, we should. Yeah, whatever.

I should stop the camera. Here we are. 611.

Handling Scans, Forms, and Handwritten Additions

It's a TypeScript with many things, and we are able to extract it. Usually when a customer asks for this specific question, we're like, yeah, let's try this one, and we show them that. So we are doing OCR, coming back to your question, we are doing OCR in that case, but then we kind of always go through the text version of the document. Yes?

So I was wondering how you handle handwriting. So this is all, handwriting is typically for OCR, for kind of challenging, when the tools get pretty expensive, if you want to really recognize handwriting, really read handwriting, like Docker and stuff.

So you, do you decide then for each document, is the handwriting on or not, and then you use OCR or not, or how do you?

Yeah, of course, handwriting is a big topic. Actually, with this OCR that we use, we have quite good result.

There are several kinds of PDFs. First one is a form, which is a proper form, and we have data as accessible as data inside the PDF. So that's one branch.

Then we have the PDF that has been typically printed from Word, let's say. That's another kind of PDF. We can extract data directly from it.

Then we have the scanned document. when we have the scan document we have either something which was written with the computer

has been printed signed and then scanned and then we have the unwriting thing four branches

four ways of extracting the text from the document

when your system decides automatically how it extracts, so I imagine also the difficult case is then if someone

if it's printed or or typed and then someone crosses out something and add something per hand maybe change the number

Prompt Injection and Consistency Checks

and have you thought about the prompt injection scenarios like yes you know someone something yes write something in right in just in white form yeah and then then how does your system handle

this yeah um so actually for us it's quite easy to manage this case uh because we as we go through through this text document, we can simply have a prompt which says, if there is a kind of prompt injection pattern, then let's put it somewhere else instead of going through the pipeline first.

And then if it goes through the pipeline, it might have access at some point to some data in our database. Yes, but it cannot really attack the system.

well I mean it's documents are not with users data or something so we have kind of segregation of data especially with the QR code with the Swiss invoices we We make sure that the QR code has the same data as what's inside the rest of the invoice.

And also that's why we compare quotes with the invoice with the QR code to make sure that there is no inconsistencies.

Well, thank you for that. We could give him a round of applause.