My name is Robert van Dijk. I'm an AI solution scientist at Cellvoint, a company that actually, we used to be here in origin for the past almost four years. Recently we've moved to science crates because we needed a lab space next to our office.
My background is in medical engineering and specifically working across disciplines in AI, biology, computer science and all that type of stuff. stuff, and I've been working specifically in image analysis, and actually just microscopy analysis in the past four years.
And building AI systems has been somewhat more novel to me, and I've only really been doing that for the past four months, so I'm sure that there will be people here that can do that much better than me. But I have had a bit of experience in the more generic AI backgrounds, and I've seen
some things that I think can be translated from computer vision which would be understanding what you can see in images to building agentic systems so trying to translate those two.
If you're new to building systems then I hope that either this will be very familiar or you'll see a new perspective or maybe if you're not even building anything yet maybe this will help you understand how these AI systems actually behave in the background and understand why you're getting certain responses from your chatbots.
All right, so first to set the scene and framing, I want you to start thinking of AI or the agents as a system. So I think for people who build regularly with this, that's a common thing.
But it's important to understand that when you talk to one of the LLM systems, so the agents, they iterate through understanding what's happening in their environment, and I'm trying to focus in this talk on the environment itself.
So it's an iteration of you querying the system, it then performs an action, for example, it wants to understand the environment better, gets the feedback, and that goes into iterations until at some point it's found that it knows enough and it can give you an output.
But what is that environment?
That is where it gets a bit more complex, because really often when we interact with these systems, we just see the model itself and the instructions that we give it. But
really, there's a lot more around it, and specifically here, I've kind of framed that in this box on the right, the operating environment, so you can think of it as the documents you give it or the data and the context, what type of tools, so what do you allow it to do, what do you not allow it to do, and then you've got the state or the memory from your
previous conversations, and then there's the more top -level orchestration of different different agents maybe once you start building in hierarchies. And what's important is that really reliable agents depend more on just that model call and the structures you give it, but really the entire environment that sits around it.
So that sounds maybe a bit abstract, and the core idea here is that I'd like to make that environment a bit more visible through imaging, because that's what I've been working on.
So in computer vision, that that environment itself, like the different systems that you have, and specifically the context, and maybe the underlying information, really is carried in the images itself.
So you can have, like for microscopy, you have the focus depth, you've got noise, you've got contrast, post -processing, pre -processing of images. And all of those are either subtle or immediately visually apparent to what you're doing.
This here are three example images, just microscopy images. images, and actually those are all images of exactly the same content, so those are the same cells that you can see under the microscope.
So for us it's immediately obvious those are different images, even though the content that we're trying to understand here is the same.
Now in my job it's one of the tasks that we work on a lot is trying to find where the cells are, and that's quite a difficult job as you can see because you can look at this and be like, I'm not sure, maybe those dots are cells or whatever.
But the point is that the environment can change even though the content that we're trying to understand is the same.
So to make that more apparent, so here on the left we've got one of those images and on the right we've got a prediction of what we call cell detection. So we're trying to find out where in this image are the cells.
So we've got the same cells, same task, same model. But for each of the images that I've just shown you, we get slightly different outputs because we've got a slightly different context of the cells that we're looking at. And if that seems really obvious to you, then great, because that's the whole point.
So in that type of environment, you didn't really just test how well did my model work, but really how well did my model work in this environment. So you're testing the environment around the task that you're trying to test for. for.
So that could be one microscope where you set up your system, so if you're building a system, then you're testing it in the end, and that can be in this case just one microscope, so we've got a data set and all of the images are acquired by the same company or by the same person operating the microscopes.
But then when you're testing it, we're testing it in different worlds, different microscopes, and that's when you can get completely different results just because your world around the task has changed a little bit.
So with computer vision and then imaging, we typically solve these types of problems by doing ETL, which is fancy language for extract, transform, load, and really it means how can we normalize or standardize the images that we see.
So if you get images from three different microscopes, how can we make sure that the model in the end sees exactly the same information every single time.
And additionally, you can then also do something that's called task routing. And we'll get to that a bit more based on the agents.
So what would that look like for an agentic task? So let's take a simple agent workflow, turning meeting notes into project tasks. This is kind of the sort of workload that I think a lot of people could relate to.
On the left, we've got our demo world, which is quite explicit. Everything that we would need to perform the task is kind of present in your inputs, in your structured nodes. So if you're trying to classify what is the action, who is the owner, the deadline, all of that is immediately visible in this structured node, and we can get a very clear answer, these are the two actions, these are the owners, and this is the deadline for those different tasks.
If we then move to a real -world scenario, so let's say we've built this beautiful setup, we're now trying to deploy it for a new client or a new user, and somehow the entire system starts misbehaving or starts behaving uncontrollably, and that's because the transcripts are much more messy than what we expected.
So we could have uncertain conditions in the transcript script where it's unclear who's the owner, it's unclear what exactly the deadline is, and we don't really know how to behave. But what tends to happen is that these agentic systems tend to just guess without telling you that they've guessed. So there's no real visibility on the task confidence or the quality, but also no real visibility on the uncertainty of it.
So what we really want to take away is agents shouldn't just guess. That's not not what we want them to do.
And you can kind of try to visualize that in this sense, where on the left we had a very clean, nice, beautiful microscopy image, and on the right we then moved into the real world, and all of a sudden we get blurry images. They're from a different microscope and we're not used to handling those types of things.
So for computer vision, this is typically the workflow that we get. Let's say we have the task, which is the prediction of the cells, where are the cells in the image, and on the left we've got the environment,
the setup of where that environment comes from, and in this case we simplify it to the microscope. So microscope setup, you can have prep, and typical instrument settings. Then you've got the input surface, so that's just the image, and then we tend to set up
different variations of standardisation, and maybe routing, which happens when, let's say say you've got such a blurry image that you don't really know what to do with it, at that point you can classify it and say we don't know what to do with this image, and you tell the user that actually we don't know how to deal with this, so maybe you should try and
take the image again. And only then do we do predictions. And that's only for the images that we're certain about.
And for a Gentic system, that workflow is actually very similar. So, you have your user query, your docs, your tools, then you have the context, the tools and the state of the system, but then the question is, what prepares that whole system for your agents to actually deal with it in the right way?
And then action in the way that you wanted to expect. So you've got the same lesson across the two systems. systems, you want to understand better the world that your agent sees and not just the model itself or just the prompt.
But we really want to think about how can we prepare the data or the environment for a model.
So to take that away, practical operating model for that.
All of that is a bit vague, but these are more of the steps you can take to do that.
So trying to chop up your workflow into different stages instead of just taking taking all the information and dumping it into your system. So you can have raw, ambiguous data, and you can try to then first prepare it.
So understand for this specific case, when we look at the transcript, was there even a task or was there not a task that we want to assign? Then the second one is can we structure it?
So for each of the bits, do we have a task? Do we have an owner? Do we have a deadline? deadline?
Do we have evidence for that deadline or for that information? And what is the confidence that you can assign to it?
And then finally, you can do quality control checks. Do we have all the information to actually write the task? And then can we in the end say, yes,
that's good enough, or no, that's not good enough, and then we route it back to the user and say, actually, we're not certain about this task or we're not certain about the owner. Can you provide us with extra information or you could go back to the environments and maybe query other documents?
So in the bottom right there, that's really the mental shift is just taking raw inputs putting it into the agent and expecting action into Structuring it in a way that we always control what type of information we give to the final act Step that's important part part.
So reliable systems expose uncertainty before they act.
So we want to move from environment dump to environment design, where we just throw all the information that we have into the agent and hope that it gives us the same reliable result every time, and instead start thinking about how can we shape the world in a way that we can think about it.
So can we select the inputs that we give it? Can we curate the context a bit better? Can we define specific tools for specific situations? And can we track the state and add checks?
So if you already build systems like this, then it would be good to think about am I always focusing on one of those five and maybe I can add one of those other fives to my pipelines?
Or maybe a better takeaway is the perspective of if you're building systems and you're not certain about how to proceed, you can ask yourself, how would my system currently respond if I gave it a blurry image example? Would I just get noise back or a random answer back? Or can I actually reliably trust the system itself?
So we get the same model, but better shaped environments and more reliable outcomes.
Thank you very much. That was it.
Have you benchmarked the last step at all, the evidence that it is actually performing better? I haven't benchmarked that for that specific task.
This pipeline is something that is described by AI engineers at Anthropic and Google as this is how you should kind of view or approach your problems in most of the cases.
So it's more of a general principle rather than something that I specifically tried for one of the topics.
Yeah, I was wondering, you mentioned confidence as one of the metrics in there.
I know with a lot of kind of ai tools in biology like protein design and protein uh structural prediction that that's a metric which they actually present to the customer at the end so you've got a kind of pldt
score at the end is that something which you'd have in embedded into your model where it actually shows the customer how confident you are in the prediction of how many cells there are there and then potentially having a kind of threshold for does it say you need to redo it or or not so now
Now you're talking about the computer vision example specifically, right? Yeah. So definitely. Definitely.
You can set that up in different ways, but we very often have a confidence or even some form of interpretability of what the prediction is based on, what it's looking at in the image, because people don't tend to trust if we just give them, oh, this image looks bad or this image looks good.
We always have to explain where we're coming from with those systems. So confidence is one way to do it. but interpretability of saying
well actually it's these cells that are defining this prediction that's even more powerful
Thank you for the talk firstly
the question I had was around I'm a bit hooked on this statement or this phrase but sort of chasing determinism where the system allows
and quantifying uncertainty where it doesn't it'd be good to understand how you quantify that uncertainty so maybe feeding into
the person behind me in terms of the question there but yeah be good How to quantify the uncertainty of the system.
So then you're asking about how can we quantify the uncertainty that it predicts itself? It's more about that could be potential. It's quite open -ended, sorry. But I think whichever way you interpret it is probably best for purposes of right now.
Yeah.
So the uncertainty that we typically use for our computer vision models is never, never you can never really say that they are actual probabilities because they tend they're always coming from more of a probabilistic system but what we tend to use is you can train it to actually add a confidence to its own predictions
and based on that if it gets predictions wrong with very high confidence you penalize it very heavily and the other way around and that's that type of
of confidence is, it would get very technical, but is what you would get with like the sigmoid transfers at the end of your models or the softmax transfers at the end of your models.
Those are, I'd say proxy measurements of it, but you could train your system to give you the confidence as an output, if that makes sense.
So you can add that as your target, yeah. Thank you, Robert.
I actually want to ask a question, just if you can bring back this box. And yeah, this, back to the last slide, I think. This one?
Yeah, we shaped the world before the agent acted. So there's quite a bit of technicality in that. For those less technical people,
let's take drafting a response, Draft email. So let's say it goes into your inbox. You've got a workflow, and it looks at your emails in your inbox, and it comes up with drafts.
I'm trying to think of a really simple example. Looking at each of these, inputs, context, tools, state, checks, could you perhaps give some examples of what they might be?
Right, you're putting me on the spot. It's like a quiz exercise.
Yeah, so let's say you've got an email coming in, and you want your agent to produce a draft that's the that's the task we've got so first things first would be selecting the inputs that you have so obvious one is selecting the incoming email
but based on how you want to reply to that person you might want to give access to different types of documentation maybe like a style of replying or maybe certain contacts that you need to give that reply that would be first and curating that so you don't just add stuff that it doesn't need But actually making sure that you only add the things that it wants or could use for that answer
Then for I would say for just replying an email you might not need that many different tools But if it becomes more technical, you could also add different different types of load. So let's let's give a simple example
Let's say you need to calculate something you could give it access to a calculator Maybe you need to calculate like what the price is for a new type of invoice or something like that You can give it access to the calculator to calculate that price, and it comes back and gives you the draft.
And then the state is really trying to structure it in a table, the types of things that you want it to reply with. So you might have, now I'm starting to think about the invoice example,
but for an invoice, let's say you have specific parts of your invoice that you always want filled out, and you always want to make sure that every invoice is structured in the same way for every response. once, you can track that, make sure that it's filled in every part of your invoice.
And then at the end, add checks and say it hasn't filled in this one completely right. You can loop back or you can maybe query back or maybe just flag it to you to say, oh, I didn't know how to actually calculate this 2 plus 2 example because I couldn't find the access to the calculator or whatever.
Brilliant. Thank you.
you uh sorry for putting you on the spot um i i think it's it's been a fantastic talk and i think um people will probably be able to ask you even more in -depth uh questions or or more superficial questions because i think this model is really really helpful for anyone wanting to break down start developing more complex workflows and following these steps is um really important
to get them working reliably. So thank you very much, Robert.
And can we all give a round of applause? Thank you.