So one example I would like to cite, had it been traditional program, it would always behave the way it was programmed to behave, thanks to the great generation of software engineers.
Then came the traditional machine learning models.
They also behaved very well.
Sometimes we had to retrain.
Sometimes we had to retrain, but it was still fine.
Then came generative AI.
A lot of hallucinations, a lot of issues.
But then if you prompt generative AI, it does not give you what you wanted, a good picture or a good essay.
You can always reprompt, or you can change your LLM which you are using.
Now we have another interesting offshoot of General Data AI, autonomous agents.
And by the virtue of their autonomy, they are very, very goal-oriented.
They are there to achieve the mission.
You give them a task, whichever way they will achieve that for you.
A lot of times at night when we are sleeping, if you have deployed autonomous agents, they will achieve that particular task for you.
To an extent it is very good, but sometimes there are a lot of issues which can get compounded and cause things which we never wanted.
It can have issues in terms of ethics, in terms of legality, regulations, fairness.
So though there is a whole branch of science and research which is focusing on that, my idea today was to bring to your attention that how best can we evaluate these
systems specifically now that we have agentic ai being utilized by all of us traditionally we have used different metrics to evaluate these systems and over a period of time we realized that few of them work very well few of them require some amount of changes so what are those challenges which we have faced so if you see
Traditional matrices which we have used for evaluation, they have limitation in terms of how they are managing and measuring what is the output.
Agents, by the virtue of the way they are, the way they achieve, whatever they need to achieve,
may have a behavior which is called emergent behavior.
So they may just imagine that, remember, in college or university, a lot of us would have not showed up for class because few of us decided to kind of just do mass bunking.
So something like that can actually happen with agents that few agents decide in a multi-agent system that today we'll not follow the instructions.
We'll still achieve what intends to be achieved.
And they do something which is they kind of shortcut or they
do things which are kind of a conspiracy or a revolt.
So you need better systems to kind of evaluate them and manage them.
And you would like, as the deployer or as the user, as the business or technical leader, that they are wholesome, they're complete, they are scalable, and they definitely are efficient in resource.
So towards that, I wanted to highlight a research which was undertaken around eight months back where it was tested that if we could have a joint team of evaluators, a combination of large language models plus autonomous agents and human experts doing it together.
And why it is important?
Because obviously, a lot of regulatory compliances are required in regulated industry, but also in domains where compliance is not that
much required, we need these agents to behave exactly the way the AI builders or the designers had thought about.
So there is an issue of alignment that you want these models and the agents which are dependent on these models to be totally aligned to the initial vision of their creators.
And those of you who would have studied, you would have realized that
AI has an issue of misalignment, which can cause harmful and unsafe outputs, which none of us would like to have in our product.
So towards that, for the last two years, there have been various options available.
researchers as well as engineers have used large language models as judges to evaluate the performance of another ai so it's like for example you want to see how good is the output of your product which is based on let's say gpd 5 you will use gemini 2.5 to evaluate the performance of another lm but then there are issues which may trickle in which may cause some amount of
unfairness, some amount of errors to propagate, you will miss it.
Similarly, researchers have researched, there are solutions available which are essentially agent based.
Agents will be used as judges.
So agents are very good in following a particular pattern, right?
So you can use agents to do, for example, adversarial attack to check how good is the security of your system.
So you can actually ask agents to continuously carry out attacks.
You check security of a system, foolproof it, then you launch it so that it does not fall prey to any of the cybersecurity threats.
And as you are aware, all the good technology which is available to good people like
us the bad actors also have access to the same technology so the ransomware industry the cyber crime industry is also booming using ai so that's another topic for another day but that is why agents can actually play a very good role to that but yes there will be some gaps which agents will not be able to monitor because of other issues then comes the human evaluators for sure a lot of you might have been
Subject matter experts in your respective field, few of you would have been QAs, testing team members, you could actually do that.
But we don't have so many human experts available to do this kind of work at scale.
Because there is, that's why you are an expert, right, as a QA or as a tester, because you are an expert.
So that is why you do not get
so many human evaluators in abundance, and that's why you'll find most of the systems are not trained.
There's another issue right now.
Most of the companies which are building AI, especially the smaller ones, the same team is doing everything.
There's hardly any QA or testing team which is separately available, and that is also causing a lot of issues.
So towards that, if you could think of
having a joint team where some task are assigned to a large domain model because those models can do something quite good.
Then some of the tasks which can be given to the agents and then you bring in the human element, the human experts who are domain experts who
actually evaluate issues which are very human centric, especially about ethics, about the domain subject specific requirements, it could be language specific requirements, it could be geography specific requirements, which the LLMs or agents may miss, right?
So there is this paper which you can scan and go through, I'll just kind of try to
Yeah, you can actually read through it.
It'll give you more details about this whole concept, which I just mentioned that how you can actually have different phases and how does it work.
1So essentially, it distributes the whole framework into five parts.
You initially use a large language model to screen.
It quickly flags some very important, very initial issues.
Then you come to AI agents, which will do testing for you, especially in terms of security and vulnerabilities.
Then you have a team of humans who will review it from their subject matter expertise.
And then as you pick up these nuances, you can refine the model and you again deploy it in a very, very controlled environment.
So it's a iterative process which allows you to continue to improve upon the performance and making it very, very robust as a system which is deployed.
So if you break it down, it could be in these five phases where you have LLMs as total independent evaluators, essentially as it was brought out in the paper, LLM as a judge, and subsequently use specialized agents to kind of focus on it.
Some of you may be thinking that, OK, I'm a business leader.
I do not need this.
But you may be needing it because very soon you will find that you are required to have very, very robust evaluation metrics for evaluating how good your agents are.
So one is that you have deployed agents, but then second is that they are aligned to what you wanted them to do.
To this point, one important commentary which is right now doing very much circles around Fortune 100, Fortune 500 company you would have seen in the media as well.
So the biggest problem with AI right now is not essentially compute.
I mean, people can definitely buy or borrow compute.
It's not even data because data you can somehow manage or you can use synthetic data.
so the three top things which are bothering people one is roi return on investment because definitely it is kind of expensive number two is trust and trust can only be built by having tools and processes like like this and number three is proper governance and evaluation so both of them are interconnected so i would strongly encourage those who are building those who are deploying those who are using or those who are actually buying or
building within your organization, think about evaluation and governance as an intrinsic
thought as you are building or white boarding the solution rather than thinking of it as an afterthought that okay let's buy let's deploy and we'll think of it later on because that that might have worked very well for traditional programs for traditional conservative conventional machine learning models perhaps generative ai to an extent but with agent tech it may not because by the time we wake up
agents would have done a lot of damage, reputation and financial to us, and that's why it is important to have these systems in place.
So very, very high level.
How does it work?
If you have an AI system which is under evaluation, you use LLM as an evaluator.
It gives you some kind of initial metrics, some pattern recognition.
Then all things which are a little shaky, you flag them to agents.
Agents do some testing, detect some very binary things in terms of 0 and 1.
Then it comes to a human evaluator.
So a lot of work.
Groundwork has been done by these two systems.
Human evaluator then they bring their expertise and then they give suggestions for model refinement and you improve upon your product and your system.
Finally, it comes back to humans.
So human has to be there.
You have to have either human in the loop, on the loop, or somewhere there because you can't leave it on autonomy of agents alone.
And the power of this escalation is that you have a very good method of passing information as well as issues from models to agents and subsequently agents to humans.
What study proved was that definitely it was able to identify around 20 percent of extra vulnerabilities, some around 20 percent of ethical concerns, and for sure it reduced the human time of doing evaluation.
It essentially builds that trust in the system as well as within your internal and external stakeholders that yes, your systems will work when you deploy them.
for a longer period across the enterprise or across your startup or across your stakeholders.
One example, for example, if there's an input prompt which essentially is asking for jailbreaking to essentially block or unblock some content filters, so LLM will be able to pick it up saying that, okay, this is something which is potential jailbreak.
Then it goes to an agent, agent tests it.
Once it confirms it,
push it up to the human expert and human expert definitely can actually flag it.
When that gets flagged, you can always get it back to the system and refine the system and improve upon the performance.
So when we tested it that time, we had only these models available that is dated GPT-40, 3.2 LAMA, and 5.3.
So the scores are what you see on the screen.
Quite okay in terms of accuracy, robustness, fairness, and ethics.
And if you compare it along with the individual LLMs doing evaluation vis-a-vis only agents doing it or humans doing it, humans definitely may be the best.
But you don't have so many humans.
You don't have so much time with humans to do it.
So if you have a combination of all three, you will find almost performance would be kind of matching to that of human experts doing it.
So my takeaways from here would be that this is a real problem, and I would strongly encourage you at least start having this particular ask from whosoever is building or supplying that you have proper evaluation metrics, both from business as well as from technical teams, to evaluate your AI systems.
And towards that, we are trying to build a platform which should be open source, openly available, which we are calling SafeAlign.
This is the initial
initial placeholder for it, and it should have all the tools which one can see and use.
Once it is available, you should be able to find it on this particular link.
And if there are questions, I can take a few if I have time.
Otherwise, this is my LinkedIn.
Feel free to contact me.
We can continue the conversation offline as well.