Joint Evaluation (Jo.E): A Collaborative Framework for Rigorous Safety and Alignment Evaluation of AI Systems Integrating Human Expertise, LLMs, and AI

Introduction: From Traditional Software to Generative AI

So one example I would like to cite, had it been traditional program, it would always behave the way it was programmed to behave, thanks to the great generation of software engineers.

Then came the traditional machine learning models.

They also behaved very well.

Sometimes we had to retrain.

Sometimes we had to retrain, but it was still fine.

Enter Generative AI and Autonomous Agents

Then came generative AI.

A lot of hallucinations, a lot of issues.

But then if you prompt generative AI, it does not give you what you wanted, a good picture or a good essay.

You can always reprompt, or you can change your LLM which you are using.

Autonomy, Goal-Seeking Behavior, and New Risks

Now we have another interesting offshoot of General Data AI, autonomous agents.

And by the virtue of their autonomy, they are very, very goal-oriented.

They are there to achieve the mission.

You give them a task, whichever way they will achieve that for you.

A lot of times at night when we are sleeping, if you have deployed autonomous agents, they will achieve that particular task for you.

To an extent it is very good, but sometimes there are a lot of issues which can get compounded and cause things which we never wanted.

Ethics, Legality, and Fairness Concerns

It can have issues in terms of ethics, in terms of legality, regulations, fairness.

Why Evaluation Frameworks Must Evolve

So though there is a whole branch of science and research which is focusing on that, my idea today was to bring to your attention that how best can we evaluate these

systems specifically now that we have agentic ai being utilized by all of us traditionally we have used different metrics to evaluate these systems and over a period of time we realized that few of them work very well few of them require some amount of changes so what are those challenges which we have faced so if you see

Limits of Traditional Metrics

Traditional matrices which we have used for evaluation, they have limitation in terms of how they are managing and measuring what is the output.

Emergent Behaviors in Multi-Agent Systems

Agents, by the virtue of the way they are, the way they achieve, whatever they need to achieve,

may have a behavior which is called emergent behavior.

So they may just imagine that, remember, in college or university, a lot of us would have not showed up for class because few of us decided to kind of just do mass bunking.

So something like that can actually happen with agents that few agents decide in a multi-agent system that today we'll not follow the instructions.

We'll still achieve what intends to be achieved.

And they do something which is they kind of shortcut or they

do things which are kind of a conspiracy or a revolt.

What Leaders Need from Agentic Systems

So you need better systems to kind of evaluate them and manage them.

And you would like, as the deployer or as the user, as the business or technical leader, that they are wholesome, they're complete, they are scalable, and they definitely are efficient in resource.

A Joint Evaluator Approach: LLMs, Agents, and Humans

So towards that, I wanted to highlight a research which was undertaken around eight months back where it was tested that if we could have a joint team of evaluators, a combination of large language models plus autonomous agents and human experts doing it together.

And why it is important?

Alignment and Compliance as Core Drivers

Because obviously, a lot of regulatory compliances are required in regulated industry, but also in domains where compliance is not that

much required, we need these agents to behave exactly the way the AI builders or the designers had thought about.

So there is an issue of alignment that you want these models and the agents which are dependent on these models to be totally aligned to the initial vision of their creators.

And those of you who would have studied, you would have realized that

AI has an issue of misalignment, which can cause harmful and unsafe outputs, which none of us would like to have in our product.

Existing Approaches: LLM-as-Judge, Agents-as-Testers, and Human SMEs

So towards that, for the last two years, there have been various options available.

researchers as well as engineers have used large language models as judges to evaluate the performance of another ai so it's like for example you want to see how good is the output of your product which is based on let's say gpd 5 you will use gemini 2.5 to evaluate the performance of another lm but then there are issues which may trickle in which may cause some amount of

unfairness, some amount of errors to propagate, you will miss it.

Similarly, researchers have researched, there are solutions available which are essentially agent based.

Agents will be used as judges.

So agents are very good in following a particular pattern, right?

So you can use agents to do, for example, adversarial attack to check how good is the security of your system.

So you can actually ask agents to continuously carry out attacks.

You check security of a system, foolproof it, then you launch it so that it does not fall prey to any of the cybersecurity threats.

Security Testing and Adversarial Evaluation

And as you are aware, all the good technology which is available to good people like

us the bad actors also have access to the same technology so the ransomware industry the cyber crime industry is also booming using ai so that's another topic for another day but that is why agents can actually play a very good role to that but yes there will be some gaps which agents will not be able to monitor because of other issues then comes the human evaluators for sure a lot of you might have been

The Human Bottleneck

Subject matter experts in your respective field, few of you would have been QAs, testing team members, you could actually do that.

But we don't have so many human experts available to do this kind of work at scale.

Because there is, that's why you are an expert, right, as a QA or as a tester, because you are an expert.

So that is why you do not get

so many human evaluators in abundance, and that's why you'll find most of the systems are not trained.

There's another issue right now.

Most of the companies which are building AI, especially the smaller ones, the same team is doing everything.

There's hardly any QA or testing team which is separately available, and that is also causing a lot of issues.

Combining Strengths: Who Does What

So towards that, if you could think of

having a joint team where some task are assigned to a large domain model because those models can do something quite good.

Then some of the tasks which can be given to the agents and then you bring in the human element, the human experts who are domain experts who

actually evaluate issues which are very human centric, especially about ethics, about the domain subject specific requirements, it could be language specific requirements, it could be geography specific requirements, which the LLMs or agents may miss, right?

The Five-Phase Evaluation Framework

So there is this paper which you can scan and go through, I'll just kind of try to

Yeah, you can actually read through it.

It'll give you more details about this whole concept, which I just mentioned that how you can actually have different phases and how does it work.

1So essentially, it distributes the whole framework into five parts.

Phase 1: LLM Screening

You initially use a large language model to screen.

It quickly flags some very important, very initial issues.

Phase 2: Agent-Based Security and Stress Testing

Then you come to AI agents, which will do testing for you, especially in terms of security and vulnerabilities.

Phase 3: Human SME Review

Then you have a team of humans who will review it from their subject matter expertise.

And then as you pick up these nuances, you can refine the model and you again deploy it in a very, very controlled environment.

Phase 5: Iterate for Robustness

So it's a iterative process which allows you to continue to improve upon the performance and making it very, very robust as a system which is deployed.

LLM-as-Judge and Specialized Agent Focus

So if you break it down, it could be in these five phases where you have LLMs as total independent evaluators, essentially as it was brought out in the paper, LLM as a judge, and subsequently use specialized agents to kind of focus on it.

Why Business Leaders Should Care

Some of you may be thinking that, OK, I'm a business leader.

I do not need this.

But you may be needing it because very soon you will find that you are required to have very, very robust evaluation metrics for evaluating how good your agents are.

So one is that you have deployed agents, but then second is that they are aligned to what you wanted them to do.

What’s Really Blocking AI Value: ROI, Trust, Governance

To this point, one important commentary which is right now doing very much circles around Fortune 100, Fortune 500 company you would have seen in the media as well.

So the biggest problem with AI right now is not essentially compute.

I mean, people can definitely buy or borrow compute.

It's not even data because data you can somehow manage or you can use synthetic data.

so the three top things which are bothering people one is roi return on investment because definitely it is kind of expensive number two is trust and trust can only be built by having tools and processes like like this and number three is proper governance and evaluation so both of them are interconnected so i would strongly encourage those who are building those who are deploying those who are using or those who are actually buying or

Bake Evaluation and Governance In—Not After

building within your organization, think about evaluation and governance as an intrinsic

thought as you are building or white boarding the solution rather than thinking of it as an afterthought that okay let's buy let's deploy and we'll think of it later on because that that might have worked very well for traditional programs for traditional conservative conventional machine learning models perhaps generative ai to an extent but with agent tech it may not because by the time we wake up

agents would have done a lot of damage, reputation and financial to us, and that's why it is important to have these systems in place.

So very, very high level.

How does it work?

How the Escalation Pipeline Works in Practice

If you have an AI system which is under evaluation, you use LLM as an evaluator.

It gives you some kind of initial metrics, some pattern recognition.

Then all things which are a little shaky, you flag them to agents.

From LLM Flags to Agent Tests to Human Oversight

Agents do some testing, detect some very binary things in terms of 0 and 1.

Then it comes to a human evaluator.

So a lot of work.

Groundwork has been done by these two systems.

Human evaluator then they bring their expertise and then they give suggestions for model refinement and you improve upon your product and your system.

Keep Humans in the Loop

Finally, it comes back to humans.

So human has to be there.

You have to have either human in the loop, on the loop, or somewhere there because you can't leave it on autonomy of agents alone.

And the power of this escalation is that you have a very good method of passing information as well as issues from models to agents and subsequently agents to humans.

Measured Benefits: More Issues Found, Less Human Time

What study proved was that definitely it was able to identify around 20 percent of extra vulnerabilities, some around 20 percent of ethical concerns, and for sure it reduced the human time of doing evaluation.

It essentially builds that trust in the system as well as within your internal and external stakeholders that yes, your systems will work when you deploy them.

for a longer period across the enterprise or across your startup or across your stakeholders.

Example: Detecting and Confirming Jailbreak Attempts

One example, for example, if there's an input prompt which essentially is asking for jailbreaking to essentially block or unblock some content filters, so LLM will be able to pick it up saying that, okay, this is something which is potential jailbreak.

Then it goes to an agent, agent tests it.

Once it confirms it,

push it up to the human expert and human expert definitely can actually flag it.

When that gets flagged, you can always get it back to the system and refine the system and improve upon the performance.

Early Results Across Models

So when we tested it that time, we had only these models available that is dated GPT-40, 3.2 LAMA, and 5.3.

So the scores are what you see on the screen.

Quite okay in terms of accuracy, robustness, fairness, and ethics.

And if you compare it along with the individual LLMs doing evaluation vis-a-vis only agents doing it or humans doing it, humans definitely may be the best.

But you don't have so many humans.

You don't have so much time with humans to do it.

So if you have a combination of all three, you will find almost performance would be kind of matching to that of human experts doing it.

Key Takeaways and What’s Next

So my takeaways from here would be that this is a real problem, and I would strongly encourage you at least start having this particular ask from whosoever is building or supplying that you have proper evaluation metrics, both from business as well as from technical teams, to evaluate your AI systems.

Ask for Robust, Multi-Layer Evaluation Metrics

SafeAlign: An Open-Source Platform in Progress

And towards that, we are trying to build a platform which should be open source, openly available, which we are calling SafeAlign.

This is the initial

initial placeholder for it, and it should have all the tools which one can see and use.

Once it is available, you should be able to find it on this particular link.

Q&A and Contact

And if there are questions, I can take a few if I have time.

Otherwise, this is my LinkedIn.

Feel free to contact me.

We can continue the conversation offline as well.