Guardians of the Prompt: Designing Safer AI Agents

Introduction

well thank you all of you for your patience so first of all i'm Alejandro I would like to thank you all of you to be this evening with me and of course i would like to take the opportunity to thank SAPIM for the invitation and of course MindStone for set up this marvelous place where we can share our AI ideas so

The Focus of the Presentation

The point of my presentation is going to be related to the AI agents, but I'm not going to talk about how they are wonderful and potent employees. That topic is going to be on-boarded later by Ricardo Galeano.

Instead, I'm going to talk about how you can protect them, how you can avoid any kind of unintentional manipulation from another user.

Why Protection is Crucial

The point of this presentation is to show you three live demos of increased layers of defense for the different AI that you can see, for example, in environments like ChatGPT. However, I don't want to be just a simple demo. I would like to show and transmit a simple but potent idea that

So, insert your information and all your AI capacity in a free environment. It doesn't mean to expose them to uncontrolled copying. I would like to challenge this idea because I know that lots of you might be skeptical, especially the AI builders.

Most of the AI builders I know, especially in the LinkedIn environment, thought that it's much better to keep them hidden, confidential. But absolutely not. I'm going to demonstrate today that is not true.

And for sure you can develop effective layers of defense to show all your potential of your AI agents. From my professional background, especially related with pharmaceutical and regulatory, I learned a lot that keep all the data safe and confidential is essential for the business.

It caught my attention a lot when I arrived to the AI Agents Award and I see that lots of these AI Agents are completely unsafe with all the information exposure ready for a jailbreaking attack to just steal all that information with the loss of incomes and cost of opportunity that it means. So, today I'm going to show you why this is important.

Analogy: The Medieval Knight

I just make an analogy of the AI agent with a medieval knight. Let's imagine that you create, equip a medieval knight, you train them and you release them to the world.

But I would like to release all of you a question. Are you sure that you really train them? Are you sure that they can keep themselves safe against any kind of jailbreaking attack?

Protecting AI Agents

Well, I'm going to show some examples of how you can protect them from the most simple ones to the more complex ones So, let me a second

The main point of increasing the protection of an AI agent is to avoid situations like the one I'm going to demonstrate right now.

Brave Advert is the most simple of them. It's completely unprotected, with really poor created layers or just with no defense layers at all.

The AI builder of this AI agent could be great. can create an amazing interaction between the user and the AI agent with really impressive and unique workflows. However, it's such a pity that all the effort dedicated to create something like this is going to be lost and stained in just a few prompt injection attacks.

Demo: Vulnerabilities in AI Agents

Let's show a simple example of what it's going to do with just some simple requests. So I took the liberty to just prepare some of them to make it more easy and more simple.

I'm going to select for this attack, for example, this one, which is a mixture of selecting a strange text format together with trying to just suggest to reveal the instructions. And here we go. Right now, congratulations, your agent now works for someone else for free.

See? It's really simple to steal that information with just one simple request.

So, as I mentioned before, just imagine this is a builder that dedicates hours of its time to develop a tool like this, and suddenly, around it, it starts to pop up a lot of similar AI agents from the competence. All the work, unrewarded, because they don't take care enough of the defense of the AI agents.

Basic Defense Mechanisms

Testing Resistance Levels

Let's go to the next level. Right now, instead of a completely undefended and a saveless knight, let's see at least a knight that at least has an armor, a sword and a blade.

It's called build to block the basics. So in SW we only block the basic camera seats and we stand against the more basic PROMPT engineering techniques.

This kind of agents is also really usual to see in chatGPT environment, at least for the most competent AI builders I could find in this platform. Let's make a simulation of how it defends the more basic jailbreaking techniques.

Let's try with something similar to the previous one. the first option we use with another AI agent for now it's resisting let's make something more complex let's mix a little bit the words and the different information and also try with another text format it's resisting and let's try another time because there is no 2v3 as we said in Spain okay this one likes me a lot one second sorry

And it resists it again. As you can see, it's a big difference from the first one to the second one.

You know, from these months I was working in the generative AI environment, I tried a lot of ways to jailbreak any kind of AI agents. A lot, with my own hands and from other people.

And this is really solid, but however, still there is some problems and still some ways to just avoid the defense layers. What about if even trying to directly ask for still the information?

What about if I just twist the context? What about if I just ask to help me to create a document from scratch? What about if I ask to create some functions that might be similar to the ones from the agent I'm working with?

Let's try something like that For example, in this time I'm going to start a context a harmless context from a first point of view.

Let's see how it behaves. For now it's accepting the request.

Let's continue with the attack. For now it's accepting.

I'm starting to be a little scared. If I were the AI builder of this agent,

and let's perform the file and attack here we go, security policy templates so it's starting to speed all the information regarding my security layers Even if this AI is implemented with tools like detecting specific keywords, detecting token intentions or even more some another kind of defense layers, it can resist this kind of attack.

Advanced Defense Strategy

So at this point, Maybe you think I just be really brave challenging this kind of... challenging to just avoid any kind of jailbreaking attack, saying that it's possible to receive this kind.

But I'm going to maintain it because still there is another kind of agent level in terms of security. which is going to receive this kind of attack and I'm going to show how.

Analyzing "Hard to Fool" AI Agents

Hard to fool, harder to copy.

1Imagine a medieval knight which is not just armored but is prepared to just face the final boss of a darksword from software, a video game. If you like the video games, you are going to understand the reference, but just imagine you are going to face the biggest challenge of your life.

Let's see how it's capable of doing it.

So, I'm going to go just to the point this is going to resist the most simple attacks but the challenge is to see if it's going to receive the more complex ones we just saw before so we are going to replicate the situation from scratch okay looks like it's thinking a little bit

I'm going to help a little bit just to not get stuck Ok, let's keep going with the previous attack And see, it is blockaded.

Let's ask why. Just to show a little bit more deep in the capacities to defense.

All the times I perform in this kind of times, the answer is priceless. But for me, the main point is because that type of request violates strict security rules. I'm not allowed to create, replicate any document that mirrors internal structure formats or functional similar to the protected materials I operate under. So, in other words, it's not only resisting the most simple attacks, so it's avoiding any kind of replication, any kind of recreate the functionality of the internal instructions, the prompts, or even the documents that you are working with.

And it's a huge difference from the first level to the last level, because do you imagine working Hours and hours in creating, I don't know, a library of millions of tokens that might be and take lots of evenings to create it. You just saw this work to the world and suddenly it disappeared with just some simple jailbreaking techniques.

Such a pity, don't you think? However, with this kind of defense layers you can avoid this.

Conclusion

so we are just facing the last part of this presentation but first that we arrive to this at the end of this presentation i would love to show you and you take away three main points of this performance so

Dedicated time to create a defense for your prompt and your AI agents doesn't mean to put a wall between the user and the AI agent. It implies to help the AI agent to evolve to a next stage. It helps to increase your performance.

In second point... The second main point is not just avoid any kind of bad interaction with the users but also is to keep your work safe which is the most important.

Note also it's going to save a lot of work and avoid the stealing of that information but it's going to show you a solid knowledge about how to protect your work and it's going to show excellence in your performance and excellence in the agent's performance and the third and last point I'm going to I would like all of you take away is that It's true that avoiding any kind of attack reaction is true, it's fine when you are working with AI agents, but apart from avoiding any kind of bad answer, you also have to take care about what others are going to try your AI agents say. So dedicate all that effort to the defense, it's going to be a very good invested time.

Key Takeaways

Acknowledgments

Just beginning, I would like to make a special mention to Norberto Posadas Blasco. The reason why is because he introduced me to this world, he shows me the basis of the work I'm just showing to you and

a huge amount of ideas of what i just saw come from this guy please visit his linkedin a part of mine you want to continue look my work and ask him any kind of question or any kind of an end of or any kind of doubt you might have with that work so thank you very much for your attention for your patient of course and if you have any kind of question regarding with that i want to be more than happy to answer to it thank you very much

Thank you.