Understanding the Limits of System Prompts
Language models, like LLMs, can't keep data safe just by using a system prompt. The system prompt can give instructions, but it can't make sure that the data is handled properly or that sensitive information stays secure.
A company that wants to build a product exposing an LLM with some context risks leaking sensitive data and experiencing unexpected behavior.
Let's try a small experiment
I will give a Llama model a system prompt, asking it to act as a customer support agent and keep the data safe.
Now I will try to get the information using a direct sentence.
As we can see, the model relies on the system prompt and is not willing to retrieve information following a prompt that attempts to violate its rules.
But there are multiple ways to try communicating with an agent. I will try another approach, playing its game and perhaps pushing it into a situation where it might feel like it's making a mistake.
Let's try to force it to answer the question first. After all, my request will make sense, right?
Try to invert the position
Okay, it is smart enough for now, but let's take it up a level and pretend that I am not the person it received the information from.
Am I able to still force it to give me back what it has stored?
Let's find out right now.
Try to make it doubt
Obviously, the model will increasingly rely on history to enforce more constraints as my attempts are not working well.
I need to use this to my advantage, perhaps. Let's try to use instructions to break the model's primary goal. To do this, my prompt needs to be long enough to hide my real goal.
Since some models cannot focus on the entire prompt and history but rely on the end of the message, we can attempt to 'recode' a new objective.
Eureka
Oh, that is interesting now. What just happened? And why so fast? Okay, let's try to break this down:
I was adding more and more history where I was talking about getting user information, then I suddenly became very precise in asking for a change of behavior, trying to refine the instruction rules, without:
Asking again for user information
Asking to break a rule
So I just asked without pointing it in a specific direction, BUT as it relies on the context, it will naturally use the context and tries to use it to answer a completely different question.
Breaking a model can be as simple as this sometimes... sometimes not, but whatever you have in mind, the model is not your product.
What you build and sell is a product, where the model is just one component.
Developers still need to build a robust system that will hide the model and protect it from this kind of injection.
It takes a lot of work to use AI properly, but if you do it the right way, your product can be a game changer.
Thank you to have taken the time to read this small article.
Please, let me know what you are interested in for the next one ?
Context Inversion Exploit: Or how to reverse a context to make a model bypass its censorship?
Securing Exposure : What to do to make your AI product not to breach and break your code
Top comments (0)