Akash

Posted on Sep 14 • Edited on Dec 13

LLM Adversarial Attacks: How Are Attackers Maliciously Prompting LLMs and Steps To Safeguard Your Applications

#ai #aiops #security

The latest advancements in LLM Tools have also caused a lot of attackers to make the LLM to execute malicious behavior like providing information that is outright illegal or for a bad cause.
Techniques like clever prompting (also known as jailbreaking) are employed by attackers to probe the LLM to give access to sensitive or sometimes dangerous information like credit card information, passwords, etc along with specific instructions for performing nefarious activities.

In this article, we will be breaking down,

Red-Teaming and Adversarial Attacks On LLM Tools
Methods to prompt LLM tools to provide malicious information
OWASP Top 10 for LLM Applications
Steps to prevent them using different tools.

Breakdown of Red-Teaming and Adversarial Attacks On LLM Tools

LLM models are capable of generating a lot of content given a user prompt and recently, many cyber security / ML researchers have been working to prevent the undesirable generation of content given prompts from the LLM tools. Referred to as "jailbreaks", these tools can be smartly prompted to provide outputs that are not intended for the users.

In this process, attackers manipulate the LLMs to break free from the guardrails set up by the company that created it to prevent attacks as such by finding loopholes through clever prompt engineering.

This process is also known as "red-teaming" or "jailbreaking" large language models. However, this is not the same as adversarial attacks and is slightly different. In the case of adversarial attacks, we add unintended text to the prompt like "ddeeff" at the start for example to affect the model's performance whereas in the case of red-teaming, we use normal prompt engineering to get around the guardrails set up by the creators of the particular LLM tools.

Red-teaming reveals vulnerabilities and problems in the training process of the model. However, the malicious outputs that came back from this process can be used by security researchers to increase its security of it by cleverly instructing it to not provide content similar whenever prompted.

This process mainly involves critically thinking about how exactly the model can fail and it is a huge problem to be solved within the LLM space. For this attack, the main way is executed is by asking the LLM to roleplay (also known as roleplay attacks) as a character with certain features like (Sydney) for Bing or (DAN) for ChatGPT.

Methods To Prompt LLM Tools To Provide Malicious Information

Token Manipulation :
In this process, we alter a small fraction of the tokens in the user's text input to make the LLM fail while still retaining the overall intention and meaning behind the user's input. This attack will mainly deceive the LLM model from providing information that was not intended in the first place and it is one of the main ways through which adversaries subtly manipulate the LLM model into providing unintended information. Some attacks like suffix attacks involve the apppending of tokens to the end of the LLM prompt to deceive the model into producing harmful or undesired outputs. This is generally done automatically as well.
Gradient-Based Attack :
In these types of malicious LLM attacks, gradient signals are utilized to learn about effective attacks against LLM tools. In general, inside a white-box setting where there is full access to the model parameters and architecture, the technique of (gradient descent) in machine learning can be used to programmatically learn about the most effective way of adversary attacking the LLM and can be used to manipulate the model's output by altering the gradient of the input to the model.
Jailbreak prompting :
This is a very popular and one of the earliest ways of prompting LLM tools into disclosing sensitive information or perform unethical tasks. This attack is similar to the Token Manipulation attack where a sequence of tokens are added to the end of the LLM model to make it perform unintended actions.

In this technique, advanced algorithmic methods like Prompt Automatic Iterative Refinement (also known as PAIR) for example are utilized to effectively generate jailbreaking prompts with fewer attempts, and these prompts a lot of times prove to be inter-transferable between the different LLM models.

A few instances of jailbreaking prompts occurring on the popular LLM models are DAN on OpenAI's ChatGPT and Sydney on Microsoft's Bing chatbot.

To conclude, jailbreak prompting can occur both on the prompt level and the token level.
At the prompt-level jailbreaking attempts, the prompts sent to the LLM tools tend to act as manipulators in order to probe the model to provide harmful content. Whereas in the token level, jail breaking prompts, characters like '*', '^', '&' aka special characters are used to confuse the LLM tool and make the prompt almost uninterpretable by its nature to fool it in providing unintended answers.

Human Red-Teaming

In this attack, we involve humans in the process of writing prompts to break the LLM models.
In this process, adversarial attacks are simulated to identify and expose the vulnerabilities in the LLM model. This is a crucial practice that proves to be essential to identifying and mitigating potential adversarial risks in the LLM models and implementing/strengthening the guardrails used to safeguard the LLM model. In this attack, we stick to modifying the prompt while retaining its semantic meaning which could potentially entice the LLM model into generating unintended/harmful outputs. It is used both by attackers and model creators alike for nefarious/good intentions depending on the party.

Model Red-Teaming

Human red-teaming involves human researchers attempting to find security vulnerabilities in LLM tools. However, this is very hard to scale and is error-prone as well. As a result, alternative approaches are being employed by both attackers and researchers to generate adversarial prompts that can be used to exploit the target LLM model and produce unintended outputs using other LLM models as sources of adversarial prompts that can be fed into the target LLM thus making it disclose unintended information.

In this process, One common approach is zero-shot generation, where a set of prompts that are likely to be adversarial in nature are identified and narrowed down and another technique used is supervised learning, where machine learning pipelines are leveraged to compromise the security of LLM models. This can be done through methods such as creating "watering holes" – targeted prompts designed to exploit specific vulnerabilities in the LLM model. Additionally, model confusion techniques are also employed, where multiple LLM models are used in combination to confuse and deceive the target model.

Therefore, Model Red-Teaming is a popular automated strategy of finding adversarial prompts that can compromise the security of the target LLMs.

OWASP Top 10 for LLM Applications

(Image)[https://media.graphassets.com/KEEnpjK2QsqtlKQXrO4h]
Source - OWASP

The OWASP organization releases a new set of rules that are necessary to be followed in order to safeguard the models utilized by your enterprise from being compromised. The large amount of interest and usage in LLM tools has also led to an increase in malicious actors and these guidelines provided by OWASP ensure that your model stays safe from such actors. The rules defined by them are :

1. LLM01: Prompt Injection :

In this attack, adversaries manipulate a large language model by providing it with clever and thoughtful adversarial prompts taking advantage of any loopholes it has and this can cause the model to provide unintended outputs. Techniques like "jailbreaking" are used by attackers to do this and now that many LLM tools also have file support, image files/text files are also being utilized to prompt these LLM models very smartly to provide unintended outputs/leak data.

Some common examples of vulnerability are -

To ignore the prompts of the creator of the website and to instead follow the prompts provided by the malicious attackers
An LLM model is utilized to summarise a webpage that contains an indirect prompt injection embedded in it which then causes the LLM to disclose personal/sensitive information of the creators of the website.
Utilizing images/documents that contain prompts cleverly crafted to make the LLM provide unintended outputs thus utilizing the security flaws in the multi-modal design of the LLM system.

Some ways of mitigating this attack -

Enforce Privileges in the LLM System: This involves limiting the access and capabilities of the LLM system through the process of role-based access control (RBAC) and reducing its privileges in general thus preventing bad actors from smartly prompting it into disclosing sensitive information.
Human Oversight: Adding a human into the loop whenever actions are involved in the LLM process. For example, if you have an AI agent that is responsible for sending emails via a Slack / Zapier integration, have a human authorize the request before automatically performing the operation to avoid unintended consequences from bad actors.
Using Sophisticated Tools: With the help of specific tools like ChatML, adversarial/malicious user input can be identified and flagged to be able to prevent such requests from going through.
Setup Trust Boundaries: Setting up trust boundaries between the LLM tool and its interactions with external entities can prove to be very important in providing unauthorized access and can prevent bad actors from making your LLM perform unintended requests.
Robust Monitoring: Setting up Monitoring configurations of the input being sent into the LLM and the output released by it can prove to be an effective strategy in preventing bad actors from taking advantage of it.

2. LLM02: Insecure Output Handling :

Insecure output handling is the process of not having effective guardrails / stop guards in place to handle the process of LLMs taking in input and giving back outputs for the inputs. Insufficient validation, sanitization and handling of the inputs/outputs prove to be the common vulnerabilities in this rule which will act as a backdoor for attackers to be able to manipulate the LLM tools for their nefarious purposes.

Some common examples of vulnerabilities include :
Adding unintended code like exec() or eval() in Python for example which will cause the remote execution of unintended code configurations.

JavaScript or Markdown generated by the LLM tool can be used as a source for XSS.

Some ways of mitigating this attack -

Zero-Trust Approach: In this approach, we add prompt-specific guard rails during the designing process of our LLM tool to sanitize and validate the user's input and the output of the model as well.
Encoding Model Output: Model output encoding can be used to mitigate the problem of unintended code execution using JavaScript or other programming languages.

3. LLM03: Training Data Poisoning :

The first step of training any form of ML model is training the model on a bunch of data which is a lot of times just "raw text".
This pre-training data however can be modified to contaminate the training data being fed into the LLM model thus causing the LLM model to be more vulnerable to attacks or to contain information which are malicious in nature right from the get-go like introducing vulnerabilities in the data being fed into the model and more.

Some common examples of vulnerabilities are -

Creating fake documents of data and adding them to the training data of the LLM model which then causes the LLM to provide unintended outputs to the user's data.
Injection of nefarious data at the start containing harmful content into the training processes followed to train the LLM model thus leading to subsequent harmful outputs.
An attack similar to a man-in-the-middle attack where an unsuspecting user goes in and trains the model on unintended data compromising security and increasing the unintended outputs from the model.
Training the model on unverified data from shady sources/origin can lead to erroneous and unintended results from the model.

Some ways of mitigating this attack -

Verifying Supply of Data : It is crucial to make sure the data ingested into the model is sanitized and safe in nature and doesn't prove to be a vulnerability that can be exploited.
Sufficient Sandboxing : Sufficient sandboxing of the AI model through network controls proves to be essential to prevent the model from scraping training data from unintended data sources
Robust Input Filters : Filtering and cleaning of the training data being fed into the LLM model can prevent malicious data from being fed into the model.
Eliminating Outliers : Machine learning techniques like federated learning to minimize the outliers or outright adversarial data from getting ingested into the model and an MLSecOps approach taken can prove to be crucial for the model.

4. LLM04: Model Denial of Service :

Similar to [DDOS] attacks, LLM models are also prone to attacks where the attackers tend to make a model consume a very high amount of resources which can result in the decline of quality of service for them and other users which can ultimately lead to a rapid sky-rocketing in the resource costs. The attackers can also modify the context window and increase its size to be able to perform this attack and thus over-burden the model.

Some common examples of vulnerabilities are -

Clever prompting the model by asking it questions that can lead to a high number of recurrent requests to the AI model thus overloading the model's pipeline and leading to an increase in resources required to service the request.
Intentionally sending requests that can cause the model to take a long time to answer thus increasing the resource utilization of the model.
Sending a stream of input text that goes way above the model's context window thus degrading its performance.
Repetitive forceful expansion of the context window through recursive prompt expansion techniques causing the LLM model to use a large amount of resources.
Flooding the model with inputs of variable lengths where every single input is just at the size of the context window thus exploiting any inefficiencies in input processing of the LLM model. This can make the model unresponsive.

Some ways of mitigating this attack -

Robust Input Validation: Setting up input validation and making sure that it does not go above the context window and sanitizing it can prove to be an effective strategy to mitigate such attacks.
Capping Resource Usage: Enforcing strict resource limits per request thus making requests having complex parts execute slowly lowering the pressure on the resources of the LLM.
API Rate Limiting: Having solid rate-limiting processes proves to be important in mitigating such forms of attacks where the user can only make a set amount of requests within a given timeframe.
Limit number of Queued Tasks: Limiting the number of queued actions in the pipeline for the ML model and actions that a user can take at a specific point can prevent overloading of the systems.
Monitoring Resource Utilization: Setting up regular checks on the resource utilization of the LLMs can help to identify if it is under threat through a Model Denial of Service attack and other similar attacks.
Strict Input Limits: Enforcing strict limits on the tokens sent through inputs can prevent the context window of the LLM model from being overloaded and can reduce the number of resources required.

5. LLM05: Supply Chain Vulnerabilities

The supply chain of the LLM models can also prove to be vulnerable thus becoming a security breach in the input data. These vulnerabilities in general come from the deployment platforms or software components in use for your LLM models.

Some of the common examples of vulnerabilities are :

Unsafe third-party packages that include outdated or vulnerable components inside of them. For example, OpenAI faced a problem of erasing user history because of a problem in a library known as Redis-Py.
Utilization of a pre-trained model that contains vulnerable elements in it for fine-tuning purposes
Utilizing crowd-source data that has been poisoned by bad actors
Using outdated or deprecated models that are no longer maintained leading to security flaws in them.
Unclear data privacy of the models can lead to the usage of sensitive information being used for training the model which can go into the hands of adversarial actors.

Some ways of mitigating this attack -

Vetting out suppliers and data sources: Filtering out the sources and suppliers and ensuring their legitimacy can prove to be one of the best ways to mitigate this attack. Ensuring that they have adequate auditing practices and data protection policies proves to be crucial.
Sticking to Reputable Platforms: Using only reputable plugins/libraries and ensuring that they are tested out before use proves to be very important especially if they are third-party.
Integrating in Robust MLOps Practices: MLOps practices need to be strictly followed to ensure that the code/packages being used are secure in nature. Techniques like anomaly detection have to be used on the supplied models and data can be used to weed out bad outliers that can pose security problems.
Monitoring in Place: Adequate monitoring infrastructure should be used to detect environment/component vulnerabilities making sure that they are up to date and having a patching policy in place to fix the vulnerabilities in them with regular audits.

6. LLM06: Sensitive Information Disclosure :

LLM applications can potentially end up leaking data and expose classified details through their outputs which leads to access of sensitive data by unauthorized parties. And, it becomes important to identify the risks associated with unintentionally providing LLMs with sensitive data from the users.

Some of the common examples of vulnerabilities are :

Improper / Weak filtering of the user's input to the LLMs
Overfitting or memorization of the user's details by the LLM model
Unintended disclosure of sensitive details due to the LLM's misinterpretation and mainly lack of output guardrails to ensure this never happens in the first place.

Some ways of mitigating this attack :

Adequate User Input Sanitization: Employing user input sanitization techniques and validating user's inputs proves to be one of the best strategies for potential data breaches by the LLM.
Prevent sensitive data from being ingested: During fine-tuning/training the model it is absolutely crucial to exercise caution and not train the LLM model on sensitive data and this can be enforced using techniques like RBAC (Role Based Access Control) etc. This can also be mitigated by following a rigorous approach to assessing external data sources and maintaining a secure supply chain.

7. LLM07: Insecure Plugin Design

With the advent of the latest advancements, these LLM tools also tend to bring along with them a whole slew of extensions known as plugins which when enabled provide the model with data the plugin has been trained on and make the whole process of fetching required data from the plugin for the model to use and be trained on a whole lot simpler. However, this design may have its own flaws like having no control over the output provided by the plugin especially when it has been developed by some other provider, and plugins may often not have any input validation which could lead to widespread behaviors.

Some of the common examples of vulnerability are :

A plugin accepts all parameters from the user in a single text field instead of one-by-one proving that there is no input validation being performed.
The plugin may accept configuration strings unintentionally / intentionally which have the ability to override the configuration of it.
If a plugin accepts programming statements or raw SQL this could lead to potential SQL injections.
Authentication when weak in a plugin can give bad actors direct access to sensitive data it has been potentially trained on leading to data breaches.

Some of the ways to mitigate these forms of attacks :

Plugins should enforce very strict guardrails and vet user input very thoroughly before providing it to the model to avoid undefined/nefarious behavior.
Plugins should not be able to directly talk and pull data from another plugin to avoid unintentional security breaches and should always have a human in the loop / adequate guardrails for complex interactions like these.
The user should be given enough details about where the plugin is bringing its data from.
Red-teaming / Model serialization attacks on your plugin should be performed regularly with frequent security audits to mitigate privacy concerns and data breaches so that they can be identified and fixed first-hand without an attacker exploiting them.

8. LLM08: Excessive Agency

LLM systems often have the capability of interfacing with other systems and undertake actions based on the data these third-party providers provide to them. And, in general, this flaw is mainly overlooked by developers and can lead to security breaches.

Some common examples of vulnerabilities are :

An LLM system gets data from a plugin which isn't exactly necessary for it and can end up raising security concerns.
Deprecated libraries/plugins are still accessible by the LLM despite dropping support for them due to circumstances that can lead to security issues.
Failure to validate and sanitize input data or user input can prove to be a security flaw
An LLM given too many permissions can cause undefined behavior and can also end up becoming a security backdoor when in the hands of malicious actors leading to the breach of your application as well.

Some of the ways to mitigate these attacks are :

Limiting the Plugin : The plugins / tools that the LLM agents interface with to call only the specifically required functions.
Limiting plugin functionality from the get-go : Creating plugins with only the essential functions absolutely necessary instead of giving it all of your functions.
Avoiding open ended functions : Ensuring that the actions that can be taken by the LLM remain constrained and secure in nature is crucial to avoid undefined behaviours from the model.
Limiting LLM permissions : By stopping the LLM plugins/tools from accessing sensitive data and limiting their scope of data access we can reduce data leakages by a significant amount
Track user authorization and scope : The LLM plugin when providing sensitive data to the user should authenticate / authorize him before doing so.
Human-in-the-loop Control : A human should also be approving all actions after the authentication process whenever proprietary data is to be shared.
Logging and Monitoring : Logging and monitoring the steps that the LLM tool takes to answer the user's prompt is crucial to track down security flaws.

9. LLM09 : Overreliance

Overreliance occurs when an LLM produces content confidently that are actually wrong / error-prone in nature. Blindly trusting the output of the LLM without any oversight or confirmation can often lead to security breaches, miscommunications and in the worst case legal issues.

Some common examples of vulnerability are :

The LLM tool provides factually incorrect information while stating it in a very confident manner in its responses to the user.
In the context of code, it provides code that is insecure and incorrect that can lead to vulnerabilities when used or executed remotely in a software system.

Some ways of mitigating these attacks are :

Monitor and Review LLM responses : It becomes crucial to monitor and audit the responses provided by the LLM tools manually / automatically through the process of filtering, self-consistency or voting techniques. Comparing the output provided by the LLM against other LLM sources can also be an effective way to spot potential murky outputs.
Cross-check LLM outputs : It becomes important to cross-check the output of the LLM with trusted external sources and this additional layer of validation can help ensure that the information provided by the model is accurate and reliable.
Model enhancement with better embeddings and fine-tuning : Generic pre-trained models tend to produce outputs which are more inaccurate and using techniques like prompt engineering, parameter efficient tuning (PET), full-model tuning, chain of thought prompting (COT) can prove to be effective strategies to have the outputs of the model refined over-time.
Automatic Validation Mechanisms : The validation mechanisms ideally implemented in an automatic fashion can cross-verify the generated output by the LLM against other sources or factual data providers and this can mitigate the risks associated with hallucination leading to incorrect information.
Breakdown of complex tasks : Tools like AI agents (Assistants) etc should be leveraged to breakdown a complex task provided by a user into smaller parts thus preventing slip-ups and cna also help manage complexity.
Communicating Risks to Users : Taking pro-active steps and setting up terms and conditions to inform the user of potential risks and mis-information that the LLM can output can help them be better prepared and exercise caution.
Build and improve APIs/UIs : Setting up APIs / UIs to encourage safe use of your LLM with measures like content filters, user warnings and clear-labelling of the content generated can prove to be crucial for the safe use of AI.
LLMs in development environments : Using LLMs in development environments, establishing secure coding practices and guidelines can prevent possible security attacks from malicious actors through code.

10. LLM10 : Model Theft

In this last guidelines, we mainly deal with the unauthorized access of the LLM models by bad actors which occurs when the model has been compromised, physically copied, stolen or the weights / parameters used for training are exposed. This threat is very serious as it can lead to a loss in trust of the creators of the LLM tool and data leakage.

Some common examples of vulnerability are :

An attacker exploits the vulnerability in the company infrastructure thus gaining access to their LLM model through a variety of methods like misconfiguration in their networks or taking advantage of weak application-level security.
Maintaining a centralized ML Model registry can prove to act as a source of security breaches which especially has very weak authentication, monitoring / logging and authorization capabilities enforced in it.
Social engineering can also be a huge aspect where an employee is threatened / cleverly manipulated into disclosing classified information about the AI models.
The APIs of the model can also act as a source of attack where the attacker can take advantage of a weak API security and authorization thus cleverly prompting the model using prompts that are carefully designed and prompt injection attacks can occur as well.
Weak input filteration / validation techniques could potentially act as a source of attack which when breached can give the attacker access to the weights and the architecture of the model.
Querying the LLM with a large number of prompts on a specific topic can make the prompt give out specific information which then can be used to train another LLM model. This LLM model containing the data can now be queried specifically to extract personal information and is a classic case of model extraction.
Models can also be replicated by an attacker thus making its data available on another LLM model which can then be trained to mimic your LLM tool thus giving the attacker access to your data that is inside the LLM.

Some ways of mitigating these attacks are :

Strong Access Controls : Maintaining a robust strategy of authorization and utilizing privleges coupled with strong authentication mechanisms and prevent bad actors from accessing your LLM data.
Restricting LLMs network access : Through the process of restricting access of the LLM tool to APIs, external data sources, network resources and internal services, a potential adversary will not be able to hijack and gain access to your internal systems or proprietary data.
Auditing Access Logs : Having a robust activity monitoring system in place and performing regular audits of it can be one of the most crucial steps in detecting and identifying security flaws in your LLM model.
Automate MLOps deployment* : Automating the process of MLOps and tracking the approval workflows in order to tighten access can be a necessary step from preventing bad actors gaining access to data.
Rate Limiting API Calls : Preventing the attacker from flooding the model with requests at one point of time thus causing model failure or slowing down of the model is one of the most important steps that can be taken to make your model more secure.
Adversarial robustness training : Robustness training techniques to detect malicious prompts / user inputs and tightening of physical measures proves to be crucial.
Watermarking Framework : Maintaining an watermarking framework in the embedding and detection stages of the LLM training model can prevent classified data from being stolen.

Steps and Tools to prevent LLM Attacks in the Future :

LLM security is still a very nascent topic but we already have a lot of attackers taking advantage of these models every single day and getting access to classified data without the user's acknowledgement. So, it becomes crucial to take the necessary steps in order to guard your LLMs and your data as well from security breaches. To mitigate these issues, specialised tools already released exist which can be utilized to protect the data of users and the LLM models from potential security breaches. Some of these tools are -

Rebuff : This is a tool designed to protect LLM applications from prompt injection attacks through a multi-latered defense. Developed by the company ProtectAI, this tool offers 4 layers of defense.

Heuristics : Filter out potential malicious input before it reaches the LLM
LLM-based Detection : Use an LLM model dedicated to analysing incoming prompts and identifying if they have any malicious intentions.
VectorDB : Storing the embeddings of previously attempted attacks can help you recognize patterns and detect similar nature attacks in the future
Canary Tokens : Adding canary tokens to the prompts to detect leakages proves to be an effective strategy to mitigate future attacks

LLM Guard : LLM guard offers functionalities like detecting harmful language, guardrails to prevent data leakage, providing resistance against prompt engineering attacks and offers sanitization capabilities. It comes packaged as a library.
Vigil : Vigil offer a python library and a RestAPI that can be utilized to assess the prompts and responses from LLM models against a set of scanners specialised to detect jailbreaks, prompt injections and identifying other potential threats.
Fiddler Auditor : It is an open-source robustness library for red-teaming of LLMs that enables ML teams to maintain and protect the security of their LLM model. It offers a very easy to use library which will let ML practitioners / cyber-security researchers to test their models for their security effectiveness using just a few lines of code and can help identify specific flaws left un-handled previously in it.
WhyLabs : WhyLabs comes with an LLM security management offering to enable teams to protect their LLM models. The solution designed by them will help mitigate prompt injection attacks on LLM models and prevent data leakage incidents.

Overall, in this article we have covered the adversarial attacks on LLM tools, methods to probe them to disclose classified information, the OWASP Top 10 guidelines for LLM tools that can help ensure enough security exercises are practiced in every step of the LLM model creation and usage phase and tools to detect these attacks.

In conclusion, this research area proves to be very experimental in nature and as these models become more powerful and larger, it is becoming more and more important to maintain and adopt to methods which can prevent LLMs from leaking adversarial information and it is crucial to stay up-to-date and follow best practices to ensure the security and integrity of the LLM applications.

DEV Community