DEV Community

Cover image for Deep dive: Privacy risks of fine-tuning LLMs
Daniel Huynh
Daniel Huynh

Posted on • Updated on

Deep dive: Privacy risks of fine-tuning LLMs

Key Takeaways:

LLMs can leak data through two mechanisms:

Input privacy: data is exposed when sent to a remote AI provider, e.g. Hugging Face or OpenAI, and can be at risk if these admins are compromised or malicious.
Output privacy, aka a user/attacker, can send prompts to make the LLM regurgitate parts of the training/fine-tuning set, which can leak confidential information. This is what happened to Samsung.
Input privacy issues arise when relying on external SaaS AI solutions like GPT4 APIs, while output privacy issues arise when people fine-tune LLMs on their private data and don’t restrict who can query the LLM.
Imagine a world where typing ‘My credit card number is…’ into a chatbot results in it auto-completing with a real person’s credit card details. Shocking but true. This article explores the inherent risks in fine-tuning Large Language Models (LLMs).

Privacy issues with LLMs have made the news, the most mediatized being Samsung’s data leakage after using OpenAI at the end of 2022.

But what exactly happened? What was the mechanism that was involved in this data leakage? Was it OpenAI’s fault? How are LLMs any different from other technologies in terms of privacy?

This article is highly inspired by the paper “Beyond Privacy Trade-offs with Structured Transparency” (Trask, A., Bluemke, E., Garfinkel, B., Cuervas-Mons, C.G. and Dafoe, A., 2020, arXiv preprint arXiv:2012.08347) which introduces the 5 pillars of privacy, including input and output privacy.

To understand how those attacks work in practice, let’s start with a concrete example.

Example — Bank Assistant Chatbot

Let’s say a bank provides customers with a Chatbot for account support and financial advice.

To deploy this Chatbot, they first have to source a good AI model. Instead of developing an in-house AI solution, utilizing an external SaaS AI service provider such as OpenAI, Cohere, or Anthropic is more efficient. Let’s consider that they choose this option.

Firstly, fine-tuning is performed to ensure good quality of the final ChatBot. (Fine-tuning is a process where the model is further trained (or “tuned”) on a new dataset to adapt it for specific tasks or to improve its performance.) To improve the performance of the Chatbot, previous conversations between customers and counselors are used to train the AI. For instance, OpenAI recently announced their fine-tuning feature, which allows such customization.

Image description

The AI provider starts by allocating a dedicated instance where their model is loaded and only available to the bank. The bank uploads past customer interactions to fine-tune the model. Note here that at this stage, the model’s weights implicitly contain information about the bank’s training set. This would not be the case if the foundational model were used as-is, without fine-tuning.

The model is then fine-tuned on the data and can then be used for deployment. Note here that at that stage, the model’s weights implicitly contain information about the bank’s training set, while this is not the case if the foundational model is used per se, without finetuning.

Image description

Finally, the Chatbot can be used in production in a deployment phase where users can send queries and prompts to receive counsel from the AI.

Let’s see now what different attacks can be performed on this system.

Input privacy

Input privacy means sensitive data shared with providers remains confidential and protected, even while applying a managed proprietary AI.

Input privacy issues arise a lot in the world of SaaS AI, with, for instance, OpenAI, Hugging Face, Vertex AI, or Cohere’s AI APIs.

Image description

The threat in that scenario comes from the AI provider, who is able to see the bank’s data, as the bank is sending the historical conversations for fine-tuning and live conversations for production.

Were the AI provider’s admins malicious or their infrastructure compromised, the bank data containing customers’ sensitive banking information would be exposed.

This threat is nothing new. It is just plain old data exposure to third-party SaaS vendors, where a data owner of sensitive data resorts to an external SaaS supplier, and this supplier gets compromised.

While it is sensible not to want one’s private data to roam freely on the internet, so far, OpenAI has not been suffering from any data exposure following external attacks.

So how was Samsung’s data leaked by OpenAI systems? Well, it’s through a totally different channel that data got leaked, and this threat is specific to LLMs and potentially affects most of them: it’s called data memorization of LLMs.

Let’s dive into this to better understand it.

Output privacy

To get a better understanding of how such data exposure happens, one has to understand what an LLM does at the core.

LLM stands for Large Language Model, which basically means it is a big neural network that is trained to complete sentences on a corpus like Wikipedia.

The formulation is as follows: given n previous tokens, which we can assume to be words for simplicity, the LLM has to predict the next word.

For instance, if n=4, we might show to the LLM “The cat sat on …” and the LLM has to answer the most statistically likely word based on the dataset it has been shown during training, which can be “ground,” “bench,” “table,” etc.

But this means the LLM has to learn by heart its training set, which means if sensitive information is included inside the training set, it is possible that if one starts inputting the beginning of a sentence, it might be completed with confidential information.

Image description

For example, an external attacker, or even a benevolent user, could prompt the LLM with “My credit card number is …” it might fill it with a credit card number from a real person whose data was leaked into the training set!

It is through this mechanism of memorization that LLMs are able to leak potentially sensitive information when prompted by users.

This is at the core of output privacy, which is the property that interactions with an LLM should not disclose sensitive interactions with a language model and shouldn’t reveal personal information possibly found in the training data.

The Samsung privacy leakage when using LLMs was due to output privacy issues. At the end of 2022, OpenAI soft-launched ChatGPT but advised users to proceed with caution and not share sensitive information, as they would use the requests to improve their model further.

Some Samsung employees either ignored or didn’t know about the rules. They sent confidential data like source code to ChatGPT to help with their work.

Unfortunately, OpenAI’s model learned by heart their data during their fine-tuning period. Reportedly, external users — potentially competitors — managed to make ChatGPT reveal Samsung’s confidential data.

While this episode made the news widely, it is not going to be a single event. LLMs have been shown to memorize large parts of their training set. The paper “Quantifying Memorization Across Neural Language Models” showed that at least 1% of the training set was learned by heart by LLMs such as GPT-J, and the bigger the model, the more likely the memorization.

A blog post also indicates that even rare examples (even with a single occurrence) in the training data can be memorized during fine-tuning.

Any company fine-tuning an LLM on their private data and exposing it to arbitrary users could potentially expose confidential information!

Unlike input privacy, which is a concern across the SaaS industry, output privacy is a unique issue for Large Language Models (LLMs). This is because LLMs have the ability to memorize their training data.

The key element of output privacy is that even innocent queries could accidentally reveal sensitive training data! The risk isn’t just from malicious attackers, unlike the typical landscape of machine learning attacks involving sophisticated hackers.


To summarize what we said, we provide the following table to explain the key elements of each exposure style:

Input Privacy

  • Threat Agents: Malicious admins or outside attackers
  • Vectors: Regular attacks to compromise the AI provider
  • Party Compromised and Responsible: Third-party AI provider

Output Privacy

  • Threat Agents: Regular users or outside attackers
  • Vectors: Regular queries to a publicly available LLM
  • Party Compromised and Responsible: Data owner

We have seen in this article that data exposure when resorting to LLMs can happen in mainly two ways: Input privacy issues (you sent data to a third-party AI supplier who got your data compromised) and Output privacy issues (a model was fine-tuned on your data, and external users queried such model and made it regurgitate your data).

Different techniques can address both of these issues. Local deployment of models and Privacy Enhancing Technologies (PETs) help guarantee input privacy. Data flow control techniques can be used to ensure output privacy.

We will explore the landscape of methods to solve these privacy issues in a later article.

Top comments (3)

srbhr profile image
Saurabh Rai

Hey @mithrilsecurity
Fine tuning has many challenges, and thanks for posting about this. This indeed is a topic that needs to be discussed thoroughly. Nice article.

Recently I've made a post about RAG (Retrieval Augmented Generation) as a better and safer alternative.

randellbrianknight profile image
Randell Brian Knight

Hello, Daniel. 👋 Welcome to the Dev community.

robinamirbahar profile image