DEV Community

Cover image for 🧹🧹 Sanitizing user input with OpenAI under $1
Sergey Bolshchikov
Sergey Bolshchikov

Posted on • Originally published at bolshchikov.net

🧹🧹 Sanitizing user input with OpenAI under $1

The objective of this task is to extract a person's name correctly as it appears on LinkedIn. For example, if the input is John, Smith, the desired output should be John,Smith. Here's a slightly more complex example: if the input is 🌯 John,Smith, the output should be John,Smith.

The Simplistic Solution

The most straightforward solution in this scenario is to utilize a library that effectively removes unwanted characters from the input. The npm package known as String-sanitizer is adept at performing this task.

const { sanitize } = require("string-sanitizer");

const names = [
  " John,Smith ",
  "🌯John,Smith",
  "John,Smith ✔️",
  "John,Smith  🇺🇦"
];

names
  .map(name => name.split(',')) // split because sanitize removes ','
  .map(([firstName, lastName]) => ([sanitize(firstName), sanitize(lastName)]))
  .map(parts => parts.join(','));

// ["John,Smith", "John,Smith", "John,Smith", "John,Smith"]
Enter fullscreen mode Exit fullscreen mode

Initially, this solution may appear effective until one encounters less predictable instances of names.

Dealing with Unpredictable Input

It is important to note that LinkedIn users often get creative with their naming conventions. The following examples illustrate how this variation can disrupt the efficiency of the code.

const { sanitize } = require("string-sanitizer");

const names = [
  "John (Johnny),Smith",
  "Joghn,\"Smith, CPA\"",
  "John,Smith-Perry",
  "John,Smith Jr.",
  "John,\"Smith, Ph.D\"",
  "John,Smith ✰ I'm Hiring ✰",
];

names
  .map(name => name.split(','))
  .map(([firstName, lastName]) => ([sanitize(firstName), sanitize(lastName)]))
  .map(parts => parts.join(','))

// ["JohnJohnny,Smith", "Joghn,Smith", "John,SmithPerry", "John,SmithJr", "John,Smith", "John,SmithImHiring"]
Enter fullscreen mode Exit fullscreen mode

One might attempt to address each of these instances by crafting complex regex. However, this approach presents two major challenges:

  1. It's plausible that there will always be a use case where the code will return an incorrect result.
  2. The maintenance cost of the code, which includes testing and improving, is quite high. For each new use case encountered, the function must be altered to accommodate it.

Probabilistic Approach

For this use case, probabilistic models can yield significantly superior results than any possible code written by developers.

An example of this is using the OpenAI API with a simple prompt to return the person’s name. This method excels in more complex use cases, such as "I am hiring". However, it may overlook various suffixes.

Chat GPT untrained answers

OpenAI provides a mechanism to refine results according to specific needs via fine-tuning the model.

Fine-tuning the model has three key advantages:

  1. The model is trained based on specific data, which yields more precise results.
  2. It is cost-effective due to the use of a pre-trained model, necessitating a smaller system prompt.
  3. The process is expedited.

Although it may seem excessive to employ machine learning for a seemingly simple task, this task is not as simple as it appears. Solutions like OpenAI offer fine-tuning capabilities that are easy to implement, providing superior results in less time than traditional coding approaches.

Fine-Tuning the Model

Here's how we can prepare a fine-tuned model in three steps:

  1. Prepare training and validation datasets.
  2. Train the model.
  3. Implement the pre-trained model in the code.

Training a model, in simplest terms, involves supplying the OpenAI model with a file containing examples that include user input and the correct answer that the model should return.

OpenAI recommends providing 10-100 such examples.

The training file in .jsonl format might look like this:

{"messages": [{"role": "system", "content": "A given phrase contains a name. Your task is to extract it."}, {"role": "user", "content": "Smith 🇮🇱"}, {"role": "assistant", "content": "Smith"}]}
{"messages": [{"role": "system", "content": "A given phrase contains a name. Your task is to extract it."}, {"role": "user", "content": "John גיון סמיט"}, {"role": "assistant", "content": "John"}]}
{"messages": [{"role": "system", "content": "A given phrase contains a name. Your task is to extract it."}, {"role": "user", "content": "John-Perry"}, {"role": "assistant", "content": "John-Perry"}]}
{"messages": [{"role": "system", "content": "A given phrase contains a name. Your task is to extract it."}, {"role": "user", "content": "John Smith"}, {"role": "assistant", "content": "John Smith"}]}

Enter fullscreen mode Exit fullscreen mode

Each line is a separate example. It contains a system prompt (what you want the model to do), user input, and assistant output (the correct answer that the model should provide).

Upon preparing this file, it can be uploaded to OpenAI's fine-tuning UI to pre-train the model. If your data preparation process is more complex, the OpenAI SDK can also be used for fine-tuning models.

The training duration will depend on the number of examples provided. Once complete, the model can be integrated into your code.

import { get } from 'lodash';

async sanitizeName(someName: string) {
    const res = await openai.chat.completions.create({
      model: 'ft:gpt-3.5-turbo-0613:personal::some-weird-code', // fine-tuned model
      temperature: 0.0,
      top_p: 1,
      frequency_penalty: 0,
      presence_penalty: 0,
      messages: [
        { role: 'system', content: systemPrompt },
        { role: 'user', content: someName },
      ],
    });

    const answer: string = get(res, 'choices[0].message.content');
    return answer;
  }

Enter fullscreen mode Exit fullscreen mode

Evaluating the Cost

Finally, let us consider the cost. If you choose to use OpenAI, there may be costs involved, but this should not be a deterrent.

Instead, it would be beneficial to assess the efficiency perspective. This approach considers how much time you have expended to achieve the most optimal results.

The Simplistic Approach

The efficiency of the simplistic approach is rather low. For instance, you might spend approximately four hours writing and testing a function that covers all known cases. The challenge with this approach is the inability to predict all possible scenarios, resulting in a high likelihood of errors.

Fine-Tuning Approach

The cost of the fine-tuning approach consists of three components:

Your time to prepare training data + Cost to train the model + Cost to use it.

Although preparing the training data is the most time-consuming part, it would likely take less than half the time spent writing code using the simplistic approach.

Fine-tuning the model with OpenAI is a cost-effective solution. For instance, it took about 15 minutes to train a model with 151 examples at a cost of $0.13.

Price spent on fine-tuning

The final component is the cost of usage, which is also not substantial.

However, the fundamental question is whether the benefits outweigh the costs. Can you truly obtain better results for unpredictable input?

Final results of fine-tuned model
Consider this: fine-tune approach works not only with known scenarios but also with names that include mixed languages or are entirely in different languages.

Photo by PAN XIAOZHEN on Unsplash

Top comments (1)

Collapse
 
ben profile image
Ben Halpern

Good idea