Large language models have become game-changers in various tasks, including the analysis of unstructured text. One common application is the extraction of structured attributes from text.
This is particularly relevant in e-commerce settings. Here, editors are frequently tasked with transforming plain-text product descriptions into structured product cards, a process where large language models can offer considerable assistance.
To get structured attributes from a product description, we feed it into a large language model and specify the attributes to extract via a prompt. If the result we get isn't quite right, we fine-tune the prompt.
Developing an effective prompt is an ongoing task. We continually add new test examples and tweak the prompt to get better results.
Taking into account the costs associated with each OpenAI request, our aim is to process each "prompt + example" pairing just a single time. Accidentally processing a large test set through OpenAI could lead to considerable expenses.
Problem
The challenge here lies in tracking: identifying which examples have been processed with a specific prompt and their results.
Writing a code to implement this tracking requires us to consider several situations:
If new examples are added, the prompt should be executed only for these new instances.
If any example undergoes changes, it needs to be re-processed with the prompt.
When the prompt itself changes or new prompts are added, all examples must be reprocessed.
If an example or a prompt is deleted, the corresponding processing results should also be removed.
Solution
Datapipe is an open-source library that offers a solution. It effectively tracks completed computations and recalculates only when there are updates. This feature significantly lightens the workload for developers.
A developer writes a Python function without needing to consider the tracking of computation status, as if the operation involves all data. Datapipe then automatically applies the calculations only to new or modified data.
Regarding OpenAI requests, developers handle two dataframes: one containing all the prompts and another with all the examples. They code as though they are applying each prompt to every example. Datapipe then ensures that there's no redundant processing of prompts and examples.
Practical Application
To explore this tool, you can download and test an example from the Datapipe examples open-source repository: https://github.com/epoch8/datapipe-examples/tree/master/openai_inference
To setup:
- Clone this repository.
- Navigate to the example directory.
- Install dependencies using Poetry (poetry install)
- Set your OpenAI API key in the script.
To run the example:
Create all nesessary SQLite databases by running
datapipe db create-all
Run the data processing
datapipe run
The prompts and examples for processing are stored in the SQLite database data.sqlite in two tables: "input" for examples and "prompt" for prompts.
After executing "datapipe run", the processing results will appear in the "output" table.
When run for the first time, Datapipe processes all examples and saves them in the "output" table. Any modification or addition to a prompt in the "prompt" table initiates reprocessing of all examples.
Adding a new example in "input" table ensures it is processed with each prompt.
Now, you can add, modify, or delete prompts and examples in the database. Observe how Datapipe efficiently tracks these changes, processing only the newly added, modified, or deleted data, ensuring an optimized and targeted approach to data handling.
We highly value your insights and experiences. Have the challenges outlined in this article resonated with your own practice?
We're keen to hear your thoughts on this approach. If you've had the chance to apply these methods or are considering their implementation, we invite you to share your perspective. Please share your views, experiences, or any reflections you have in the comments section below.
Top comments (0)