Introduction
Large language models (LLMs) heavily changed the way I and other developers as well as data scientists globally approach various data analysis tasks. Now we can think more of streamlined automation and better predictive analysis outcomes. One of the ways in which we can now analyze data is by using ReAct AI agent models.
Usage of ReAct AI Agent Model in Data Analysis
ReAct AI agent model, which is the main subject of this post, is a framework that harnesses the capabilities of LLMs to tackle complex data analysis challenges. It combines the power of LLMs with the ability to take actionable steps, making it a highly versatile tool for working with data, in particular, tabular data, which comprise a large portion of the overall volume of data we work with.
The way ReAct AI agent combines reasoning and taking action to provide an observation back is shown below.
Source: https://arize.com/blog-course/react-agent-llm/
At the core of the ReAct agent model is the LLM, which serves as the engine for generating verbal reasoning traces and actions. This powerful language model, such as GPT-4, is capable of producing coherent and contextually relevant responses, enabling the agent to understand the task at hand and formulate appropriate solutions. So:
ReAct AI agent model can both generate reasons and extract observations while producing responses to user queries.
To interact with the external environment and gather information, the ReACT agent model utilizes a suite of tools. Bringing a few as an example:
- Search APIs for accessing external data
- Mathematical utilities for performing complex calculations
- NLP libraries for understanding and processing text-based data
I recommend selecting tools depending on the specific requirements of the task, allowing the agent to tailor its capabilities to the problem at hand. Avoid over adding tools, especially ones that can overlap in their functionality as it increases the probability of error or inaccuracy, and can just deteriorate the AI agentโs performance in any other way.
ReAct Agent Model Architecture
The ReAct agent model encompasses various agent architectures, each designed for specific applications and interactions. These include:
- ZERO_SHOT_REACT_DESCRIPTION
- REACT_DOCSTORE, SELF_ASK_WITH_SEARCH
- CONVERSATIONAL_REACT_DESCRIPTION
- OPENAI_FUNCTIONS
Each of these agent types is suited to different types of tasks and interactions.
A crucial component of the ReAct agent model is the Chain-of-Thought (CoT) prompting.
CoT prompting enables the LLM to carry out reasoning traces, create and adjust action plans, and even handle exceptions.
This capability enhances the decision-making and problem-solving abilities of the agent, making it a powerful tool for tackling complex data analysis challenges.
The ReAct prompting technique is another essential element of the ReAct agent model. This method guides the LLM in generating both reasoning traces and actions, ensuring that the agent's responses are well-structured and aligned with the task requirements.
By leveraging the power of LLMs and the versatile ReAct agent model, researchers and data analysts can unlock new possibilities in the realm of tabular data analysis. This framework empowers users to extract valuable insights, uncover hidden patterns, and make informed decisions, ultimately transforming the way we approach data-driven tasks.
Workflow for Tabular Data Insight Extraction Using a Large Language Model (LLM) ReAct Agent
1. Explore and Understand the Data
Letโs assume you have a tabular dataset with various columns and features, including both numerical and categorical data. The tricky part is that many available tools work well with only one type of data, or require complicated workflows to enable processing or complex tables. But with an LLM agent things are different. Here, you need to perform two preliminary steps:
Source: Intelliarts
- Perform an initial Exploratory Data Analysis (EDA) to understand the structure, distributions, and characteristics of the data.
- Clean and preprocess the dataset, handling missing values, outliers, and other data quality issues as needed.
2. Set Up the Python Execution Environment
- Utilize an external provider, such as Bearly, to set up a separate sandboxed Python environment for executing your code. This allows you to leverage the capabilities of the Python ecosystem, including libraries and tools, within the context of the LLM agent.
- Customize the instructions for working with the Python execution tool, specifying any necessary dependencies or configuration details.
Source: Author
3. Define the System Prompt
Then, prepare a system prompt that provides instructions and context for the LLM agent to work with the tabular data.
The system prompt should include information about the data structure, any preprocessing steps, and the expected output format. So one, the system prompt could include instructions like "Treat all numbers as floats" or "Provide the output in a well-formatted table."
The prompt can be as extensive as necessary. You may need to modify the prompt several times to achieve better results.
A prompt example:
Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, the Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.
Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.
Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
Then, you specify tools, i.e., predefined instruments the assistant has access to. Hereโs an example of a barely-interpreter tool:
> bearly_interpreter:
It evaluates Python code in a sandbox environment. The environment resets on every execution. You must send the whole script every time and print your outputs. Script should be pure Python code that can be evaluated. It should be in Python format NOT markdown. The code should NOT be wrapped in backticks. All Python packages including requests, matplotlib, scipy, numpy, pandas, etc are available. If you have any files outputted write them to "output/" relative to the execution path. Output can only be read from the directory, stdout, and stdin. Do not use things like plot.show() as it will not work instead write them out output/
and a link to the file will be returned. print() any output and results so you can capture the output.
The following files are available in the evaluation environment:
- path:
cdc_county_data.csv first four lines: ['State,County,First Year,Last Year,Injury Intent,Total Deaths (#),Population,"Age-adjusted Rate (per 100,000)"\n', 'Alabama,Autauga County,2018,2021,Firearm Suicide,22,226710,9.704027171\n', 'Alabama,Autauga County,2018,2021,Firearm Homicide,13,226710,5.734197874\n', 'Alabama,Baldwin County,2018,2021,Firearm Suicide,99,909837,10.88106991\n'] description:
CDC data on deaths`
To utilize a tool, exploit the following format:
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [bearly_interpreter]
Action Input: the input to the action
Observation: the result of the action
When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format:
Thought: Do I need to use a tool? No
AI: [your response here]
Begin!
Previous conversation history:
{chat_history}
New input: {input}
{agent_scratchpad}
4. Integrate the LLM Agent and Python Execution Tool
Create an agent structure that combines the capabilities of the LLM agent with the Python execution tool. This integration allows the LLM agent to leverage the Python environment to perform data processing, analysis, and other operations on the tabular data.
You should ensure that the combined prompt, which includes both the main agent prompt and the tool prompt instructions, is clear and comprehensive.
5. Process Queries and Retrieve Insights
Note! In the code below, at the beginning of the lines, the hashtag symbol should be used without ( ) brackets!
Getting to the way user queries are processed:
When a user submits a query, such as "What is the state with the most deaths?", the LLM agent will:
- Identify the appropriate tool (in this case, the Python execution tool) to use for the task. Thought: Do I need to use a tool? Yes Action: bearly_interpreter
- Generate the necessary Python code to retrieve the requested information from the tabular data.
- Execute the Python code in the sandboxed environment and capture the output.
Action Input:
import pandas as pd
(#)Load the data
data = pd.read_csv('cdc_county_data.csv')
(#)Convert columns to float where necessary
data['Total Deaths (#)'] = data['Total Deaths (#)'].astype(float)
(#)Group by state and sum the total deaths
state_deaths = data.groupby('State')['Total Deaths (#)'].sum()
(#)Find the state with the most deaths
state_with_most_deaths = state_deaths.idxmax()
most_deaths = state_deaths.max()
print(state_with_most_deaths, most_deaths)[0m
Observation:{'stdout': '', 'stderr': 'Tr..โ}
If the initial attempt encounters any errors, the LLM agent will analyze the error, refine the code, and retry the execution until a successful result is obtained.
Thought: Do I need to use a tool? Yes
Action: bearly_interpreter
Action Input:
import pandas as pd
(#)Load the data
data = pd.read_csv('cdc_county_data.csv')
(#)Handle non-numeric values in 'Total Deaths (#)' column
data['Total Deaths (#)'] = pd.to_numeric(data['Total Deaths (#)'], errors='coerce')
(#)Group by state and sum the total deaths
state_deaths = data.groupby('State')['Total Deaths (#)'].sum()
(#)Find the state with the most deaths
state_with_most_deaths = state_deaths.idxmax()
most_deaths = state_deaths.max()
print(state_with_most_deaths, most_deaths)[0m
Observation:{'stdout': 'Texas 16227.0\n', 'stderr': '', 'fileLinks': [], 'exitCode': 0}
Finally, the LLM agent will formulate a comprehensive answer based on the output of the Python code execution and present it to the user.
Final Take
By following this workflow, you can effectively leverage the capabilities of an LLM agent to work with tabular data, combining the agent's language understanding and reasoning abilities with the data processing power of Python.
Using LLMs, in particular, a ReAct AI agent model is useful in processing vast amounts of tabular data. Usage of ReAct AI agent model boils down to exploring and preprocessing data, setting up Python execution environment, defining the system prompt, integrating the agent with Python execution tool, and processing tabular data extraction and insight formation user queries.
About the author: Oleksii Babych, Machine Learning Engineer
Top comments (0)