DEV Community

Cover image for Extract Valuable Insights from Your Data Using AutoGPT with Qdrant
Siddhant saxena
Siddhant saxena

Posted on • Originally published at Medium

Extract Valuable Insights from Your Data Using AutoGPT with Qdrant

This article presents an in-depth guide to extracting valuable insights from raw data using AutoGPT and Qdrant Database.

Workflow of AutoGPT
Workflow of AutoGPT

In the ever-evolving landscape of data science, extracting meaningful insights from vast datasets has been akin to finding a needle in a haystack. But what if we could transform this daunting task into an intuitive, efficient, and surprisingly straightforward journey? Welcome to my exploration of AutoGPT and Qdrant, two revolutionary tools that reshape how we interact with and understand our data.

Whether you are a seasoned data scientist, a curious beginner, or somewhere in between, this exploration is designed to illuminate the path to extracting valuable insights from your data. So, fasten your seatbelts and embark on this exciting adventure together!

In this article, we will give the autonomous prompting powers to GPT and develop a chat-like system for interacting with large data files. I will be using Langchain as an interface between AutoGPT for retrieval tasks and Qdrant as a cloud vector store. This is an impressive tech stack that provides seamless integrations for data profiling, retrieval, and generation tasks for modern-day ecosystems.

Let’s first have a look at AutoGPT and understand its capabilities for executing complicated tasks.

AutoGPT: Supercharge GPT-4 with Autonomous Task Execution

AutoGPT represents an innovative leap in the field of automation, moving beyond the conventional boundaries of a standalone model. It’s a groundbreaking experiment that effectively harnesses the impressive capabilities of advanced Large Language Models like GPT-4 and GPT-3. The core objective of AutoGPT is to automate a variety of tasks by utilizing the vast knowledge and understanding embedded within these models. It does so by generating a series of instructions from the LLM and then executing them, primarily focusing on tasks that involve programming logic and step-by-step execution.

Key differences between LLMs and AutoGPT
Key differences between LLMs and AutoGPT

To put this into perspective, consider the task of conducting exploratory data analysis (EDA) on a dataset as an example. AutoGPT employs a logical, step-by-step methodology for this complex process. Initially, it identifies the dataset and understands the type of analysis required. Then, it proceeds to write and execute a Python script for importing the data, often from a CSV or Excel file. Next, AutoGPT performs various data cleansing steps, such as handling missing values or outliers, followed by executing a series of commands for data visualization, like creating histograms, box plots, or scatter plots to understand data distributions and relationships. This approach simplifies the intricate process of EDA, which is fundamental in data science.

The true brilliance of AutoGPT lies not just in the automation of these analytical steps, but in its ability to dynamically create and adapt Python scripts tailored to the specific needs of the dataset and the analysis objectives, making the exploratory process both efficient and insightful.

All right, now we have understood the framework of AutoGPT and how it is different from LLMs like ChatGPT. It is time to take a deeper understanding of its workings and current interfaces. Also, let’s execute some custom tasks for better evaluation.

Executing Generative LM tasks on AutoGPT

We can easily use the official AutoGPT agent in Google Colab. Here we will perform inference tasks from AutoGPT; let’s first pull the official GitHub repo Significant-Gravitas/AutoGPT. Keep your Open-AI API key ready for AutoGPT environment configurations.

!git clone https://github.com/Significant-Gravitas/Auto-GPT.git -b stable --single-branch
%cd Auto-GPT/
!pip install -r requirements.txt
%cd Auto-GPT/
!cp .env.template env.txt
Enter fullscreen mode Exit fullscreen mode

Edit the “env.txt” file and add your API keys now add this new configuration in the environment using: “!cp env.txt .env”. We will initialize the AutoGPT CLI interface using the below command.

!python -m autogpt # If you have GPT-4 accessible keys
!python -m auto-gpt gpt3only # If you do not have GPT-4 keys

# We can also use the --continuous argument for recursive agent execution.
Enter fullscreen mode Exit fullscreen mode

Here we are not setting up the Continuous Mode: ENABLED, for instantiating AutoGPT agents in non-recursive querying on complex tasks.

Init AutoGPTelcome to Auto-GPT!
Once this workflow is activated, AutoGPT takes the initial step of inquiring about the foundational task at hand, in this scenario, ‘generation’. Based on this primary task, AutoGPT deftly assigns a task-specific GPT model tailored to meet the specific needs of the task. For generation tasks, AutoGPT automatically designates the GenGPT agent, a specialized module designed for this purpose.

GenGPT1
Workflow adopted by GenGPT Agent

The GenGPT agent operates using four pivotal components, each playing a unique role in the generation process:

  • GenGPT Thoughts: This is the core idea generation component of GenGPT. It involves the gathering and processing of information relevant to the task. This component synthesizes data from its trained knowledge base and integrates it with the context of the current request, essentially forming the ‘thoughts’ behind the response.
  • Reasoning: Here, GenGPT applies logical analysis and critical thinking to the information at hand. This step is crucial for ensuring that the response is not just based on data but is also logically sound and contextually appropriate. It’s where the agent evaluates different aspects of the information, checks for consistency, and forms coherent arguments or explanations.
  • Criticism: In this stage, GenGPT engages in a self-evaluation process. It critically assesses the response it has formulated, looking for potential flaws, biases, or inaccuracies. This internal review mechanism is key to maintaining the quality and reliability of the responses, ensuring that they meet a high standard of accuracy and relevance.
  • Speak: The final component is the delivery of the response. ‘Speak’ encompasses the way GenGPT articulates its response, ensuring it’s in a clear, coherent, and user-friendly manner. This involves not only the linguistic aspects but also tailoring the response to fit the user’s style of inquiry and the platform’s requirements.

Together, these components enable GenGPT to generate responses that are not only informed and logical but also critically sound and effectively communicated, showcasing the advanced capabilities of AutoGPT in handling generation tasks.

Alice in Wonderland

User Input: “Generate a storyline about Alice in Wonderland”<br>
User Input: “Generate a storyline about Alice in Wonderland”

GenGPT Response and Next ACTION
GenGPT Response and Next ACTION

Title: “Alice’s Whimsical Journey Through the Clockwork Garden”

In the quaint village of Darlington, young Alice, known for her boundless curiosity, discovered a hidden, moss-covered doorway in the heart of the old forest. Pushing the door open, she stepped into a world unlike any other — a Clockwork Garden, where time itself seemed to dance and swirl around. In this fantastical land, the trees were made of shimmering gears and the flowers ticked and tocked, blooming in sync with the rhythm of time. Alice was greeted by the Clockwork Rabbit, a dapper creature with a pocket watch embedded in its chest. “Hurry now, Alice, the Timekeeper’s Festival is about to begin,” he urged, leading her through the maze of gears and springs.As they journeyed, they met the Gearwork Caterpillar, lounging atop a mushroom made of cogs, who offered Alice riddles and wisdom about the nature of time. Further along, they encountered the Mad Tinker, a peculiar inventor hosting an endless tea party with robots and automatons, celebrating the moment’s fleeting beauty.
The heart of the Clockwork Garden was ruled by the Tick-Tock Queen, a majestic figure adorned in robes of woven silver threads, who guarded the Great Clock and controlled the flow of time in the garden. However, the Great Clock was malfunctioning, causing time to run amok. With her innate cleverness and the help of her newfound friends, Alice set out to fix the Great Clock. Through a series of adventures, from scaling the Tower of Seconds to navigating the Labyrinth of Hours, Alice learned the intricate dance of time and its precious value.

qdrant

Powering AutoGPT with Qdrant Vector Database

Qdrant is an innovative vector database designed to handle complex, high-dimensional data efficiently. At its core, Qdrant specializes in storing and indexing vector embeddings, making it an invaluable tool for tasks that require rapid and accurate retrieval of similar items, such as in recommendation systems or image searches. What sets Qdrant apart is its user-friendly API, which simplifies the process of integrating advanced search capabilities into various applications. Furthermore, Qdrant offers a unique feature that allows for filtering results based on additional metadata, enhancing the relevance and precision of search outcomes. As an open-source alternative to other vector databases like Pinecone, Qdrant is not only accessible but also stands at the forefront of technology, offering state-of-the-art speed in nearest-neighbor searches. Its approach to handling vector data and its commitment to continuous improvement make it a standout choice for developers and organizations dealing with complex data landscapes.

Creating Collections in AutoGPT-DA Cluster

Now, let’s start building a cloud vector database as a cluster on the Qdrant cloud. First, we will create a free-tier cloud cluster for our experimental purposes. You can adjust the configuration for specific use cases and requirements.

If you are new to the Qdrant Cloud Database, check my previous post about setting up the Qdrant Cloud cluster and monitoring the vector database using the Thunder-HTTP client.

Let’s check the status of the AutoGPT-DA cluster using the Python qdrant client, we need to install the following dependencies and export necessary API-Keys to the environment:

!pip install qdrant-client
!export OPENAI_API_KEY="sk-SpCU2Iz2aoBEKS7F5QzDT3BlbkxxxxxxxxxxuyPX1hRCklJy" #Your API Key
!export Qdrant_API_KEY="ZcnKdbf9617SH5sy-wklOxxxxxxxxxvs3vsPeSo0_Zv3cOjQbg" #Your API Key
Enter fullscreen mode Exit fullscreen mode

Now we will import the qdrant python client and create our collection to store the vector embeddings as datapoints in the collection.

import os
from qdrant_client import QdrantClient
from qdrant_client.http import models
from qdrant_client.models import Distance, VectorParams

qdrant_client = QdrantClient(url="https://9444ba5f-a4a1-xxxx-xxxx-b24c6d459624.us-east4-0.gcp.cloud.qdrant.io",
api_key=os.getenv("Qdrant_API_KEY"),
)

vectors_config = models.VectorParams(size=768, distance=models.Distance.COSINE)

qdrant_client.recreate_collection(
   collection_name="autogpt-collection",
   vectors_config=vectors_config,
)
Enter fullscreen mode Exit fullscreen mode

The above Python script initiates a vector configuration for Qdrant, specifying that the vectors should be of a size 1536 and defining the distance metric for comparisons as the cosine distance. This configuration sets up the characteristics of vectors that will be used within the Qdrant system, determining their size and the specific metric used to measure the distance or similarity between vectors.

Now that we have created our collection, we need some embeddings to store in our vector database.

get_collections
Using the “get_collections()” method of the qdrant_client class, we can check the status of the created cluster.

Creating a Vector Database for Tabular Dataset

Now that we have created our collection, we will have to append the data points in the form of vectorized embeddings as per the vector configuration. In this article, I will pick a finance dataset related to stock markets from Kaggle. Let’s have a look at the initial format of our dataset.

dataframe
Finance dataset: It contains 24 feature columns related to stock markets

import pandas as pd
import openai


file_path = "/content/Finance_data.csv"
df = pd.read_csv(file_path)
selected_columns = ['gender', 'age', 'Investment_Avenues', 'Expect', 'Avenue', 'Reason_Equity', 'Reason_Mutual', 'Reason_Bonds', 'Reason_FD', 'Source']
df['concatenated_text'] = df[selected_columns].astype(str).agg(' '.join, axis=1)
openai.api_key = "sk-SpCUxxxxxxxxJCw44h5sOuyPX1hRCklJy" 

def get_embedding(text):
    try:
        response = openai.Embedding.create(input=text, engine="text-similarity-babbage-001")
        return response['data'][0]['embedding']
    except Exception as e:
        print(f"Error in getting embedding: {e}")
        return None

df['content_vector'] = df['concatenated_text'].apply(get_embedding)
final_df = pd.DataFrame({
    'title_vector': list(df.index),
    'content_vector': df['content_vector']
})
output_file = 'title_content_embeddings.csv' 
final_df.to_csv(output_file, index=False)
Enter fullscreen mode Exit fullscreen mode

The above Python script creates a vectorized dataset containing two columns “title_vector” which will store the indices of the rows, and “content_vector” which will store the OpenAI embeddings for the rows.

Index Data

In Qdrant, a data management system designed for vector search, data is organized into structures called “collections.” Each collection serves as a container for multiple objects, with each object represented by one or more vectors. These vectors are essentially multi-dimensional data points that capture the essence of the object’s characteristics in a numerical form. Additionally, objects can be accompanied by “payloads,” which are metadata providing extra contextual information about the object.

In your specific scenario, you have established a collection named ‘autogpt-collection’ within Qdrant. This collection is unique in that each object within it is characterized by two different types of vectors: one representing the “title” and the other representing the “content.” This dual-vector approach allows for a more nuanced and detailed representation of each object, enhancing the accuracy and relevance of search results within the collection.

from qdrant_client.http import models as rest

vector_size = len(article_df["content_vector"][0])

qdrant_client.recreate_collection(
   collection_name="autogpt-collection",
   vectors_config={
       "title": rest.VectorParams(
           distance=rest.Distance.COSINE,
           size=vector_size,
       ),
       "content": rest.VectorParams(
           distance=rest.Distance.COSINE,
           size=vector_size,
       ),
   }
)
Enter fullscreen mode Exit fullscreen mode

Next, we will upsert the vector payload into our collection using the following script:

qdrant_client.upsert(
   collection_name="autogpt-collection",
   points=[
       rest.PointStruct(
           id=k,
           vector={
               "title": v["title_vector"],
               "content": v["content_vector"],
           },
           payload=v.to_dict(),
       )
       for k, v in final_df.iterrows()
   ],
)
Enter fullscreen mode Exit fullscreen mode

Now that we have upserted all of our vector embeddings as points, we can move on to integrate the vector database with AutoGPT.

Enhancing AutoGPT with Qdrant Vector Database

We can easily integrate Qdrant with Autogpt by updating the “env.txt” file with our keys from OpenAI and Qdrant. And update the environment by:

!cp env.txt .env
Enter fullscreen mode Exit fullscreen mode

Now we will use the AutoGPT integrations from Langchain and equip the vector store embeddings using the vector store as memory of the agent. First, let’s install some required dependencies:

!pip install langchain google-search-results openai tiktoken
Enter fullscreen mode Exit fullscreen mode

Langchain-Qdrant Integration

We currently have a vector store set up in the cloud, which can be accessed by any of our applications. With the appropriate access credentials, we can easily connect to this store without having to regenerate the embeddings each time. Our next step is to integrate this vector store with our application, enabling AutoGPT to use it for query processing in our question-answering tasks.

from langchain.embeddings import OpenAIEmbeddings
from langchain.docstore import InMemoryDocstore
from langchain_community.vectorstores import Qdrant
from langchain.vectorstores import Qdrant
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

embeddings_model = OpenAIEmbeddings()
text:str
embedded_query = embeddings_model.embed_query(text)

vector_store = Qdrant(
  client=client,
  collection_name="autogpt-collection",
  embeddings=embeddings,
)
Enter fullscreen mode Exit fullscreen mode

Langchain-SerpApi Integration

We will use the API for search engine result page (SERP) queries. Let’s set tools from the language processing agent using Langchain as follows:
SerpAPIWrapper is initialized as a search tool, wrapping an API for search engine result page (SERP) queries.

  1. The tools list is created, consisting of multiple Tool instances, each designed for specific functionalities.
  2. A search tool named “search” is added with the functionality, tailored for answering questions about current events through targeted queries.
  3. Additionally, tools for writing to (WriteFileTool()) and reading from (ReadFileTool()) files are included in the tools list, enhancing the agent's file management capabilities.
import os
os.environ['SERPAPI_API_KEY'] = "b5eafbade1f9a4423fxxxxxxxxxx006ace4f1c9c408f1f3f22f5705513e186050"
os.environ['OPENAI_API_KEY'] = "sk-SpCU2Iz2aoBEKS7F5QzDT3BlbkxxxxxxxxxxuyPX1hRCklJy"
Enter fullscreen mode Exit fullscreen mode
from langchain.utilities import SerpAPIWrapper
from langchain.agents import Tool
from langchain.tools.file_management.write import WriteFileTool
from langchain.tools.file_management.read import ReadFileTool

search = SerpAPIWrapper()
tools = [
    Tool(
        name = "search",
        func=search.run,
        description="useful for when you need to answer questions about current events. You should ask targeted questions"
    ),
    WriteFileTool(),
    ReadFileTool(),
]
Enter fullscreen mode Exit fullscreen mode

Langchain-AutoGPT Integration

We will be using the ChatOpenAI model and initialize an AutoGPT agent with specific configurations:

from langchain.experimental import AutoGPT
from langchain.chat_models import ChatOpenAI
agent = AutoGPT.from_llm_and_tools(
    ai_name="AutoEDA",
    ai_role="Analyst",
    tools=tools,
    llm=ChatOpenAI(temperature=0),
    memory=vectorstore.as_retriever()
)
# Set verbose to be true
agent.chain.verbose = True
Enter fullscreen mode Exit fullscreen mode
  1. agent = AutoGPT.from_llm_and_tools(...) creates an instance of the AutoGPT agent, configuring it with a set of tools and a language model.
  2. The agent is named “AutoEDA” and assigned the role of “Analyst”, indicating its purpose or functionality.
  3. The language model used is ChatOpenAI with a temperature setting of 0, which controls the randomness of the model's responses.
  4. The agent’s memory is linked to a vector store (vectorstore.as_retriever()), enabling it to retrieve information from this store.

Additionally, the line agent.chain.verbose = True sets the agent's verbose mode to true, likely to enable detailed logging or output of its operations.

Hurray! We have successfully developed an AutoGPT agent that can understand large raw datasets for question-answering tasks. I hope this journey has been enlightening, particularly in understanding vector databases, LangChain, and OpenAI. Keep an eye out for more exciting blog posts.

Conclusion

In this journey through AutoGPT and Qdrant, I’ve explored how these innovative tools can transform data analysis into an intuitive, efficient process. AutoGPT, with its autonomous task execution, pairs seamlessly with the Qdrant vector database, enabling effective handling of complex, high-dimensional data. This combination simplifies tasks such as exploratory data analysis, ensuring responses are not only data-driven but also contextually sound. My exploration into their integration and application in real-world scenarios highlights their potential in modern data ecosystems, offering a glimpse into the future of automated data processing and insight extraction.

Follow me on Twitter: @sidgraph

(Note: This blogpost is in collaboration with Superteams.ai.)

Top comments (0)