Why build AI Agents for the workplace?
AI agents (or “assistants”) are an extension of large language models (LLMs). They can access external tools, such as search engines, via API calls. They’re therefore able to overcome some common shortcomings of LLMs, such as lack of access to internal business data.
Suppose you query a basic LLM, like ChatGPT, with the following:
In contrast, an agent with website analytics access could approach the problem differently:
Agents can be augmented with a large number of tools, which opens up endless exciting possibilities. Importantly, these tools allow the LLM to perform actions (like sending an email) as well as retrieval (like summarizing an email).
How an agent handles complex queries
Let’s now compare approaches to a more complicated query:
“Can you send an email to Olly with a chart of our landing page views over the past month?”
A basic LLM could neither access the data source nor send an email.
An approach like Retrieval Augmented Generation (RAG) could get the data on landing page views, but couldn’t send the chart to Olly.
An agent with a large number of tools could approach the problem like this:
The interface to agents is natural language, like LLMs, so there are many exciting possibilities. For example: an Agent could be accessed via a Slack message, massively reducing friction for employees in their basic daily tasks. MindsDB has easy tutorials for setting up Slack bots!
AI agents can accomplish tasks that would require hours of a skilled analyst’s time. The analyst would’ve needed to use an analytics dashboard, then put the data into a charting tool, and then draft an email. Now, all this can be executed with a single Slack message instead.
The challenges with building AI agents for the workplace
Initial benchmarking on agents’ capabilities has been promising, as they’ve been able to execute complex tasks with external tools. Some benchmarking studies have focused on agents in video game settings like Minecraft. While they’ve performed impressively, far outperforming basic LLMs, this doesn’t necessarily translate into business value.
Outside the game world, other benchmarking studies have largely focused on abstract tasks. Let’s look at a challenging example from Meta’s GAIA benchmark:
Level 3
Question: In NASA's Astronomy Picture of the Day on 2006 January 21, two astronauts are visible, with one appearing much smaller than the other. As of August 2023, out of the astronauts in the NASA Astronaut Group that the smaller astronaut was a member of, which one spent the least time in space, and how many minutes did he spend in space, rounded to the nearest minute? Exclude any astronauts who did not spend any time in space. Give the last name of the astronaut, separated from the number of minutes by a semicolon. Use commas as thousands separators in the number of minutes.
Ground truth: White; 5876
This requires up-to-date information (from the Wikipedia API in this case), complex reasoning (using the LLM’s internal logic), and mathematical accuracy (from a Math API). However, it doesn’t represent a realistic daily business task very well.
It’s therefore difficult to know how reliable Agents solely tested on these benchmarks will be in a business setting. Think of the best mathematicians you know - would they also schedule meetings reliably?
How can agents' true capabilities be evaluated for the real business environment?
MindsDB is continuously assessing the true capabilities of agents in the workplace, as the field evolves. In the next articles of the “Agents in the workplace” series we’ll publish our internal benchmarks of Agent performance across 5 key domains: email, calendars, website analytics, project management, and customer relationship management (CRM).
In line with our open-source ethos, we’ll be publishing the benchmarking dataset too. This will allow the open-source community to continuously test their agents in a realistic business setting too.
All these state-of-the-art results are part of MindsDB agents for businesses. Your employees can get started quickly with our agents, accessible via common chat interfaces such as Slack and Microsoft Teams.
Who is designing the evaluation for these agents?
Our team includes MindsDB engineers, AI PhDs, and associate professors. The MindsDB engineers are Dr. Olly Styles, Dr. Sam Miller, and Patricio Cerda Mardini. They’re supported by Associate Profs. Tanaya Guha (Glasgow University) and Victor Sanchez Silva (Warwick University), Dr Bertie Vidgen (Oxford University).
What’s next for MindsDB agents?
Our next blog post will feature results from our internal benchmarking across the five business domains. We’ll follow that up with the CTO’s Guide to Building Agents in the Workplace, along with a rigorous peer-reviewed study supporting our results.
Top comments (0)