DEV Community

Cover image for What is LLM Observability and Monitoring?
Lina Lam
Lina Lam

Posted on

What is LLM Observability and Monitoring?

Building with LLMs in production (well) is incredibly difficult. You probably have heard of the word LLM Observability. But what is it? How does it differ from traditional observability? What is being observed? Our team at Helicone AI have the answers.


The TL;DR

LLM Observability is complete visibility into every layer of an LLM-based software system - the application, the prompt, and the response. LLM Observability comes hand-in-hand with LLM Monitoring. While monitoring tracks application performance metrics, observability is more investigative.

LLM Observability LLM Monitoring
Purpose Event logging Collect metrics
Key Aspects Trace the flow of requests to understand system dependencies and interactions Track application performance metrics, such as usage, cost, latency, error rates
Example Correlate different types of data to understand issues and complex behaviours Set up thresholds for unexpected behaviors

What's the difference between LLM vs. Traditional Observability?

Traditional development is typically transactional. Developers observe how the application handles HTTP requests/responses, a database query, or published message. In contrast, LLMs are much more complex.

Here's a comparison of the logs:

Traditional LLMs
Simple, isolated interactions Indefinitely nested interactions, creating a complex tree structure
Clear start and end points Encompass multiple interactions
Small body size (low KBs of data) Massive payloads (potentially GBs)
Predictable behavior (easy to evaluate) Lack of predictability (difficult to evaluate)
Primarily text-based logs and numerical metrics Multi-modal data (text, image, audio, video)

Issues with LLMs

Hallucination: LLMs' objective is to predict the next few characters and not accuracy. This means that responses are not grounded in facts.

Complex use cases: LLM-based software systems require an increasing number of LLM calls to execute a complex task (i.e. agentic workflow). Reflexion is a technique engineers use to get LLMs to analyze their own results. But this consists of having multiple calls inside of multiple spans for checking hallucinations.

Proprietary data: Managing proprietary data is tricky. You need it to answer specific customer questions, but it can accidentally find its way into the responses.

Quality of response: Is the response in the wrong tone? Is the amount of detail appropriate for your users' ask?

Cost (the big elephant in the room) - As usage goes up, and your LLM setup becomes more complicated (i.e. adding Reflexion), the cost can easily add up.

Third-party models: Their API can change, new models and new guardrails can be added, causing your LLM app to behave differently than before.

Limited competitive advantage: LLMs are hard to train and maintain. Chances are that you are using the same model as your competitor. Your differentiator becomes your prompt engineering and proprietary data.


What LLM Observability Tools Have In Common

Developers working on LLM applications need effective tools to understand and address bugs, and exceptions, and prevent regressions. They require unique visibility into the functioning of these applications, including:

  • Real-time monitoring of AI models
  • Detailed error tracking and reporting
  • Insights into user interactions and feedback
  • Performance metrics and trend analysis
  • Multi-metric correlations
  • Tools for prompt iterations and experimentation

Further reading

Arize AI created a very in-depth read about the Five Pillars of LLM Observability, covering common use cases and issues with LLM apps, the importance of LLM observability, and the five pillars (evaluation, traces and spans, retrieval augmented generation, fine-tuning, prompt engineering) crucial for making your application reliable.


The author

Aparna Dhinakaran is the Co-Founder and Chief Product Officer at Arize AI, a leader in machine learning observability. She is recognized in Forbes 30 Under 30 and led ML engineering at Uber, Apple, and TubeMogul (Adobe).


What we've learned

At Helicone AI, we've seen the complexities of productizing LLMs first-hand. Effective observability is key to navigating these challenges, and we strive to help our customers produce reliable and high-quality LLM applications, making the observability process easier and faster.

What are your thoughts?

Top comments (0)