LLMs (Large Language Models) have tremendous potential to enable new types of AI applications. The truth is, turning simple prototypes into robust, production-ready applications is quite challenging.
We've been supporting dozens of companies in bringing applications to production, and we're excited to share our learnings with you.
When building LLM applications for production use, certain capabilities rise above the rest in importance.
Carefully crafted prompts are key to achieving reliable performance from LLMs. Think about:
- Where and how to store your prompts for quick iteration.
- Ability to experiment with prompts (A/B testing, user segmentation).
- Collaboration. Stakeholders can contribute immensely to prompt engineering.
We've all been there - we designed some prompts that worked really well with test data. Then we went live and disaster struck. Bad handling of LLMs can have negative effects such as long waiting times, inappropriate responses, lack of context and more. This translates to bad user experience, and could negatively affect your brand/churn rate.
It's important to carefully think about the observability and monitoring aspects of your LLM operations, and have the ability to quickly identify issues and troubleshoot them. Think about tracing, the ability to track an entire conversation, replay it and improve it over time. Consider anomaly detection as well as emerging trends.
It's important to know "what good looks like". Having the ability to mark good (e.g. converting) LLM responses versus bad (e.g. churn) will really pay off in the long run.
LLM API costs can quickly spiral out of control. It's important to be prepared and budget accordingly. With that being said, we've seen cases where a trivial parameter change has increased costs by 25% over night.
Granular tracking of API usage and billing helps identify expensive calls. With detailed visibility into LLM costs, you can set custom budgets and alerts to proactively manage spend. By analyzing logs and performance data, expensive queries using excessive tokens can be identified and reworked to be more efficient. With rigorous cost management tools, LLM costs can be predictable and optimized.
You often find yourself chaining multiple calls. Think about the models in use. Do you really need to use GPT-4 for everything? If you can, save GPT-4 calls for scoring/labeling/classification calls, where the output is short. This will save you plenty of money. When you need to generate long responses, GPT-3.5-Turbo might be more appropriate from a cost perspective.
Rigorous evaluation using datasets and metrics is key for reliability when building LLM applications. With a centralized dataset store, relevant datasets can be easily logged from real application queries and used to frequently evaluate production models. Built-in integration with open source evaluation libraries makes it simple to assess critical metrics like accuracy, response consistency, and more.
Evaluation frameworks help you efficiently validate new prompts, chains, and workflows before deploying them to production. Ongoing evaluation using real user data helps identify areas for improvement and ensures optimal performance over time.
Evaluation doesn't have to be too complicated. You can sample an X% of your LLM responses and run them through another, simple prompt for scoring. Over time, this will give you valuable data.
There's a limit to what you can do with off-the-shelf models. If using LLMs becomes an important aspect of your operations, you'll likely resort to fine-tuning at some point.
With integrated data pipelines, real user queries can be efficiently logged and processed into clean training datasets. These datasets empower on-going learning - models can be fine-tuned to better handle terminology and scenarios unique to your business use-case.
Invest in tooling to generate datasets and fine-tune models early to ensure LLMs deliver maximum value by keeping them closely aligned with evolving business needs.
Apart from yielding better results, fine-tuning can dramatically improve costs. For example, you can train the gpt-3.5-turbo model based on data produced by GPT-4, or other capable (and expensive) models.
Besides the pillars mentioned above, there are a few more concepts you need to consider when building production-grade, LLM-powered applications:
- Performance: Depending on your application, it might be crucial to optimize for fast response times and minimal latency. Make sure to design your prompt chains for maximum throughput.
- Multi-Model Support: If you use multiple LLMs like GPT-3.5, GPT-4, Claude, LLaMA-2, consider how you consume these. Adopting a unified, abstracted way to consume various models will make your application more maintainable as you scale.
- User Feedback: Understanding how real users interact with your LLMs is invaluable for guiding improvements. Make sure to capture real usage data and feedback so you can improve the user experience over time.
- Enterprise Readiness: Depending on your target market, enterprise-grade capabilities might be important. Think about fine-grained access controls and permissions, predictability and reliability SLAs, data security, privacy, and compliance assurance, automated testing and validation frameworks to ensure reliability, and more.
Pezzo is the open-source (Apache 2.0) LLMOps platform. It addresses prompt management, versioning, instant delivery, A/B testing, fine-tuning, observability, monitoring, evaluation, collaboration and more.
Regardless of where you’re at in your LLM adoption journey, consider using Pezzo. It takes exactly one minute to integrate, and endless value will come your way.
If you’d like to learn more about Pezzo: