The Hidden Costs of Sticking to OpenAI (and What to Do Instead)

In the beginning, there was OpenAI… then came the flood. :)

Over the past two years, the number of LLM models has grown significantly, each with its own pros and cons. Yet, many companies still rely heavily on OpenAI, possibly due to convenience or a lack of understanding of alternatives.

In some cases, OpenAI might be a good choice. However, in many of the use cases I’ve observed, it might not be ideal—especially if speed is crucial to your application. Speed here refers to the response time, usually measured in tokens per second.

Disclaimer: I am not recommending any specific provider or model in this post. Instead, I want to illustrate how comparing models can make a significant difference when you care about quality, price, and speed. Especially when going into production, these factors really matter.

Comparing Models and Providers

Before considering any alternatives, it’s essential to:

Look at leaderboards and benchmark metrics for the models themselves.
Compare performance across different providers.

Even the same model can behave differently depending on the provider. For example, a Llama model on Fireworks may have drastically different response times compared to the same model on Together.

Benchmark Example:

https://artificialanalysis.ai/models/llama-3-2-instruct-1b/providers

Concrete Example: Summarizing API

Below is a real-world scenario. We built a summarizing API (it does a bit more than just summarizing, but that’s not crucial here). The key metric is the speed-to-cost ratio. Initially, we used OpenAI as the provider.

When provider = openai is selected, it automatically chooses gpt-4o-mini, the fastest OpenAI model (with Azure possibly being slightly faster).

Running a summary on one of my blog posts took about 4.5 seconds:

Switching Providers: Llama on Together.ai

Next, we switched to Llama 3.1 8b on Together.ai. The difference was significant, despite generating more output tokens.

After multiple tests, the speed difference remained huge—almost a 3x improvement compared to OpenAI. And not just Together, but Fireworks.ai also showed similar improvements.

Note: Different models on the same provider can yield different results. On Together, the 8b model performed at 240–270 tokens/second, more than double gpt-4o-mini.