DEV Community

Cover image for Is OpenAI's o1 model a breakthrough or a bust?
Steve Sewell for Builder.io

Posted on • Originally published at builder.io

Is OpenAI's o1 model a breakthrough or a bust?

OpenAI's latest model, o1, isn't getting nearly as much buzz as its predecessors GPT-3 and GPT-4. Let's dive into why that is and what it means for the future of AI.

AI progress is slowing down

Let's start with a real-world example. Here's a comparison between o1 and Claude converting a Figma design into code via Builder.io:

See full clip and details below

As you can see, o1 is significantly slower. It's also more expensive, and only sometimes better.

So why is OpenAI investing so heavily in o1? It all comes down to this:

a graph modeling the timeline of various AI models releases and their capabilities

Source: Alex Albert

Each new large language model (LLM) that comes out seems to be only incrementally better than the last. In fact, it's expected that OpenA/I's latest model, Orion, isn't even always better than GPT-4.

Why? We're simply running out of data to train on.

LLM's are already trained on the data we have available

Source: Will we run out of data? Limits of LLM scaling based on human-generated data

The need for speed

If we're hitting limits on how smart we can make AI models, what's the next move? Making them faster.

logos of Microsoft, OpenAI, Google, Nvidia, AMD, and Amazon

AI players are investing in specialized hardware to speed up inference and lower costs. They're building their own data centers and even looking into nuclear power for more sustainable, cheaper energy.

Groq and Cerebras are accelerating GPU performance

Companies like Groq and Cerebras have seen up to 10x performance increases with LLM-optimized hardware. This isn't just theoretical - Amazon's already released their new chips, and Apple is planning to use them.

Faster inference could open up new workflows that weren't feasible before due to long wait times and poor user experience.

The Smartness vs Speed trade-off

But here's the million-dollar question: does being faster matter if AI can't get any smarter? Or to put it another way:

If the way we train models today isn't getting better, if AI is plateauing on intelligence, can we use increased speed and decreased cost to find another path to smarter AI outputs?

does faster equal smarter?

The answer might surprise you.

Let's borrow a concept from Daniel Kahneman. He talks about two systems of thinking:

  1. System 1: Fast, automatic thinking. Like knowing 3 + 4 = 7 without having to calculate it.
  2. System 2: Slower, more deliberate thinking. Like solving a complex math problem step by step.

three plus four equals what?

System 1 thinking is automatic.

thirty nine plus fourty eight equals what?

System 2 thinking requires taking things step-by-step.

Using System 2 thinking is used to break down complicated problems into simpler steps.

For example, you likely don’t have the answer to 39 + 48 memorized the same way as you know the answer to 3 + 4.

Instead, you’d need to break the more complicated question down into steps like this:

breaking down 39 + 48 into steps to solve it, equaling 87

Current LLMs work a lot like System 1 thinking. They give you an answer fast, but as complexity increases, accuracy can suffer.

O1, on the other hand, is more like System 2 thinking. It breaks down complex problems into smaller, manageable steps. This approach can help with one of LLMs' biggest weaknesses: hallucinations.

o1 breaking down a math problem into steps to correctly solve for 892 + 847, which equals 1739

For example, if you ask o1 how many Rs are in "strawberry", it might first guess incorrectly. But then it'll go through the word letter by letter, count the Rs, and give you the right answer.

o1 regurgitating the wrong number of

This step-by-step approach isn't entirely novel. Chain-of-thought techniques have been around for a while.

In fact, open-source models, like the QwQ model from Alibaba, are already beginning to use this approach with similar performance. What's new is training a model to use this approach specifically.

Alibaba's QwQ model has already caught up  in using chain-of-thought techniques, and is performing similarly to OpenAI and Claude.

Source: Alibaba

As speeds increase and costs decrease, we might be able to afford this extra time to get better answers without hurting the user experience.

The downsides of o1

The problem is, o1 doesn't always give a better answer to every type of problem. But it is always more expensive and slower.

o1 is not the best model at solving coding problems

Source: Aider LLM Leaderhboards

Currently, o1 Preview costs four times more per token than Claude Sonnet. It also outputs 2-10 times more tokens because of all the "thinking" it does.

This means an o1 output could cost up to 40 times more than Sonnet - and it's not always better.

The slowness is a big issue too. You often don't see any results from o1 for 10, 20, 30 seconds or more. That's a much worse user experience.

Conversion above done with Builder.io using claude-3-5-sonnet-20241022 and o1-preview-2024-09-12 (latest API-accessible model versions). I hand checked the results on ChatGPT with the latest o1 and didn't see any major differences in speed or accuracy.

The product in the video above costs $20 per month now. If the o1 model costs up to 40 times more, it doesn’t seem worth it to pay $800 a month for an offering that is slower and only marginally better, if at all.

something that is 40 times more expensive than a product that costs 20 dollars a month will cost 800 dollars a month

The potential of AI agents

So why is this still interesting? AI agents.

a picture of Agent Smith from

We want to use AI to complete a series of tasks without constant human supervision.

For that to work, AI needs to be better at breaking things down and completing tasks step by step. It also needs to have a lower failure rate and catch its own mistakes sooner.

Traditional LLMs are great at completing the next word in a sentence, but they weren't trained to break down and execute tasks.

For instance, Claude’s computer use today only has a 15% success rate at accomplishing real-world tasks. O1 is showing us what happens when we do train for that.

Claude 3.5 Sonnet scored a 14.9% success rate in its ability to solve real world problems

Source: Anthropic

The big questions

How much better can this get with novel training methods? Will this lead to new breakthroughs, or are we heading for an AI bubble burst?

a graph of the rug pull from the Hawk Tuah cryptocurrency

The average person's life hasn't changed much despite all the AI hype. If AI models aren't getting that much better, and these new techniques don't lead to major breakthroughs, some of the hype might start to cool off.

And you know what? That might not be the worst thing.

Software hype cycle chart with

Lessons from the dot-com bubble

As the dot-com bubble peaked in 2000, people invested in the web as if its major challenges had already been solved. There was widespread excitement about the web's potential for explosive growth and profitability.

However, that excitement deflated when companies burned through hundreds of millions of venture capital and failed to become profitable. Only a few companies survived, and they now serve as models for how online businesses can succeed.

four companies and their establishment years, being google in 1998, ebay in 1995, amazon in 1994, and paypal in 1998

Today, we're seeing a similar pattern with AI—growing excitement and venture capital flowing freely. The enthusiasm mirrors what we saw during the dot-com bubble's growth.

But there will be hard problems. Progress will come in S-curves. While AI could solve an immense number of problems, it might not happen today. Some AI applications might not be effective enough for mass adoption for another decade.

Remember Webvan and Pets.com? They were viable internet businesses - just 10 years too early.

webvan became instacart in 2012, and pets.com became chewy in 2011

Before you hit me with the "but this time it's different, man" – don't forget that that's what people say every time.

What's working now

While the future is uncertain, some AI use cases are already working well:

  1. AI chat assistants for brainstorming, researching, writing, and editing
  2. AI-assisted coding, whether through specialized IDEs, Copilot, full application builders, or tools that convert designs into high-quality code

These categories are seeing major adoption with generally happy users.

a list of AI products and their primary uses

The bottom line

Maybe AI agents will change everything as soon as next year. Or maybe it'll take longer. We just don't know yet.

What we do know is that some AI applications are already proving their worth. Those are the ones I'm watching closely, and I'm excited to see what comes next.

Top comments (0)