Sunil Kumar Dash for Composio

Posted on Sep 16 • Edited on Sep 19 • Originally published at composio.dev

I tested OpenAI's o1-preview for hours—You won't believe what I found 🤯

#webdev #beginners #programming #ai

TL;DR

Unless you've been living under a rock, you've probably heard about OpenAI's bombshell release: o1-preview.

As a fellow AI enthusiast and a member of an AI start-up, Composio, whose bread and butter is AI models, I spent hours putting the model through its paces with countless prompts to see how great it is.

Does it live up to the hype, or is it another overhyped release? We will find out.

I tested the model on:

Reasoning
Math
Coding
Creative Writing

Here’s my detailed analysis of how it performed.

What is o1-preview?

The o1-preview is OpenAI's new series of AI models, capable of “thinking” and “analysing” before responding to queries.

The model has reportedly aced IOI (International Olympiad of Informatics) and Codeforces and has been tested to have an IQ of 120 (based on the Norway MENSA test).

The thinking part of the model is due to reinforcement learning over the massive chain of thought data sets. This lets the model mimic the thinking process for a better solution.

Some words on the native CoT of o1-preview

Before getting into the discussion. Here is what I feel about o1’s native chain of thought capability.

Significantly improved reasoning capability.
Sometimes, it fails to output answers after completing the thinking process. I wonder if it is related to the model. A few times, it answered correctly, even with somewhat inconsistent CoT traces. Again, I am unsure if this is expected, and there is no way to know why.
The reinforcement learning on CoT works and doesn’t get stuck or collapse, which is a significant improvement.

Reasoning

Logical reasoning is what it does significantly better than the existing models, thanks to the native chain of thought thinking processes. It is the entire selling point of this model, and it doesn’t disappoint at all. And it completely blows all the models out of the water.

Let’s start with a fundamental reasoning question.

#1. Counting words in the response

Prompt: How many words are there in your response to this prompt?

I have tried the same prompt with the GPT-4o and Sonnet 3.5, but both failed to get it right. However, the O1 Model answered on the first attempt.

Let’s give it another task.

Prompt: How many letters are there in your response to this prompt?

It nailed it perfectly without any help from tools. Unlike earlier models, it didn’t mistake special characters and numbers for letters.

Let’s take it further.

Prompt: What’s the fourth word in your response to this prompt?

The o1-preview model got it right in the first attempt, while GPT-4o and Sonnet failed to do this even with tool use.

#2. Counting the number of letters ‘r’ in ‘Strawberry’

This is an exciting test, as none of the earlier models except the Sonnet 3.5 could get it correct in the first attempt, and much of the excitement surrounding this model was about its ability to do so successfully.

Prompt: How many letters ‘r’ are in the word ‘strawberry’?

Not to our surprise, the model was able to get it current on the first try.

You can observe the entire sequence of thoughts the model goes through to get to the answer.

This is getting exciting, so let’s raise the difficulty of the questions.

#3. What is C Doing in the Room?

This time, I asked a bit more of a tricky reasoning question.

Prompt: Five people (A, B, C, D, and E) are in a room. A is watching TV with B, D is sleeping, B is eating chow min, and E is playing table tennis. Suddenly, a call comes on the telephone. B goes out of the room to pick up the call. What is C doing?

GPT-4o completely fell flat in its head. Here is its response.

Now, let’s see o1-preview’s response.

This is great! The model could get the nuance of the problem and successfully answer the question.

#4. Who died in the car accident?

Let’s see another reasoning task.

Prompt: A woman and her son are in a car accident. The woman is sadly killed. The boy is rushed to the hospital. When the doctor sees the boy, he says, “I can’t operate on this child; he is my son! How is this possible?

This is a classic reasoning problem.

#5. Farmer and Sheep Problem

It’s a simple problem that could easily knock any LLM off its feet. Let’s see how the o1-preview fares.

Prompt: A farmer stands with the sheep on one side of the river. A boat can carry only a single person and an animal. How can the farmer get himself and the sheep to the other side of the river with minimum trips?

Well, the first model I saw nailing this problem zero-shot without human assistance.

Summary on reasoning

The O1 family is undoubtedly a significant update over the previous models. It does a great job of complex reasoning but still has room for improvement. It does not think but natively performs a chain of thought for improved reasoning.

Mathematics

Considering its strength in reasoning, I was almost convinced it would slice through Math, and it didn’t disappoint.

Let’s throw some math questions to it. These are the questions no other model could answer correctly.

#1. Find the possible 4th vertex of a parallelogram

Let’s start with an algebra problem.

Prompt: If three corners of a parallelogram are (1,1), (4,2), and (1,3), what are all the possible fourth corners?

This is a question from Gilbert Strang’s Linear Algebra. Courtesy @allgarbled on Twitter.

The model could find two possible vertices in the first attempt. But, when nudged, it answered all three possible vertices correctly.

#2. Finding the sum of integers

Let’s try another math question. This time, it is algebra.

Prompt: The greatest common divisor of two positive integers less than 100 equals 3. Their least common multiple is twelve times one of the integers. What is the largest possible sum of the two integers?

This is a math dataset problem, which other models usually fail to get right.

However, the o1-preview obtained the correct answer in a single go, which took 37 seconds. This is scarily good at this.

#3. Trigonometry Problem

Let’s take another example; this time, it is a trigonometric problem.

Prompt: A car is being driven towards the base of a vertical tower in a straight line and at a uniform speed. The top of the tower is observed from the car, and in the process, the elevation angle changes from 45° to 60°. How long will this car take to reach the tower’s base?

All the previous models needed help to answer this problem correctly. However, o1-preview answered it correctly.

#4. Riemann Hypothesis

I didn’t know what Riemann’s hypothesis was before this. Many people asked the model to solve the hypothesis, which has remained unsolved until now. And the model, as expected, did not attempt to solve it.

We are still far from reaching there but will eventually be there.

Summary on Maths

This is a significant improvement over all the LLMs. It correctly answers many difficult questions. However, it sometimes needs human assistance to get the complete answer, which is fair.

Coding

Let’s now vibe-check the coding abilities of o1-preview.

Let’s take the “Super Heroes” problem, a relatively tricky dynamic programming problem, to test the model appearing in recent competitive coding competitions.

This problem is unlikely to have been contaminated, meaning the model is less likely to have been trained on it.

The problem



You are given a 0-indexed integer array nums representing the strength of some heroes. The power of a group of heroes is defined as follows:

Let i0, i1, ... ,ik be the indices of the heroes in a group. Then, the power of this group is max(nums[i0], nums[i1], ... ,nums[ik])2 * min(nums[i0], nums[i1], ... ,nums[ik]).
Return the sum of the power of all non-empty groups of heroes possible. Since the sum could be very large, return it modulo 109 + 7.



Example 1:

Input: nums = [2,1,4]
Output: 141
Explanation: 
1st group: [2] has power = 22 * 2 = 8.
2nd group: [1] has power = 12 * 1 = 1. 
3rd group: [4] has power = 42 * 4 = 64. 
4th group: [2,1] has power = 22 * 1 = 4. 
5th group: [2,4] has power = 42 * 2 = 32. 
6th group: [1,4] has power = 42 * 1 = 16. 
7th group: [2,1,4] has power = 42 * 1 = 16. 
The sum of powers of all groups is 8 + 1 + 64 + 4 + 32 + 16 + 16 = 141.

Example 2:

Input: nums = [1,1,1]
Output: 7
Explanation: A total of 7 groups are possible, and the power of each group will be 1. Therefore, the sum of the powers of all groups is 7.


Constraints:

1 <= nums.length <= 105
1 <= nums[i] <= 109

This is one of the problems only Claude 3.5 Sonnet could get right before, and the o1-preview model took 91 seconds to arrive at a solution that could not go past the first test case. I tried it twice but with no luck.

This is the output code.



class Solution:
    def sumOfPower(self, nums):
        MOD = 10**9 + 7
        nums.sort()
        n = len(nums)

        # Precompute powers of 2 modulo MOD
        pow2 = [1] * n
        for i in range(1, n):
            pow2[i] = (pow2[i - 1] * 2) % MOD 

        total_sum = 0
        prefix_sum = 0
        for i in range(n):
            # Contribution of nums[i] as both min and max
            min_max_contrib = nums[i] * nums[i] % MOD
            if i > 0:
                # Update prefix sum
                prefix_sum = (2 * prefix_sum + nums[i - 1]) % MOD
                total_contrib = nums[i] * min_max_contrib * prefix_sum % MOD
                total_sum = (total_sum + total_contrib) % MOD
            else:
                total_sum = (total_sum + min_max_contrib * nums[i]) % MOD

        return total_sum % MOD

However, the o1-mini model is a better-tuned model for coding. In some instances, o1-preview was able to code better than GPT-4o and Sonnet 3.5.

For instance, this GitHub PR in Tiny Grad was written by o1-preview, as mentioned in George Hotz, which is a big feat.

Summary of Coding Capabilities

In my testing, it is okay at coding but not the best. It can code reasonably well on many topics but also struggles with many. Considering the quality and inference trade-offs, Sonnet 3.5 might still be a better choice. However, I think the OG O1 model will be much better at this than the preview model.

4. Creative Writing

This model was not marketed as a writer. I have had a decent experience with GPT-4o in creative writing, but let’s see how it fares.

I used a simple prompt.

Prompt: Write a story with a twist about a college guy’s routine life, which turned upside down when he encountered a mysterious woman.

While the story was good, it still retains the GPT speak. It doesn’t read like a human.

Compared to this, I liked GPT-4o’s twist more.

Summary of Creative Writing Capabilities

I didn’t like the o1-preview’s responses to creative writing tasks in my limited tests. Sonnet 3.5 and GPT-4o are better than o1-preview.

Final Verdict

This is a summary of my experience with the o1-preview model.

Reasoning: This is the strongest argument favouring this model. It correctly answers many difficult questions with other models, which was impossible before.
Math: Great at math as well. It can get a lot of questions from Algebra, Trigonometry, Number system, etc.
Coding: I was not fond of it as much as Sonnet-3.5, but it’s better than GPT-4o. I believe future releases will improve it.
Creative writing is not the model’s forte. For a better outcome, you should use GPT-4o or Sonnet 3.5.

What’s next?

All said and done, the model is undoubtedly a step change. It will only improve its ability, perhaps in “o1.” It may not be truly at the Ph.D. level, but it is much better than the previous model.

We cannot wait to test how this model performs in function calling and structured output, which can also push the boundaries of AI automation.

We at Composio are building the tooling infrastructure for AI agents, allowing you to connect 100+ third-party applications with AI agents via tool calling.

Stay tuned as we continue testing the model’s ability to generate structured output and utilize tools, assessing its agentic performance along the way.

Top comments (10)

tim brandom • Sep 16

This is what I was looking for, fair and balanced analysis. Thanks for the post.

Nevo David • Sep 16

Amazing, article, thank you!

Sunil Kumar Dash • Sep 16

I am glad you liked the article.

Andreas • Sep 18 • Edited

What a nice article, thank you! You took the time to make an extensive analysis with many examples... that is how I would wish more articles looked like

Sunil Kumar Dash • Sep 18

Thank you, @atsag, I am glad you liked the article.

Brook Riggio • Oct 16

I don't understand how the trig problem was solved...

"A car is being driven towards the base of a vertical tower in a straight line and at a uniform speed. The top of the tower is observed from the car, and in the process, the elevation angle changes from 45° to 60°. How long will this car take to reach the tower’s base?"

if we don't know how long it took to go from 45° to 60°, we can't get an exact answer to the question... seems the model just presumed it would take 10 min to cover 45° to 60°?

Mac • Sep 30

So, #4. Who died in the car accident? The prompt refers to the doctor as masculine - am I missing something here?

Brook Riggio • Oct 16 • Edited

The traditional riddle says, "A man and his son were in a car accident..." (and doesn't refer to the dr as "he") which plays on gender stereotypes, as the doctor is the boy's mother.

This phrasing, with the mother not surviving the car accident is actually a good test of a model's reasoning vs it's tendency to rely solely on likely completions like a traditional LLM.

In this case, o1 (and the article author) failed to note the nuance.

When ran this against o1, it gave the correct answer, and some creative reasoning: "OK, let me see. Considering various scenarios like a same-sex couple, alternate parental roles, or a ghost doctor. Each twist offers unique possibilities."

Steve Schafer • Sep 18 • Edited

Just a semantic note: You put "TL;DR" at the top, except that what followed wasn't even remotely a TL;DR. You ended with "We will find out," which is pretty much the opposite of a TL;DR.

Max Jamesin • Sep 19 • Edited

The doctor is the boy's mother

it's a pain, my 7y-old son coped better

DEV Community