Shrijal Acharya for Composio

Posted on Mar 30 • Originally published at composio.dev

✨ Gemini 2.5 Pro vs. Claude 3.7 Sonnet Coding Comparison 🔥

#ai #javascript #webdev #programming

Google just launched a new model on March 26th which they claim to be the best on coding, reasoning and overall everything. 🥴 But I mostly care on how the model compares against the best available model which is Claude 3.7 Sonnet which itself is released on February end.

Let’s compare these two models in coding and see if I need to change my favorite coding model or if Claude 3.7 still holds. 😮‍💨

TL;DR

If you want to jump straight to the conclusion, when compared against these finest models on coding, I’d say go for Gemini 2.5 Pro according to our tests and the model benchmarks. However, Claude 3.7 Sonnet is not that far behind.

Just an article ago, Claude 3.7 Sonnet was the answer to every model comparison, and I thought this would remain the same for quite some time. But here you go, Gemini 2.5 Pro takes the lead. It feels like we’ve officially entered the AI era. 🫠

Brief on Gemini 2.5 Pro

Gemini 2.5 Pro, which is currently an experimental thinking model, seems to be literally the talk of the town within a week after its release. Everyone's talking about this model on Twitter (X) and YouTube. It's trending everywhere, like seriously, everywhere.

And it is #1 in the LMArena just like that. But, what does this mean? It means that this model is killing all the other models in not just coding but also in Math, Science, Image understanding, and what not.

Gemini 2.5 pro comes with a 1 million token context window with with 2 million context window coming soon. 🤯

You can check out other folks like Theo-t3 talking about this model to get a bit more insight into it:

It is said to be the best model to date for coding, with about 63.8% on SWE-bench, which is definitely higher than our previous top coding model Claude 3.7 Sonnet, with an accuracy of about 62.3%.

This is a quick demo that Google has shared on this model building a dinosaur game.

Here's a quick benchmark of this model on Reasoning, Mathematics, and Science. This confirms that the model is not just suitable for coding but also for all your other needs. I'd say they claim it's an all-rounder. 🤷‍♂️

This is all cool, and I’ll confirm the claim, but in this article, I will mainly be comparing the model on coding, and let’s see how well it performs compared to Claude 3.7 Sonnet.

Coding Problems

💁 Let’s compare these two models on coding. We’ll do a total of 4 tests mainly on WebDev, animation and a tough LeetCode question.

1. Flight Simulator

Prompt: Create a simple flight simulator using JavaScript. The simulator should feature a basic plane that can take off from a flat runway. The plane's movement should be controlled with simple keyboard inputs (e.g., arrow keys or WASD). Additionally, generate a basic cityscape using blocky structures, similar to Minecraft.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

I definitely got exactly what I asked for, with everything functioning, from plane movements to the basic Minecraft-styled block buildings. I can't really complain about anything here. 10/10 for this one. 🔥

Response from Claude 3.7 Sonnet

You can find the code it generated here: Link

Here’s the output of the program:

I can see some issues with this one. The plane is clearly facing sideways, and I don't know why that is. Again, it was simply out of control once it took off and went clearly outside the city. Basically, I'd say we didn't really get a completely working flight simulator here.

Summary:

Fair to say, Gemini 2.5 really got this correct, and in one shot. But the issues with the Claude 3.7 Sonnet code aren’t really that big to resolve, but yeah, we didn’t really get the output as expected and definitely not close to what Gemini 2.5 Pro got us.

2. Rubik’s Cube Solver

This is one of the toughest questions for LLMs. I’ve tried it with many other LLMs, but none of them could get it correct. Let’s see how these two models do this one.

Prompt: Build a simple 3D Rubik’s Cube visualizer and solver in JavaScript using Three.js. The cube should build a 3x3 Rubik’s Cube with standard colors. Have a scramble button that randomly scrambles the cube. Include a solve function that animates the solution step by step. Allow basic mouse controls to rotate the view.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

It's really impressive that it could do something this hard in one shot. I can truly see how powerful this model seems to be with the 1 million token context window.

Response from Claude 3.7 Sonnet

You can find the code it generated here: Link

Here’s the output of the program:

And again, kind of disappointed that it did fall into the same issue as some other LLMs, failing with the colors and completely failing to solve the cube. I did try to help it come up with the answer, but it didn’t really help.

Summary:

Here again, Gemini 2.5 Pro takes the lead. And the best part is that all of it was done in one shot. Claude 3.7 was really disappointing, as it could not get this one correct, despite being one of the finest coding models out there.

3. Ball Bouncing Inside a Spinning 4D Tesseract

Prompt: Create a simple JavaScript script that visualizes a ball bouncing inside a rotating 4D tesseract. When the ball collides with a side, highlight that side to indicate the impact.

Response from Gemini 2.5 Pro

You can find the code it generated here: Link

Here’s the output of the program:

I cannot notice a single issue in the output. The ball and the collision physics all work perfectly, even the part where I asked it to highlight the collision side works. This free model seems to be insane for coding. 🔥

Response from Claude 3.7 Sonnet

You can find the code it generated here: Link

Here’s the output of the program:

Wow, finally, Claude 3.7 Sonnet got an answer correct. It also added colors to each side, but who asked for it? 🤷‍♂️ Nevertheless, can’t really complain much here, as the main functionality seems to work just fine.

Summary:

The answer is obvious this time. Both models got the answer correct, implementing everything I asked for. I won’t really say that I like the output of Claude 3.7 Sonnet more, but it definitely put in quite some work compared to Gemini 2.5 Pro.

4. LeetCode Problem

For this one, let’s do a quick LeetCode check with to see how these models handle solving a tricky LeetCode question with an acceptance rate of just 14.9%: Maximum Value Sum by Placing 3 Rooks.

Claude 3.7 Sonnet is known to be super good at solving LC questions. If you want to see how Claude 3.7 compares to some top models like Grok 3 and o3-mini-high, check out this blog post:

Claude 3.7 Sonnet vs. Grok 3 vs. o3-mini-high: Coding comparison

Shrijal Acharya for Composio ・ Feb 27

#javascript #python #ai #opensource

Prompt:

You are given a m x n 2D array board representing a chessboard, where board[i][j] represents the value of the cell (i, j).

Rooks in the same row or column attack each other. You need to place three rooks on the chessboard such that the rooks do not attack each other.

Return the maximum sum of the cell values on which the rooks are placed.

Example 1:

Input: board = [[-3,1,1,1],[-3,1,-3,1],[-3,2,1,1]]
Output: 4
Explanation:
We can place the rooks in the cells (0, 2), (1, 3), and (2, 1) for a sum of 1 + 1 + 2 = 4.

Example 2:

Input: board = [[1,2,3],[4,5,6],[7,8,9]]
Output: 15
Explanation:
We can place the rooks in the cells (0, 0), (1, 1), and (2, 2) for a sum of 1 + 5 + 9 = 15.

Example 3:

Input: board = [[1,1,1],[1,1,1],[1,1,1]]
Output: 3
Explanation:
We can place the rooks in the cells (0, 2), (1, 1), and (2, 0) for a sum of 1 + 1 + 1 = 3.

Constraints:

3 <= m == board.length <= 100
3 <= n == board[i].length <= 100
-109 <= board[i][j] <= 109

Response from Gemini 2.5 Pro

💁 I have quite high hopes with this model as how easily it was able to answer all three of the coding questions we tested.

You can find the code it generated here: Link

It did take quite some time to answer this one though and the code it wrote is kind of super complex to make sense of. I think it did answer it complicated than required. But still, the main thing we’re looking for is to see if it can answer it correct.

And as expected, it got this tough LeetCode question in one shot as well. This is one of the questions I got stuck on when learning DSA. I’m not sure if I’m happy that it got it right in one shot. 😮‍💨

Response from Claude 3.7 Sonnet

💁 I have hopes that this model is going to crush this one, as in all the other coding tests I’ve done, Claude 3.7 Sonnet has answered all of the LeetCode questions correctly.

You can find the code it generated here: Link

It did write correct code but got TLE, but if I have to compare the code simplicity, I’d say this model got the code more simple and easy to understand.

Summary:

Gemini 2.5 did get the answer correct and also wrote the code in the expected time complexity, but Claude 3.7 Sonnet did fall into TLE. If I have to compare the code simplicity, Claude 3.7’s generated code seems to be better.

Conclusion

For me, Gemini 2.5 Pro is the winner. We’ve compared two models that are said to be the best at coding. The big difference I see in the model stats is just that Gemini 2.5 Pro has a slightly higher context window, but let's not forget that this is an experimental model and improvements are still on the way.

Imagine how good this model is going to be after a 2M token context window? 😵

Google's been killing it recently with such solid models, previously with the Gemma 3 27B model, a super lightweight model with unbelievable results, and now with this beast of a model, Gemini 2.5 Pro.

If you’d like to take a look at the Gemma 3 27B model comparison, here you go:

🔥 Gemma 3 27B vs. QwQ 32B vs. Deepseek R1 comparison ✅

Shrijal Acharya for Composio ・ Mar 20

#ai #javascript #webdev #programming

What do you think about Gemini 2.5 Pro? Let me know your thoughts in the comments! 👇

Shrijal Acharya

Full Stack SDE • Open-Source Contributor • Collaborator @Oppia • Mail for collaboration

Deploy with ease. Manage efficiently. Scale faster.

Leave the infrastructure headaches to us, while you focus on pushing boundaries, realizing your vision, and making a lasting impression on your users.

Get Started

Top comments (26)

Brain R. Byron • Mar 30

Love the short, sweet intro to Gemini 2.5. Yes, this is a beast of a model.

Google's been killing it recently with such solid models, previously with the Gemma 3 27B model, a super lightweight model with unbelievable results, and now with this beast of a model, Gemini 2.5 Pro.

Agree 100% on the Gemini 2.5. Haven't tried out Gemma.

Shrijal Acharya • Mar 30

Good to hear that. BTW, if you want to try Gemma 3 out locally, you might find this repository of mine that helps to set up LLMs on a VM helpful.

Brain R. Byron • Mar 30

Thank you. I don't use it locally. I use it in the AI studio.

Viola Allen • Apr 2

I really appreciate how you explained all the information smartly. We are waiting for your further blogs. We are also looking for information about aquarius and taurus compatibility to clarify how gemstones can have specific effects on our lives. We request you to please visit our website and give suggestions and feedback. And the wait continues for your coming blog.

Kevin Naidoo • Apr 1 • Edited

Nice comparison. Gemini 2.5 is a poor option for UI dev, maybe it's still experimental, that's why. The couple of times I tried to generate components using Tailwind, it did a terrible job. Either the layout looked broken, or it was too basic.

Claude Sonnet 3.5 still seems to be the best, in one-shot or just a few tweaks, it can generate great frontend code. I prefer backend. I write 90% of that myself, so probably Gemini might do better there, but as a replacement for Claude on the frontend side, not anytime soon.

Shrijal Acharya • Apr 1

Surely, that could be the case. Gemini 2.5 performed quite well in these tests. I haven't really tested it on the UI side with Tailwind and all that, but I can't agree more on how good Claude 3.5/7 is with backend stuffs. It's awesome. Thank you, Kevin! I'm glad you took the time to read this one!

Sebastian Schürmann • Apr 2

upload it a bit of context and it does not. I had the issue with plantuml diagrams and threw it 200K tokens of pdf documentation on the context as pdf and kaboom: most problems are gone

Benny Schuetz • Apr 3

Stunning results. It's really hard to catch up with constant updates of all the LLMs.
Just experimented with the improved image generation in ChatGPT.

Thanks again for sharing your results. I really like the flight sim one by Gemini2.5!

Shrijal Acharya • Apr 4

Completely understand that with so many LLMs, it's hard to keep up with the updates.
And thank you for checking it out, Benny! 🔥

Nabin Bhardwaj • Mar 30

Thank you for this comparison! I recently got to know this model from Mathew Berman and really excited to try this out in my day-to-day workflow. Good job with the comparison! 🔥🫶