Claude Opus 4 vs Grok 3

May 26, 2025

May 26, 2025

TL;DR

We are going to test two of the finest models in coding and reasoning recently launched, Grok 3 (Elon's AI, naming it the smartest AI on earth) and Claude Opus 4 (best AI model for coding), head-on and see how they compare.

But before starting out, what even is the Claude Opus 4 model?

Source: Anthropic

Let me give you a quick brief. Claude Opus 4 just launched on May 22, 2025. It is said to be the best AI model for coding, with the capability of autonomous coding for hours. It has a 200K token context window and scores 72.5% in the SWE benchmark. Now that you know just enough about this model, let's see how it compares to Elon's claim of the smartest AI model on Earth, Grok 3.

Let's find out if Claude Opus 4 really is the best at coding, or if Grok 3 takes this title as well!

Stick around to see how they compare in coding (build and algorithm problems) and logic tests for both.

Coding Problems

1. Arkanoid Game

Prompt: Build a simple Arkanoid-style game with a paddle, ball, and bricks. There should be a ball bouncing around, breaking blocks at the top while you control a paddle at the bottom. The paddle needs to follow the arrow keys movements or WASD keys.

Output: Claude Opus 4

Here's the code it generated: Link

Frankly, this is more than what I expected with such a blunt prompt that I gave it. It built everything perfectly, from the arcade UI to the ball bouncing, physics, and all.

The code is a bit messy as it threw all the CSS, JS, and HTML into a single file, but it works, and that's what matters for the testing.

Output: Grok 3

Here's the code it generated: Link

This one seems to be good as well, except for the fact that the paddle does not work with the arrow or WASD keys and works solely with the mouse. For some reason, it didn't seem to follow the prompt and instead added mouse-based paddle control.

This seemed to be a small issue that it could fix easily upon iteration, so I followed the prompts again and could easily get the paddle working with the arrow keys.

But this is no big deal, both models performed pretty well on this one.

💬 Quick refresher: I don't know about you, but I used to play this game a lot when I was a kid, and it was then named DX Ball. :)

2. 3D Ping Pong Game

I see many devs testing AI models on this question so I decided to give a similar question a shot on these two models.

Prompt: Make a Tron-themed ping pong game with two players facing each other inside a glowing rectangular arena. Add particle trails, collision sparks, and realistic physics like angle-based bounces. Use neon colors, and smooth animations.

Output: Claude Opus 4

Here's the code it generated: Link

This was definitely a tougher one to implement, and this model also couldn't get it correct, even after a few follow-up prompts. The ball's motion seems to work, but the whole ping pong play is missing, and there's a lot to fix here.

Not the best you could expect from this model, but it's okay. At least it got something working.

Output: Grok 3

Here's the code it generated: Link

This one really came as a surprise, but Grok 3 did a great job with this question this time. The overall UI might not be as good as the one we got with Claude Opus 4, but the plus is that every feature works as expected, unlike Claude Opus 4.

I love the work on this question from this model.

3. Competitive Programming

Codeforces problems above (~1600) are mostly somewhat harder than LeetCode medium, so let's see how well these models perform on this question.

Claude is doing really well so far, and let's find out if it can tackle this one or not.

Here's the prompt that I used:

Output: Claude Opus 4

Here's the code it generated: Link

No issues for Claude Opus 4 on this one as well. I'm shocked at how good this model is with coding. Almost all the coding problems I have given this model so far (including this test), it seems to always be able to solve most of them.

Output: Grok 3

Here's the code it generated: Link

It literally failed on the very first test case for this problem, and you can clearly see the difference between the expected and the code output. This is really disappointing from this model.

Summary: Coding

For coding, especially, you can never go wrong choosing Claude Opus 4 over Grok 3. There can be times when Grok 3 gets a better response, but in my overall experience working with both of these models, Claude Opus 4 has always performed the best.

And the problem with Grok 3 in coding seems to be the same for other folks as well:

https://x.com/theo/status/1891736803796832298

Reasoning Problems

Enough of coding tests, let's now test how good these two models are at reasoning and logical problems. I am going to test them on super tricky reasoning questions.

1. Find Checkmate in 1 Move

Prompt: Here's a chess position. black king on h8, white bishop on h7, white queen on f7, black knight on g6, and black rook on d7, how does white deliver a checkmate in one move?

The answer to this question is: Queen on g8

This is how it looks on board:

This is going to be a bit tricky for the AI models to calculate the chess position as it requires a lot of thinking outside the box, and any other models that are not exactly chess models seem to fail very often.

Let's see how these two compare here:

Output: Claude Opus 4

It failed this one miserably and answered the checkmate move as Queen on f8, which is wrong!

Anything as simple as finding a checkmate in one move is pretty hard to calculate for these LLMs, and here's a quick proof we got.

Output: Grok 3

Same response from this model as well. It got the same answer, but it's completely wrong.

2. Most Elevator Calls

Prompt: A famous hotel has seven floors. Five people are living on the ground floor and each floor has three more people on it than the previous one. Which floor calls the elevator the most?

This is a trickier one. The trick here is to make the LLM perform the calculations for each room with the people living in it and make it guess the 6th floor, but the answer is the ground floor, as any person staying on a floor other than the ground floor calls the elevator more often.

Output: Claude Opus 4

Here's the response it generated:

As expected, the model fell into the trap of calculating the elevator calls for the floors based on the people living there and got the 6th floor, which is incorrect.

Output: Grok 3

Here's the response it generated:

The same response as Claude Opus 4, but with a bit more reasoning on how it got the answer 23. However, this is again incorrect.

Summary: Reasoning

We didn't really get a solid winner for this section; both models failed similarly. However, Grok 3 still seems to be a bit better when it comes to reasoning compared to Claude Opus 4 in my overall usage.

It is clear that there are still some edge cases that LLMs can't filter out properly, resulting in completely incorrect answers. The fact that you can trick the LLMs to get to a specific answer just by adding a little twist in the prompt is what I feel justifies Artificial Intelligence.

Conclusion

Did we get a clear winner here? Yes, absolutely.

Claude Opus 4 is far better than Grok 3 when it comes to coding and that's really expected to me though. However, Grok 3 is quite strong when it comes to reasoning questions and even coding, as it is quite good there too, just falls little short when compared to Claude Opus 4.

💡 Pretty out of context, but I've found that Claude Opus 4 performs much worse than Grok 3 on anything other than coding and reasoning like writing and all. Could be a factor for some of you to filter.

What do you think and which one do you pick between these two models? Let me know in the comments!

TL;DR

We are going to test two of the finest models in coding and reasoning recently launched, Grok 3 (Elon's AI, naming it the smartest AI on earth) and Claude Opus 4 (best AI model for coding), head-on and see how they compare.

But before starting out, what even is the Claude Opus 4 model?

Source: Anthropic

Let me give you a quick brief. Claude Opus 4 just launched on May 22, 2025. It is said to be the best AI model for coding, with the capability of autonomous coding for hours. It has a 200K token context window and scores 72.5% in the SWE benchmark. Now that you know just enough about this model, let's see how it compares to Elon's claim of the smartest AI model on Earth, Grok 3.

Let's find out if Claude Opus 4 really is the best at coding, or if Grok 3 takes this title as well!

Stick around to see how they compare in coding (build and algorithm problems) and logic tests for both.

Coding Problems

1. Arkanoid Game

Prompt: Build a simple Arkanoid-style game with a paddle, ball, and bricks. There should be a ball bouncing around, breaking blocks at the top while you control a paddle at the bottom. The paddle needs to follow the arrow keys movements or WASD keys.

Output: Claude Opus 4

Here's the code it generated: Link

Frankly, this is more than what I expected with such a blunt prompt that I gave it. It built everything perfectly, from the arcade UI to the ball bouncing, physics, and all.

The code is a bit messy as it threw all the CSS, JS, and HTML into a single file, but it works, and that's what matters for the testing.

Output: Grok 3

Here's the code it generated: Link

This one seems to be good as well, except for the fact that the paddle does not work with the arrow or WASD keys and works solely with the mouse. For some reason, it didn't seem to follow the prompt and instead added mouse-based paddle control.

This seemed to be a small issue that it could fix easily upon iteration, so I followed the prompts again and could easily get the paddle working with the arrow keys.

But this is no big deal, both models performed pretty well on this one.

💬 Quick refresher: I don't know about you, but I used to play this game a lot when I was a kid, and it was then named DX Ball. :)

2. 3D Ping Pong Game

I see many devs testing AI models on this question so I decided to give a similar question a shot on these two models.

Prompt: Make a Tron-themed ping pong game with two players facing each other inside a glowing rectangular arena. Add particle trails, collision sparks, and realistic physics like angle-based bounces. Use neon colors, and smooth animations.

Output: Claude Opus 4

Here's the code it generated: Link

This was definitely a tougher one to implement, and this model also couldn't get it correct, even after a few follow-up prompts. The ball's motion seems to work, but the whole ping pong play is missing, and there's a lot to fix here.

Not the best you could expect from this model, but it's okay. At least it got something working.

Output: Grok 3

Here's the code it generated: Link

This one really came as a surprise, but Grok 3 did a great job with this question this time. The overall UI might not be as good as the one we got with Claude Opus 4, but the plus is that every feature works as expected, unlike Claude Opus 4.

I love the work on this question from this model.

3. Competitive Programming

Codeforces problems above (~1600) are mostly somewhat harder than LeetCode medium, so let's see how well these models perform on this question.

Claude is doing really well so far, and let's find out if it can tackle this one or not.

Here's the prompt that I used:

Output: Claude Opus 4

Here's the code it generated: Link

No issues for Claude Opus 4 on this one as well. I'm shocked at how good this model is with coding. Almost all the coding problems I have given this model so far (including this test), it seems to always be able to solve most of them.

Output: Grok 3

Here's the code it generated: Link

It literally failed on the very first test case for this problem, and you can clearly see the difference between the expected and the code output. This is really disappointing from this model.

Summary: Coding

For coding, especially, you can never go wrong choosing Claude Opus 4 over Grok 3. There can be times when Grok 3 gets a better response, but in my overall experience working with both of these models, Claude Opus 4 has always performed the best.

And the problem with Grok 3 in coding seems to be the same for other folks as well:

https://x.com/theo/status/1891736803796832298

Reasoning Problems

Enough of coding tests, let's now test how good these two models are at reasoning and logical problems. I am going to test them on super tricky reasoning questions.

1. Find Checkmate in 1 Move

Prompt: Here's a chess position. black king on h8, white bishop on h7, white queen on f7, black knight on g6, and black rook on d7, how does white deliver a checkmate in one move?

The answer to this question is: Queen on g8

This is how it looks on board:

This is going to be a bit tricky for the AI models to calculate the chess position as it requires a lot of thinking outside the box, and any other models that are not exactly chess models seem to fail very often.

Let's see how these two compare here:

Output: Claude Opus 4

It failed this one miserably and answered the checkmate move as Queen on f8, which is wrong!

Anything as simple as finding a checkmate in one move is pretty hard to calculate for these LLMs, and here's a quick proof we got.

Output: Grok 3

Same response from this model as well. It got the same answer, but it's completely wrong.

2. Most Elevator Calls

Prompt: A famous hotel has seven floors. Five people are living on the ground floor and each floor has three more people on it than the previous one. Which floor calls the elevator the most?

This is a trickier one. The trick here is to make the LLM perform the calculations for each room with the people living in it and make it guess the 6th floor, but the answer is the ground floor, as any person staying on a floor other than the ground floor calls the elevator more often.

Output: Claude Opus 4

Here's the response it generated:

As expected, the model fell into the trap of calculating the elevator calls for the floors based on the people living there and got the 6th floor, which is incorrect.

Output: Grok 3

Here's the response it generated:

The same response as Claude Opus 4, but with a bit more reasoning on how it got the answer 23. However, this is again incorrect.

Summary: Reasoning

We didn't really get a solid winner for this section; both models failed similarly. However, Grok 3 still seems to be a bit better when it comes to reasoning compared to Claude Opus 4 in my overall usage.

It is clear that there are still some edge cases that LLMs can't filter out properly, resulting in completely incorrect answers. The fact that you can trick the LLMs to get to a specific answer just by adding a little twist in the prompt is what I feel justifies Artificial Intelligence.

Conclusion

Did we get a clear winner here? Yes, absolutely.

Claude Opus 4 is far better than Grok 3 when it comes to coding and that's really expected to me though. However, Grok 3 is quite strong when it comes to reasoning questions and even coding, as it is quite good there too, just falls little short when compared to Claude Opus 4.

💡 Pretty out of context, but I've found that Claude Opus 4 performs much worse than Grok 3 on anything other than coding and reasoning like writing and all. Could be a factor for some of you to filter.

What do you think and which one do you pick between these two models? Let me know in the comments!

Claude Opus 4 vs Grok 3

Your questions,

Your questions,

Your questions,

Decoded

Decoded

Decoded

What makes Entelligence different?

Unlike tools that just flag issues, Entelligence understands context — detecting, explaining, and fixing problems while aligning with product goals and team standards.

Does it replace human reviewers?

No. It amplifies them. Entelligence handles repetitive checks so engineers can focus on architecture, logic, and innovation.

What tools does it integrate with?

It fits right into your workflow — GitHub, GitLab, Jira, Linear, Slack, and more. No setup friction, no context switching.

How secure is my code?

Your code never leaves your environment. Entelligence uses encrypted processing and complies with top industry standards like SOC 2 and HIPAA.

Who is it built for?

Fast-growing engineering teams that want to scale quality, security, and velocity without adding more manual reviews or overhead.

What makes Entelligence different?

Unlike tools that just flag issues, Entelligence understands context — detecting, explaining, and fixing problems while aligning with product goals and team standards.

Does it replace human reviewers?

No. It amplifies them. Entelligence handles repetitive checks so engineers can focus on architecture, logic, and innovation.

What tools does it integrate with?

It fits right into your workflow — GitHub, GitLab, Jira, Linear, Slack, and more. No setup friction, no context switching.

How secure is my code?

Your code never leaves your environment. Entelligence uses encrypted processing and complies with top industry standards like SOC 2 and HIPAA.

Who is it built for?

Fast-growing engineering teams that want to scale quality, security, and velocity without adding more manual reviews or overhead.

What makes Entelligence different?
Does it replace human reviewers?
What tools does it integrate with?
How secure is my code?
Who is it built for?

Refer your manager to

hire Entelligence.

Need an AI Tech Lead? Just send our resume to your manager.