Deepseek R1 -
While building out Entelligence's PR review bot, we've tried, tested and run continuous evaluations across all models. In our previous testing across GPT-4, Claude Sonnet, and Gemini Flash, Claude has consistently emerged as the clear winner - identifying thorny edge cases and producing the fewest incorrect reviews.
Even when we asked the models themselves to evaluate each other's code reviews, every model agreed that Claude was superior at identifying critical errors and issues in the code. While GPT-4 and Gemini were able to catch several of the surface level issues, only Claude was the one able to find the more critical reasoning related issues.
We consistently noticed that every model would confirm that the PR review comments generated by Claude were the best. Deepseek R1 seemed poised to change the game so we put it to the test against our top performer - Claude 3.5 Sonnet.
Using our open source PR review evaluation framework (check out this link for reference - https://www.entelligence.ai/post/pr_review.html), we ran both models against our dataset of 500 pull requests. Given Claude's track record of outperforming other models in coding tasks, we expected similar results. To ensure consistent classification of code review comments, we used our LLM-based evaluation system focused on functional and security issues. Here's our approach:
You are a senior staff engineer - analyze these code review comments and categorize each one into exactly ONE of:
{ "comment_index": "<index>", "Comment": <comment>, "category": "CRITICAL_BUG|NITPICK|OTHER", "reasoning": "Brief explanation of why this category was chosen" }
IMPORTANT: Each comment MUST be categorized. The category field MUST be exactly one of CRITICAL_BUG, NITPICK, or OTHER. Remember: Only report issues that could actually break functionality or corrupt data at runtime.
For each PR comment, we captured the comment text, related code, and developer response. This structured approach helped maintain consistency and focus on functional impact rather than style. We've open-sourced these evaluation criteria to help standardize industry practices.
Deepseek entirely blew Claude's results out of the water with over 80% critical bug to noise ratio compared to Claude's 67%. Even more impressive than this one stat however is that Deepseek also caught 3.7x as many bugs as Claude did. I couldn't believe these stats at first so we decided to dive in deeper - here are examples of exactly how Deepseek R1 was beating out Claude PR comment by PR comment.
It's extremely impressive to see how well Deepseek managed to reason through race conditions, constructor mismatch issues and more. Though I've only highlighted 3 PRs in this blog, as we went through all 500 PRs in the analysis, I was consistently impressed at how Deepseek managed to perform better than Claude.
Streamline your Engineering Team
Get started with a Free Trial or Book A Demo with the founderBuilding artificial
engineering intelligence.
Product
Home
Log In
Sign Up
Resources
Blog
Changelog