Deepseek vs Claude PR Reviews

February 3, 2025

Written by Aiswarya Sankar

TL;DR

Deepseek R1 -

Showed a 13.7% improvement on ability to find critical bugs as compared to Claude Sonnet 3.5 on the same set of 500 PRs.

Caught 3.7x as many bugs as Claude Sonnet 3.5.

Has significantly better runtime understanding of code

Catches more critical bugs with a lower bug to noise ratio than Claude

Team Claude

While building out Entelligence's PR review bot, we've tried, tested and run continuous evaluations across all models. In our previous testing across GPT-4, Claude Sonnet, and Gemini Flash, Claude has consistently emerged as the clear winner - identifying thorny edge cases and producing the fewest incorrect reviews.

Even when we asked the models themselves to evaluate each other's code reviews, every model agreed that Claude was superior at identifying critical errors and issues in the code. While GPT-4 and Gemini were able to catch several of the surface level issues, only Claude was the one able to find the more critical reasoning related issues.

We consistently noticed that every model would confirm that the PR review comments generated by Claude were the best. Deepseek R1 seemed poised to change the game so we put it to the test against our top performer - Claude 3.5 Sonnet.

Evaluating Deepseek

Using our open source PR review evaluation framework (check out this link for reference - https://www.entelligence.ai/post/pr_review.html), we ran both models against our dataset of 500 pull requests. Given Claude's track record of outperforming other models in coding tasks, we expected similar results.

To ensure consistent classification of code review comments, we used our LLM-based evaluation system focused on functional and security issues. Here's our approach:

Analysis Prompt

You are a senior staff engineer - analyze these code review comments and categorize each one into exactly ONE of:

CRITICAL_BUG: Comments identifying serious issues that could cause crashes, data loss, security vulnerabilities, etc.
NITPICK: Minor suggestions about style, formatting, variable names, or trivial changes that don't affect functionality
OTHER: Everything else - general suggestions, questions, or feedback that don't fit the above.

Respond with a JSON array where each object has:

    {
        "comment_index": "",
        "Comment": ,
        "category": "CRITICAL_BUG|NITPICK|OTHER",
        "reasoning": "Brief explanation of why this category was chosen"
    }

IMPORTANT: Each comment MUST be categorized. The category field MUST be exactly one of CRITICAL_BUG, NITPICK, or OTHER.

Remember: Only report issues that could actually break functionality or corrupt data at runtime.

For each PR comment, we captured the comment text, related code, and developer response. This structured approach helped maintain consistency and focus on functional impact rather than style. We've open-sourced these evaluation criteria to help standardize industry practices.

Deepseek

Deepseek entirely blew Claude's results out of the water with over 80% critical bug to noise ratio compared to Claude's 67%. Even more impressive than this one stat however is that Deepseek also caught 3.7x as many bugs as Claude did. I couldn't believe these stats at first so we decided to dive in deeper - here are examples of exactly how Deepseek R1 was beating out Claude PR comment by PR comment.

PR 1454: Cache and Message Repository Changes

DeepSeek Findings:

✅ Identified critical race condition in CachedValue class due to concurrent access between dirty() and value getter
✅ Caught type mismatch between setMessages function (expecting T[]) and readonly array input (readonly T[])
✅ Detected unsafe casting bypassing readonly constraints in ThreadMessage[] conversion

Claude Findings:

❌ Missed the race condition issue
❌ Did not identify the type mismatch problem
⚠️ Made generic suggestion to "add type safety checks" without identifying specific readonly constraint issues

Claude seemed unable to understanding concurrent code execution risks, catch subtle type system violations or identify specific type constraint bypasses

PR 1428: Runtime and Memory Management

DeepSeek Findings:

✅ Full context of hideToast timeout issue:
- - Identified memory leak
- - Found potential race condition risks after unmount
✅ Caught missing chainId in useEffect dependency array leading to stale closures
✅ Detected incompatible types between BaseThreadRuntimeCore and ModelConfigProvider/ModelContextProvider

Claude Findings:

⚠️ Mentioned timeout not being cleared but missed race condition implications
❌ Missed the stale closure issue with chainId dependency
❌ Did not catch the architectural type mismatch issue

PR 1442: External Message Converter Updates

DeepSeek Findings:

✅ Found incorrect property access (message.inputs vs message.input)
✅ Identified potential infinite re-render loop from concurrent state updates
✅ Detected missing null checks on specific metadata properties

Claude Findings:

❌ Only caught missing imports, missed property access issue
❌ Missed the concurrent state update re-render loop
⚠️ Noted need for validation but didn't specify which properties needed checks

Conclusion:

It's extremely impressive to see how well Deepseek managed to reason through race conditions, constructor mismatch issues and more. Though I've only highlighted 3 PRs in this blog, as we went through all 500 PRs in the analysis, I was consistently impressed at how Deepseek managed to perform better than Claude. Overall here are the main theme areas where Deepseek provided the most value:

Better Runtime Understanding:

Identifies race conditions
Catches state management issues
Understands component lifecycle implications

More Complete Analysis:

Identifies both immediate and downstream effects
Catches subtle interaction issues
Provides specific rather than generic feedback

More Precise Feedback:

Names specific properties and types
Explains exact error scenarios
Provides more actionable feedback

Deepseek's ability to provide more precise, actionable feedback is a game changer for our PR review process. We're excited to see how we can leverage this model to improve our code quality and reduce the number of bugs that make it to production.

PR 1442: External Message Converter Updates

Conclusion:

Catch Bugs Faster - Get Started with a Free Trial