logo
hero

Gemini Flash 2.0 vs OpenAI O1, O1-Mini, and O3-Mini PR Reviews

Raghav Dixit

Feb 10, 2025

5 min read

Introduction

With the rapid evolution of AI-powered PR reviewers, developers now have access to increasingly sophisticated tools that can identify critical bugs, improve code quality, and boost team productivity. Our most recent evaluation pits Gemini Flash 2.0 against OpenAI’s O1, O1-Mini, and O3-Mini models using our open-source evaluation framework. The new analysis—based on a total of 419 review comments from multiple pull requests—provides fresh insights into each model’s performance on real-world code issues.

Deepseek vs Claude comparison

TL;DR

Gemini Flash 2.0

  • Generated 187 review comments detecting 96 critical bugs (a 51.3% critical bug ratio)
  • Found 2.67× more critical bugs than OpenAI O1, which flagged 36 critical issues from 56 comments
  • Offers comprehensive review with 20 nitpicks (10.7%) and 71 additional feedback items (38.0%), adding valuable context and runtime insight

OpenAI O1-Mini & O3-Mini

  • O1-Mini: 107 comments with 85 critical bugs (79.4% ratio), plus minimal nitpicks and other feedback
  • O3-Mini: 69 comments with 55 critical bugs (79.7% ratio), demonstrating highly focused feedback
  • High critical bug ratios indicate strong precision on core issues, though overall volume is lower

OpenAI O1

  • Produced 56 comments with 36 critical bugs (64.3% ratio)
  • While precise, output volume is modest compared to Gemini's extensive coverage

OpenAI's Track Record in PR Reviews

Over time, while developing Entelligence's PR review bot, we rigorously tested multiple models—including GPT-4, Claude Sonnet, and earlier versions of Gemini Flash. OpenAI O1 consistently emerged as a strong performer, reliably identifying edge cases and critical issues with minimal false positives. Even when models evaluated each other's reviews, O1's focus on critical bug detection was frequently validated.

However, with the advent of Gemini Flash 2.0, we sought to answer one key question: Can it not only match but extend the depth of OpenAI's models—especially when it comes to complex code reasoning and runtime behavior?

Evaluating Gemini Flash 2.0

Using our open-source PR review evaluation framework (see our detailed guide), we analyzed a dataset that produced a total of 419 review comments. Each bot was run on its own subset of pull requests, and the results were classified using our LLM-based evaluation system focused on both functional and security issues.

Analysis Prompt

You are a senior staff engineer – analyze these code review comments and categorize each one into exactly ONE of:

  1. CRITICAL_BUG: Identifying serious issues (e.g., crashes, data loss, security vulnerabilities)
  2. NITPICK: Minor suggestions on style, formatting, or naming that do not affect functionality
  3. OTHER: General suggestions, questions, or miscellaneous feedback
{
  "comment_index": "",
  "Comment": "",
  "category": "CRITICAL_BUG|NITPICK|OTHER",
  "reasoning": "Brief explanation for the chosen category"
}

For every PR comment, we captured the comment text, related code snippet, and developer response. This structured approach—now open-sourced—ensures industry-standard consistency and helps focus on feedback that truly impacts functionality.

Gemini Flash 2.0 vs. OpenAI Models

Gemini Flash 2.0

  • Output: 187 review comments
  • Critical Bugs: 96 (51.3% of comments)
  • Nitpicks: 20 (10.7%)
  • Other Feedback: 71 (38.0%)

Gemini Flash 2.0 delivered a high volume of actionable insights. Although its critical bug ratio is lower compared to the OpenAI Mini models, the raw numbers tell a compelling story: Gemini flagged 2.67× more critical bugs than OpenAI O1. Moreover, its extensive feedback—spanning both deep functional issues and additional contextual suggestions—demonstrates an enhanced understanding of code runtime behavior.

OpenAI Models

OpenAI O1:

  • Output: 56 comments
  • Critical Bugs: 36 (64.3%)
  • While precise, the total number of critical detections is modest relative to Gemini Flash 2.0

OpenAI O1-Mini:

  • Output: 107 comments
  • Critical Bugs: 85 (79.4%)
  • Highly focused on flagging core issues with minimal extra commentary

OpenAI O3-Mini:

  • Output: 69 comments
  • Critical Bugs: 55 (79.7%)
  • Similar to O1-Mini in precision, delivering nearly 80% critical bug content

Key Takeaways

Volume vs. Precision:

While OpenAI's Mini variants offer a very high critical bug ratio—demonstrating exceptional focus—Gemini Flash 2.0 provides a broader review. Its larger output means it uncovers more total critical bugs and delivers nuanced feedback on runtime issues, even if the proportion of critical feedback is lower.

Depth of Analysis:

Gemini's reviews include deeper runtime insights and contextual commentary, which can be invaluable when addressing complex issues such as race conditions and state management problems.

Trade-offs:

The choice between a high ratio of critical bug flags and a comprehensive review largely depends on team needs. For teams looking for pinpointed feedback, OpenAI's Mini models shine. For those requiring broader insights, Gemini Flash 2.0 offers significant advantages.

PR Examples

PR #1573

PR #1573 - docs: re-add attachments to example

o1-mini & o3-mini Findings (Better at Bug Detection)

  • ✅ o1-mini flagged an animation issue in tooltips, where removing specific classes could cause tooltips to appear/disappear abruptly, impacting user experience
  • ✅ o3-mini identified the removal of 'use client', which could break client-side behavior in Next.js environments

o1 Findings (Limited Detection)

  • ⚠️ Flagged missing focus/disabled classes in button styles, which could impact accessibility, but missed the critical impact of removing UI elements

Gemini Findings (Minimal Detection)

  • ⚠️ Identified a non-existent autohideFloat prop, which was causing an error, but failed to recognize multiple UI and accessibility issues
  • ❌ Did not catch the missing close button issue in dialogs
  • ❌ Missed accessibility concerns related to buttons
PR #2847

PR #2847 - feat: Implement new authentication flow

o3-mini & o1-mini Findings (Better at Bug Detection)

  • ✅ o3-mini correctly flagged a security vulnerability where user tokens were stored in localStorage, recommending the use of httpOnly cookies for improved security
  • ✅ o3-mini identified a missing event.preventDefault() in form submission, which could lead to unexpected page reloads
  • ✅ o1-mini found a missing try-catch block in API calls, leading to potential unhandled errors
  • ✅ o1-mini caught a missing dependency in a useEffect hook, which could cause stale closures

o1 Findings (Limited Detection)

  • ⚠️ Flagged inconsistent CSS theming, recommending CSS variables for maintainability
  • ⚠️ Identified missing ARIA attributes in menu items, improving accessibility but not directly related to functionality or security

Gemini Findings (Minimal Detection)

  • ⚠️ Suggested lazy loading for performance improvement, which is a good optimization but not a critical bug or security issue
  • ⚠️ Flagged outdated installation instructions, which is useful but not related to the core functionality or correctness of the PR

Key Takeaways

  • o3-mini and o1-mini provided the most impactful PR feedback, catching security vulnerabilities, unhandled errors, and functional issues
  • o1 performed decently but focused more on styling and accessibility, which, while important, were not as critical as security and functional bugs
  • Gemini was the weakest in real-world issue detection, flagging documentation and performance optimizations but missing core security and functional problems
  • For teams handling authentication, API calls, and security-sensitive code, o3-mini and o1-mini are the best choices for ensuring robust and bug-free implementations

Conclusion

Our latest evaluation reveals a clear trade-off between focused precision and comprehensive coverage:

  • OpenAI O1-Mini and O3-Mini deliver a high percentage of critical bug feedback, making them excellent for pinpointing the most severe issues
  • OpenAI O1 offers a balanced but modest output
  • Gemini Flash 2.0, while exhibiting a lower critical bug ratio (51.3%), produced a much larger volume of feedback—identifying 96 critical bugs versus O1's 36—thus catching 2.67× more critical issues overall. Its broader scope, enriched with additional context and runtime insights, makes it a powerful tool for comprehensive code reviews

Gemini Flash 2.0's ability to combine in-depth analysis with actionable, contextual feedback is a game-changer for PR reviews. Whether you need pinpoint precision or an expansive review, these insights can help you choose the right tool to improve code quality and reduce production bugs.

hero

Streamline your Engineering Team

Get started with a Free Trial or Book A Demo with the founder
footer
logo

Building artificial
engineering intelligence.

Product

Home

Log In

Sign Up

Resources

Blog

Changelog