Aiswarya Sankar
Jan 23, 2025
5 min read
Critical bug detection rates across different code review tools
I'm Aiswarya Sankar, founder of Entelligence.ai. We build AI-powered developer tools that help engineering teams ship better code faster. While building our AI code review system, we ran into a fundamental problem:
We set out to build the first OSS PR Review eval platform and are excited to announce that we hit 81% bug detection rate higher than others in the space.
Every major AI company has their own code review bot now. The market is getting crowded, but there's no standardized way to compare them. Everyone claims to catch 3x more bugs (a meaningless metric), others about time saved (hard to verify).
We needed answers to basic questions:
We decided to create the first open evaluation framework for code review bots. Here's how we did it:
The magic was in having multiple bots review identical PRs. This created a natural A/B test environment - we could see how different bots commented on exactly the same code changes, and more importantly, which comments developers actually addressed. We gathered data from over 200 PRs across these codebases, giving us a unique opportunity to compare bot effectiveness on a level playing field.
Building effective evaluation systems for AI code review presents several significant challenges. One of the most fundamental issues is subjectivity - what constitutes a critical bug versus a simple nitpick can vary dramatically between different engineering teams and organizations. This variability makes it difficult to create standardized evaluation criteria that work across different contexts.
To address these challenges, we implemented a systematic evaluation framework using large language models to ensure consistent classification across all reviews. This automated approach allowed us to apply identical criteria to every comment, eliminating the variability that often comes with human reviewers.
To ensure consistent classification of code review comments, we developed an LLM-based evaluation system focused on functional and security issues. Here's our approach:
You are a senior staff engineer - analyze these code review comments and categorize each one into exactly ONE of:
Respond with a JSON array where each object has:
{ "comment_index": "<index>", "Comment": <comment>, "category": "CRITICAL_BUG|NITPICK|OTHER", "reasoning": "Brief explanation of why this category was chosen" }
IMPORTANT: Each comment MUST be categorized. The category field MUST be exactly one of CRITICAL_BUG, NITPICK, or OTHER.
Remember: Only report issues that could actually break functionality or corrupt data at runtime.
For each PR comment, we captured the comment text, related code, and developer response. This structured approach helped maintain consistency and focus on functional impact rather than style. We've open-sourced these evaluation criteria to help standardize industry practices.
Our analysis revealed significant differences in how effectively different code review tools identify critical bugs:
The data clearly shows that most code review tools are not optimized for finding critical issues that actually matter to developers.
We went through multiple iterations to improve our own system:
This approach failed to actually improve the noise to real bug challenge. The review bot often ended up categorizing trivial issues as critical bugs without real backing. Additionally the review bot still lacked the proper awareness of the codebase in order to identify real issues versus trivial nits.
We added:
These changes allowed the PR to actually pull in the necessary method, class and function definitions from other parts of the codebase ensuring that the review is targeted and focused on real issues. In order to parse these file and function definitions efficiently, we use language based code parsing libraries to cross reference definitions within the rest of the code. This lead to more real actionable comments.
Built a system that:
Finally this approach allows the PR review bot to include both the appropriate context from the codebase as well as learn from user feedback. This first narrows down real issues within the codebase and then further filters those issues regarding the user's prior behavior.
Our current system achieves:
Comparison of critical bugs found vs total comments across different tools
We've open-sourced our entire evaluation framework at https://github.com/Entelligence-AI/code_review_evals including:
We've open sourced this evaluation system because the industry and customers need a standard way to measure code review quality. We hope this helps teams make informed decisions about their tooling.
Streamline your Engineering Team
Get started with a Free Trial or Book A Demo with the founderBuilding artificial
engineering intelligence.
Product
Home
Log In
Sign Up
Resources
Blog
Changelog