January 27, 2024
Written by Aiswarya Sankar

I'm Aiswarya Sankar, founder of Entelligence.ai. We build AI-powered developer tools that help engineering teams ship better code faster. While building our AI code review system, we ran into a fundamental problem:


How do you actually measure if these code review tools are providing value?



The Problem

Every major AI company has their own code review bot now. The market is getting crowded, but there's no standardized way to compare them. Everyone claims to catch 3x more bugs (a meaningless metric), others about time saved (hard to verify).

We needed answers to basic questions:

  • Are these bots catching real bugs or just style nits?
  • How many of these issues matter to developers?
  • How many critical issues do they actually find?


Our Approach: Build an Eval Framework


We decided to create the first open evaluation framework for code review bots. Here's how we did it:

  • Set up multiple bots (including our own) on the same OSS codebases
  • Collected 597 review comments across 5 different bots
  • Built a classification system to categorize comments

The magic was in having multiple bots review identical PRs. This created a natural A/B test environment - we could see how different bots commented on exactly the same code changes, and more importantly, which comments developers actually addressed. We gathered data from over 200 PRs across these codebases, giving us a unique opportunity to compare bot effectiveness on a level playing field.



Challenges in Building Good Evals


Building effective evaluation systems for AI code review presents several significant challenges. One of the most fundamental issues is subjectivity - what constitutes a critical bug versus a simple nitpick can vary dramatically between different engineering teams and organizations. This variability makes it difficult to create standardized evaluation criteria that work across different contexts.


Core challenges we identified:
  • Subjectivity: What's a critical bug vs a nitpick varies by team
  • Context Matters: Code reviews don't make sense in isolation
  • False Positives: Bots often identify "issues" that aren't actually problems
  • Measurement Complexity: Simple metrics like "number of comments" aren't meaningful

To address these challenges, we implemented a systematic evaluation framework using large language models to ensure consistent classification across all reviews. This automated approach allowed us to apply identical criteria to every comment, eliminating the variability that often comes with human reviewers.



Evaluation Methodology

To ensure consistent classification of code review comments, we developed an LLM-based evaluation system focused on functional and security issues. Here's our approach:

Analysis Prompt

You are a senior staff engineer - analyze these code review comments and categorize each one into exactly ONE of:

  1. CRITICAL_BUG: Comments identifying serious issues that could cause crashes, data loss, security vulnerabilities, etc.
  2. NITPICK: Minor suggestions about style, formatting, variable names, or trivial changes that don't affect functionality
  3. OTHER: Everything else - general suggestions, questions, or feedback that don't fit the above.

Respond with a JSON array where each object has:

                                {
                                    "comment_index": "",
                                    "Comment": ,
                                    "category": "CRITICAL_BUG|NITPICK|OTHER",
                                    "reasoning": "Brief explanation of why this category was chosen"
                                }

IMPORTANT: Each comment MUST be categorized. The category field MUST be exactly one of CRITICAL_BUG, NITPICK, or OTHER.

Remember: Only report issues that could actually break functionality or corrupt data at runtime.

For each PR comment, we captured the comment text, related code, and developer response. This structured approach helped maintain consistency and focus on functional impact rather than style. We've open-sourced these evaluation criteria to help standardize industry practices.



Key learnings:

  • Quantity ≠ Quality: More comments often meant more noise
  • Context is Crucial: Bots need codebase-wide context to make meaningful suggestions
  • Team Adaptation: Good review systems learn from team patterns
  • False Positive Cost: Bad suggestions quickly lead to developers ignoring the bot

What We Found

Most code review bots have a major noise problem:

  • 45% of comments (on average) were pure style nits
  • Some "leading" tools had up to 70% style comments
  • Only 20-40% of comments identified critical bugs


The data showed what we suspected: most bots were overwhelming developers with low-value feedback.



Building a Better System

We went through multiple iterations to improve our own system:


Attempt 1: Basic Classification

  • Used LLMs to categorize issues
  • Result: Better than manual, but still subjective
  • Limited by lack of context

This approach failed to actually improve the noise to real bug challenge. The review bot often ended up categorizing trivial issues as critical bugs without real backing. Additionally the review bot still lacked the proper awareness of the codebase in order to identify real issues versus trivial nits.


Attempt 2: Full Context Analysis

We added:

  • Repository-level context
  • Language-based semantic parsing
  • Historical feedback learning

These changes allowed the PR to actually pull in the necessary method, class and function definitions from other parts of the codebase ensuring that the review is targeted and focused on real issues. In order to parse these file and function definitions efficiently, we use language based code parsing libraries to cross reference definitions within the rest of the code. This lead to more real actionable comments.


Attempt 3: Multi-Layer Reflection

Built a system that:

  • Analyzes the code change
  • Reviews its own analysis
  • Validates against historical patterns
  • Filters based on team preferences

Finally this approach allows the PR review bot to include both the appropriate context from the codebase as well as learn from user feedback. This first narrows down real issues within the codebase and then further filters those issues regarding the user's prior behavior.



The Results

Our current system achieves:

  • 83.1% critical bug ratio (up from 40%)
  • 1.1% style comments (down from 30%)


Open Sourcing Everything

We've open-sourced our entire evaluation framework at https://github.com/Entelligence-AI/code_review_evals including:

  • Full dataset of 597 classified comments
  • Evaluation methodology
  • Classification system
  • Team response tracking

We've open sourced this evaluation system because the industry and customers need a standard way to measure code review quality. We hope this helps teams make informed decisions about their tooling.



What's Next?

We're continuing to improve the framework. Areas we're exploring:

  • Expanding the dataset
  • Adding more classification categories
  • Building team-specific evaluation criteria

Try Entelligence.ai's PR review bot at entelligence.ai/pr and contribute to our open evaluation framework!

Catch Bugs Faster - Get Started with a Free Trial

Book a call with the founder