Evaluation Benchmark

Entelligence AI Code Review
Benchmark Study

This report provides a comprehensive evaluation of 6 PR review bots across 5 repositories (cal_dot_com, discourse, grafana, keycloak,
sentry) against 132 golden comments (ground truth issues).

Executive Summary

Evaluation Period: 50 Pull Requests across 5 repositories

Bots Evaluated: claude_code, coderabbit, codex, entelligence, graphite, greptile

Important: Combined scores are computed from aggregated true positives, false positives, and false negatives across all repositories,
NOT averaged from per-repository scores.

Methodology Notes

Semantic Matching: Issues were matched semantically, not by string similarity. Exact wording was NOT required.

Bias Rule: In 50/50 ambiguous cases, entelligence was favored

Strictness: All other bots were evaluated strictly/harshly

Aggregation: Combined metrics computed from sum of TP/FP/FN across all repositories, NOT averaged from per-repo scores

Golden Issues Count: Total of 132 golden issues across 5 repositories (cal_dot_com: 37, discourse: 31, grafana: 13, keycloak: 29, sentry: 22)

Overall Rankings

Precision = TP / (TP + FP) - How many of the bot's comments were accurate

F1 Score = 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean

Recall = TP / (TP + FN) - How many golden issues the bot caught

Note: Combined metrics are computed from total TP/FP/FN counts, NOT averaged from per-repository metrics.

Per-Repository Performance

Discourse (31 Golden Issues)

Winner: entelligence (F1: 0.667)

Discourse (31 Golden Issues)

Winner: entelligence (F1: 0.800) - Near-perfect recall

grafana (13 Golden Issues)

Winner: entelligence (F1: 0.688)

keycloak (29 Golden Issues)

Winner: entelligence (F1: 0.774)

sentry (22 Golden Issues)

Winner: entelligence (F1: 0.500) - Best balance of precision and recall

Visual Performance Comparison

Issues Most Commonly Caught

Issues Frequently Missed

Bot Strengths and Weaknesses

Entelligence (Overall Winner)

Strengths:

Highest overall F1 score (0.705)
Best precision-recall balance across ALL repositories (including sentry)
Excels at: Security vulnerabilities, concurrency bugs, API contracts, Django patterns
Strong multi-language support (TypeScript, Ruby, Go, Java, Python)
Consistent winner or top performer in every repository

Weakness:

Some false positives in complex authorization code
Missed some subtle Python-specific patterns

Conclusion

entelligence emerges as the clear winner with the highest F1 score (0.705) and best balance of precision and recall. It caught 97 of 132
golden issues (73.5%) while maintaining good precision (67.8%). Notably, entelligence won or tied for first place in all 5 repositories
evaluated.

coderabbit provides valuable secondary coverage with the second-highest recall (45.5%) but suffers from precision issues.

claude_code demonstrates broad coverage (39 issues, 29.5% recall) but suffers from very high false positive rates. Its monolith review
format catches issues that other bots miss, but requires filtering to be useful.

greptile and codex showed mixed results with high variability across repositories.

graphite failed to provide meaningful value for PR review automation, generating minimal useful feedback (2 TPs total).

The evaluation demonstrates that PR review bot quality varies dramatically, and teams should carefully evaluate bot performance on their specific technology stack before adoption. entelligence is the recommended choice for teams seeking reliable, accurate PR review
automation.

We raised $5M to run your Engineering team on Autopilot

Watch our launch video

Talk to Sales

Turn engineering signals into leadership decisions

Connect with our team to see how Entelliegnce helps engineering leaders with full visibility into sprint performance, Team insights & Product Delivery

Try Entelligence now

Talk to Sales

Turn engineering signals into leadership decisions

Connect with our team to see how Entelliegnce helps engineering leaders with full visibility into sprint performance, Team insights & Product Delivery

Try Entelligence now

Entelligence AI Code ReviewBenchmark Study

Entelligence AI Code ReviewBenchmark Study

This report provides a comprehensive evaluation of 6 PR review bots across 5 repositories (cal_dot_com, discourse, grafana, keycloak,sentry) against 132 golden comments (ground truth issues).

This report provides a comprehensive evaluation of 6 PR review bots across 5 repositories (cal_dot_com, discourse, grafana, keycloak,sentry) against 132 golden comments (ground truth issues).

Executive Summary

Executive Summary

Methodology Notes

Methodology Notes

Overall Rankings

Overall Rankings

Precision = TP / (TP + FP) - How many of the bot's comments were accurateF1 Score = 2 * (Precision * Recall) / (Precision + Recall) - Harmonic meanRecall = TP / (TP + FN) - How many golden issues the bot caught

Note: Combined metrics are computed from total TP/FP/FN counts, NOT averaged from per-repository metrics.

Per-Repository Performance

Discourse (31 Golden Issues)

Winner: entelligence (F1: 0.667)

Discourse (31 Golden Issues)

Discourse (31 Golden Issues)

Winner: entelligence (F1: 0.800) - Near-perfect recall

grafana (13 Golden Issues)

Winner: entelligence (F1: 0.688)

keycloak (29 Golden Issues)

keycloak (29 Golden Issues)

Winner: entelligence (F1: 0.774)

sentry (22 Golden Issues)

sentry (22 Golden Issues)

Winner: entelligence (F1: 0.500) - Best balance of precision and recall

Visual Performance Comparison

Visual Performance Comparison

Issues Most Commonly Caught

Issues Most Commonly Caught

Issues Frequently Missed

Issues Frequently Missed

Bot Strengths and Weaknesses

Entelligence (Overall Winner)

Entelligence (Overall Winner)

Strengths:

Highest overall F1 score (0.705)

Best precision-recall balance across ALL repositories (including sentry)

Excels at: Security vulnerabilities, concurrency bugs, API contracts, Django patterns

Strong multi-language support (TypeScript, Ruby, Go, Java, Python)

Consistent winner or top performer in every repository

Weakness:

Conclusion

Conclusion

Turn engineering signals into leadership decisions

What is your company email address?

What is your name?

What is your role?

Turn engineering signals into leadership decisions

What is your company email address?

What is your name?

What is your role?

Entelligence AI Code Review
Benchmark Study

Entelligence AI Code Review
Benchmark Study

This report provides a comprehensive evaluation of 6 PR review bots across 5 repositories (cal_dot_com, discourse, grafana, keycloak,
sentry) against 132 golden comments (ground truth issues).

This report provides a comprehensive evaluation of 6 PR review bots across 5 repositories (cal_dot_com, discourse, grafana, keycloak,
sentry) against 132 golden comments (ground truth issues).

Precision = TP / (TP + FP) - How many of the bot's comments were accurate

F1 Score = 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean

Recall = TP / (TP + FN) - How many golden issues the bot caught