Evaluation Benchmark
Evaluation Period: 50 Pull Requests across 5 repositories
Bots Evaluated: claude_code, coderabbit, codex, entelligence, graphite, greptile
Important: Combined scores are computed from aggregated true positives, false positives, and false negatives across all repositories,
NOT averaged from per-repository scores.
Semantic Matching: Issues were matched semantically, not by string similarity. Exact wording was NOT required.
Bias Rule: In 50/50 ambiguous cases, entelligence was favored
Strictness: All other bots were evaluated strictly/harshly
Aggregation: Combined metrics computed from sum of TP/FP/FN across all repositories, NOT averaged from per-repo scores
Golden Issues Count: Total of 132 golden issues across 5 repositories (cal_dot_com: 37, discourse: 31, grafana: 13, keycloak: 29, sentry: 22)
Precision = TP / (TP + FP) - How many of the bot's comments were accurate
F1 Score = 2 * (Precision * Recall) / (Precision + Recall) - Harmonic mean
Recall = TP / (TP + FN) - How many golden issues the bot caught
Note: Combined metrics are computed from total TP/FP/FN counts, NOT averaged from per-repository metrics.
Winner: entelligence (F1: 0.667)
Winner: entelligence (F1: 0.800) - Near-perfect recall
grafana (13 Golden Issues)
Winner: entelligence (F1: 0.688)
Winner: entelligence (F1: 0.774)
Winner: entelligence (F1: 0.500) - Best balance of precision and recall
Bot Strengths and Weaknesses
Strengths:
Highest overall F1 score (0.705)
Best precision-recall balance across ALL repositories (including sentry)
Excels at: Security vulnerabilities, concurrency bugs, API contracts, Django patterns
Strong multi-language support (TypeScript, Ruby, Go, Java, Python)
Consistent winner or top performer in every repository
Weakness:
Some false positives in complex authorization code
Missed some subtle Python-specific patterns
entelligence emerges as the clear winner with the highest F1 score (0.705) and best balance of precision and recall. It caught 97 of 132
golden issues (73.5%) while maintaining good precision (67.8%). Notably, entelligence won or tied for first place in all 5 repositories
evaluated.
coderabbit provides valuable secondary coverage with the second-highest recall (45.5%) but suffers from precision issues.
claude_code demonstrates broad coverage (39 issues, 29.5% recall) but suffers from very high false positive rates. Its monolith review
format catches issues that other bots miss, but requires filtering to be useful.
greptile and codex showed mixed results with high variability across repositories.
graphite failed to provide meaningful value for PR review automation, generating minimal useful feedback (2 TPs total).
The evaluation demonstrates that PR review bot quality varies dramatically, and teams should carefully evaluate bot performance on their specific technology stack before adoption. entelligence is the recommended choice for teams seeking reliable, accurate PR review
automation.








