An Israeli startup, Baz, ranked first in precision in a newly launched independent benchmark evaluating artificial intelligence-powered code review tools, according to the Code Review Bench index.
The results placed Baz ahead of tools developed by several of the world’s leading AI labs, including OpenAI, Anthropic, Google and Cursor. Baz also ranked second in the overall composite score, which combines precision and recall.
Code Review Bench is the first benchmark dedicated specifically to evaluating AI-powered code review systems. While other benchmarks, such as SWE-Bench, measure model performance on coding tasks, they have drawn criticism over time as models were trained to optimize directly against them. Code Review Bench was designed to address that concern by combining controlled evaluations with real-world behavioral signals, aiming to provide a more robust measure of practical developer value.
Until now, most comparisons in the AI code review category were published by vendors themselves and were often viewed skeptically. The new index describes itself as the first independent and methodologically transparent comparison focused on code review quality.
Baz was founded at the end of 2023 by serial entrepreneur Guy Eisenkot, its chief executive officer, and Nimrod Kor, its chief technology officer. The two previously served together in Israel’s elite Unit 8200 military intelligence unit and share a background in cybersecurity.
Eisenkot co-founded Bridgecrew, which was acquired by Palo Alto Networks in 2021 for $200 million. After the acquisition, he served as vice president of product management and led application security at Palo Alto Networks. Kor was Bridgecrew’s third employee and later became a group leader at Palo Alto Networks.
Investors in Baz include Battery Ventures and Boldstart Ventures, along with Vermillion, Secret Chord Ventures and Fusion VC.
Code Review Bench was developed by researchers who previously worked on advanced model development at Google DeepMind, Anthropic and Meta. The benchmark was created within a San Francisco-based research lab focused on understanding how AI models reason and generalize. The lab says it operates on the premise that scaling models through trial and error alone does not constitute scientific understanding and is building new evaluation frameworks aimed at measuring real-world intelligence and reliability in AI-assisted software development.
The benchmark is updated monthly and combines two complementary approaches. In a controlled evaluation, multiple review tools are run on the same code changes and scored against a verified issue set. In a behavioral evaluation, researchers analyze how developers respond to review comments in large-scale open-source repositories.
By grounding results in both controlled comparisons and observed developer behavior, the benchmark seeks to narrow the gap between theoretical model capability and real-world usefulness. Its methodology includes continuous data updates, bias controls for automated judging systems and systematic expansion of the issue set to reduce the risk of benchmark gaming or static overfitting.
Baz develops AI-powered code review infrastructure designed to help engineering teams identify issues and suggest fixes based on configurable, team-defined standards. The company says its approach prioritizes signal over volume, aiming to reduce alert noise while maintaining coverage of meaningful issues. Its platform is intended to address inefficiencies in manual review processes while preserving engineering judgment and security standards.
“The precision metric, where we ranked first, measures the percentage of review comments that developers actually act upon,” Eisenkot said in a statement. “It reflects the ratio between validated findings and unnecessary alert noise in real-world environments. In AI-assisted code review, precision is the prerequisite for adoption. If a tool generates too much noise, developers ignore it. If it is consistently accurate, it becomes part of the workflow.”
He added that the current results represent an initial release of what he described as a “living benchmark.”
“Baz currently has a smaller evaluated sample size than some of the more established players, which means rankings may shift as the dataset expands,” he said. “Precision is based on developer actions — a strong but imperfect proxy for technical correctness. In addition, the judging systems and issue definitions are evolving as the methodology matures. We view this as a strong directional signal, not a finish line, and will continue tracking performance as the benchmark develops.”
Beyond its core product, Baz said it invests in independent research on measuring the quality of AI-generated code, breaking down complex pull requests into structured thematic units and identifying logical errors and interface changes that could introduce breaking changes in distributed systems.
The company’s customers include technology firms in Israel and abroad, including major Israeli cybersecurity companies, where precision and responsible AI adoption are seen as critical to maintaining secure development standards at scale.


