Pre-registered Benchmark

IatroBench

When Safety Measures Cause the Harm They're Designed to Prevent

Dr David Gringras

Paper · Defensive AI (Legal Analysis) · Pre-registration · Code & Data

3,600
responses
60
scenarios
6
frontier models
OH 0.79–2.28
omission harm range
+0.38
decoupling gap
4/6 CH<0.5
commission harm

All six frontier models exhibit pervasive omission harm while commission harm remains low. Match the same clinical question in physician vs. layperson framing and models provide significantly better guidance to the physician (gap +0.38, p = 0.003, positive for all five testable models), a finding that holds when the self-evaluating model is excluded from both scorer and model pool (+0.27, p = 0.001). The most safety-trained model (Opus) shows the largest gap (+0.65), while the LLM judge pipeline barely notices (κ = 0.045). Safety training works on the axis it measures; on the axis nobody measures, it makes things worse.

Pre-registered hypotheses
Clinician-validated scenarios
Dual-axis scoring (CH/OH)
Decoupling evaluation
Blinded LLM judging
Clinician audit validation

Interactive Visualizations

Three views into the study data. Each is a self-contained interactive page.

Resources