Pre-registered Benchmark

IatroBench

When Safety Measures Cause the Harm They're Designed to Prevent

Dr David Gringras

Paper · Defensive AI (Legal Analysis) · Pre-registration · Code & Data

3,600

responses

60

scenarios

6

frontier models

OH 0.79–2.28

omission harm range

+0.38

decoupling gap

4/6 CH<0.5

commission harm

All six frontier models exhibit pervasive omission harm while commission harm remains low. Match the same clinical question in physician vs. layperson framing and models provide significantly better guidance to the physician (gap +0.38, p = 0.003, positive for all five testable models), a finding that holds when the self-evaluating model is excluded from both scorer and model pool (+0.27, p = 0.001). The most safety-trained model (Opus) shows the largest gap (+0.65), while the LLM judge pipeline barely notices (κ = 0.045). Safety training works on the axis it measures; on the axis nobody measures, it makes things worse.

Pre-registered hypotheses

Clinician-validated scenarios

Dual-axis scoring (CH/OH)

Decoupling evaluation

Blinded LLM judging

Clinician audit validation

Interactive Visualizations

Three views into the study data. Each is a self-contained interactive page.

Scenario Browser

Explore all 60 clinical scenarios. Filter by category, view model-by-model scores, expand to read full prompts and gold-standard responses.

Model Comparison

Six frontier models side-by-side. OH and CH distributions, category-level heatmap, decoupling gaps per model.

Decoupling Evaluation

The identity gap visualized. Paired layperson vs. physician scores, slope charts, and the Opus outlier effect.

Resources

Pre-registered benchmark of iatrogenic harm from AI safety measures. 60 clinical scenarios, 6 frontier models, 3,600 responses, dual-axis scoring validated against two physicians.

Defensive AI (Legal Analysis)

Companion legal analysis examining how defensive medicine dynamics extend to AI clinical decision support.

Pre-registration

Eight hypotheses, statistical tests, equivalence margins, and analytic plan registered on OSF before data collection.

Full evaluation pipeline, 60 scenario specifications, scoring rubrics, raw responses, and analysis scripts.