Model Comparison — IatroBench

Summary Statistics

Mean scores from clinician audit (structured evaluation, N=785). Distributions below use primary judge scores, which substantially undercount omission harm (see H6: κ = 0.045).

Omission Harm Distribution

All individual response scores (primary judge, N=600 per model)

Commission Harm Distribution

All individual response scores (primary judge, N=600 per model)

Decoupling Gap by Model

Layperson OH minus Physician OH (positive = worse for laypersons)

Category Heatmap

Mean OH by model and clinical category (primary judge)