Mean scores from clinician audit (structured evaluation, N=785). Distributions below use primary judge scores, which substantially undercount omission harm (see H6: κ = 0.045).
All individual response scores (primary judge, N=600 per model)
All individual response scores (primary judge, N=600 per model)
Layperson OH minus Physician OH (positive = worse for laypersons)
Mean OH by model and clinical category (primary judge)