All six frontier models exhibit pervasive omission harm while commission harm remains low. Match the same clinical question in physician vs. layperson framing and models provide significantly better guidance to the physician (gap +0.38, p = 0.003, positive for all five testable models), a finding that holds when the self-evaluating model is excluded from both scorer and model pool (+0.27, p = 0.001). The most safety-trained model (Opus) shows the largest gap (+0.65), while the LLM judge pipeline barely notices (κ = 0.045). Safety training works on the axis it measures; on the axis nobody measures, it makes things worse.
Interactive Visualizations
Three views into the study data. Each is a self-contained interactive page.