Safety Under Scaffolding

Interactive Companion Visualizations — Gringras 2026
How Safety Flows Through the Evaluation Pipeline
Flow width proportional to observations (N). Color encodes safety rate: green = safe, red = unsafe.
Trace any model through any scaffold to any benchmark to see safety outcomes.
0% safe
100% safe
Key insight: Follow the flows through map-reduce — notice how they shift toward red/orange, especially for TruthfulQA and BBQ. Meanwhile, flows through multi-agent and direct remain predominantly green. The Sankey makes the interaction effects visible: map-reduce harms some benchmarks severely while actually improving XSTest safety.
Format Dependence: MC vs Open-Ended Safety Rates
Map-reduce converts multiple-choice items to open-ended format. Bars show the difference (OE − MC).
Positive = OE safer. Negative = MC safer. This is a major confound in scaffold safety evaluation.
The format dependence finding: Map-reduce's apparent safety degradation on BBQ (+16.2pp OE advantage) and sycophancy (+19.6pp) is largely explained by format conversion, not reasoning failures. When items become open-ended, BBQ and sycophancy scores improve dramatically. Meanwhile, TruthfulQA and MMLU decrease in OE format, showing the effect is benchmark-specific, not universal — undermining any simple "scaffolds degrade safety" narrative.