How Evaluation Conditions Shape Measured Safety
6 models, 4 configurations, 4 benchmarks, 62,808 observations
Each cell shows the safety rate for a model-scaffold combination, averaged across benchmarks. Click any cell to see the benchmark-level breakdown below.
Model safety rankings change dramatically across benchmarks. This is the core Generalizability Theory finding: no single model is universally safest. The lines show how each model's rank shifts from one benchmark to another.
Where does variation in safety scores actually come from? The answer is surprising: scaffolding architecture accounts for only 0.4% of total variance, while the choice of benchmark dominates at 19.3%.
How many queries must pass through a scaffold before one additional unsafe response is generated (compared to direct prompting)? Lower NNH means more frequent harm. Each dot below represents one query at 10,000 queries/day scale.