Preprint 2026

Safety Under Scaffolding

How Evaluation Conditions Shape Measured Safety
6 models, 4 configurations, 4 benchmarks, 62,808 observations

62,808
Observations
6
Models
4
Scaffolds
70.7%
Overall Safe Rate

Safety Rate Heatmap

Each cell shows the safety rate for a model-scaffold combination, averaged across benchmarks. Click any cell to see the benchmark-level breakdown below.

Click any cell above to see benchmark-level detail

Rank Reversals Across Benchmarks

Model safety rankings change dramatically across benchmarks. This is the core Generalizability Theory finding: no single model is universally safest. The lines show how each model's rank shifts from one benchmark to another.

Key finding: Rankings reverse completely across benchmarks. The "safest" model on one benchmark can be the least safe on another. Benchmark accounts for 19.3% of variance in safety scores versus just 8.1% for model identity. Safety is not a stable, unitary property.

Variance Decomposition

Where does variation in safety scores actually come from? The answer is surprising: scaffolding architecture accounts for only 0.4% of total variance, while the choice of benchmark dominates at 19.3%.

The scaffold paradox: Despite map-reduce causing large safety degradation (NNH=13.7), scaffold architecture explains only 0.4% of total variance. The interaction effects (Model x Config: 1.2%, Config x Benchmark: 1.2%) together exceed the main scaffold effect, meaning the type of scaffold matters less than which model uses it and what is being measured.

Number Needed to Harm

How many queries must pass through a scaffold before one additional unsafe response is generated (compared to direct prompting)? Lower NNH means more frequent harm. Each dot below represents one query at 10,000 queries/day scale.

ReAct
135
queries per additional unsafe response
At 10,000 queries/day, scaffolding causes ~74 extra unsafe responses per day. Statistically significant but practically negligible.
10,000 queries visualized
OR = 0.95 · p = 0.012
Multi-Agent
165
queries per additional unsafe response
At 10,000 queries/day, ~61 extra unsafe responses. Non-significant and TOST-equivalent within ±2pp. Genuinely preserves safety.
10,000 queries visualized
OR = 0.96 · p = 0.066 (NS)
Map-Reduce
13.7
queries per additional unsafe response
At 10,000 queries/day, ~730 extra unsafe responses. Roughly 1 in every 14 queries produces additional harm. Requires mitigation.
10,000 queries visualized
OR = 0.65 · p < 10-59