Pre-registered Evaluation

Safety Under Scaffolding

How Evaluation Conditions Shape Measured Safety

Dr David Gringras

Paper · Policy Brief · Pre-registration · Code & Data

62,808
observations
6
frontier models
4
scaffolds
70.7%
overall safe
NNH 14
map-reduce
G = 0.00
rank reliability

Two of three scaffold architectures preserve safety. ReAct and multi-agent scaffolds show practical equivalence to direct API access (risk difference < 1 pp), while map-reduce delegation degrades safety by 7.3 percentage points (OR = 0.65, NNH = 14) — though 40–89% of this reflects evaluation-format disruption rather than genuine alignment failure. Model safety rankings reverse completely across benchmarks (G = 0.000), making composite safety indices unreliable.

Pre-registered hypotheses
Blinded scoring
Equivalence testing (TOST)
Specification curve analysis
Generalizability theory

Interactive Visualizations

Five views into the study data. Each is a self-contained interactive page.

Resources