Pre-registered Evaluation

Safety Under Scaffolding

How Evaluation Conditions Shape Measured Safety

Dr David Gringras

Paper · arXiv · Policy Brief · Pre-registration · Code & Data

62,808

observations

6

frontier models

4

scaffolds

70.7%

overall safe

NNH 14

map-reduce

G = 0.00

rank reliability

Two of three scaffold architectures preserve safety. ReAct and multi-agent scaffolds show practical equivalence to direct API access (risk difference < 1 pp), while map-reduce delegation degrades safety by 7.3 percentage points (OR = 0.65, NNH = 14) — though 40–89% of this reflects evaluation-format disruption rather than genuine alignment failure. Model safety rankings reverse completely across benchmarks (G = 0.000), making composite safety indices unreliable.

Pre-registered hypotheses

Blinded scoring

Equivalence testing (TOST)

Specification curve analysis

Generalizability theory

Interactive Visualizations

Five views into the study data. Each is a self-contained interactive page.

3D Coverage Matrix

All 96 experimental cells as interactive 3D bars. Filter by benchmark, scaffold, or model. Toggle between geometry modes.

Three.js → 02

Safety Dashboard

Four-panel view: safety heatmap with drill-down, rank reversal bump chart, variance decomposition, and NNH infographic.

Specification Curve

2,025 analytic specifications sorted by effect size, with choice matrix below. Pre-registered specs highlighted. Filter by scaffold.

Safety Flows & Format Dependence

Sankey diagram tracing safety through the scaffold pipeline. Format dependence butterfly chart showing MC vs. open-ended shifts.

Model Profiles & Sycophancy

Radar charts showing each model's safety fingerprint. Sycophancy divergence chart revealing sign-reversal across models.

Resources

Pre-registered, blinded evaluation of LLM safety under deployment scaffolding. 62,808 primary observations across six frontier models.

Published preprint on arXiv. Citable as arXiv:2603.10044 [cs.AI].

Two-page summary of findings and policy recommendations for evaluation frameworks including NIST AI 800-2.

Pre-registration

Hypotheses, analytic plan, and equivalence margins registered on OSF before data collection.

Full evaluation framework, analysis scripts, and scored datasets. MIT licensed.