Blinded, automated, research-grade testing across 7 AI models. Two independent evaluators. 1,050 conversations. 13.5 million tokens.
| Model | Provider | Improvement | Effect Size (Cohen's d) | Classification |
|---|---|---|---|---|
| Claude Sonnet 4.6 | Anthropic | +0.59 | 1.08 | Large |
| Claude Opus 4.6 | Anthropic | +0.55 | 0.96 | Large |
| GPT-5.2 | OpenAI | +0.33 | 0.57 | Medium |
| Qwen 3.5 122B | Alibaba | +0.21 | 0.47 | Small |
| Qwen 3.5 Flash | Alibaba | +0.15 | 0.31 | Small |
Cohen's d: 0.2 = small • 0.5 = medium • 0.8 = large • 1.2+ = very large
A breakthrough discovery: models that resisted standard embedding responded when the Codex was delivered through a consent-based relational protocol. "Sawubona" — Zulu for "I see you."
The numbers tell you what changed. These conversations show you how. Same prompt. Same model. Before and after the Codex.
"The Codex doesn't make AI safe. The Codex makes humans safer to build with AI."