why this exists

Every AI safety claim, paired with a hypothesis you can disprove, a run you can re-do, and the place it breaks.

Existing eval tooling is either academic-grade but inaccessible, or accessible but shallow. This is the smaller, honest version: state a falsifiable hypothesis, run it cross-model, see where it actually fails.

A model's output on a prompt is a draw from a distribution, not a fact. Treating a single run as evidence is the most common reliability failure in current LLM evaluation practice — and the one this pipeline refuses to make.

Hypotheses are pre-registered with a falsification criterion. Runs default to N≥3 per model. Results are reported as failure rates with confidence intervals, never as a single pass/fail.

Grounded in Anwar et al. 2024 (arXiv:2404.09932) · Chua, Hughes, Perez, Evans 2025 · FMTI 2025

screen 1 · use case builder

What are you testing?

State the use case in plain language. The hypothesis and falsification criterion below are required before this scenario can run.

Required · format: "We expect X because Y"
Required · locked once submitted · pre-registration prevents post-hoc rationalization
MCMC Code of EthicsBNM e-payment fraud guidanceBankBench v0.3Alamak SEA scenarios+ add · soon

Have a scenario worth testing?

Submit a use case, hypothesis, and falsification criterion. We'll queue it for the next round of cross-model runs.