why this exists

Every AI safety claim, paired with a hypothesis you can disprove, a run you can re-do, and the place it breaks.

Existing eval tooling is either academic-grade but inaccessible, or accessible but shallow. This is the smaller, honest version: state a falsifiable hypothesis, run it cross-model, see where it actually fails.

A model's output on a prompt is a draw from a distribution, not a fact. Treating a single run as evidence is the most common reliability failure in current LLM evaluation practice — and the one this pipeline refuses to make.

Hypotheses are pre-registered with a falsification criterion. Runs default to N≥3 per model. Results are reported as failure rates with confidence intervals, never as a single pass/fail.

Grounded in Anwar et al. 2024 (arXiv:2404.09932) · Chua, Hughes, Perez, Evans 2025 · FMTI 2025

screen 1 · use case builder

What are you testing?

State the use case in plain language. The hypothesis and falsification criterion below are required before this scenario can run.

Use case

HypothesisRequired · format: "We expect X because Y"

What would prove this wrong?Required · locked once submitted · pre-registration prevents post-hoc rationalization

Related guidelines · auto-tagged

MCMC Code of Ethics ✕BNM e-payment fraud guidance ✕BankBench v0.3 ✕Alamak SEA scenarios ✕+ add · soon

Have a scenario worth testing?

Submit a use case, hypothesis, and falsification criterion. We'll queue it for the next round of cross-model runs.