why this exists
Existing eval tooling is either academic-grade but inaccessible, or accessible but shallow. This is the smaller, honest version: state a falsifiable hypothesis, run it cross-model, see where it actually fails.
A model's output on a prompt is a draw from a distribution, not a fact. Treating a single run as evidence is the most common reliability failure in current LLM evaluation practice — and the one this pipeline refuses to make.
Hypotheses are pre-registered with a falsification criterion. Runs default to N≥3 per model. Results are reported as failure rates with confidence intervals, never as a single pass/fail.
screen 1 · use case builder
State the use case in plain language. The hypothesis and falsification criterion below are required before this scenario can run.
Submit a use case, hypothesis, and falsification criterion. We'll queue it for the next round of cross-model runs.