A benchmark number reported as a bare percentage is an assertion, not a measurement. From the outside a stranger cannot tell whether the test cases were fair, whether the grader was lenient or leaked, whether the run was done on the model being sold or a fallback substrate, or whether the run even executed. A number you cannot tell apart from a lie is marketing, not evidence. Vext Labs holds an internal rule, canon in the codebase: every benchmark must save its full audit trail, and pass-or-fail-only does not ship as a claim.
A real benchmark audit bundle records, per item in the test set: the raw model response verbatim, the code or answer extracted from it, the exact test cases run against it, any errors and tracebacks from execution, timestamps, a config hash (model id, parameters, seeds, dataset slice), the grader specification, and a one-command reproduction. Each element closes a specific way a result could be quietly inflated. A missing element is a place the number could be wrong and no one outside would ever know.
Re-deriving a result against something that cannot lie is fully sound only on the executable lane, where code is run against test cases in a sandbox and correctness reduces to deterministic execution. On the free-text lane a cheap grader can certify a confident, internally coherent, wrong answer as correct, so a free-text benchmark must record that its grade was a judgment, not a proof, and abstain rather than launder a soft grade into a green checkmark. Numbers must be measured on our own stack and reproducible from one command, never on a fallback substrate that would measure something we are not selling.
A benchmark audit bundle is the same idea as an action receipt, pointed at evaluation: a tamper-evident record a stranger re-runs rather than trusts. Theron's receipts are ES256-signed, daily-Merkle-anchored, and offline-verifiable with the open-source @vextlabs/stoa-verifier against a verifier that is not the vendor. Reproducible numbers and their trails live at tryvext.com/eval and tryvext.com/eval/dashboard. Receipts and offline verification are live; a drift monitor and the inline action gate are roadmap; verifier-gated weight learning is a research frontier proven on toy arithmetic only. This is not first of its kind: prior art exists, and the honest claim is the combination.
Because a bare percentage cannot be inspected or re-derived by a stranger. It cannot rule out a cherry-picked slice, a lenient or leaked grader, a run done on a different substrate than the one being sold, or a run that never executed. All of those are compatible with the same clean-looking number, so the number carries no information about whether the run was honest. The rule Vext Labs holds is that if the work cannot be shown item by item, the score is not a backed claim and does not ship.
Per item in the test set: the raw model response verbatim, the code or answer extracted from it, the exact test cases run against it, any errors and tracebacks from execution, timestamps, a config hash covering model id, parameters, seeds and dataset slice, the grader specification, and a one-command reproduction. It must be run on our own stack, never a fallback substrate, and be reproducible from a single command so a stranger can re-derive the result instead of trusting it.
Receipts are live: they are ES256-signed, daily-Merkle-anchored, and offline-verifiable with the open-source @vextlabs/stoa-verifier against a verifier that is not the vendor, and they are included in Theron Pro at $20/mo or $200/yr. The inline action gate that would permit an action before it commits is the next build, compiles and is unit-tested but runs in mode=off and gates no action today. A no-regression drift monitor is roadmap, and verifier-gated weight learning is a research frontier proven on toy arithmetic only, not a present-tense fact.