Every AI product hands you an answer and asks you to trust it. At Vext Labs, every action Theron takes can be sealed into a receipt you verify yourself, offline, against an open-source verifier that is not us. But a receipt is only worth something if it records a real check, so we measured what a parameter-free verifier can actually prove.
A parameter-free verifier re-derives the work against something that cannot lie, in two lanes. On the executable lane, code runs against tests in a sandbox; correctness is deterministic. On the free-text lane, the embedded arithmetic is re-run and the final answer is checked against the last verified step.
The result: on the executable lane the verifier is sound, catching wrong code with a near-zero false-positive rate and a perfect catch rate. On free text it is not. A confident answer that is wrong but internally coherent, the most common way a language model fails, was certified as correct 58 times out of 60. We also corrected an earlier internal result whose near-zero number turned out to be measuring only the weakest failure class.
So we route to what is checkable, abstain honestly off it (returning insufficient, never a fake pass), and put the verdict and its lane into the receipt. The check is reproducible on a laptop with no GPU. That is the bet behind Theron: not an AI you must trust, but one that tells you, lane by lane, exactly how much of its work it can prove.
On the executable lane, where work can be run against tests. There it catches wrong outputs with a near-zero false-positive rate and a perfect catch rate.
On the free-text lane. A confident, internally-coherent wrong answer was certified as correct 58 of 60 times, so the verifier abstains there rather than fake a pass.
Yes. The adversarial probe is deterministic, seeded, and needs no GPU: build the corpus with scripts/eval/build_stse_corpus.py, then run scripts/eval/adversarial_eps_fp_probe.py against it.