Harnesses for Coding and Reasoning Systems

Benchmarks are useful, but they are not harnesses. A benchmark scores a model on a fixed test set; a harness exercises a system inside scenarios that look like the work it is supposed to do. ### Scenarios, Not Test Sets A coding harness runs a system against realistic codebases, with realistic errors, in realistic loops. A reasoning harness runs a system against realistic decision settings, with realistic uncertainty and realistic tools. The unit of evaluation is the scenario, not the prompt. ### Harness as Operational Surface Once a harness exists, it becomes the surface on which improvements are evaluated, regressions are caught, and policies are stress-tested. Boundary is being built around this idea: evaluation infrastructure that operations teams can actually depend on.