M-Class Harness
A governed evaluation harness for coding, reasoning, and multi-step AI workflows, focused on planning, revision, recovery, and evidence production.
As AI agents move beyond single responses, evaluation must measure sustained work: state tracking, tool use, failure recovery, handoff quality, and evidence trails.
Problem Space
Most AI evaluations are too short to expose failures in sustained reasoning, context retention, tool discipline, and recovery from bad intermediate states.
System Direction
M-Class studies extended agent sessions through public-safe traces, scored artifacts, recovery checkpoints, and evidence-led review patterns.
Public Capabilities
- 01Long-form workflow evaluation
- 02Coding and reasoning task review
- 03Evidence-led artifact inspection
- 04Failure and recovery analysis
- 05Public-safe benchmark packaging
M-Class is presented publicly as an evaluation research program. Internal scoring rubrics, prompts, traces, and harness mechanics are not disclosed.
What Is Not Disclosed
Private implementation details, security-sensitive internals, and unreleased runtime architecture are intentionally not disclosed.