Internal R&DEvaluation & Governance

M-Class Harness

A governed evaluation harness for coding, reasoning, and multi-step AI workflows, focused on planning, revision, recovery, and evidence production.

As AI agents move beyond single responses, evaluation must measure sustained work: state tracking, tool use, failure recovery, handoff quality, and evidence trails.

Status

Internal R&D

Type

Governed Evaluation Harness

Problem Space

Most AI evaluations are too short to expose failures in sustained reasoning, context retention, tool discipline, and recovery from bad intermediate states.

System Direction

M-Class studies extended agent sessions through public-safe traces, scored artifacts, recovery checkpoints, and evidence-led review patterns.

Public Capabilities

01Long-form workflow evaluation
02Coding and reasoning task review
03Evidence-led artifact inspection
04Failure and recovery analysis
05Public-safe benchmark packaging

Disclosure Boundary

M-Class is presented publicly as an evaluation research program. Internal scoring rubrics, prompts, traces, and harness mechanics are not disclosed.

What Is Not Disclosed

Private implementation details, security-sensitive internals, and unreleased runtime architecture are intentionally not disclosed.