Defined ground truth
Every eval names exactly what it is measured against — contractor outcome, settled invoice, server traces, the operator audit log. A number without a stated ground truth is not a measurement.
▸ The category standard · Published May 30, 2026 · v2.0.0
SWE-bench gave coding agents one shared question: did the model close the issue. AI facility dispatch had no such yardstick. STEADYWRK publishes one — eight operational evals on a rolling 30-day window: 90% completion rate, ±9% not-to-exceed (NTE) variance, a <2hr quote turnaround, 340ms median and 890ms p95 dispatch latency, and a 3% human override rate.
These figures are self-reported and estimated from STEADYWRK’s own production telemetry — not an independent third-party audit. The defensible claim is narrower and stronger: STEADYWRK is the first in AI facility dispatch to publish operational evals openly, on a public, versioned, dated endpoint — /api/dispatch/analytics/evals — and proposes that format as the category standard. The page and the JSON read the same registry, so they cannot disagree.
curl https://steadywrk.app/api/dispatch/analytics/evals?period=rolling_30d▸ Standard
A benchmark is not a number — it is a definition of what gets measured, against what, and with what disclosed limits. Four properties make these evals citeable rather than marketing.
Every eval names exactly what it is measured against — contractor outcome, settled invoice, server traces, the operator audit log. A number without a stated ground truth is not a measurement.
The numbers live behind a no-auth JSON endpoint and carry a schema version (v2.0.0). When the definition changes, the version changes — so a citation can pin exactly what it cited.
Each reading is a rolling 30-day window with a publication date (May 30, 2026). Operational quality drifts; an honest standard is dated, not eternal.
These figures are self-reported and estimated from production telemetry. The standard states that plainly rather than implying an independent audit it does not yet have.
▸ Evals
Each metric is defined, sourced, and current as of May 30, 2026. Quote any single line — every statement stands on its own with its ground truth attached.
Completion rate
90%
STEADYWRK closes 90% of dispatched work orders successfully on a rolling 30-day window. This is the category's analogue to a task-resolution score: did the job actually get done.
Ground truth: contractor outcome + payment disposition.
NTE variance
±9%
Final invoices settle within ±9% of the not-to-exceed (NTE) figure quoted at intake — the measure of whether a quoted price holds through to the settled invoice.
Ground truth: quoted NTE vs. settled invoice.
Quote turnaround
<2hr
A not-to-exceed quote is returned in under 2hr from work-order intake — a target, not a binding SLA. Time-to-quote is the dispatch category's latency-to-first-useful-output.
Ground truth: intake timestamp vs. NTE-returned timestamp.
Dispatch latency (p50)
340ms
Median API latency from work-order accept to contractor notification is 340ms. The routing decision itself, measured server-side.
Ground truth: server-side request traces.
Dispatch latency (p95)
890ms
Tail (p95) latency for routing plus contractor outreach is 890ms. Tail latency, not just the median, because dispatch is judged on its worst common case.
Ground truth: server-side request traces.
Human override rate
3%
A human operator escalates or reverses 3% of agent decisions; everything below 70% confidence is routed to a person by design. Autonomy is measured, not assumed.
Ground truth: operator audit log.
Policy-violation catch rate
Tracking
In tracking. Every decision is gated by Zod schemas and a policy layer. STEADYWRK withholds a headline number until the sample is large enough to publish honestly — naming the metric without faking a value is part of the standard.
Ground truth: pending sufficient sample.
Cost per decision
Private
Kept private to protect contractor-margin confidentiality. The standard publishes operational quality openly while withholding unit economics that would leak partner pricing — disclosure with a stated, principled boundary.
Ground truth: internal only.
▸ Methodology
Source.Every figure is computed from STEADYWRK’s own production telemetry over a rolling 30-day window and read from a single canonical metrics registry (v2.0.0). The same registry feeds this page, the live evals dashboard, and the public JSON endpoint.
Self-reported and estimated.These are not independently audited and they are not a dataset that other vendors are scored against. They are STEADYWRK’s estimates of its own operational performance, published in a format designed to be checkable. We state this plainly rather than imply an external audit we do not yet have. What is independently checkable today is the format: the endpoint is open, versioned, and dated, so anyone can pin and re-pull exactly what was claimed and when.
Not a ranking.This page does not score or rank other platforms. It defines what AI dispatch should be measured on and shows STEADYWRK’s own readings against those definitions. The intent is a shared yardstick for the category, not a leaderboard.
▸ FAQ