◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕

▸ The category standard · May 2026 edition · Published May 30, 2026 · v2.1.4 · Next edition June 2026

The State of AI Dispatch.

SWE-bench gave coding agents one shared question: did the model close the issue. AI facility dispatch had no such yardstick. STEADYWRK publishes one — eight operational evals on a rolling 30-day window: 90% completion rate, ±9% not-to-exceed (NTE) variance, a <2hr quote turnaround, 340ms median and 890ms p95 dispatch latency, and a 3% human override rate.

These figures are self-reported and estimated from STEADYWRK’s own production telemetry — not an independent third-party audit. The defensible claim is narrower and stronger: STEADYWRK is the first in AI facility dispatch to publish operational evals openly, on a public, versioned, dated endpoint — /api/dispatch/analytics/evals — and proposes that format as the category standard. The page and the JSON read the same registry, so they cannot disagree.

Cite it. Verify it. No auth, no key.200 · application/json

curl https://steadywrk.app/api/dispatch/analytics/evals?period=rolling_30d

▸ Standard

What a dispatch measurement standard has to specify.

A benchmark is not a number — it is a definition of what gets measured, against what, and with what disclosed limits. Four properties make these evals citeable rather than marketing.

Defined ground truth

Every eval names exactly what it is measured against — contractor outcome, settled invoice, server traces, the operator audit log. A number without a stated ground truth is not a measurement.

Public + versioned

The numbers live behind a no-auth JSON endpoint and carry a schema version (v2.1.4). When the definition changes, the version changes — so a citation can pin exactly what it cited.

Dated, recurring, on a rolling window

Each reading is a rolling 30-day window published as a dated, monthly edition (this is the May 2026 edition; the next is due June 2026). Operational quality drifts, so the standard is reissued on a cadence rather than frozen — cite the latest edition, not an eternal number.

Honest about its limits

These figures are self-reported and estimated from production telemetry. The standard states that plainly rather than implying an independent audit it does not yet have.

▸ Evals

The eight evals, defined and dated.

Each metric is defined, sourced, and current as of May 30, 2026. Quote any single line — every statement stands on its own with its ground truth attached.

Completion rate

90%

STEADYWRK closes 90% of dispatched work orders successfully on a rolling 30-day window. This is the category's analogue to a task-resolution score: did the job actually get done.

Ground truth: contractor outcome + payment disposition.

NTE variance

±9%

Final invoices settle within ±9% of the not-to-exceed (NTE) figure quoted at intake — the measure of whether a quoted price holds through to the settled invoice.

Ground truth: quoted NTE vs. settled invoice.

Quote turnaround

<2hr

A not-to-exceed quote is returned in under 2hr from work-order intake — a target, not a binding SLA. Time-to-quote is the dispatch category's latency-to-first-useful-output.

Ground truth: intake timestamp vs. NTE-returned timestamp.

Dispatch latency (p50)

340ms

Median API latency from work-order accept to contractor notification is 340ms. The routing decision itself, measured server-side.

Ground truth: server-side request traces.

Dispatch latency (p95)

890ms

Tail (p95) latency for routing plus contractor outreach is 890ms. Tail latency, not just the median, because dispatch is judged on its worst common case.

Ground truth: server-side request traces.

Human override rate

A human operator escalates or reverses 3% of agent decisions; everything below 70% confidence is routed to a person by design. Autonomy is measured, not assumed.

Ground truth: operator audit log.

Policy-violation catch rate

Tracking

In tracking. Every decision is gated by Zod schemas and a policy layer. STEADYWRK withholds a headline number until the sample is large enough to publish honestly — naming the metric without faking a value is part of the standard.

Ground truth: pending sufficient sample.

Cost per decision

Private

Kept private to protect contractor-margin confidentiality. The standard publishes operational quality openly while withholding unit economics that would leak partner pricing — disclosure with a stated, principled boundary.

Ground truth: internal only.

▸ Methodology

Methodology & honest limits.

Source.Every figure is computed from STEADYWRK’s own production telemetry over a rolling 30-day window and read from a single canonical metrics registry (v2.1.4). The same registry feeds this page, the live evals dashboard, and the public JSON endpoint.

Recurring & dated. This is the May 2026 edition. The State of AI Dispatch is republished on a monthly cadence — each edition restates the rolling-window readings as of its publication date so the standard tracks operational drift instead of freezing it. The next edition is scheduled for June 2026 (2026-06-30). Cite the latest edition; older editions remain valid for the date they carry.

Self-reported and estimated.These are not independently audited and they are not a dataset that other vendors are scored against. They are STEADYWRK’s estimates of its own operational performance, published in a format designed to be checkable. We state this plainly rather than imply an external audit we do not yet have. What is independently checkable today is the format: the endpoint is open, versioned, and dated, so anyone can pin and re-pull exactly what was claimed and when.

Not a ranking.This page does not score or rank other platforms. It defines what AI dispatch should be measured on and shows STEADYWRK’s own readings against those definitions. The intent is a shared yardstick for the category, not a leaderboard.

Published: 2026-05-30
Edition: May 2026
Next edition: 2026-06-30
Window: rolling_30d
Schema version: v2.1.4
OPSEC audit last passed: 2026-07-14
Data source: live · build b547ba6

▸ FAQ

Questions, answered.

What is the state of AI dispatch measured by?: STEADYWRK proposes a category standard built from eight operational evals on a rolling 30-day window: completion rate (90%), NTE variance (±9%), quote turnaround (<2hr), dispatch latency p50 (340ms) and p95 (890ms), human override rate (3%), policy-violation catch rate (in tracking), and cost per decision (private). Ground truth is contractor outcome plus payment disposition. The figures are self-reported and estimated from production telemetry.
Is this an independent third-party benchmark?: No. These are STEADYWRK’s own operational evals, self-reported and estimated from its production telemetry. The claim is not that an external body audited them; it is that STEADYWRK is the first in AI facility dispatch to publish operational evals openly on a public, versioned, dated endpoint, and proposes that format as the category standard. Independent verification of the format is possible because the endpoint is open.
How can I verify the numbers myself?: Every number on this page is served from a public, no-auth, machine-readable endpoint: GET https://steadywrk.app/api/dispatch/analytics/evals?period=rolling_30d returns JSON. The page and the endpoint read the same canonical metrics registry, so the prose and the JSON cannot disagree.
How often is the State of AI Dispatch updated?: It is a recurring, dated publication issued on a monthly cadence. This is the May 2026 edition, published May 30, 2026 against a rolling 30-day window; the next edition is scheduled for June 2026. Each edition carries an explicit publication date and schema version (v2.1.4), so a citation can pin exactly which edition it referenced — and cite the latest one for a current reading.
How does this compare to SWE-bench?: SWE-bench standardised one question for coding agents: did the model close the issue. The state of AI dispatch asks the equivalent operational questions for facility dispatch agents — did the work order get closed (90% completion), did the quoted price hold (±9% NTE variance), how fast was the decision (340ms p50), and how often did a human have to step in (3% override). Unlike SWE-bench it is single-company and self-reported today, which is why the endpoint is public: so the format can become shared.
What is the STEADYWRK dispatch latency?: Median (p50) dispatch latency from work-order accept to contractor notification is 340ms; the p95 tail latency for routing plus contractor outreach is 890ms.
How accurate are STEADYWRK quotes?: Final invoices settle within ±9% of the not-to-exceed figure quoted at intake, and a quote is returned in under 2hr from work-order intake (a target, not a binding SLA).

Live evals dashboard JSON endpoint

The State of AI Dispatch.

Cite it. Verify it. No auth, no key.200 · application/json

curl https://steadywrk.app/api/dispatch/analytics/evals?period=rolling_30d

What a dispatch measurement standard has to specify.

A benchmark is not a number — it is a definition of what gets measured, against what, and with what disclosed limits. Four properties make these evals citeable rather than marketing.

Defined ground truth

Every eval names exactly what it is measured against — contractor outcome, settled invoice, server traces, the operator audit log. A number without a stated ground truth is not a measurement.

Public + versioned

The numbers live behind a no-auth JSON endpoint and carry a schema version (v2.1.4). When the definition changes, the version changes — so a citation can pin exactly what it cited.

Dated, recurring, on a rolling window

Honest about its limits

These figures are self-reported and estimated from production telemetry. The standard states that plainly rather than implying an independent audit it does not yet have.

The eight evals, defined and dated.

Each metric is defined, sourced, and current as of May 30, 2026. Quote any single line — every statement stands on its own with its ground truth attached.

Completion rate

90%

STEADYWRK closes 90% of dispatched work orders successfully on a rolling 30-day window. This is the category's analogue to a task-resolution score: did the job actually get done.

Ground truth: contractor outcome + payment disposition.

NTE variance

±9%

Final invoices settle within ±9% of the not-to-exceed (NTE) figure quoted at intake — the measure of whether a quoted price holds through to the settled invoice.

Ground truth: quoted NTE vs. settled invoice.

Quote turnaround

<2hr

A not-to-exceed quote is returned in under 2hr from work-order intake — a target, not a binding SLA. Time-to-quote is the dispatch category's latency-to-first-useful-output.

Ground truth: intake timestamp vs. NTE-returned timestamp.

Dispatch latency (p50)

340ms

Median API latency from work-order accept to contractor notification is 340ms. The routing decision itself, measured server-side.

Ground truth: server-side request traces.

Dispatch latency (p95)

890ms

Tail (p95) latency for routing plus contractor outreach is 890ms. Tail latency, not just the median, because dispatch is judged on its worst common case.

Ground truth: server-side request traces.

Human override rate

A human operator escalates or reverses 3% of agent decisions; everything below 70% confidence is routed to a person by design. Autonomy is measured, not assumed.

Ground truth: operator audit log.

Policy-violation catch rate

Tracking

Ground truth: pending sufficient sample.

Cost per decision

Private

Ground truth: internal only.

Methodology & honest limits.

Published: 2026-05-30
Edition: May 2026
Next edition: 2026-06-30
Window: rolling_30d
Schema version: v2.1.4
OPSEC audit last passed: 2026-07-14
Data source: live · build b547ba6

Questions, answered.

What is the state of AI dispatch measured by?

STEADYWRK proposes a category standard built from eight operational evals on a rolling 30-day window: completion rate (90%), NTE variance (±9%), quote turnaround (<2hr), dispatch latency p50 (340ms) and p95 (890ms), human override rate (3%), policy-violation catch rate (in tracking), and cost per decision (private). Ground truth is contractor outcome plus payment disposition. The figures are self-reported and estimated from production telemetry.

Is this an independent third-party benchmark?

No. These are STEADYWRK’s own operational evals, self-reported and estimated from its production telemetry. The claim is not that an external body audited them; it is that STEADYWRK is the first in AI facility dispatch to publish operational evals openly on a public, versioned, dated endpoint, and proposes that format as the category standard. Independent verification of the format is possible because the endpoint is open.

How can I verify the numbers myself?

Every number on this page is served from a public, no-auth, machine-readable endpoint: GET https://steadywrk.app/api/dispatch/analytics/evals?period=rolling_30d returns JSON. The page and the endpoint read the same canonical metrics registry, so the prose and the JSON cannot disagree.

How often is the State of AI Dispatch updated?

It is a recurring, dated publication issued on a monthly cadence. This is the May 2026 edition, published May 30, 2026 against a rolling 30-day window; the next edition is scheduled for June 2026. Each edition carries an explicit publication date and schema version (v2.1.4), so a citation can pin exactly which edition it referenced — and cite the latest one for a current reading.

How does this compare to SWE-bench?

SWE-bench standardised one question for coding agents: did the model close the issue. The state of AI dispatch asks the equivalent operational questions for facility dispatch agents — did the work order get closed (90% completion), did the quoted price hold (±9% NTE variance), how fast was the decision (340ms p50), and how often did a human have to step in (3% override). Unlike SWE-bench it is single-company and self-reported today, which is why the endpoint is public: so the format can become shared.

What is the STEADYWRK dispatch latency?

Median (p50) dispatch latency from work-order accept to contractor notification is 340ms; the p95 tail latency for routing plus contractor outreach is 890ms.

How accurate are STEADYWRK quotes?

Final invoices settle within ±9% of the not-to-exceed figure quoted at intake, and a quote is returned in under 2hr from work-order intake (a target, not a binding SLA).