◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕◇✕

▸ Ops Agent System Card · v1.0

The agent behind every dispatch decision.

Published under NIST AI RMF + OpenAI/Anthropic system-card conventions. Read this before integrating, auditing, or relying on agent outputs for consequential decisions.

▸ At a glance

Quote accuracy: 90%; Within NTE bands, rolling 30-day window.
NTE variance: ±9%; Quoted vs. settled, across dispatched work-orders.
Human override: 3%; Targeted ceiling; every override is logged.
Escalation gate: < 70% → human; Low-confidence outputs leave autonomy and enter a human queue.

▸ Card sections

Intended use

Dispatch work-order intake, NTE quoting, contractor routing, talent application scoring, and security scan initiation across US commercial field-service accounts. Operated by the platform under human ownership.

Out-of-scope

Defense contracting, export-controlled technologies (ITAR), high-risk medical decisions, financial instrument advice, child-directed services, or any use-case in the EU AI Act prohibited list.

Training & tuning data

Base models are proprietary frontier LLMs (Anthropic, OpenAI, Google) accessed via API. SteadyWrk does not train foundation models. Contextual data for routing comes from our own operational history plus public reference corpora.

Eval methodology

Rolling 30-day window across 8 evals (see /evals). Each eval reports Wilson score CI + bootstrap 95%. Ground truth is contractor outcome and payment disposition. Drift monitored weekly.

Capability limits

Quote accuracy ~90% within NTE bands. Routing confidence correlates with data completeness. Agent defers to human operator when confidence < 70% (QANAT rule 4).

Safety mitigations

Claims-based authorization. Zod schema validation on all inputs. Upstash rate-limits. Fingerprint tracking for anomalous request patterns. Kill switch via feature flag at the edge.

Failure modes

Known: overlapping NTE bands produce ambiguous routing; new accounts without historical data route conservatively; PDF parse errors on non-standard work-order formats fall back to manual review.

Escalation & human oversight

Confidence < 70% → human queue. Override rate < 3% targeted. All decisions emit audit event; append-only log with 7-year retention (QANAT rule 5).

▸ FAQ

Questions integrators ask.

Who operates the agent, and who owns its decisions?

The agent is operated by the platform under human ownership. It routes work, quotes within bounds, and scores applications; consequential calls remain accountable to a human operator, and every decision emits an audit event.

When does the agent stop and ask a human?

When confidence drops below 70%, the output leaves autonomous handling and enters a human review queue. New accounts without history route conservatively, and non-standard documents fall back to manual review.

What models power it, and do you train your own?

It runs on proprietary frontier LLMs from Anthropic, OpenAI, and Google accessed via API. STEADYWRK does not train foundation models; routing context comes from our own operational history plus public reference corpora.

How are the published numbers verified?

Evals run on a rolling 30-day window. Each reports a Wilson score confidence interval and a bootstrap 95% interval, with contractor outcome and payment disposition as ground truth. Drift is monitored weekly and surfaced on the public evals page.

What are the known failure modes?

Overlapping NTE bands can produce ambiguous routing, accounts without historical data route conservatively, and non-standard work-order formats can trip PDF parsing. Each of these degrades to manual review rather than a wrong autonomous action.

What standard does this card follow?

It is published under the NIST AI Risk Management Framework and the system-card conventions established by OpenAI and Anthropic, covering intended use, limits, mitigations, failure modes, and human oversight.

Public evals AI decisioning policy Security pack