▸ Ops Agent System Card · v1.0
The agent behind every dispatch decision.
Published under NIST AI RMF + OpenAI/Anthropic system-card conventions. Read this before integrating, auditing, or relying on agent outputs for consequential decisions.
▸ At a glance
- Quote accuracy
- 90%
- Within NTE bands, rolling 30-day window.
- NTE variance
- ±9%
- Quoted vs. settled, across dispatched work-orders.
- Human override
- 3%
- Targeted ceiling; every override is logged.
- Escalation gate
- < 70% → human
- Low-confidence outputs leave autonomy and enter a human queue.
▸ Card sections
Intended use
Dispatch work-order intake, NTE quoting, contractor routing, talent application scoring, and security scan initiation across US commercial field-service accounts. Operated by the platform under human ownership.
Out-of-scope
Defense contracting, export-controlled technologies (ITAR), high-risk medical decisions, financial instrument advice, child-directed services, or any use-case in the EU AI Act prohibited list.
Training & tuning data
Base models are proprietary frontier LLMs (Anthropic, OpenAI, Google) accessed via API. SteadyWrk does not train foundation models. Contextual data for routing comes from our own operational history plus public reference corpora.
Eval methodology
Rolling 30-day window across 8 evals (see /evals). Each eval reports Wilson score CI + bootstrap 95%. Ground truth is contractor outcome and payment disposition. Drift monitored weekly.
Capability limits
Quote accuracy ~90% within NTE bands. Routing confidence correlates with data completeness. Agent defers to human operator when confidence < 70% (QANAT rule 4).
Safety mitigations
Claims-based authorization. Zod schema validation on all inputs. Upstash rate-limits. Fingerprint tracking for anomalous request patterns. Kill switch via feature flag at the edge.
Failure modes
Known: overlapping NTE bands produce ambiguous routing; new accounts without historical data route conservatively; PDF parse errors on non-standard work-order formats fall back to manual review.
Escalation & human oversight
Confidence < 70% → human queue. Override rate < 3% targeted. All decisions emit audit event; append-only log with 7-year retention (QANAT rule 5).
▸ FAQ
Questions integrators ask.
Who operates the agent, and who owns its decisions?
The agent is operated by the platform under human ownership. It routes work, quotes within bounds, and scores applications; consequential calls remain accountable to a human operator, and every decision emits an audit event.
When does the agent stop and ask a human?
When confidence drops below 70%, the output leaves autonomous handling and enters a human review queue. New accounts without history route conservatively, and non-standard documents fall back to manual review.
What models power it, and do you train your own?
It runs on proprietary frontier LLMs from Anthropic, OpenAI, and Google accessed via API. STEADYWRK does not train foundation models; routing context comes from our own operational history plus public reference corpora.
How are the published numbers verified?
Evals run on a rolling 30-day window. Each reports a Wilson score confidence interval and a bootstrap 95% interval, with contractor outcome and payment disposition as ground truth. Drift is monitored weekly and surfaced on the public evals page.
What are the known failure modes?
Overlapping NTE bands can produce ambiguous routing, accounts without historical data route conservatively, and non-standard work-order formats can trip PDF parsing. Each of these degrades to manual review rather than a wrong autonomous action.
What standard does this card follow?
It is published under the NIST AI Risk Management Framework and the system-card conventions established by OpenAI and Anthropic, covering intended use, limits, mitigations, failure modes, and human oversight.