Skip to content

Autonomous Orchestration Policy — Decision-Rule Research for Agent-to-Agent Behavior Switching (2026)

Autonomous Orchestration Policy — Decision-Rule Research for Agent-to-Agent Behavior Switching (2026)

Section titled “Autonomous Orchestration Policy — Decision-Rule Research for Agent-to-Agent Behavior Switching (2026)”

What this document is. A research synthesis of when and how an orchestrator should autonomously change behavior on the user’s behalf in an agent-to-agent system: model tier, planning mode, research invocation, sub-agent dispatch, retrieval, compaction, escalation. It catalogs the external state of the art per decision axis with citations, then maps each axis onto Vox’s existing surface and isolates the decisions that can reasonably be automated today from those that cannot.

What this document is not. Not a routing mechanics audit (see model-orchestration-ssot-audit-2026.md). Not a non-routing orchestrator surface critique (see orchestrator-companion-audit-findings-2026.md). Not a research deck on enterprise failure modes (see nextgen-orchestrator-research-2026.md). Not an architecture decision (see ADRs in docs/src/adr/). It assumes those have been read.

Reading order if you are touching the orchestrator code.

  1. model-orchestration-ssot-audit-2026.md — what the routing surface actually does today
  2. docs/src/adr/005-socrates-anti-hallucination-ssot.md — confidence gating contract
  3. docs/src/adr/030-state-machine-ssot.md — mode/state machine
  4. This document — the cross-cutting decision policy that ties them together
  5. nextgen-orchestrator-research-2026.md — where the field is going and what we’re missing

An orchestrator that automates behavior changes on the user’s behalf is making a small, finite set of decisions on every turn. Each decision has inputs (signals it can observe), an automatable core (rules that can run without asking the user), and a deferral edge (where it must hand off to HITL). The surface is:

#DecisionInputs available nowFrequency
D1Model tier: which model handles this turn (Haiku / Sonnet / Opus / local / custom)task category, predicted complexity, prior model_scoreboard outcomes, budget, sensitivity tag, prompt-cache prefix overlapevery LLM call
D2Planning mode: act-immediately (ReAct), plan-then-execute, or extended-thinkingtask length-of-horizon estimate, irreversibility, prior failures on similar tasksper task
D3Research invocation (Socrates): trust the answer, sample more, retrieve, or escalategrounding score, semantic entropy / self-consistency, abstention signals, source-citation qualityper claim or per turn
D4Sub-agent dispatch: do inline, spawn one specialist, or fan out paralleltask decomposability, independent subtasks count, latency budgetper task
D5Mode switch: autonomous ↔ interactive ↔ approval-requiredrisk dimensions of next action (irreversibility, blast radius, compliance), cumulative confidence, noveltyper action
D6Continue ↔ replan ↔ abort: detect doom-loops and unproductive trajectoriesrepeated tool args, n-gram overlap on actions, semantic drift score, no-progress counter, iteration budgetevery tool call
D7Context strategy: continue, compact, snip, branchtoken pressure %, salience of older turns, task-boundary signalevery N turns
D8Privacy routing: which providers are eligible for this promptPII/secret detection, sensitivity tag, ZDR/on-prem requirement, jurisdictionevery LLM call
D9HITL escalation: ask user, queue for review, or proceedrisk × confidence matrix, anomaly detection, time-budget exhaustionevent-driven
D10Adaptation: update the router itselfreward signal (task success), preference feedback, drift detectionbackground

The next nine sections survey what the field has learned about each axis. The last two sections fold the findings back into Vox: what is automatable today on Vox’s existing telemetry, what is not, and the contract skeletons for the gaps.


Cascade routing. Send the query to a cheap model first; only escalate to a stronger model when the cheap model’s response fails a reliability check. The reliability check is a scoring function on (query, response) → [0, 1]; the cascade exits at the first index where the score crosses a threshold.

“An incoming query first goes to a small LLM, and if the small model’s confidence score is below a chosen threshold, the cascade forwards the query to a larger and more powerful LLM. When the confidence of a small model is high, the system can safely stop the inference, saving the most powerful model’s inference cost while maintaining its accuracy.” — Bi-directional Model Cascading with Proxy Confidence 1

FrugalGPT showed a cascade can match GPT-4 quality at up to 98% lower cost 2; C3PO and GATEKEEPER are 2025 refinements that learn the threshold from a held-out set 34.

Predictive routing. A small classifier looks at the prompt and predicts which model is best, in a single shot. RouteLLM (Berkeley/LMSYS) showed routing-by-classifier hits 95% of GPT-4 quality with 14–26% strong-model calls — a 75–85% cost reduction 5. Hybrid LLM (ICLR 2024, Microsoft) used a DeBERTa router and routed 22–40% of queries to a smaller model with <1% quality drop 6. Martian and Not Diamond commercialized this pattern; Not Diamond reports 7–15 point accuracy lift when routing across multiple frontier models with consensus aggregation 7.

Online (bandit) routing. Instead of a frozen classifier, a contextual bandit updates the policy from per-call outcomes. BaRP (2025) frames it as a contextual bandit conditioned on a user preference vector (accuracy / cost trade-off) and adapts online from binary feedback 8. PILOT (2025) adds explicit budget constraints and preference priors 9. RouterArena (2025) is the standard benchmark 10.

Independent of academic routers, the production consensus in 2026 for Anthropic-only stacks is the three-tier orchestrator:

“Haiku 4.5 serves as the router, classifying incoming requests and handling simple ones directly. Sonnet 4.6 processes the bulk of medium-complexity tasks — code generation, document analysis, and data extraction. Opus 4.6 handles the 10 to 15% of requests that require deep reasoning or complex multi-step problem solving. … This approach can reduce total API costs by 60 to 70% compared to using Opus for everything — without any meaningful quality compromise.” — Augment Code, AI Model Routing Guide 11

Vox already implements the mechanism of tier-based routing (crates/vox-orchestrator/src/models/registry.rs::best_for(), scoring.rs); what is missing is the complexity classifier — the function prompt → predicted_tier that the routing decision depends on. Today this is a hand-tuned StrengthTag × TaskCategory table. The literature suggests three replacement options, ranked by maturity:

  1. Rule-based prefilter on regex/keyword features — simplest, deployable today on existing telemetry. ~70% of public router work uses this as a layer-1 filter even when ML-based routing is layered on top 12.
  2. Small-classifier router (BERT/DeBERTa) — needs labeled data; the labels can come from model_scoreboard outcomes (which already exist in the v59 schema). One offline training pass per quarter.
  3. Contextual bandit — needs the binary “did this task succeed” signal, which Vox has via llm_interactions.context_utilization_pct and socrates outcomes. Strongest long-term, weakest day-one.
fn select_model_tier(task) -> ModelTier {
if task.has_pii or task.sensitivity == Critical:
return restrict_to(privacy_eligible_providers()) // see Part 9
if task.budget_remaining < cheapest_strong_model_cost:
return cheapest_provider_meeting_quality_floor()
let predicted = router.predict(task) // BERT classifier or rules
if predicted.tier == Cheap:
return Cheap // direct routing path
if cascade_enabled:
return Cascade(start=Cheap, escalate_threshold=0.65) // FrugalGPT path
return predicted.tier
}

The escalation threshold (0.65) is an empirical hyperparameter; see §10.4 for how to tune it from model_scoreboard.


Part 3 — D2: Planning Mode (Plan-Execute vs ReAct vs Extended Thinking)

Section titled “Part 3 — D2: Planning Mode (Plan-Execute vs ReAct vs Extended Thinking)”
PatternWhenCostInspectable?Source
ReAct (think → act → observe)exploratory, real-time, you can’t list the tools up frontlow–mediumpartial1314
Plan-and-Executeyou can list the tools up front; cost-of-wrong-turn highlow (planner once + cheap executor)yes — plan is a reviewable artifact1516
Tree-of-Thoughtscombinatorial / search problem with intermediate scorer10–100× CoTyes (search tree)1718
Reflexion (verbal RL)repeated trials on same task class; learn from failuremedium per trialpartial1920
Extended thinking (Anthropic adaptive)one expensive decision step (tool pick, refactor under constraint)medium–highthinking trace2122

Numbers worth memorizing.

  • Plan-and-Execute hits 92% task completion vs ReAct’s 85% on multi-step workflows 23.
  • ToT spends 10–100× the tokens of CoT for the same answer when there is no useful intermediate scorer 18. This is the failure mode of “always use ToT for hard tasks.”
  • Adaptive computation papers show 20–60% reduction in thinking tokens at no quality loss when the model picks its own budget 2425.

The literature converges on a question rather than a threshold: can you list the tools that will be called before execution starts?

  • Yes → Plan-and-Execute. The plan is the artifact you can review, parallelize independent steps, and re-plan on failure with a structured score-and-replan loop (see LangGraph PEV 26).
  • No, but the task is one expensive decision (pick the right tool with the right parameters; refactor under a strict constraint) → Extended thinking. Anthropic’s guidance is do not default-enable; pick the 2–3 steps where a wrong choice cascades and turn it on there 21.
  • No, exploratory → ReAct.
  • No, but you are repeating the same task class (debugging the same kind of bug, generating tests for the same kind of function) → Reflexion-on-top-of-ReAct so the agent accumulates verbal lessons across trials 19.

crates/vox-orchestrator/src/mcp_tools/chat_tools/plan.rs and plan_loop.rs::maybe_refine_plan() already implement the mechanism; the trigger (“when do I plan vs just act?”) is a hardcoded heuristic today. The proposed rule:

fn pick_planning_mode(task) -> Mode {
if task.estimated_steps >= 4 and task.tools_predictable:
return PlanThenExecute // 92% vs 85% lift
if task.has_irreversible_side_effects:
return PlanThenExecute // plan is reviewable artifact
if task.requires_constraint_satisfaction:
return ExtendedThinking(budget=8192)
if task.is_repeated_class and reflexion_memory.exists():
return ReActWithReflexion
return ReAct // default
}

task.tools_predictable is itself a small classifier or a structural check (“does the task description name specific tools?”).


Part 4 — D3: Research / Socrates Invocation

Section titled “Part 4 — D3: Research / Socrates Invocation”

This is the hardest of the ten decisions, because the orchestrator has to ask: “do I trust my own answer enough to ship it?” The literature has produced four distinct signals for this. The right policy is to fuse them, because no single one is reliable.

Signal A: Token logprobs. The model’s own per-token confidence. Simple, cheap, and produced for free by every supporting API. The catch: token probability conflates factual confidence with lexical uncertainty (“Paris” vs “the capital of France” can both be right but one has lower token prob); naive thresholds miscalibrate [^verbalized-vs-logprob]. LogU / LogTokU (2025) extracts uncertainty from logits without multi-sampling, fixing the “loss of evidence strength” failure mode of vanilla probability methods 27.

Signal B: Self-consistency. Sample N responses at high temperature; measure their mutual agreement. If the model “knows” the answer, samples align; if it’s confabulating, they diverge. SelfCheckGPT is the reference implementation 28. Limit: if the model is confidently wrong, all samples agree on the same wrong answer — high consistency but zero correctness 29.

Signal C: Semantic entropy. Cluster N samples by meaning (entailment-equivalent), then compute entropy over clusters rather than over surface strings. Farquhar et al. (Nature 2024) showed this beats token-level entropy for confabulation detection. Semantic Entropy Probes (NeurIPS 2024) approximate it from a single generation’s hidden states for a 5–10× cost reduction 3031. Semantic Energy (2025) beats Semantic Entropy by +13% AUROC 32.

Signal D: Verbalized confidence. Just ask the model “how sure are you, 0–1?” Empirical results conflict: some studies find verbalized scores better calibrated than logprobs; others find ECE > 0.377 (catastrophic miscalibration) 33. The gap is almost entirely prompt design — there are prompt formats that produce well-calibrated verbalized scores and prompt formats that don’t.

The 2026 production consensus is to fuse 2–3 signals rather than rely on one:

“Initial confidence thresholds are set conservatively — 0.85 for irreversible actions and 0.70 for reversible actions, and after 30 days of production data, thresholds are recalibrated based on Expected Calibration Error (ECE) and adjusted to achieve a target false-positive rate matching reviewer capacity.” — Human-in-the-Loop AI Agents 34

Translated to orchestrator terms: compute a composite score per claim that combines logprob entropy, SE-probe estimate, and (for high-risk claims) a fresh self-consistency check. Define two thresholds, one for “ship the answer” and one for “must escalate,” with a middle band for “invoke research.”

Once the orchestrator decides “I don’t trust this answer,” it has three follow-on options:

  1. Re-sample at higher temperature with self-consistency check (cheap, ~3× cost).
  2. Retrieve — adaptive RAG. Self-RAG learns to emit a “retrieve” reflection token when confidence is low 35; the threshold is tunable at inference time for accuracy / cost trade-off.
  3. Spawn a Socrates research agent that asks structured follow-up questions and gathers evidence (Princeton SocraticAI’s Socrates/Theaetetus/Plato pattern) 36. Most expensive; reserve for high-stakes / low-confidence intersections.

The escalation order matters: cheap-then-expensive saves cost on the long tail of medium-confidence claims.

ADR-005 already names this surface: vox-socrates-policy, RiskDecision::Abstain, ConfidencePolicy, RiskBand. What is missing is the fusion function — today the decision is a single heuristic gate, not a composite of logprobs + entropy + self-consistency. The composite is implementable today on the llm_interactions schema (v59) once two columns are added: per-call entropy estimate (LogU or SEP) and per-call sample-disagreement score for high-stakes calls.

fn should_invoke_research(claim) -> ResearchAction {
let score = fuse(claim.logprob_entropy, claim.sep_estimate, claim.self_consistency);
if score >= ship_threshold: return Ship
if score >= research_threshold: return ReSample(temp=0.7, n=5)
if score >= retrieve_threshold: return Retrieve(adaptive_rag)
if score >= socrates_threshold: return SpawnSocrates(claim)
return Escalate // user must decide
}

The thresholds are calibration parameters tuned per-task-category from model_scoreboard history.


Part 5 — D4: Sub-agent Dispatch (When to Spawn vs Inline)

Section titled “Part 5 — D4: Sub-agent Dispatch (When to Spawn vs Inline)”

Six concrete patterns in production use:

PatternWhereTrigger to spawn
Supervisor (clear control flow, one router node)LangGraph, OpenAI Agents SDKdescription-driven: subagent’s description field matches subtask 3738
Swarm (peer-to-peer handoffs)LangGraph swarm, OpenAI Swarmtransfer_to_X tool call from current agent 39
Hierarchical / manager-workerCrewAI hierarchicalallow_delegation=True + complexity decomposable into specialist domains 40
Selector group chat (LLM picks next speaker)AutoGen SelectorGroupChatmodel picks based on agent descriptions + current state; constrained by allowed_or_disallowed_speaker_transitions 41
Parallel fan-outAnthropic multi-agent researchindependent subtasks with no shared mutable state 42
Cross-vendor A2AGoogle Agent2Agent protocolAgent Cards advertise capabilities; orchestrator queries cards to find specialist 4344

5.2 The decision rule the field converged on

Section titled “5.2 The decision rule the field converged on”

“The orchestration layer decomposes incoming requests into executable subtasks and assigns them to the most suitable agent based on capability, context, and real-time system state.” — N-iX, AI Agent Orchestration 45

“Use hierarchical delegation when you have complex, multi-faceted problems that benefit from task decomposition and specialist agents rather than trying to handle everything with a single agent.” — ActiveWizards, CrewAI Delegation Guide 46

In practice: spawn a sub-agent when (a) the subtask is independent of the main agent’s running state, AND (b) there is a specialist whose description matches the subtask, AND (c) the parallelism saves wall-clock time. Otherwise, inline.

The Anthropic pattern is description-driven: “When you define subagents, Claude determines whether to invoke them based on each subagent’s description field” 38. Vox already has this surface in crates/vox-skills/skills/*.skill.md — every skill is a sub-agent description.

The cumulative reliability of an agent chain is the product of per-agent reliability. A 5-agent chain at 95% per-agent ≈ 77% overall. The literature explicitly flags this as a HITL trigger: “multi-agent chain complexity where compound uncertainty across autonomous agent handoffs degrades cumulative reliability” 47. Vox’s multi-agent-vcs-replication-spec-2026.md implicitly handles this with state-recovery checkpoints; the trigger — “this chain is now too long, fork to HITL” — is not yet in the policy.


Part 6 — D5/D6: Mode Switching and Doom-Loop Detection

Section titled “Part 6 — D5/D6: Mode Switching and Doom-Loop Detection”

The HITL/autonomy literature has converged on four-dimensional risk scoring 34:

  1. Irreversibility — can the action be undone?
  2. Blast radius — how many people / records does it affect?
  3. Compliance exposure — does it create legal or regulatory obligations?
  4. Confidence — how certain is the agent?

The product (or weighted sum) of (1, 2, 3) × (1 - 4) is a risk score that maps to autonomy level. This is the AURA framework’s contribution: parse, score, gate high-impact actions according to predefined thresholds, with optional HITL review for uncertain cases 48.

The EU AI Act (Article 14, August 2026 enforcement) makes a HITL surface legally required for any high-risk AI system 49. Vox is below that threshold for most use cases but needs the mechanism in place to enable it for users who deploy in those domains.

6.2 Doom-loop detection: what triggers the circuit breaker

Section titled “6.2 Doom-loop detection: what triggers the circuit breaker”

The production consensus on circuit breaker conditions 5051:

SignalThreshold (typical)Source
No file/state change for N consecutive loopsN=352
Same error message N consecutive loopsN=552
Output decline (response getting shorter / lower quality)>70% reduction52
Tool call count without state progression>1553
Repeated tool-args (Jaccard / n-gram similarity on action)n=4 grams; >0.85 cosine on action embeddings5455
Semantic drift from baseline (Sentence-BERT distance to canary)task-specific; needs baseline56
Hard turn cap30–50 turns / max_iterations57

The graduated warning pattern (NousResearch hermes-agent) is worth copying: a CAUTION tier 10 turns before the cap and a WARNING tier 3 turns before, so the agent can wrap up cleanly instead of hitting the wall 58.

orchestrator-companion-audit-findings-2026.md FIX-B-11 names the doom-loop detector as a P1 gap. The literature gives us an executable spec:

struct CircuitBreaker {
no_progress_count: u32, // increments when state hash unchanged
same_error_count: u32,
output_decline_ratio: f64,
tool_calls_without_progress: u32,
action_n4gram_history: RingBuffer<NgramSet>,
drift_baseline: Embedding,
}
impl CircuitBreaker {
fn should_trip(&self) -> Option<TripReason> {
if self.no_progress_count >= 3 { return Some(NoProgress) }
if self.same_error_count >= 5 { return Some(StuckOnError) }
if self.tool_calls_without_progress > 15 { return Some(ToolThrash) }
if self.action_repetition_score() > 0.85 { return Some(ActionLoop) }
if self.semantic_drift() > drift_threshold { return Some(Drifting) }
None
}
}

The trip action is not “abort” — it is “hand to replanner with the trip reason in the prompt, and if replanning also fails within K attempts, escalate to HITL with a partial-progress report.” This matches the LangGraph PEV pattern 26.

The cleanest framing in the literature is “governed autonomy” 59: the agent runs autonomously inside a defined operating envelope; outside the envelope it switches to interactive. The envelope is defined by the four risk dimensions above. The 2026 conservative defaults seen in multiple sources 34:

  • Irreversible action + confidence < 0.85 → require approval
  • Reversible action + confidence < 0.70 → notify but proceed
  • Blast radius > N records (N depends on tenant) → require approval
  • Compliance-tagged action (PII, financial, regulated) → require approval regardless of confidence

Earn-based expansion is the discipline: start strict, log every (recommendation, decision, outcome), shift from prior-approval to after-the-fact review only after sustained alignment is demonstrated 59.


Part 7 — D7: Context Pressure and Compaction

Section titled “Part 7 — D7: Context Pressure and Compaction”

Anthropic’s Claude Cookbook exposes a configurable threshold; the default is 0.9 (90%) of the context window 60. Microsoft Agent Framework’s compaction default is 50% 61. Why the gap?

Because the right trigger isn’t “I’m running out of room” — it’s “I’m at a natural task boundary” 62:

“It is not ideal to compact when you’re in the middle of a complex refactor; It is better to compact when you are starting a new task or otherwise believe that prior context will lose relevance.” — Autonomous Context Compression, LangChain blog

The 95%-trigger pattern degrades performance before compaction fires 63. The agent-driven pattern (give the agent a compact_context tool it calls proactively) outperforms threshold-driven on long-horizon tasks 60.

Multiple production agents converge on a layered strategy 63:

  1. Budget reduction — truncate oversized individual tool outputs
  2. Snip — drop turns older than a window
  3. Microcompact — summarize within-call state to free cache
  4. Context collapse — summarize long histories to a structured digest
  5. Auto-compact — full semantic compression as a last resort

Each layer fires at a different pressure level. The orchestrator’s job is to pick the layer, not to skip straight to (5).

7.3 Cache-aware routing — the missing dimension

Section titled “7.3 Cache-aware routing — the missing dimension”

Routing today optimizes for {capability, cost, latency}. Cache state is rarely a routing input, but it should be: prompt caching saves up to 90% per cached token 6465. If two providers can both handle a query, prefer the one whose cache contains the longest matching prefix. vLLM’s router and SGLang’s PrefixCacheAffinityRouter formalize this: “consistent hashing ensuring that requests with the same routing key are routed to the same worker replica, maximizing KV cache reuse” 6667.

For Vox this is a net-new routing dimension. nextgen-orchestrator-research-2026.md §7.3 names it as P2; the cited papers give it concrete shape: maintain a per-provider approximate radix tree keyed on prompt prefix; route to the worker with the longest match unless capability/budget overrides.

Per-tenant token-and-spend caps (orchestrator-companion-audit-findings-2026.md FIX-F-05) need three things, per the gateway literature 6869:

  1. In-memory budget tracking at every request — no per-call DB hit
  2. Hierarchical buckets — per-tenant and per-app inside it
  3. Token-based, not request-based — request-based does not capture variable cost per call

Part 8 — D8: Privacy / Sensitivity Routing

Section titled “Part 8 — D8: Privacy / Sensitivity Routing”

Production deployments converge on an AI-Gateway-shaped PII boundary 7071:

“When text enters an AI Gateway, it’s inspected for PII in real time, and if PII is identified, it’s automatically rerouted to a more secure, on-premises model instead of a cloud-hosted one.”

The detection layer is two-pass: regex/dictionary patterns (deterministic, fast, ~80% recall) layered with an ML PII classifier (Microsoft Presidio, PII Tools, Nightfall) for the long tail 7273. Detected matches elevate the sensitivity tag (e.g., a note containing medical keywords moves from MediumCritical even if the surface string lacked a regex match).

Once a sensitivity tag is set, the routing decision is a filter, not a scorer:

fn privacy_eligible_providers(task) -> Vec<Provider> {
match task.sensitivity {
Public | Internal => all_providers(),
Confidential => providers_with_zdr(),
Restricted | Critical => self_hosted_only(),
}
}

Composing with §2: pick the highest-quality model in the eligible set, not the highest-quality model overall.

The 2026 consensus on guardrails 747576:

  • Input guardrails — prompt injection / jailbreak detection, PII redaction before the model sees the prompt
  • Output guardrails — PII / secrets / toxicity scan before output reaches the user
  • Runtime guardrails — tool-call authorization at the moment of invocation

Each layer combines rule-based (deterministic, fast) and ML-based (catches paraphrased attacks). Vox’s vox-secrets already enforces secret-egress prevention; the missing piece is input PII detection that informs the routing decision (not just blocks it).


Compose Part 6’s risk dimensions with Part 4’s confidence signal into a 2D matrix 3448:

Confidence ≥ 0.850.70 ≤ c < 0.85c < 0.70
Reversible, low blastproceedproceed + lognotify + proceed
Reversible, medium blastproceed + lognotify + proceedescalate
Irreversible, any blastnotify + proceedescalateescalate
Compliance-taggedescalateescalateescalate

The matrix collapses to two thresholds: a proceed gate and an escalate gate. The middle band is “notify and proceed” — a low-friction in-band notification rather than a blocking modal.

LangGraph’s interrupt() and interrupt_before=["tools"] are the cleanest reference implementation 7778. Pattern:

  • Static breakpoint — wrap risky nodes (financial transactions, deletes, external sends) with interrupt_before
  • Dynamic interrupt — runtime call to interrupt() when the matrix above lands in “escalate”

The persistence layer captures full state at the interrupt; the user resumes via Command(resume=...) with the same thread_id.

Conservative defaults aren’t permanent. The discipline 59:

  1. Log every (recommendation, user decision, outcome) tuple.
  2. After 30 days of production data, recalibrate thresholds based on Expected Calibration Error (ECE).
  3. Adjust to a target false-positive rate matching reviewer capacity — if reviewers ignore 90% of escalations, the threshold is too low.
  4. Shift from prior-approval to after-the-fact review only after sustained alignment.

Vox has the data substrate (llm_interactions, model_scoreboard); the calibration loop is the missing automation.


Part 10 — D10: Adaptation (Learning the Router)

Section titled “Part 10 — D10: Adaptation (Learning the Router)”

A frozen classifier-router degrades as model capabilities, prompt distributions, and pricing shift. Online learning closes the loop:

  • BaRP — contextual bandit conditioned on user preference vector, learning from binary “good response?” feedback. Single policy, multiple operating points at inference 8.
  • PILOT — bandit + budget constraint with online cost policy 9.
  • Dueling bandits — pairwise preference feedback (which response is better?), label-efficient 79.

All three need only binary or pairwise outcome signals, which Vox already collects (socrates pass/fail, user thumbs, task completion).

A separate concern from “is the router good” is “is the router still good.” The semantic-drift detection literature 8081 gives us a baseline: compute a Sentence-BERT embedding of typical-task responses; flag when current responses diverge by more than 2σ from baseline. The trip surface is request retraining, not model swap.

OpenTelemetry’s GenAI Semantic Conventions (SIG since April 2024) standardize span/event shapes for LLM calls, agent steps, vector queries, token usage, and cost 8283. Datadog began native support in v1.37; Grafana followed. Vox’s v59 telemetry schema predates this standard but maps cleanly; an explicit conformance pass is a low-cost win and unblocks third-party observability.

loop every 24h:
samples = sample_recent_completions(n=10000)
for tier in [Cheap, Mid, Strong]:
observed_quality = score(samples.filtered(tier))
observed_cost = sum(samples.filtered(tier).cost)
update_router_weights(tier, observed_quality, observed_cost)
if drift_score(samples) > 2σ:
emit_alert(RouterDriftDetected)
flag_for_retraining()

This sits in vox-orchestrator as a background task; it does not block any user-facing call.


Part 11 — Vox Mapping: Reasonably Automatable Today

Section titled “Part 11 — Vox Mapping: Reasonably Automatable Today”

These rules can be implemented on Vox’s current surfaces (data, contracts, crates) without new external dependencies. Each cites the existing artifact it builds on.

DecisionAutomatable layerExisting surfaceNet-new code
D1 model tierThree-tier (Cheap/Mid/Strong) routing with rule-based prefiltervox-orchestrator/src/models/registry.rs::best_for()Tier classifier (rule-based v1, BERT v2)
D1 cascadeOptional cascade for medium-confidence callsSameConfidence-gated escalation wrapper
D2 plan-vs-actStep-count + irreversibility heuristicmcp_tools/chat_tools/plan.rsTrigger function pick_planning_mode()
D3 confidence fusionLogprob-entropy + verbalized + per-claim self-consistency for stakes-tagged claimsvox-socrates-policy, ADR-005Two new columns on llm_interactions; fuse function
D4 sub-agent dispatchDescription-driven dispatch + chain-length capvox-skills/skills/*.skill.md, vox-orchestrator agent queueChain-length tracker, fanout decider
D5 mode switch4-dim risk score + autonomy envelopeADR-030 (state machine SSoT)RiskScore calculator, envelope config
D6 doom-loopFive-signal circuit breaker (no-progress / same-error / tool-thrash / action-loop / drift) + graduated warningsorchestrator-companion-audit-findings-2026.md FIX-B-11CircuitBreaker struct from §6.3
D7 contextFive-layer pipeline triggered at distinct thresholdsvox-orchestrator compaction codePer-layer trigger config; agent-driven compact_context tool
D7 cache-awareApproximate radix tree per provider; route to longest-prefix match unless overriddennone yet (P2 in nextgen-orchestrator-research)New routing dimension in scoring.rs
D8 privacyTwo-pass PII detection → eligible-provider filtervox-secrets egress guardInput-side detector + filter wrapper
D9 HITLRisk×confidence matrix → interrupt or notifynone yetMatrix evaluator + interrupt point
D10 calibrationDaily recalibration + drift alertmodel_scoreboard v59Background calibration job

These decisions look automatable but the literature consistently warns against full automation. They belong in the HITL surface or in advisory mode.

  1. Ambiguous-intent disambiguation. When the user’s request is genuinely ambiguous (specification uncertainty, not model uncertainty), the literature is unanimous: ask 8485. EVPI (Expected Value of Perfect Information) gives a calibrated cost-benefit for asking; do not paper over with assumptions.
  2. Compliance-tagged actions. EU AI Act Article 14 makes HITL on high-risk actions legally required, not optional 49. No confidence threshold should bypass this — confidence and compliance are independent axes.
  3. Goal redefinition under semantic drift. If the agent’s interpretation of the goal is drifting, it cannot self-detect reliably (the drift detector does, but the fix is not “let the agent re-decide what the goal is”) 80. Hand back.
  4. Cross-tenant boundary changes. Routing within a tenant’s eligible-provider set is automatable; changing what’s eligible (e.g., promoting a new provider into the privacy-restricted pool) is a configuration change that needs explicit operator intent.
  5. Money / external messaging. Per CLAUDE.md and the agent autonomy literature, transactions and outbound messages are by-default-HITL regardless of confidence 34.
  6. Reasoning-fine-tuned model abstention. AbstentionBench (NeurIPS 2025) found that reasoning fine-tuning degrades abstention by 24% on average 86. Reasoning-tuned models that confidently answer unanswerable questions cannot be trusted to abstain on their own — the orchestrator must impose external abstention.
  7. Self-consistency-only confidence on confidently-wrong claims. Self-consistency reports high confidence when the model is uniformly wrong. Never use it as a single signal — fuse it 29.

Part 13 — Proposed Decision-Rule Contracts

Section titled “Part 13 — Proposed Decision-Rule Contracts”

These are skeleton schemas, not full proposals — a starting point for ADRs. Each is one YAML/Rust file that becomes the SSoT for one decision.

13.1 contracts/orchestration/tier-routing.v1.yaml (D1)

Section titled “13.1 contracts/orchestration/tier-routing.v1.yaml (D1)”
version: 1
classifier:
type: rule_based # or "bert" once a model is trained
rules:
- if: prompt.length < 200 and prompt.tools_named == 0
tier: cheap
- if: prompt.contains_code_block and prompt.language in [rust, ts, py]
tier: mid
- if: prompt.has_keyword(["prove", "derive", "design"])
tier: strong
cascade:
enabled: true
start_tier: cheap
escalate_threshold: 0.65 # tunable; see model_scoreboard calibration
max_escalations: 1
budget_floor:
enforce: true # block strong tier if budget < cheapest_strong_cost

13.2 contracts/orchestration/risk-confidence-matrix.v1.yaml (D5/D9)

Section titled “13.2 contracts/orchestration/risk-confidence-matrix.v1.yaml (D5/D9)”
version: 1
risk_dimensions:
irreversibility: # boolean
blast_radius: # int — records or users affected
compliance_tag: # enum: none | pii | financial | regulated
confidence_thresholds:
proceed: 0.85
notify_and_proceed: 0.70
escalate: 0.0
matrix:
- {irreversible: false, blast: <10, compliance: none}: proceed_at_0.0
- {irreversible: false, blast: <100, compliance: none}: proceed_at_0.70
- {irreversible: true, blast: any, compliance: any }: notify_at_0.85_else_escalate
- {irreversible: any, blast: any, compliance: pii_or_higher}: always_escalate
calibration:
recalibrate_every_days: 30
signal: ece_on_logged_decisions

13.3 contracts/orchestration/circuit-breaker.v1.yaml (D6)

Section titled “13.3 contracts/orchestration/circuit-breaker.v1.yaml (D6)”
version: 1
trips:
no_progress_loops: 3
same_error_loops: 5
tool_calls_no_progress: 15
action_ngram_overlap: 0.85 # 4-gram cosine
semantic_drift_sigma: 2.0
hard_turn_cap: 50
warnings:
caution_at_remaining: 10 # turns
warning_at_remaining: 3
trip_action: handoff_to_replanner
replanner_max_retries: 2
on_replanner_failure: escalate_to_hitl

13.4 contracts/orchestration/socrates-fusion.v1.yaml (D3)

Section titled “13.4 contracts/orchestration/socrates-fusion.v1.yaml (D3)”
version: 1
signals:
logprob_entropy: {weight: 0.4, source: llm_interactions.logprob_entropy}
sep_estimate: {weight: 0.4, source: llm_interactions.sep_estimate}
self_consistency: {weight: 0.2, source: per_claim_resample, fire_when: stakes >= medium}
thresholds:
ship: 0.80
resample: 0.65
retrieve: 0.50
spawn_socrates: 0.30
abstain: 0.0
abstention_override:
if_compliance_tagged: always_require_explicit_evidence
if_user_disabled_socrates: never_abstain_silently # surface uncertainty instead

These are starting points; each will need an ADR.


Part 14 — Open Questions / Research Gaps

Section titled “Part 14 — Open Questions / Research Gaps”

Items the literature does not answer cleanly and that need primary investigation in Vox:

  1. The complexity classifier ground-truth. RouteLLM-style classifiers are trained from preference data (Chatbot Arena). What is the equivalent for Vox? Best candidate today: model_scoreboard outcomes filtered for high-quality verifiers, but the noise floor is unknown.
  2. Logprob availability across providers. OpenRouter does not consistently surface logprobs from all backends; the LogU family of techniques degrades to verbalized-only on those calls. The fusion function in §13.4 needs a fallback path.
  3. Cache-prefix radix tree at scale. vLLM/SGLang’s structures are designed for a single inference cluster’s KV cache. Vox routes across multiple external providers — the radix tree becomes a prediction of likely cache state, not ground truth. Calibration unknown.
  4. Drift-of-the-drift-detector. Sentence-BERT embeddings themselves shift across model versions; the canary-baseline pattern needs a re-baselining cadence which the literature does not specify.
  5. Mesh agent chain-length cap. Single-agent chain-length advice (§5.3) does not directly translate to populi-mesh A2A delivery, where the chain is partly determined by network topology. ADR-025 names lock coherence but not chain depth.
  6. Reasoning-tuned abstention failure mode. AbstentionBench’s 24% degradation 86 applies to reasoning-fine-tuned models; this is the class Vox routes to most often for hard tasks. External abstention enforcement (the orchestrator imposing “I don’t know” when the model overclaims) is undocumented in the framework literature.
  7. The earn-back path for autonomy. Conservative defaults are well-described; the automated expansion path (when does the orchestrator promote a class of decisions from “always escalate” to “notify-and-proceed”?) is described qualitatively but no production implementation publishes thresholds.

Part 15 — How This Document Relates to Existing Vox Documents

Section titled “Part 15 — How This Document Relates to Existing Vox Documents”
ExistingRelationship
model-orchestration-ssot-audit-2026.mdThis doc consumes routing mechanics from there; it adds the decision rules that drive the mechanics.
nextgen-orchestrator-research-2026.mdThis doc converts that doc’s failure modes into decision-rule contracts.
orchestrator-companion-audit-findings-2026.mdThis doc gives specs (e.g., circuit breaker §6.3) for several P1 gaps that doc only names.
docs/src/adr/005-socrates-anti-hallucination-ssot.mdThis doc operationalizes Socrates by defining the fusion function (§4) and the trigger thresholds (§13.4).
docs/src/adr/025-multi-agent-lock-coherence.mdThis doc adds chain-length and handoff decision rules on top of that doc’s lock mechanics.
docs/src/adr/030-state-machine-ssot.mdThis doc supplies the risk × confidence matrix that drives the state-machine’s mode transitions.
multi-agent-vcs-replication-spec-2026.mdSub-agent dispatch logic in §5 is consistent with that doc’s handoff protocol.
populi-mesh-a2a-durability-spec-2026.mdA2A handoff in §5.1 maps to that doc’s transport.
telemetry-driven-cost-accounting-research-2026.mdThe calibration loop in §10.4 reads from the telemetry surface that doc defines.
planning-meta/00..12Plan-mode trigger in §3.3 fits inside that planning framework’s exception policy.

This document is horizontal: it crosses the routing/policy/observability/security boundaries that the others draw vertically. New ADRs proposed here (§13) should each cite this doc as research foundation.


Citations are split into three categories: academic (peer-reviewed papers and arXiv preprints), framework (production-system documentation), and industry (analysis posts and platform blogs).


End of document.

  1. Bi-directional Model Cascading with Proxy Confidence (2025). https://arxiv.org/pdf/2504.19391

  2. FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance (Chen et al., 2023). https://arxiv.org/abs/2305.05176

  3. C3PO: Optimized Large Language Model Cascades with Probabilistic Cost Constraints for Reasoning (2025). https://arxiv.org/html/2511.07396v1

  4. GATEKEEPER: Improving Model Cascades Through Confidence Tuning (2025). https://arxiv.org/pdf/2502.19335

  5. RouteLLM: Learning to Route LLMs from Preference Data (Ong et al., 2024). https://arxiv.org/html/2406.18665v3 / https://github.com/lm-sys/RouteLLM

  6. Hybrid LLM: Cost-Efficient and Quality-Aware Query Routing (ICLR 2024). https://arxiv.org/abs/2404.14618

  7. VentureBeat — Not Diamond automatically routes your query to the best LLM. https://venturebeat.com/ai/not-diamond-automatically-routes-your-query-to-the-best-llm

  8. Learning to Route LLMs from Bandit Feedback: One Policy, Many Trade-offs (BaRP, 2025). https://arxiv.org/abs/2510.07429 2

  9. Adaptive LLM Routing under Budget Constraints (PILOT, 2025). https://arxiv.org/html/2508.21141v1 2

  10. RouterArena: An Open Platform for Comprehensive Comparison of LLM Routers (2025). https://arxiv.org/html/2510.00202v1

  11. Augment Code — Best AI Model for Coding Agents in 2026: A Routing Guide. https://www.augmentcode.com/guides/ai-model-routing-guide

  12. MegaNova — The 3-Tier Routing Cascade: Rule-Based → Semantic → LLM. https://blog.meganova.ai/the-3-tier-routing-cascade-rule-based-semantic-llm/

  13. DEV.to — ReAct vs Plan-and-Execute: A Practical Comparison. https://dev.to/jamesli/react-vs-plan-and-execute-a-practical-comparison-of-llm-agent-patterns-4gh9

  14. Oracle Integration — ReAct vs Plan & Execute. https://blogs.oracle.com/integration/react-vs-plan-execute-choosing-the-right-agent-thinking-pattern-in-oracle-integration

  15. LangGraph Plan-and-Execute example. https://github.com/langchain-ai/langgraph/blob/main/examples/plan-and-execute/plan-and-execute.ipynb

  16. Medium — Built with LangGraph! #33: Plan & Execute. https://medium.com/@okanyenigun/built-with-langgraph-33-plan-execute-ea64377fccb1

  17. IBM — What is Tree Of Thoughts Prompting? https://www.ibm.com/think/topics/tree-of-thoughts

  18. More Effectively Searching Trees of Thought. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/KamyarJohnSalahiPranavGurusankarSathyaEdamadaka.pdf 2

  19. Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., NeurIPS 2023). https://github.com/noahshinn/reflexion 2

  20. Agent Patterns — Reflexion Agent Pattern. https://agent-patterns.readthedocs.io/en/stable/patterns/reflexion.html

  21. Anthropic — Building with Extended Thinking. https://platform.claude.com/docs/en/build-with-claude/extended-thinking 2

  22. Anthropic — Adaptive Thinking. https://platform.claude.com/docs/en/build-with-claude/adaptive-thinking

  23. louisbouchard — ReAct vs Plan-and-Execute: The Architecture Behind Modern AI Agents. https://louisbouchard.substack.com/p/react-vs-plan-and-execute-the-architecture

  24. Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs (2025). https://arxiv.org/html/2507.02076v1

  25. Learning How Hard to Think: Input-Adaptive Allocation of LM Computation (2024). https://arxiv.org/abs/2410.04707

  26. Building a Reliable LangGraph Workflow: Plan-Execute-Validate (PEV). https://dev.to/manjunathgovindaraju/building-a-reliable-langgraph-workflow-plan-execute-validate-pev-automated-retries-and-mcp-1pik 2

  27. Estimating LLM Uncertainty with Evidence (LogU / LogTokU, 2025). https://arxiv.org/abs/2502.00290

  28. SelfCheckGPT — discussed in Beyond Self-Consistency in Black Box Hallucination Detection (2025). https://www.arxiv.org/pdf/2502.15845

  29. Consistency Is the Key: Detecting Hallucinations in LLM Generated Text (2025). https://arxiv.org/html/2511.12236 2

  30. Detecting hallucinations in large language models using semantic entropy (Farquhar et al., Nature 2024). https://www.nature.com/articles/s41586-024-07421-0

  31. Semantic Entropy Probes: Robust and Cheap Hallucination Detection in LLMs (NeurIPS 2024). https://arxiv.org/abs/2406.15927

  32. Semantic Energy: Detecting LLM Hallucination Beyond Entropy (2025). https://arxiv.org/pdf/2508.14496

  33. On Verbalized Confidence Scores for LLMs (2024). https://arxiv.org/html/2412.14737v2

  34. Medium — Human-in-the-Loop AI Agents (Anna Jey, 2026). https://medium.com/@arvisionlab/human-in-the-loop-ai-agents-how-to-add-approvals-escalation-and-safe-autonomy-in-production-0a21e359781c 2 3 4 5

  35. Self-RAG: Learning to Retrieve, Generate and Critique through Self-Reflection. https://selfrag.github.io/

  36. The Socratic Method for Self-Discovery in Large Language Models (Princeton NLP). https://princeton-nlp.github.io/SocraticAI/

  37. LangGraph Multi-Agent Supervisor. https://reference.langchain.com/python/langgraph-supervisor

  38. Anthropic — Subagents in the SDK. https://docs.anthropic.com/en/docs/claude-code/sdk/subagents 2

  39. OpenAI Cookbook — Orchestrating Agents: Routines and Handoffs. https://cookbook.openai.com/examples/orchestrating_agents

  40. CrewAI Agents documentation. https://docs.crewai.com/en/concepts/agents

  41. Microsoft AutoGen — Selector Group Chat. https://microsoft.github.io/autogen/stable/user-guide/agentchat-user-guide/selector-group-chat.html

  42. Anthropic — How we built our multi-agent research system. https://www.anthropic.com/engineering/multi-agent-research-system

  43. A2A Protocol. https://a2a-protocol.org/latest/

  44. Agent2Agent (A2A) Protocol Specification. https://a2a-protocol.org/latest/specification/

  45. N-iX — AI agent orchestration. https://www.n-ix.com/ai-agent-orchestration/

  46. ActiveWizards — Hierarchical AI Agents: A Guide to CrewAI Delegation. https://activewizards.com/blog/hierarchical-ai-agents-a-guide-to-crewai-delegation

  47. Strata — Human-in-the-Loop: A 2026 Guide to AI Oversight. https://www.strata.io/blog/agentic-identity/practicing-the-human-in-the-loop/

  48. AURA: An Agent Autonomy Risk Assessment Framework (2025). https://arxiv.org/html/2510.15739v1 2

  49. Knowlee — Human-in-the-Loop AI Policy Template (2026) + AI Act SLAs. https://www.knowlee.ai/blog/human-in-the-loop-ai-policy-template 2

  50. DEV.to — AI Agent Circuit Breakers. https://dev.to/waxell/ai-agent-circuit-breakers-the-reliability-pattern-production-teams-are-missing-5bpg

  51. paperclipai/paperclip — feat: Agent circuit breaker issue. https://github.com/paperclipai/paperclip/issues/390

  52. DEV.to — ralph-claude-code: The Technology to “Stop” AI Agents. https://dev.to/tumf/ralph-claude-code-the-technology-to-stop-ai-agents-how-the-circuit-breaker-pattern-prevents-3di4 2 3

  53. MindStudio — AI Agent Failure Pattern Recognition. https://www.mindstudio.ai/blog/ai-agent-failure-pattern-recognition

  54. A Study of Thought-Action-Result Trajectories (ASE 2025). https://software-lab.org/publications/ase2025_trajectories.pdf

  55. Medium — The Agent Loop Problem: When “Smart” Won’t Stop. https://medium.com/@Modexa/the-agent-loop-problem-when-smart-wont-stop-ccbf8489180f

  56. Detecting Sleeper Agents in Large Language Models via Semantic Drift Analysis (2025). https://arxiv.org/html/2511.15992

  57. LangChain GitHub — Agent stopped due to iteration limit or time limit. https://github.com/langchain-ai/langchain/discussions/27264

  58. NousResearch hermes-agent issue — Iteration Budget Pressure. https://github.com/NousResearch/hermes-agent/issues/414

  59. NIST AI Risk Management Framework — Generative AI Profile (NIST AI 600-1). https://www.nist.gov/itl/ai-risk-management-framework 2 3

  60. Anthropic Cookbook — Automatic Context Compaction. https://platform.claude.com/cookbook/tool-use-automatic-context-compaction 2

  61. Microsoft Agent Framework — Compaction. https://learn.microsoft.com/en-us/agent-framework/agents/conversations/compaction

  62. LangChain Blog — Autonomous context compression. https://blog.langchain.com/autonomous-context-compression/

  63. Morph — Cursor Context Window (2026). https://www.morphllm.com/cursor-context-window 2

  64. Anthropic — Prompt Caching. https://www.anthropic.com/news/prompt-caching

  65. ProjectDiscovery — How We Cut LLM Costs by 59% With Prompt Caching. https://projectdiscovery.io/blog/how-we-cut-llm-cost-with-prompt-caching

  66. vLLM Router Release Blog (2025). https://blog.vllm.ai/2025/12/13/vllm-router-release.html

  67. Prefix Caching — SGLang vs vLLM. https://medium.com/byte-sized-ai/prefix-caching-sglang-vs-vllm-token-level-radix-tree-vs-block-level-hashing-b99ece9977a1

  68. Truefoundry — Rate Limiting in AI Gateway: The Ultimate Guide. https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway

  69. DEV.to — Building Hierarchical Budget Controls for Multi-Tenant LLM Gateways. https://dev.to/pranay_batta/building-hierarchical-budget-controls-for-multi-tenant-llm-gateways-ceo

  70. DEV.to — PII-aware routing. https://dev.to/micelclaw/pii-aware-routing-how-to-use-cloud-ai-and-keep-your-sensitive-data-local-1m40

  71. Blue Prism — AI Gateway for PII Sanitization. https://www.blueprism.com/resources/blog/ai-gateway-pii-sanitization/

  72. Microsoft Presidio. https://github.com/microsoft/presidio

  73. PII Tools. https://pii-tools.com/

  74. LangChain Guardrails. https://docs.langchain.com/oss/python/langchain/guardrails

  75. Datadog — LLM guardrails: Best practices. https://www.datadoghq.com/blog/llm-guardrails-best-practices/

  76. Wiz — AI Guardrails: Safety Controls for Responsible AI Use. https://www.wiz.io/academy/ai-security/ai-guardrails

  77. LangGraph Interrupts. https://docs.langchain.com/oss/python/langgraph/interrupts

  78. LangGraph Static Breakpoints. https://langchain-ai.github.io/langgraph/cloud/how-tos/human_in_the_loop_breakpoint/

  79. LLM Routing with Dueling Feedback (2025). https://arxiv.org/html/2510.00841

  80. I Built a Policy Drift Detector for LLM Agents. DEV.to (2025). https://dev.to/gnomeman4201/i-built-a-policy-drift-detector-for-llm-agents-heres-what-four-versions-taught-me-2be 2

  81. RIVA: Leveraging LLM Agents for Reliable Configuration Drift Detection (2026). https://arxiv.org/pdf/2603.02345v1

  82. OpenTelemetry — Semantic conventions for generative AI systems. https://opentelemetry.io/docs/specs/semconv/gen-ai/

  83. OpenTelemetry — Semantic Conventions for GenAI agent and framework spans. https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/

  84. Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents (2026). https://arxiv.org/abs/2603.26233

  85. Structured Uncertainty guided Clarification for LLM Agents (2025). https://arxiv.org/html/2511.08798v1

  86. AbstentionBench: Reasoning LLMs Fail on Unanswerable Questions (NeurIPS 2025). https://arxiv.org/pdf/2506.09038 2