Skip to content

Orchestrator Companion Audit — Non-Routing Surface Critique & Improvement Plan

Orchestrator Companion Audit — Non-Routing Surface Critique & Improvement Plan

Section titled “Orchestrator Companion Audit — Non-Routing Surface Critique & Improvement Plan”

What this document covers. Everything in the vox-orchestrator system except model selection, model catalog, and vox-secrets secret distribution — those are covered by model-orchestration-ssot-audit-2026.md (FIX-01..75) and nextgen-orchestrator-research-2026.md (P0–P3 gaps). Items here that touch routing are labelled Prior audit ref: FIX-NN and are net-new extensions, not duplicates.

In scope:

  • crates/vox-orchestrator/src/ — all modules except models/ and catalog.rs
  • crates/vox-orchestrator/src/mcp_tools/ — all eight tool subdirectories
  • crates/vox-db/src/store/ops_orchestrator.rs
  • contracts/communication/orchestrator-persistence-outbox.schema.json
  • .vox/agents/orchestrator.md, .vox/agents/vox-orchestrator.md, docs/agents/orchestrator.md, crates/vox-skills/skills/orchestrator.skill.md
  • crates/vox-integration-tests/tests/orchestrator_*.rs
  • docs/src/adr/022-orchestrator-bootstrap-and-daemon-boundaries.md

Out of scope for this document: crates/vox-orchestrator/src/models/, catalog.rs, scoring.rs, key_guard.rs, crates/vox-secrets/, Populi mesh transport (covered separately by vox-populi architecture docs).


Each row is one FIX item. Columns: ID | surface | axis | concern | P | effort | title.

IDSurfaceAxisConcernPEffortTitle
FIX-A-01orchestrator-corehygienemodularityP2MFlatten Arc<RwLock> nesting with typed guard wrappers
FIX-A-02orchestrator-coreriskcorrectnessP1MGuard every db.read().as_ref() dereference against None
FIX-A-03orchestrator-corehygienemodularityP2LSplit Orchestrator struct into domain-bounded sub-stores
FIX-A-04orchestrator-coreriskcorrectnessP1SEnforce feature-contract: runtime feature ↔ runtime fields
FIX-A-05orchestrator-corehygienecontractsP3SDocument Orchestrator state invariants in a module-level doc-comment
FIX-A-06orchestrator-coreriskcorrectnessP1MAdd state-machine guard for agent lifecycle transitions
FIX-B-01runtimeriskcorrectnessP0MExtract AiTaskProcessor::process() into focused sub-functions
FIX-B-02runtimeriskcorrectnessP1SReplace UUID split().next().unwrap() with validated parse
FIX-B-03runtimeriskcorrectnessP1SUpgrade AtomicUsize::fetch_add to Ordering::AcqRel for spawn counter
FIX-B-04runtimehygienemodularityP2MSplit ActorAgent::handle_command ProcessQueue branch into helper
FIX-B-05runtimehygienemodularityP2MExtract queue-lock-find-update pattern into orchestrator helper
FIX-B-06runtimeriskcorrectnessP1MAudit Mutex lock inside reconciled_cost closure for deadlock
FIX-B-07runtimeriskcorrectnessP1SAdd structured error type; remove bare String errors in process()
FIX-B-08runtimecapabilityobservabilityP2SEmit structured trace span at start of each TaskPhase transition
FIX-B-09runtimeperfperfP2MDeduplicate provider match arm (10-branch) into a method
FIX-B-10runtimeriskcorrectnessP1MVerify async cancel safety in handle_command under task abort
FIX-B-11runtimecapabilitycapabilityP1LImplement doom-loop detector (cost_delta/progress_delta watchdog)
FIX-B-12runtimeriskcorrectnessP2SCheck_scaling: guard AgentFleet span line 657 with overflow-safe arithmetic
FIX-C-01handoffhygienemodularityP3SExtract JSON find+parse into a single parse_metadata_key() helper
FIX-C-02handoffriskcorrectnessP2SAggregate transcript-bleed errors instead of three separate calls
FIX-C-03handoffriskcorrectnessP2SFix double-parse of HARNESS_SPEC_JSON_METADATA_KEY when no envelope
FIX-C-04handoffhygienecontractsP2MWrite a JSON Schema for ContextEnvelope and validate at handoff boundary
FIX-C-05handoffrisktestingP1MAdd tests: handoff with no verification_criteria, with bleed, with bad harness JSON
FIX-C-06handoffcapabilityobservabilityP2SEmit handoff telemetry event with invariant check results
FIX-C-07handoffriskcorrectnessP2SValidate harness spec side-effects before accepting as invariant
FIX-D-01eventsriskcorrectnessP1SReplace fetch_add(Relaxed) with AcqRel for event ID ordering
FIX-D-02eventsriskcorrectnessP1SLog dropped events when broadcast channel is at capacity
FIX-D-03eventshygienemodularityP2LSplit AgentEventKind (50+ variants) into TaskEvent/AgentEvent/SystemEvent sub-enums
FIX-D-04eventshygienecontractsP2MReplace string IDs in ActivityStarted/Completed/Retried with AgentId/TaskId
FIX-D-05eventshygienemodularityP3MExtract TaskStarted/Failed/Completed shared fields into a TaskEventCore struct
FIX-D-06eventscapabilityobservabilityP2SAdd subscriber_count telemetry metric for event bus health
FIX-D-07eventsriskcorrectnessP2MDefine and enforce event channel capacity; document lossy vs lossless lanes
FIX-D-08eventscapabilitycapabilityP2MAdd lock-propagation event variants for multi-agent schema-mutation coordination
FIX-E-01groundingriskcorrectnessP1SReplace expect(“char boundary”) at line 235 with safe UTF-8 boundary check
FIX-E-02groundinghygienemodularityP2MDecompose split_summary_into_claim_segments() 132-line function into named steps
FIX-E-03groundinghygienemodularityP3MExternalize English phrase list in classify_line_claim_kind into config/YAML
FIX-E-04groundingrisktestingP2MAdd tests: emoji input, surrogate pairs, German mixed-script, empty summary
FIX-E-05groundinghygienemodularityP3SDeduplicate str/strasse suffix checks in locale detection helper
FIX-E-06groundingcapabilitycapabilityP2LAdd entropy/confidence scoring from token probabilities to ground truth check
FIX-F-01usage-finopsriskperfP0MReplace global static Mutex in record_call_detailed (line 256) with async channel
FIX-F-02usage-finopsriskcorrectnessP1SValidate JSON shape before as_u64()/as_f64() fallbacks (lines 265-268)
FIX-F-03usage-finopshygienemodularityP3SReplace custom today() date calc with chrono::Utc::now().date_naive()
FIX-F-04usage-finopshygienemodularityP3SRemove record_call() wrapper that just calls record_call_detailed() with constants
FIX-F-05usage-finopscapabilitycapabilityP1LAdd per-tenant and per-fleet-segment budget caps alongside global cap
FIX-F-06usage-finopscapabilityobservabilityP2SExpose real-time budget utilization via a /v1/budget/status HTTP endpoint
FIX-F-07usage-finopsriskcorrectnessP2MWrite unit tests: budget exhaustion halts task routing; tenant cap isolation
FIX-F-08usage-finopscapabilitycapabilityP2MImplement pre-execution token estimation using tool output length prediction
FIX-G-01mcp-dispatchriskcorrectnessP1SAdd per-tool execution timeout in handle_tool_call_inner (no timeout today)
FIX-G-02mcp-dispatchriskcorrectnessP1MNormalize 960-line match into a tool registry (name → handler fn ptr)
FIX-G-03mcp-dispatchrisksecurityP1SMove hardcoded sentinel strings (SYSTEM_INTERVENTION, etc.) to typed constants
FIX-G-04mcp-dispatchriskcorrectnessP2SReplace parse::().unwrap_or(0) AgentId/TaskId fallbacks with explicit error
FIX-G-05mcp-dispatchcapabilityobservabilityP2SEmit dispatch latency span per tool call with tool_name attribute
FIX-G-06mcp-dispatchcapabilitycapabilityP2MAdd per-tool rate limiter configurable from Vox.toml [orchestrator.tool_limits]
FIX-G-07mcp-dispatchhygienemodularityP3LSplit dispatch.rs into per-tool-group modules (chat, scientia, llm, memory, task, dei)
FIX-G-08mcp-dispatchriskcorrectnessP2SAdd backpressure: reject tool call when agent queue exceeds configurable high-watermark
FIX-G-09mcp-dispatchhygienecontractsP2MDefine a ToolResult error taxonomy enum (Budget/Auth/Validation/Internal/Timeout)
FIX-H-01chat-toolsriskcorrectnessP1SFix .expect(“telemetry JSON must parse”) at mod.rs:176 — return error instead
FIX-H-02chat-toolsriskcorrectnessP1SFix .expect(“search_refinement field”) at mod.rs:214 — guard with if-let
FIX-H-03chat-toolsriskcorrectnessP1SFix .unwrap() on static Regex::new() in mentions.rs:9 — use expect with message
FIX-H-04chat-toolshygienemodularityP1LDecompose plan_goal() (514 lines, plan.rs:186-699) into planning pipeline stages
FIX-H-05chat-toolshygienemodularityP2MDecompose maybe_refine_plan() (337 lines, plan_loop.rs:151-487)
FIX-H-06chat-toolshygienecontractsP2SMove hardcoded token caps (3072/4096/8192) in plan.rs:55-60 to Vox.toml
FIX-H-07chat-toolshygienecontractsP2SMake max_tasks default (plan.rs:187) and max_refine_rounds cap (plan_loop.rs:119) configurable
FIX-H-08chat-toolshygienecontractsP2SMove reserve_tokens=4096 and refine_budget=18000 constants to config
FIX-H-09chat-toolshygienecontractsP2SMake refine threshold 0.28 (plan_loop.rs:162) and inadequacy defaults configurable
FIX-H-10chat-toolsriskcorrectnessP2MAdd retry limit enforcement in plan_loop refinement; guard against infinite refine
FIX-H-11chat-toolscapabilityobservabilityP2SEmit plan generation span with task_count, depth, token_budget attributes
FIX-H-12chat-toolsrisktestingP2MAdd tests: plan with zero tasks, plan exceeding budget, replan with empty base
FIX-H-13chat-toolshygienemodularityP3SRemove .unwrap_or(None).unwrap_or(0) chained option at plan.rs:495
FIX-I-01scientia-toolshygienecontractsP2SMove hardcoded score thresholds (0.85, 0.62, 0.58) to config/YAML
FIX-I-02scientia-toolshygienecontractsP2SMove common.rs:87 evidence threshold (0.85) to Vox.toml [orchestrator.scientia]
FIX-I-03scientia-toolsriskcorrectnessP2MAdd retry limit and timeout guard on external.rs job submission/replay loops
FIX-I-04scientia-toolscapabilityobservabilityP2SEmit scholar lifecycle events (submit, discover, publish) as typed AgentEventKind variants
FIX-I-05scientia-toolsrisktestingP2MAdd tests: novelty check with empty corpus, discovery with no results, lifecycle failure paths
FIX-I-06scientia-toolshygienemodularityP2LBreak novelty.rs and scholar.rs functions >100 lines into named computation steps
FIX-I-07scientia-toolsriskcorrectnessP2SAdd backpressure on discovery refresh calls: enforce min_refresh_interval
FIX-J-01llm-bridgeriskcorrectnessP0MAdd explicit HTTP timeout Duration to all provider HTTP clients in llm_bridge/
FIX-J-02llm-bridgehygienecontractsP1MReplace Result<T, String> error type in infer.rs with typed ProviderError enum
FIX-J-03llm-bridgehygienemodularityP1LDecompose mcp_infer_tool_completion() (390 lines, infer.rs:210-599)
FIX-J-04llm-bridgehygienecontractsP2SReplace hardcoded vision token estimate (1000) with per-model config lookup
FIX-J-05llm-bridgeriskcorrectnessP2MAdd exponential backoff on HTTP 429/408/504 (detected but no backoff today)
FIX-J-06llm-bridgecapabilityobservabilityP2SEmit provider attempt span with status_code, attempt_number, provider_name
FIX-J-07llm-bridgecapabilitycapabilityP2MAdd schema-aware message translation layer for Anthropic tool-call alternation
FIX-J-08llm-bridgeriskcorrectnessP2SValidate provider response schema before silent .unwrap_or_default() fallbacks
FIX-J-09llm-bridgehygienemodularityP3MConsolidate per-provider adapters behind a single ProviderAdapter trait impl
FIX-J-10llm-bridgecapabilitycapabilityP2MAdd PII-sensitive routing: AgentTask.sensitivity_marker gates provider selection
FIX-K-01http-gatewayrisksecurityP0SRemove DEBUG println! at mod.rs:176-177 (env var leak to logs)
FIX-K-02http-gatewayrisksecurityP0SFix origin_guard.rs starts_with URL matching (prefix bypass: 127.0.0.1.attacker.com)
FIX-K-03http-gatewayrisksecurityP1MAdd request body size limit on eval endpoint; reject EvalRequest.code > 64KiB
FIX-K-04http-gatewayrisksecurityP1MMove WebSocket auth-on-first-message to connection handshake; reject unauthenticated upgrades
FIX-K-05http-gatewayrisksecurityP1SRemove query-param token fallback in ws.rs (?token=, ?bearer=); tokens in URLs appear in logs
FIX-K-06http-gatewayrisksecurityP1SFix constant_time_eq: ensure max(a,b) branch doesn’t leak token length via timing
FIX-K-07http-gatewayrisksecurityP1SAudit bearer-token-to-dashboard-token downgrade (mod.rs:227); document role mapping
FIX-K-08http-gatewayriskcorrectnessP1SAdd SIGTERM/SIGINT graceful shutdown handler to the daemon and gateway
FIX-K-09http-gatewaycapabilityopsP1SAdd /healthz and /readyz endpoints for k8s-style probe routing
FIX-K-10http-gatewayriskcorrectnessP1SReplace fragile unwrap() after guarded branch in ws.rs:87 with pattern-match
FIX-K-11http-gatewayhygienemodularityP2MDecompose spawn_http_gateway_if_enabled (124 lines) into named setup phases
FIX-K-12http-gatewayhygienecontractsP2SMake dashboard token TTL (30 days, token.rs:19) configurable via Vox.toml
FIX-K-13http-gatewayriskcorrectnessP2SValidate token.rs file parse failure: log and require re-auth rather than silently regenerate
FIX-K-14http-gatewayrisksecurityP2SEnforce Windows ACL semantics for token file (token.rs platform path); match Unix mode 0o600
FIX-K-15http-gatewaycapabilityopsP2MAdd per-IP rate limiting to eval endpoint (CPU DoS risk, eval.rs:87)
FIX-K-16http-gatewayhygienecontractsP2SCentralise auth role-decision logic; remove cross-layer auth in eval.rs vs origin_guard.rs
FIX-K-17http-gatewayhygienemodularityP3MPre-compute role-to-tool permission matrix at startup instead of per-call check
FIX-K-18http-gatewaycapabilityobservabilityP2SAdd structured access log with trace_id, role, tool_name on every gateway request
FIX-L-01memory-toolshygienecontractsP3SMove hardcoded item limits (20, 10) and confidence defaults (0.5, 1.0) to config
FIX-L-02memory-toolscapabilityobservabilityP3SPropagate REM_* remediation constants to all handler error paths consistently
FIX-L-03memory-toolsriskcorrectnessP2MAdd rate limiting on memory store/recall operations per agent session
FIX-L-04memory-toolsrisktestingP2MAdd tests: store near capacity, concurrent reads/writes, session isolation
FIX-L-05memory-toolshygienemodularityP3MDocument memory isolation guarantees: can agent A read agent B’s scratchpad?
FIX-M-01task-toolshygienemodularityP2MDeduplicate companion lookup/upsert pattern (3× in lifecycle.rs) into a helper
FIX-M-02task-toolsriskcorrectnessP1SLog errors from fire-and-forget companion gamification async tasks
FIX-M-03task-toolsriskcorrectnessP2SValidate priority string at parse time; reject unknown priority rather than defaulting
FIX-M-04task-toolsrisktestingP2MAdd tests: task priority parsing, companion update failure, lifecycle abort
FIX-M-05task-toolscapabilityobservabilityP2SEmit task lifecycle events (submit/start/complete/cancel) as typed events via EventBus
FIX-N-01dei-shimhygienemodularityP1LPlan and execute dei_shim (2769 LOC) retirement: audit callers, migrate, gate with feature flag
FIX-N-02dei-shimhygienemodularityP2MAudit dei_tools/ for callers of shim functions; migrate each to canonical module
FIX-N-03dei-shimhygienedocsP2SAdd #[deprecated] attribute to all dei_shim public symbols with migration path
FIX-N-04dei-shimhygienemodularityP2MConfirm orchestrator_snapshot.rs in dei_tools/ still needed; extract to dei-independent path
FIX-N-05dei-shimhygieneopsP3SAdd CI lint: fail if any new non-test file imports from dei_shim
FIX-O-01daemonriskcorrectnessP1MAdd SIGTERM/SIGINT handler with graceful orchestrator flush in vox_orchestrator_d.rs
FIX-O-02daemonriskcorrectnessP1STreat init_db failure (line 75) as fatal with structured error; do not continue
FIX-O-03daemonriskcorrectnessP1SReturn Result from SessionManager initialization (line 120); eliminate panic!
FIX-O-04daemoncapabilityopsP1SAdd /healthz readiness probe (FIX-K-09 provides HTTP; daemon needs to wire it)
FIX-O-05daemonhygienedocsP2MUpdate ADR-022 with Phase B TCP/stdio RPC flag matrix current status
FIX-O-06daemonriskcorrectnessP2SLog and alarm if spawn_http_gateway_if_enabled returns Err; daemon should not silently continue
FIX-O-07daemoncapabilityopsP2MAdd —dry-run flag to daemon startup: validates config + DB without binding sockets
FIX-O-08daemoncapabilityopsP2MEmit structured startup metrics (uptime, build_sha, feature_flags) on boot
FIX-O-09daemonhygienedocsP3SWrite a new ADR for daemon split-plane RPC flag matrix (Phase B from ADR-022)
FIX-P-01persistence-outboxriskcorrectnessP1MGuard fence token increment in ops_orchestrator.rs for i64::MAX overflow
FIX-P-02persistence-outboxriskcorrectnessP1MAdd clock-skew tolerance (±5s window) to distributed lock TTL comparison
FIX-P-03persistence-outboxhygienecontractsP2MAdd schema version field to orchestrator-persistence-outbox.schema.json
FIX-P-04persistence-outboxhygienecontractsP2SReplace additionalProperties:true in replayPayload with explicit discriminated union
FIX-P-05persistence-outboxriskcorrectnessP2MWrite idempotence tests for every outbox replay_op (all 6 operations)
FIX-P-06persistence-outboxcapabilityobservabilityP2SAdd lifecycle tick observability: emit metric for queued/pruned/retried/replayed counts
FIX-P-07persistence-outboxriskcorrectnessP2MClarify dual Result<Result<T, String>, StoreError> return in acquire_distributed_lock
FIX-P-08persistence-outboxhygienedocsP3SWrite a mini-ADR for degraded-mode persistence outbox semantics and replay ordering
FIX-P-09persistence-outboxriskcorrectnessP2MAdd circuit-breaker recovery test: simulate DB unavailability then reconnect
FIX-P-10persistence-outboxcapabilityobservabilityP2STrack outbox queue depth as a Prometheus-compatible gauge exposed via /metrics
FIX-Q-01planninghygienecontractsP2SReplace string-based policy selection (policy.rs:6-10) with a typed PlanPolicy enum
FIX-Q-02planningriskcorrectnessP2MAdd maximum recursion depth guard to router to prevent infinite replan loops
FIX-Q-03planningrisktestingP2MAdd tests: plan synthesis with no nodes, replan after partial completion, policy conflict
FIX-Q-04planningcapabilityobservabilityP2SEmit planning decision telemetry: chosen policy, node_count, synthesis_latency_ms
FIX-Q-05planninghygienemodularityP3MDocument planning module’s relationship to chat_tools/plan.rs; separate responsibilities
FIX-R-01scaling-serviceshygienemodularityP2MExtract services/ embeddings, routing-policy, gateway helpers into typed service traits
FIX-R-02scaling-servicesriskcorrectnessP2MAdd invariant test: ScalingAction decisions are idempotent under concurrent check_scaling
FIX-R-03scaling-servicescapabilityobservabilityP2SEmit scaling decision events (scale_up/scale_down/no_op) with agent_count, load metrics
FIX-R-04scaling-serviceshygienecontractsP3SDocument scaling cooldown logic; make cooldown_secs configurable via Vox.toml
FIX-R-05scaling-servicesriskcorrectnessP2SVerify check_scaling arithmetic is overflow-safe for large agent counts (line 657)
FIX-S-01agent-promptshygieneagent-promptsP2MReconcile MCP tool lists across .vox/agents/orchestrator.md and docs/agents/orchestrator.md
FIX-S-02agent-promptshygieneagent-promptsP2MAudit .vox/agents/orchestrator.md tool categories against dispatch.rs tool registry
FIX-S-03agent-promptshygieneagent-promptsP2SAdd explicit version/date stamp to each agent prompt so staleness is detectable
FIX-S-04agent-promptshygieneagent-promptsP2MConsolidate Socrates gate policy between .vox/agents/orchestrator.md and docs/agents/orchestrator.md
FIX-S-05agent-promptshygieneagent-promptsP3SDefine canonical read vs. write tool categories in vox-orchestrator.md scope declaration
FIX-S-06agent-promptshygieneagent-promptsP3MReview orchestrator.skill.md for drift from crate layout (agent lifecycle, handoff, budget)
FIX-S-07agent-promptscapabilityagent-promptsP2MAdd doom-loop intervention policy to orchestrator.md (cost/progress watchdog instructions)
FIX-S-08agent-promptscapabilityagent-promptsP2MAdd lock-propagation awareness: when agent performs schema mutation, instruct broadcast
FIX-S-09agent-promptshygienedocsP3SLink each agent prompt to its governing ADR and SSOT section
FIX-T-01testsrisktestingP0MAdd tests: Socrates gate enforce mode (task blocked), shadow mode (task proceeds but logged)
FIX-T-02testsrisktestingP1MAdd tests: trust-gate-relax path (agent reliability below threshold)
FIX-T-03testsrisktestingP1MAdd tests: handoff invariant failures (each of the 6 invariants violated independently)
FIX-T-04testsrisktestingP1MAdd tests: conflict resolution between concurrent agents modifying same file
FIX-T-05testsrisktestingP1LAdd tests: outbox degraded-mode round-trip (DB offline → queue → reconnect → replay)
FIX-T-06testsrisktestingP1MAdd tests: doom-loop detector triggers under simulated runaway cost
FIX-T-07testsrisktestingP1MAdd tests: distributed lock expiry and foreign-holder rejection
FIX-T-08testsrisktestingP2MAdd tests: event bus capacity reached; verify dropped events are logged
FIX-T-09testsrisktestingP2MAdd tests: multi-agent task fan-out with budget exhaustion mid-flight
FIX-T-10testshygienetestingP2MUnify MCP tool tests from inline #[cfg(test)] into a coherent per-tool test module
FIX-T-11testsrisktestingP2MAdd tests: daemon graceful shutdown drains outbox before exiting
FIX-T-12testsrisktestingP2MAdd tests: gateway origin guard — loopback pass, external reject, prefix-bypass blocked
FIX-T-13testshygienetestingP3SAdd deterministic seeds to e2e tests (orchestrator_e2e_test.rs) for reproducibility
FIX-T-14testshygienetestingP3SAdd bootstrap parity test for all Orchestrator fields, not only repository_id and shard paths
FIX-U-01contractshygienecontractsP2MDefine contracts/communication/handoff-envelope.v1.schema.json; validate in handoff.rs
FIX-U-02contractshygienecontractsP2MDefine contracts/communication/agent-event.v1.schema.json aligned with AgentEventKind
FIX-U-03contractshygienecontractsP2MDefine contracts/communication/mcp-dispatch-error.v1.schema.json for ToolResult errors
FIX-U-04contractshygienecontractsP3SAdd schema version field to all orchestrator contract JSON files
FIX-U-05contractshygienecontractsP3SAdd vox ci contracts-validate guard checking all contracts/ JSON against their schemas
FIX-V-01adrs-docshygienedocsP2MUpdate ADR-022 Phase B section: document current RPC flag matrix and remaining gaps
FIX-V-02adrs-docshygienedocsP2SWrite ADR-023: event bus semantics (lossy vs lossless lanes, channel capacity policy)
FIX-V-03adrs-docshygienedocsP2SWrite ADR-024: distributed lock fence-token discipline and clock-skew policy
FIX-V-04adrs-docshygienedocsP3SAdd “related ADRs” footer to each orchestrator ADR linking to others
FIX-V-05adrs-docshygienedocsP3SUpdate docs/agents/orchestrator.md to reflect dei_shim retirement plan (FIX-N-01)
FIX-X-01error-handlingriskcorrectnessP1LAudit all unwrap()/expect() in crates/vox-orchestrator/src/**/*.rs; replace panic paths
FIX-X-02error-handlinghygienemodularityP1MDefine OrchestratorError enum with variants per subsystem; propagate through public API
FIX-X-03error-handlinghygienecontractsP2MStandardize on err_with_remediation() pattern across all MCP tool handlers
FIX-X-04error-handlinghygienemodularityP2MAdd error context (file, agent_id, task_id) to every error propagation in the call stack
FIX-X-05error-handlingriskcorrectnessP2SBan bare String as error type in Result<_, String>; enforce with a clippy lint
FIX-X-06error-handlinghygienemodularityP3SCentralise all REM_* remediation string constants into one remediation.rs module
FIX-Y-01async-cancelriskcorrectnessP1MAudit all async fn holding RwLock guard across .await boundaries; replace with scope guard
FIX-Y-02async-cancelriskcorrectnessP1MAudit tokio::select! arms for dropped futures holding locks; ensure cleanup on cancel
FIX-Y-03async-cancelriskcorrectnessP2MAdd cancel-safety documentation to every async fn in runtime.rs and orchestrator.rs
FIX-Y-04async-cancelrisktestingP2MAdd test: task abort mid-execution leaves no orphan lock or outbox entry
FIX-Z-01observabilitycapabilityobservabilityP1MAdd slow-operation detector: emit WARN if any tool handler exceeds 5s wall-clock
FIX-Z-02observabilitycapabilityobservabilityP2SEnforce tracing span names follow vox_orchestrator:: convention (no vox_dei::)
FIX-Z-03observabilitycapabilityobservabilityP2MAdd log-level audit: DEBUG in production code; demote or gate behind —verbose flag
FIX-Z-04observabilitycapabilityobservabilityP2SAdd metric: active_agents gauge, queue_depth gauge, handoff_count counter
FIX-Z-05observabilitycapabilityobservabilityP3MEmit structured startup audit log listing feature flags, config sources, DB connection state
FIX-Z-06observabilityhygienedocsP3SDocument metric naming convention in docs/src/reference/observability.md
FIX-AA-01securityrisksecurityP0SRemove DEBUG println! (already filed as FIX-K-01 — cross-reference only)
FIX-AA-02securityrisksecurityP1MInstall tracing redaction middleware across all orchestrator spans (see prior FIX-53)
FIX-AA-03securityrisksecurityP1MVerify no SecretSpec-managed value reaches a tool handler parameter or log line
FIX-AA-04securityrisksecurityP2MAdd replay-attack protection to A2A delivery: reject duplicate jwe_payload nonces
FIX-AA-05securityrisksecurityP2MAdd SSRF guard in eval endpoint: block outbound network from eval subprocess
FIX-AA-06securityrisksecurityP2SDocument authenticated surface area in a threat model table in docs/src/security/
FIX-AA-07securityhygienesecurityP3MRun cargo-audit and cargo-deny on vox-orchestrator in CI; gate on advisories
FIX-AB-01lifecycleriskcorrectnessP1MImplement Drop for Orchestrator: flush outbox, drain event bus, close DB handle
FIX-AB-02lifecycleriskcorrectnessP1MEnsure shutdown ordering: stop accepting tasks → drain queue → flush outbox → close DB
FIX-AB-03lifecycleriskcorrectnessP2MAdd RAII guard for distributed lock: auto-release on drop even if holder panics
FIX-AB-04lifecycleriskcorrectnessP2MDocument and test restart determinism: same repo → same shard paths and memory layout
FIX-AB-05lifecyclerisktestingP2MAdd test: orchestrator restart after crash recovers from outbox replay without duplicates

Each cluster has a Design Note (architectural context) followed by its FIX items (pulled from Part 2 above by ID prefix). Use the concern index for cross-referencing.


Cluster A — Core orchestrator.rs and Submodule Tree

Section titled “Cluster A — Core orchestrator.rs and Submodule Tree”

Design Note. The Orchestrator struct (orchestrator.rs:53-128) is a 75-field god object where every major subsystem — affinity map, lock manager, context store, budget, event bus, agent registry, JJ oplog, conflict manager, snapshot, runtime handles, DB handle — is held behind Arc<RwLock<T>> without consistency invariants between fields. The optional db: Option<Arc<VoxDb>> at line 98 is consumed across ~20 downstream sites without a unified access pattern, creating silent None-dereference risk. The submodule split into core, agent_lifecycle, scaling, vcs_ops helps but the struct root remains an anti-pattern against domain cohesion. The goal is to move toward bounded sub-stores with typed accessors, deferring a full God-object split to Cluster B (runtime) and Cluster R (scaling).

FIX-A-01 [OPEN] surface: orchestrator-core | axis: hygiene | concern: modularity | P2 | effort: M

  • Problem. Every field in Orchestrator is wrapped Arc<RwLock<T>> with no shared consistency boundary. Write to two related fields (e.g., agent registry + task assignments) requires locking each separately; no atomic update across the pair.
  • Operation. Introduce typed guard wrappers: AgentRegistryGuard, TaskQueueGuard. Each wraps a coarse RwLock covering all related fields. Replace raw Arc<RwLock<T>> at each related field cluster with a single shared guard type.
  • Success. rg 'Arc<RwLock' crates/vox-orchestrator/src/orchestrator.rs returns ≤10 hits (from ~40 today). Existing tests green.

FIX-A-02 [OPEN] surface: orchestrator-core | axis: risk | concern: correctness | P1 | effort: M

  • Problem. self.db.read().unwrap().as_ref() dereference pattern at numerous call sites. If db is None (offline mode) the call site must handle it; inconsistent handling produces silent no-ops or panics.
  • Operation. Add fn db_required(&self) -> Result<Arc<VoxDb>, OrchestratorError> and fn db_optional(&self) -> Option<Arc<VoxDb>>. Replace all direct field reads. Callers that require DB explicitly propagate OrchestratorError::DatabaseUnavailable.
  • Success. rg '\.db\.read\(\)' crates/vox-orchestrator/src returns zero hits outside the new accessor.

FIX-A-03 [OPEN] surface: orchestrator-core | axis: hygiene | concern: modularity | P2 | effort: L

  • Problem. 75-field struct resists reading and modification. Adding a new subsystem requires editing the god object.
  • Operation. Group fields into four domain structs: AgentDomain { agents, groups, task_assignments, task_traces }, VcsDomain { snapshot, oplog, conflict_manager, workspace_manager }, BudgetDomain { budget, usage }, InfraDomain { db, event_bus, runtime_handle }. Expose via fn agents(&self) -> &AgentDomain, etc.
  • Success. Orchestrator struct body ≤20 fields; domain struct fields are its own concern.

FIX-A-04 [OPEN] surface: orchestrator-core | axis: risk | concern: correctness | P1 | effort: S

  • Problem. Runtime handles at orchestrator.rs:84-86 are #[cfg(feature = "runtime")] but DB at line 98 is unconditional. A build without the runtime feature still requires DB, producing asymmetric feature contracts.
  • Operation. Audit all feature-gated fields. Either gate DB access behind the same feature or document explicitly why they differ (add a // INVARIANT: comment).
  • Success. cargo check --no-default-features -p vox-orchestrator produces zero unexpected type errors.

FIX-A-05 [OPEN] surface: orchestrator-core | axis: hygiene | concern: contracts | P3 | effort: S

  • Problem. No invariant documentation exists on the struct. A developer reading it cannot know valid state transitions.
  • Operation. Add a //! # State invariants block at the top of orchestrator.rs listing: (1) db can be None only in offline mode, (2) agents map keys match task_assignments keys, (3) event_bus is always initialized.
  • Success. Block exists; reviewed in code review.

FIX-A-06 [OPEN] surface: orchestrator-core | axis: risk | concern: correctness | P1 | effort: M

  • Problem. Agent lifecycle transitions (idle → assigned → running → completed) have no state-machine guard. Callers can set contradictory states.
  • Operation. Define AgentState enum; add transition(&mut self, to: AgentState) -> Result<()> that validates the transition table. Reject invalid transitions with OrchestratorError::InvalidStateTransition.
  • Success. Test: attempt idle→completed transition returns Err; idle→assigned→running→completed returns Ok.

Cluster B — Runtime Supervisor (runtime.rs)

Section titled “Cluster B — Runtime Supervisor (runtime.rs)”

Design Note. The AiTaskProcessor::process() function at runtime.rs:139-417 (279 lines) combines model routing, phase-loop execution, drift detection, cost reconciliation, and transcript persistence in a single method. This is the most complex function in the codebase and the highest-risk change target. The phase loop at lines 266-349 has embedded match TaskPhase::* control flow across ~80 lines. ActorAgent::handle_command (83 lines, lines 467-549) contains a ProcessQueue branch worth 60 lines of nested conditionals. Scaling check at lines 657-771 is independently well-factored. The panic at line 733 (UUID format assumption) is the only hard-crash path found.

FIX-B-01 [OPEN] surface: runtime | axis: risk | concern: correctness | P0 | effort: M

  • Problem. AiTaskProcessor::process() (279 lines) violates SRP: routing + phases + drift + persistence in one function makes testing and modification high-risk.
  • Operation. Extract: select_model_for_task(), execute_phase_loop() -> PhaseLoopResult, detect_drift(PhaseLoopResult) -> DriftDecision, persist_transcript(PhaseLoopResult). Keep process() as the orchestrating thin wrapper.
  • Success. process() body ≤60 lines; each extracted fn has its own unit test.

FIX-B-02 [OPEN] surface: runtime | axis: risk | concern: correctness | P1 | effort: S

  • Problem. runtime.rs:733.split('-').next().unwrap() on a UUID string. Panics if UUID format changes or value is empty.
  • Operation. Replace with s.split('-').next().ok_or(RuntimeError::MalformedUuid(s.to_string()))?.
  • Success. rg '\.split.*\.next\(\)\.unwrap\(\)' crates/vox-orchestrator/src/runtime.rs returns zero hits.

FIX-B-03 [OPEN] surface: runtime | axis: risk | concern: correctness | P1 | effort: S

  • Problem. runtime.rs:722-746 spawn counter uses AtomicUsize::fetch_add(1, Ordering::Relaxed). Concurrent scale-up decisions may read a stale counter, triggering over-spawning.
  • Operation. Change to Ordering::AcqRel for the fetch_add and Ordering::Acquire for loads.
  • Success. Test with two concurrent check_scaling calls: total spawned agents ≤ configured max.

FIX-B-04 [OPEN] surface: runtime | axis: hygiene | concern: modularity | P2 | effort: M

  • Problem. handle_command’s ProcessQueue branch (lines 474-532) is 60 lines with four levels of nesting. Hard to follow cancellation and error paths.
  • Operation. Extract process_queued_task(orchestrator: &Orchestrator, agent_id: AgentId) -> Result<ProcessOutcome> as a standalone function.
  • Success. handle_command ≤40 lines; process_queued_task has a dedicated unit test.

FIX-B-05 [OPEN] surface: runtime | axis: hygiene | concern: modularity | P2 | effort: M

  • Problem. Queue-lock → find_task_mut → update pattern duplicated at lines 278-286 and 392-410 with subtle differences masking bugs.
  • Operation. Add fn with_task_mut<F, R>(orchestrator: &Orchestrator, task_id: TaskId, f: F) -> Result<R> helper; use at both call sites.
  • Success. Duplication gone; both call sites share one implementation.

FIX-B-06 [OPEN] surface: runtime | axis: risk | concern: correctness | P1 | effort: M

  • Problem. runtime.rs:259-261 acquires a Mutex inside a reconciled_cost closure that may run while another Mutex is held. Potential deadlock under contention.
  • Operation. Move cost reconciliation outside the closure; compute the cost value first, then acquire the Mutex, then write.
  • Success. Deadlock stress test (1000 concurrent tasks) completes without hang.

FIX-B-07 [OPEN] surface: runtime | axis: hygiene | concern: contracts | P1 | effort: S

  • Problem. Several paths in process() return Err(String). String errors lose structured context and cannot be matched.
  • Operation. Add RuntimeError enum (variants: ModelNotFound, PhaseTimeout, DriftHalted, PersistenceFailed). Replace all Err("...") in runtime.rs.
  • Success. rg 'Err(String' crates/vox-orchestrator/src/runtime.rs returns zero hits.

FIX-B-08 [OPEN] surface: runtime | axis: capability | concern: observability | P2 | effort: S

  • Problem. No span emitted at TaskPhase transitions; phases are invisible in traces.
  • Operation. At start of each match TaskPhase::* arm, emit tracing::info_span!("phase", phase = %phase, task_id = %task_id).
  • Success. A trace for a multi-phase task shows one child span per phase.

FIX-B-09 [OPEN] surface: runtime | axis: perf | concern: perf | P2 | effort: M

  • Problem. runtime.rs:163-180 matches on 10 ProviderType variants to build a provider config. Same pattern exists in model resolution.
  • Operation. Move to ProviderType::to_route_config(&self) -> RouteConfig method; call it at both sites.
  • Success. rg 'ProviderType::OpenRouter\|ProviderType::Anthropic' crates/vox-orchestrator/src/runtime.rs returns zero hits (moved to impl block).

FIX-B-10 [OPEN] surface: runtime | axis: risk | concern: correctness | P1 | effort: M

  • Problem. handle_command at runtime.rs:452-454 receives orchestrator references across an .await. If the future is cancelled mid-call, the borrow may outlive the orchestrator’s expected state.
  • Operation. Audit with #[must_use] and tokio-cancel-safety doc tags. Replace borrowed references with owned Arc clones that are cancel-safe.
  • Success. cargo clippy with clippy::await_holding_refcell_ref equivalent passes; cancel test does not corrupt agent state.

FIX-B-11 [OPEN] surface: runtime | axis: capability | concern: capability | P1 | effort: L

  • Problem. No doom-loop detector exists. An agent spending unbounded tokens with no task progress can run indefinitely. (Identified as P0 gap in nextgen research §4.1.)
  • Operation. Track (token_spend_delta, task_progress_delta) per agent per window (configurable, default 5 min). If cost_delta / progress_delta > threshold for 3 consecutive windows → emit AgentEventKind::DoomLoopDetected → halt agent and notify operator.
  • Success. Test: agent that loops 100 iterations with no task completion is halted within 3 windows; trace shows DoomLoopDetected event.

FIX-B-12 [OPEN] surface: runtime | axis: risk | concern: correctness | P2 | effort: S

  • Problem. AgentFleet::check_scaling at line 657 performs arithmetic on agent counts without overflow protection.
  • Operation. Replace plain arithmetic with .checked_add() / .saturating_sub(); return ScalingError::ArithmeticOverflow if bounds exceeded.
  • Success. Test with usize::MAX agent count returns Err rather than panic.

Design Note. handoff.rs is the most structurally sound cluster in the codebase: invariant validation is well-separated from execution, and the six invariant checks are readable. Three issues undermine it: (1) three independent calls to detect_transcript_bleed() instead of one aggregating pass, (2) nearly identical JSON find+parse for two metadata keys creating copy-paste divergence risk, (3) double-parsing of HARNESS_SPEC_JSON_METADATA_KEY when no context envelope is found. FIX-C items are mostly S-effort cleanup with high return.

FIX-C-01 [OPEN] surface: handoff | axis: hygiene | concern: modularity | P3 | effort: S

  • Problem. JSON find+parse pattern for CONTEXT_ENVELOPE_JSON (lines 361-371) and HARNESS_SPEC_JSON_METADATA_KEY (lines 372-391) is nearly identical but differs subtly — divergence risk.
  • Operation. Extract fn parse_metadata_key<T: DeserializeOwned>(metadata: &Value, key: &str) -> Result<Option<T>, HandoffError>. Call it twice.
  • Success. Lines 361-391 replaced by two calls; unit test covers missing key, malformed JSON, and correct parse.

FIX-C-02 [OPEN] surface: handoff | axis: risk | concern: correctness | P2 | effort: S

  • Problem. detect_transcript_bleed() called three times (lines 340, 345, 354) with separately constructed errors. If the first check fails, downstream checks still run and may produce confusing compound errors.
  • Operation. Collect all three results, then aggregate: let bleed_errors: Vec<_> = [check_context, check_plan, check_metadata].into_iter().flatten().collect(). Return once with all violations.
  • Success. Test with bleed in two fields returns one error listing both violations.

FIX-C-03 [OPEN] surface: handoff | axis: risk | concern: correctness | P2 | effort: S

  • Problem. handoff.rs:405-418 re-parses HARNESS_SPEC_JSON_METADATA_KEY (lines 406-411 and 437-440) when no context envelope is found — parses the same key twice.
  • Operation. Parse once, store in a local let harness_spec = parse_metadata_key(...)?, reuse.
  • Success. rg 'HARNESS_SPEC_JSON_METADATA_KEY' crates/vox-orchestrator/src/handoff.rs shows ≤2 occurrences (one parse, one reference).

FIX-C-04 [OPEN] surface: handoff | axis: hygiene | concern: contracts | P2 | effort: M

  • Problem. ContextEnvelope shape is validated only by Rust type-level deserialization; no JSON Schema contract for external producers (e.g., agent runtimes calling over MCP).
  • Operation. Write contracts/communication/handoff-envelope.v1.schema.json. Call jsonschema::validate() in validate_handoff_invariants before deserialization.
  • Success. A misshapen envelope from an external caller is rejected at the boundary, not silently defaults.

FIX-C-05 [OPEN] surface: handoff | axis: risk | concern: testing | P1 | effort: M

  • Problem. No dedicated handoff test file. The six invariants have no failing-path coverage.
  • Operation. Create crates/vox-orchestrator/tests/handoff_tests.rs. Cover: (1) no verification criteria with pending tasks, (2) transcript bleed in context_notes, (3) remote handoff without A2A context, (4) malformed harness JSON, (5) valid full handoff, (6) empty handoff.
  • Success. cargo test -p vox-orchestrator handoff passes 6 test cases.

FIX-C-06 [OPEN] surface: handoff | axis: capability | concern: observability | P2 | effort: S

  • Problem. Handoff success/failure is not visible in traces or event bus.
  • Operation. At end of execute_handoff, emit AgentEventKind::PlanHandoff with invariant_check_passed, violation_count fields. On failure, emit AgentEventKind::HandoffFailed with error_kind.
  • Success. A handoff violation is visible in the event log without checking handoff.rs source.

FIX-C-07 [OPEN] surface: handoff | axis: risk | concern: correctness | P2 | effort: S

  • Problem. validate_agent_harness_ingest() at line 385 is called for its side effects; result is only error-checked, not acted upon for state changes. If it modifies shared state on success, accepting the result without storing it is a logic gap.
  • Operation. Document explicitly whether validate_agent_harness_ingest is pure (validation only) or mutating. If mutating, capture return value and apply. Add a // PURE or // MUTATES: comment.
  • Success. Code comment present; reviewer can audit without digging into implementation.

Design Note. The EventBus implementation (lines 575-633) is a thin wrapper around a Tokio broadcast channel, which is correct and minimal. The problem is upstream: the AgentEventKind enum at lines 90-569 has grown to 50+ variants in a single discriminated union. String-typed IDs in some variants contradict typed IDs in others. The broadcast channel’s silent-drop-on-overflow behavior at line 610 is the most operationally dangerous issue — a monitoring system that silently loses events will produce misleading dashboards.

FIX-D-01 [OPEN] surface: events | axis: risk | concern: correctness | P1 | effort: S

  • Problem. events.rs:597fetch_add(1, Ordering::Relaxed). Relaxed ordering means the assigned ID may be observed out-of-order by subscribers, producing non-monotonic event IDs.
  • Operation. Change to fetch_add(1, Ordering::AcqRel) on write; load(Ordering::Acquire) on read.
  • Success. 1000-event concurrent test: subscriber sees monotonically non-decreasing IDs.

FIX-D-02 [OPEN] surface: events | axis: risk | concern: correctness | P1 | effort: S

  • Problem. events.rs:610let _ = self.sender.send(event) silently discards SendError (channel at capacity). Subscribers miss events with no indication.
  • Operation. Replace with: if send fails, emit tracing::warn!(dropped_event = ?event.kind, "event bus at capacity; event dropped") and increment EVENTS_DROPPED counter metric.
  • Success. Forcing channel overflow in a test shows a WARN log and a non-zero counter.

FIX-D-03 [OPEN] surface: events | axis: hygiene | concern: modularity | P2 | effort: L

  • Problem. 50+ variants in AgentEventKind make adding, reading, and matching events error-prone. Pattern-matching the whole enum to handle task events means touching agent and system events.
  • Operation. Split into TaskEventKind, AgentEventKind, SystemEventKind. Wrap in enum OrchestratorEvent { Task(TaskEventKind), Agent(AgentEventKind), System(SystemEventKind) }. Update all match sites.
  • Success. Each sub-enum has ≤20 variants; rg 'AgentEventKind::Task' crates/ returns zero hits.

FIX-D-04 [OPEN] surface: events | axis: hygiene | concern: contracts | P2 | effort: M

  • Problem. ActivityStarted, ActivityCompleted, ActivityRetried (lines 406-415, 420, 424) use raw String for agent_id and task_id, while other variants use typed AgentId/TaskId.
  • Operation. Replace String with AgentId / TaskId in these three variants. Update constructors at call sites.
  • Success. rg 'ActivityStarted.*String\|ActivityCompleted.*String' crates/vox-orchestrator/src/events.rs returns zero hits.

FIX-D-05 [OPEN] surface: events | axis: hygiene | concern: modularity | P3 | effort: M

  • Problem. TaskStarted, TaskFailed, TaskCompleted repeat (task_id, agent_id, session_id, audit_report) fields.
  • Operation. Extract struct TaskEventCore { task_id, agent_id, session_id, audit_report }. Embed in each variant.
  • Success. Three variants share one struct; adding a new task field is a one-line change.

FIX-D-06 [OPEN] surface: events | axis: capability | concern: observability | P2 | effort: S

  • Problem. No metric for event bus subscriber count or throughput; cannot diagnose slow subscribers causing channel pressure.
  • Operation. Expose subscriber_count() result as a Prometheus gauge vox_orchestrator_event_bus_subscribers. Add events_emitted_total counter.
  • Success. curl /metrics shows both gauges.

FIX-D-07 [OPEN] surface: events | axis: risk | concern: correctness | P2 | effort: M

  • Problem. Channel capacity is implicit (Tokio broadcast default). No documented policy on lossy vs. lossless semantics. Dashboard and audit consumers need lossless; monitoring consumers can be lossy.
  • Operation. Write contracts/communication/agent-event.v1.schema.json including a channel_policy: lossy | lossless field. Create two bus instances: a lossless unbounded channel for audit (backed by a ring-buffer spill to disk), a lossy broadcast for real-time monitoring.
  • Success. Audit consumer test verifies zero events dropped under 10k event burst.

FIX-D-08 [OPEN] surface: events | axis: capability | concern: capability | P2 | effort: M

  • Problem. No lock-propagation event variant. An agent performing a DB schema mutation cannot inform other agents to pause data-retrieval operations. (Gap identified in nextgen research §8.2.)
  • Operation. Add SystemEventKind::LockAcquired { resource_id, holder_agent_id, ttl_ms } and LockReleased { resource_id }. Emit from acquire_distributed_lock. Agents subscribed to SystemEventKind can pause on LockAcquired.
  • Success. Test: agent A acquires schema lock → agent B’s data-retrieval tool pauses until LockReleased emitted.

Design Note. Grounding is the citation and evidence verification layer. The implementation is sound conceptually but split_summary_into_claim_segments() at lines 139-270 (132 lines) embeds Unicode boundary logic, German locale special-casing, and claim segmentation in one function with nested helpers. This creates brittleness for non-English content and makes the function resistant to testing. The expect("char boundary") at line 235 is the only hard-crash path found.

FIX-E-01 [OPEN] surface: grounding | axis: risk | concern: correctness | P1 | effort: S

  • Problem. grounding.rs:235.expect("char boundary") when indexing summary[i..]. If summary contains corrupted UTF-8 or if i lands on a multi-byte boundary, panics in production.
  • Operation. Replace with summary.get(i..).ok_or(GroundingError::InvalidUtf8Boundary(i))?. Add fuzz test with random byte strings.
  • Success. Fuzz corpus of 10k random strings never panics; test with mid-codepoint boundary returns Err.

FIX-E-02 [OPEN] surface: grounding | axis: hygiene | concern: modularity | P2 | effort: M

  • Problem. split_summary_into_claim_segments (132 lines) mixes char-boundary detection, locale rules, and segmentation loop. Unit testing one concern requires exercising all.
  • Operation. Extract: fn detect_char_boundary(s: &str, i: usize) -> Option<usize>, fn apply_locale_rules(token: &str) -> bool, fn segment_loop(s: &str) -> Vec<Range<usize>>. Keep top-level as coordinator.
  • Success. Each extracted fn has its own unit test; top-level fn ≤30 lines.

FIX-E-03 [OPEN] surface: grounding | axis: hygiene | concern: modularity | P3 | effort: M

  • Problem. classify_line_claim_kind() (lines 79-125) hardcodes English phrases (“according to”, “suggests that”, etc.). Non-English citations are unclassifiable.
  • Operation. Move phrase list to contracts/grounding/claim-phrases.v1.yaml keyed by language code. Load at startup. Default to en. Add --grounding-locale flag.
  • Success. Adding German phrases requires only a YAML edit; existing English tests still pass.

FIX-E-04 [OPEN] surface: grounding | axis: risk | concern: testing | P2 | effort: M

  • Problem. No tests for edge cases in claim segmentation: emoji sequences, surrogate pairs, empty summary, mixed-script (Latin + CJK), German compound streets.
  • Operation. Add crates/vox-orchestrator/tests/grounding_tests.rs with a parameterized test for each edge case.
  • Success. All 8 edge cases pass; no panics.

FIX-E-05 [OPEN] surface: grounding | axis: hygiene | concern: modularity | P3 | effort: S

  • Problem. grounding.rs:153-156 and 163-164 duplicate suffix checks for "str" and "strasse". Four lines of near-identical code.
  • Operation. Extract fn is_street_suffix(token: &str) -> bool covering all street variants.
  • Success. Duplication gone; adding “straße” (ß variant) is a one-liner.

FIX-E-06 [OPEN] surface: grounding | axis: capability | concern: capability | P2 | effort: L

  • Problem. Socrates confidence is computed from evidence heuristics, not from model token probability distributions. Hallucination entropy scoring from logprobs would give a model-native confidence signal. (nextgen research §6.2.)
  • Operation. When provider returns logprobs (OpenAI/OpenRouter support; flag required), compute per-token entropy H = -Σ p log p over the completion. Average entropy > configurable threshold → confidence_estimate lowered. Feed into existing SocratesGateOutcome.confidence.
  • Success. High-entropy completion on a factual question lowers Socrates confidence score; test with known uncertain model output.

Design Note. usage.rs implements per-provider cost tracking and daily budget enforcement. The global static Mutex at lines 256-261 is the most severe performance bottleneck in the non-routing surface: it serialises every API call through a single lock, which is untenable at scale. The custom today() date implementation at lines 183-200 is mathematically correct but adds cognitive overhead and leap-second edge-case risk when chrono is already a transitive dependency.

FIX-F-01 [OPEN] surface: usage-finops | axis: risk | concern: perf | P0 | effort: M

  • Problem. usage.rs:256-261OnceLock<Mutex<()>> serialises all record_call_detailed calls. Every API call (potentially hundreds per second) blocks on this single global lock.
  • Operation. Replace with a tokio::sync::mpsc::Sender<UsageRecord> channel. Spawn a single writer task that drains the channel and batches writes to DB. record_call_detailed becomes a non-blocking channel.send(record).ok() call.
  • Success. Load test: 500 concurrent record_call_detailed calls complete in <100ms total vs. serialized baseline.

FIX-F-02 [OPEN] surface: usage-finops | axis: risk | concern: correctness | P1 | effort: S

  • Problem. usage.rs:265-268.as_u64().unwrap_or(0) and .as_f64().unwrap_or(0.0) silently accept schema drift. A column type change produces wrong math with no error.
  • Operation. Replace with explicit type-check: value.as_u64().ok_or(UsageError::SchemaFieldTypeMismatch { key, expected: "u64" })?. Fail-open only in Profile::Dev.
  • Success. Test with wrong-type JSON field returns Err in production profile; passes with fallback in dev.

FIX-F-03 [OPEN] surface: usage-finops | axis: hygiene | concern: modularity | P3 | effort: S

  • Problem. usage.rs:183-200 — custom date calculation. chrono is already a transitive dependency.
  • Operation. Replace with chrono::Utc::now().date_naive(). Remove the custom function.
  • Success. rg 'fn today' crates/vox-orchestrator/src/usage.rs returns zero hits.

FIX-F-04 [OPEN] surface: usage-finops | axis: hygiene | concern: modularity | P3 | effort: S

  • Problem. record_call() (lines 203-226) just calls record_call_detailed() with constant defaults. An extra indirection with no added value.
  • Operation. Inline the defaults at call sites or make record_call_detailed() take an impl Into<UsageRecord> so callers can use a builder.
  • Success. rg 'fn record_call\b' crates/vox-orchestrator/src/usage.rs returns one hit.

FIX-F-05 [OPEN] surface: usage-finops | axis: capability | concern: capability | P1 | effort: L

  • Problem. BudgetManager::max_financial_cost_micros is a single global cap. No per-tenant or per-fleet-segment budget (nextgen research §4.2 P1 gap).
  • Operation. Add TenantBudget { tenant_id, daily_cap_micros, current_spend } table. On record_call_detailed, look up tenant from AgentTask.tenant_id and enforce. Emit BudgetExceeded { tenant_id } event. vox config budgets set <tenant> <amount>.
  • Success. Test: two tenants with separate caps; one exhausting budget does not affect the other.

FIX-F-06 [OPEN] surface: usage-finops | axis: capability | concern: observability | P2 | effort: S

  • Problem. Budget utilization is only checkable by querying DB. No real-time observable.
  • Operation. Add /v1/budget/status HTTP endpoint (gateway, authenticated) returning { global: { cap, spend, remaining }, tenants: [...] }.
  • Success. curl /v1/budget/status returns JSON; integration test asserts spend updates after a task.

FIX-F-07 [OPEN] surface: usage-finops | axis: risk | concern: correctness | P2 | effort: M

  • Problem. No unit tests for budget enforcement: exhaustion, tenant isolation, reconciliation.
  • Operation. Add crates/vox-db/tests/budget_tests.rs: (1) spend to cap → next call rejected, (2) tenant A exhausted → tenant B unaffected, (3) reconciliation updates balance.
  • Success. Three tests green; budget enforcement path exercised.

FIX-F-08 [OPEN] surface: usage-finops | axis: capability | concern: capability | P2 | effort: M

  • Problem. Pre-execution token estimation uses 4 chars/token heuristic (nextgen research §4.2). Tool output length is unpredictable and not estimated.
  • Operation. For each tool call, look up p95_output_tokens from model_scoreboard (if available). Use as the pre-execution estimate. Gate task dispatch if estimated_cost > remaining_budget.
  • Success. Test: task with estimated cost > budget cap is rejected before dispatching any LLM call.

Cluster G — MCP Dispatch (mcp_tools/dispatch.rs)

Section titled “Cluster G — MCP Dispatch (mcp_tools/dispatch.rs)”

Design Note. dispatch.rs is 1,224 LOC with a handle_tool_call_inner match statement spanning 960 lines routing ~150 tools. This is the single largest function in the codebase. It has no per-tool timeout, no rate limiting, no backpressure, and no structured error taxonomy — all tool errors are stringly typed. The hardcoded sentinel strings (SYSTEM_INTERVENTION, LAZY_GENERATION_DETECTED, RBAC_VIOLATION) are used in error-handling logic but belong in typed constants. FIX-G-02 (tool registry) is the foundational change that enables most others.

FIX-G-01 [OPEN] surface: mcp-dispatch | axis: risk | concern: correctness | P1 | effort: S

  • Problem. handle_tool_call_inner has no per-tool execution timeout. A stalled tool call holds the dispatch slot indefinitely.
  • Operation. Wrap each tool dispatch branch: tokio::time::timeout(Duration::from_secs(tool_timeout_secs), handler()).await.map_err(|_| ToolError::Timeout { tool_name }). Default tool_timeout_secs = 30; configurable per-tool via Vox.toml [orchestrator.tool_limits].
  • Success. Test: a handler that sleeps 60s returns ToolError::Timeout within 31s.

FIX-G-02 [OPEN] surface: mcp-dispatch | axis: risk | concern: correctness | P1 | effort: M

  • Problem. 960-line match statement is unmaintainable. Adding a tool means touching a 1k-line file; removing one requires careful search.
  • Operation. Introduce ToolRegistry: HashMap<&'static str, Box<dyn ToolHandler + Send + Sync>>. Each tool module exposes fn register(r: &mut ToolRegistry). handle_tool_call_inner becomes registry.get(tool_name)?.call(params).await.
  • Success. handle_tool_call_inner ≤20 lines; adding a tool is a new impl + one register() call; no match arm needed.

FIX-G-03 [OPEN] surface: mcp-dispatch | axis: risk | concern: security | P1 | effort: S

  • Problem. Hardcoded string sentinels "SYSTEM_INTERVENTION", "LAZY_GENERATION_DETECTED", "RBAC_VIOLATION" at dispatch.rs:60, 77, 103 are used in conditional logic. A typo breaks gating silently.
  • Operation. Define enum DispatchGateReason { SystemIntervention, LazyGenerationDetected, RbacViolation }. Replace all string comparisons with enum matching.
  • Success. rg '"SYSTEM_INTERVENTION"\|"RBAC_VIOLATION"' crates/vox-orchestrator-mcp/src/dispatch.rs returns zero hits.

FIX-G-04 [OPEN] surface: mcp-dispatch | axis: risk | concern: correctness | P2 | effort: S

  • Problem. dispatch.rs:50,135agent_id.parse::<u64>().ok().unwrap_or(0) silently maps parse failures to agent ID 0. Agent 0 may be a valid agent, producing misrouted calls.
  • Operation. Return ToolError::InvalidParam { param: "agent_id", reason: "not a valid u64" } on parse failure instead of defaulting to 0.
  • Success. Test: malformed agent_id string returns ToolError::InvalidParam, not a call to agent 0.

FIX-G-05 [OPEN] surface: mcp-dispatch | axis: capability | concern: observability | P2 | effort: S

  • Problem. No span or metric per tool call. Latency and error rates are invisible per tool.
  • Operation. In the registry dispatch path (FIX-G-02), wrap each call in info_span!("tool_call", tool_name = %name, agent_id = %agent_id). Emit tool_calls_total{tool_name} and tool_call_duration_seconds{tool_name} metrics.
  • Success. After 10 tool calls, /metrics shows non-zero counts per tool.

FIX-G-06 [OPEN] surface: mcp-dispatch | axis: capability | concern: capability | P2 | effort: M

  • Problem. No per-tool rate limiting. A single agent can flood the dispatch layer.
  • Operation. Add [orchestrator.tool_limits.<tool_name>] rate_per_minute = N in Vox.toml. Apply governor::RateLimiter per (agent_id, tool_name). On exceeded: return ToolError::RateLimited { retry_after_ms }.
  • Success. Test: 100 rapid calls to a limited tool returns RateLimited after the configured limit.

FIX-G-07 [OPEN] surface: mcp-dispatch | axis: hygiene | concern: modularity | P3 | effort: L

  • Problem. All tool registrations live in dispatch.rs. The file is a coupling point for all tool groups.
  • Operation. Move registration into per-group modules: chat_tools::register(), scientia_tools::register(), etc. dispatch.rs calls each group’s register() at startup.
  • Success. wc -l crates/vox-orchestrator-mcp/src/dispatch.rs < 100.

FIX-G-08 [OPEN] surface: mcp-dispatch | axis: risk | concern: correctness | P2 | effort: S

  • Problem. No backpressure mechanism. When agent queue is saturated, new tool calls continue to be accepted and pile up.
  • Operation. Add max_pending_tool_calls_per_agent: usize (default 50, config). On exceed: return ToolError::QueueFull { agent_id }. Emit AgentEventKind::SystemEventKind::BackpressureTriggered.
  • Success. Test: 100 concurrent tool calls for one agent → 51st returns QueueFull.

FIX-G-09 [OPEN] surface: mcp-dispatch | axis: hygiene | concern: contracts | P2 | effort: M

  • Problem. Tool errors are stringly-typed ToolResult::err(string). Callers cannot branch on error class.
  • Operation. Define enum ToolErrorKind { Budget, Auth, Validation, Internal, Timeout, RateLimited, QueueFull }. ToolResult carries error_kind: Option<ToolErrorKind>. All dispatch paths use it.
  • Success. `rg ‘ToolResult::err.*String’ crates/vox-orchestrator/src/mcp_tools’ returns zero untyped hits.

Cluster H — Chat Tools (mcp_tools/chat_tools/)

Section titled “Cluster H — Chat Tools (mcp_tools/chat_tools/)”

Design Note. The chat tools cluster contains the planning pipeline, which is the most user-visible MCP surface. plan_goal() at plan.rs:186-699 is 514 lines — the largest single function in the tooling layer. It builds prompts, resolves models, synthesizes tasks, handles budget, and formats output all in one pass. plan_loop.rs maybe_refine_plan() is 337 lines. Both suffer from hardcoded numeric constants scattered across 8+ locations. Three .expect() calls in production code are P1 crash risks.

FIX-H-01 [OPEN] surface: chat-tools | axis: risk | concern: correctness | P1 | effort: S

  • Problem. chat_tools/mod.rs:176.expect("telemetry JSON must parse"). If telemetry struct serialization fails (e.g., serde_json panic on NaN float), process crashes.
  • Operation. Replace with .unwrap_or_else(|e| { tracing::error!(err = %e, "telemetry serialization failed"); Value::Null }).
  • Success. Test: serializing a struct with NaN field returns Value::Null and a WARN log; no panic.

FIX-H-02 [OPEN] surface: chat-tools | axis: risk | concern: correctness | P1 | effort: S

  • Problem. chat_tools/mod.rs:214.expect("search_refinement field"). Field may be absent if schema evolves.
  • Operation. Replace with .and_then(|v| v.get("search_refinement")).cloned().unwrap_or(Value::Null).
  • Success. Test: JSON without search_refinement field returns Null; no panic.

FIX-H-03 [OPEN] surface: chat-tools | axis: risk | concern: correctness | P1 | effort: S

  • Problem. chat/mentions.rs:9Regex::new(pattern).unwrap() in a LazyLock. If pattern is invalid (e.g., after editing), process panics on first use.
  • Operation. Replace with Regex::new(pattern).expect("mentions regex: compile error — check pattern"). Add a unit test that instantiates the regex.
  • Success. Deliberate syntax error in pattern fails at test time, not at mention parse time.

FIX-H-04 [OPEN] surface: chat-tools | axis: hygiene | concern: modularity | P1 | effort: L

  • Problem. plan.rs:186-699 (514 lines) mixes prompt construction, model resolution, task synthesis, budget accounting, and output formatting in one function.
  • Operation. Extract a PlanningPipeline with stages: build_context() -> PlanContext, resolve_model(ctx) -> ModelHandle, synthesize_tasks(ctx, model) -> Vec<PlanNode>, enforce_budget(nodes) -> Vec<PlanNode>, format_output(nodes) -> PlanResult. plan_goal() orchestrates the pipeline.
  • Success. plan_goal() ≤60 lines; each stage has its own unit test; total coverage of error paths increases.

FIX-H-05 [OPEN] surface: chat-tools | axis: hygiene | concern: modularity | P2 | effort: M

  • Problem. plan_loop.rs:151-487 (337 lines) mixes refinement logic, budget tracking, and result formatting.
  • Operation. Extract: assess_adequacy(plan, budget) -> AdequacyResult, refine_once(plan, budget, model) -> RefinedPlan, check_round_limit(round, max) -> Result<(), PlanError>. Loop wrapper ≤50 lines.
  • Success. Each extracted fn tested independently.

FIX-H-06 [OPEN] surface: chat-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. Token caps at plan.rs:55-60 are hardcoded const values (3072, 4096, 8192) keyed to plan depth. Changing them requires a code change and rebuild.
  • Operation. Move to Vox.toml [orchestrator.planning.token_caps_by_depth] as an array. Read at plan creation time. Keep Rust consts as compile-time fallbacks.
  • Success. Changing caps in Vox.toml takes effect without recompile; vox config planning show prints active caps.

FIX-H-07 [OPEN] surface: chat-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. plan.rs:187 max_tasks.unwrap_or(30) and plan_loop.rs:119 .min(8) round cap are magic numbers with no operator control.
  • Operation. Add Vox.toml [orchestrator.planning] max_tasks_default = 30 and max_refine_rounds = 8. Operators can increase for large codebases.
  • Success. Setting max_tasks_default = 50 in Vox.toml is reflected in plan generation.

FIX-H-08 [OPEN] surface: chat-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. plan_loop.rs:103 reserve_tokens = 4096_u32 and :164 refine_budget_tokens.unwrap_or(18_000) are hardcoded.
  • Operation. Move to config keys planning.reserve_tokens and planning.refine_budget_tokens.
  • Success. Config keys documented; tests pass with non-default values.

FIX-H-09 [OPEN] surface: chat-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. plan_loop.rs:162 .unwrap_or(0.28_f32) refinement adequacy threshold is unexplained magic.
  • Operation. Add Vox.toml [orchestrator.planning] adequacy_threshold = 0.28 with a comment explaining the 0–1 scale. Document: below threshold → refinement triggered.
  • Success. Threshold documented; unit test uses non-default value.

FIX-H-10 [OPEN] surface: chat-tools | axis: risk | concern: correctness | P2 | effort: M

  • Problem. plan_loop.rs refinement loop enforces max_refine_rounds but has no guard against the adequacy check always returning false (degenerate case → runs max rounds regardless).
  • Operation. After max_refine_rounds without adequacy passing: emit AgentEventKind::PlanRefinementExhausted and return the best plan seen so far, tagged as quality: BestEffort.
  • Success. Test: always-inadequate adequacy stub terminates after max_refine_rounds with a BestEffort result.

FIX-H-11 [OPEN] surface: chat-tools | axis: capability | concern: observability | P2 | effort: S

  • Problem. Plan generation is a black box in traces. Duration, task count, depth, and refinement rounds are invisible.
  • Operation. Emit tracing::info_span!("plan_generation", task_count, depth, refine_rounds, budget_tokens) wrapping plan_goal execution.
  • Success. A plan trace shows the span with non-zero attributes.

FIX-H-12 [OPEN] surface: chat-tools | axis: risk | concern: testing | P2 | effort: M

  • Problem. No test coverage for plan error paths: zero tasks, budget exceeded on synthesis, replan with empty base plan.
  • Operation. Add crates/vox-orchestrator/tests/plan_tests.rs covering: (1) zero-task result, (2) budget cap hit before all tasks generated, (3) replan with no prior nodes, (4) refinement budget exhausted.
  • Success. Four tests green.

FIX-H-13 [OPEN] surface: chat-tools | axis: hygiene | concern: modularity | P3 | effort: S

  • Problem. plan.rs:495.unwrap_or(None).unwrap_or(0) chained Option. The middle unwrap_or(None) is a no-op that adds confusion.
  • Operation. Replace with .flatten().unwrap_or(0) or restructure the Option chain.
  • Success. rg 'unwrap_or(None)' crates/vox-orchestrator/src/mcp_tools/chat_tools/plan.rs returns zero hits.

Cluster I — Scientia Tools (mcp_tools/scientia_tools/)

Section titled “Cluster I — Scientia Tools (mcp_tools/scientia_tools/)”

Design Note. The Scientia tools implement the scholar research lifecycle: preflight, discovery, novelty checking, publication, and external job submission. The implementation is defensively coded (mostly safe .unwrap_or() patterns) but leaks policy into constants: score thresholds at 0.85, 0.62, 0.58 are embedded in code with no documentation of their derivation. External job submission loops have no retry limit guard. Functions over 100 lines in novelty.rs and lifecycle.rs need decomposition.

FIX-I-01 [OPEN] surface: scientia-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. Score thresholds (0.85, 0.62, 0.58) in novelty.rs are undocumented magic numbers.
  • Operation. Add Vox.toml [orchestrator.scientia] novelty_high_threshold = 0.85 # reject as duplicate, novelty_medium_threshold = 0.62 # flag for review, etc. Document thresholds in docs/src/reference/scientia-thresholds.md.
  • Success. `rg ‘0.85|0.62|0.58’ crates/vox-orchestrator/src/mcp_tools/scientia_tools’ returns zero code hits (moved to config read).

FIX-I-02 [OPEN] surface: scientia-tools | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. common.rs:87 .unwrap_or(0.85) evidence threshold is separate from the novelty thresholds and undocumented.
  • Operation. Add Vox.toml [orchestrator.scientia] evidence_confidence_floor = 0.85. Document its role (minimum confidence to count a source as evidence).
  • Success. Config key present and documented; test with non-default value shows different gating.

FIX-I-03 [OPEN] surface: scientia-tools | axis: risk | concern: correctness | P2 | effort: M

  • Problem. external.rs job submission and replay loops have no explicit retry limit. A stuck external job could loop indefinitely.
  • Operation. Add max_retries: u32 (default 3) and retry_backoff_ms: u64 (default 1000, exponential) to external job config. After max retries: emit JobFailed { job_id, reason: "max_retries_exceeded" }.
  • Success. Test: mock external endpoint always returns 500 → job fails after 3 retries with structured error.

FIX-I-04 [OPEN] surface: scientia-tools | axis: capability | concern: observability | P2 | effort: S

  • Problem. Scholar lifecycle events (submit, discover, publish) are not emitted to the event bus. Orchestrator cannot react to publication completion.
  • Operation. Add TaskEventKind::ScholiaSubmitted, ScholiaDiscovered, ScholiaPublished variants. Emit from lifecycle.rs at each milestone.
  • Success. Integration test: publishing a scholia emits ScholiaPublished event visible to a subscriber.

FIX-I-05 [OPEN] surface: scientia-tools | axis: risk | concern: testing | P2 | effort: M

  • Problem. No tests for edge cases: novelty check with empty corpus, discovery returning no results, lifecycle failure mid-way.
  • Operation. Add crates/vox-orchestrator/tests/scientia_tests.rs: (1) empty corpus → NoveltyCheckSkipped, (2) discovery with zero results → DiscoveryEmpty, (3) lifecycle step failure → partial state rollback.
  • Success. Three tests green.

FIX-I-06 [OPEN] surface: scientia-tools | axis: hygiene | concern: modularity | P2 | effort: L

  • Problem. novelty.rs and scholar.rs contain functions >100 lines mixing computation logic with side effects (DB writes, event emission).
  • Operation. For each >100-line function: separate pure computation (returns data) from side effects (writes to DB, emits events). Apply the same stage-extraction pattern as FIX-H-04.
  • Success. No function in scientia_tools > 80 lines; each stage independently testable.

FIX-I-07 [OPEN] surface: scientia-tools | axis: risk | concern: correctness | P2 | effort: S

  • Problem. Evidence refresh calls in discovery.rs have no minimum refresh interval. An agent could trigger rapid re-fetches.
  • Operation. Apply the same min_refresh_interval_secs pattern as OpenRouter catalog (from prior audit FIX-28). Gate discovery refresh calls with a per-session TTL.
  • Success. Rapid consecutive refresh calls: only first executes; subsequent return cached result.

Cluster J — LLM Bridge (mcp_tools/llm_bridge/)

Section titled “Cluster J — LLM Bridge (mcp_tools/llm_bridge/)”

Design Note. The LLM bridge is the provider-facing inference layer: it translates tool call params into provider-specific HTTP requests and normalizes responses. infer.rs::mcp_infer_tool_completion (390 lines) is the second-largest function in the tooling layer. The most critical gap is the absence of explicit HTTP timeout Durations on any provider client — a hanging provider connection will hold the dispatch slot and eventually exhaust the thread pool. The Result<T, String> error type prevents structured error handling upstream.

FIX-J-01 [OPEN] surface: llm-bridge | axis: risk | concern: correctness | P0 | effort: M

  • Problem. No explicit Duration timeout found in any provider HTTP client setup in llm_bridge/. A stalled provider holds the connection indefinitely.
  • Operation. Add connect_timeout(Duration::from_secs(10)) and timeout(Duration::from_secs(provider_timeout_secs)) to every provider’s reqwest::Client::builder(). Default 60s; configurable per provider in Vox.toml [providers.<name>] timeout_secs.
  • Success. Test: provider that hangs for 120s is aborted after 61s with ProviderError::Timeout.

FIX-J-02 [OPEN] surface: llm-bridge | axis: hygiene | concern: contracts | P1 | effort: M

  • Problem. infer.rs:190,224 returns Result<(String, String, u64), String>. String errors cannot be matched by callers.
  • Operation. Define enum ProviderError { Timeout, RateLimited { retry_after_ms }, Unauthorized, ServiceError { status: u16, body: String }, Serialization(serde_json::Error) }. Replace all Err(String) in llm_bridge/.
  • Success. `rg ‘Result<.*String>’ crates/vox-orchestrator/src/mcp_tools/llm_bridge’ returns zero hits.

FIX-J-03 [OPEN] surface: llm-bridge | axis: hygiene | concern: modularity | P1 | effort: L

  • Problem. mcp_infer_tool_completion (390 lines, infer.rs:210-599) handles vision tokens, model selection, context packing, streaming, provider dispatch, fallback, and response normalization in one function.
  • Operation. Extract pipeline stages: pack_context(params) -> PackedContext, dispatch_to_provider(ctx, model) -> ProviderResponse, normalize_response(raw) -> InferResult. Orchestrating wrapper ≤50 lines.
  • Success. Each stage unit-tested; mcp_infer_tool_completion ≤50 lines.

FIX-J-04 [OPEN] surface: llm-bridge | axis: hygiene | concern: contracts | P2 | effort: S

  • Problem. infer.rs:278estimated_vision_tokens += 1000. Hardcoded estimate regardless of image size or model context expectations.
  • Operation. Replace with model_spec.vision_tokens_per_image.unwrap_or(1000) — add vision_tokens_per_image: Option<u32> to ModelSpec populated from catalog (OpenRouter returns image token cost).
  • Success. Vision-capable models show provider-specific estimates; fallback to 1000 documented.

FIX-J-05 [OPEN] surface: llm-bridge | axis: risk | concern: correctness | P2 | effort: M

  • Problem. HTTP 429/408/504 are detected at infer.rs:544-565 but no exponential backoff is applied. Provider is immediately retried, worsening rate-limit conditions.
  • Operation. On 429: wait retry_after header value (or 1s default), then retry up to 3 times with exponential backoff (1s, 2s, 4s). On 408/504: retry once after 2s. Log each attempt with ProviderError::attempt_number.
  • Success. Test: mock 429 → after 3 retries returns ProviderError::RateLimited; retry_after respected.

FIX-J-06 [OPEN] surface: llm-bridge | axis: capability | concern: observability | P2 | effort: S

  • Problem. Provider attempt details (status code, attempt number, provider name) are logged but not in structured spans.
  • Operation. Emit tracing::info_span!("provider_attempt", provider, attempt_number, status_code, latency_ms) for each attempt. Link to parent trace_id from AgentTask.
  • Success. A retried call shows multiple child spans under the parent trace.

FIX-J-07 [OPEN] surface: llm-bridge | axis: capability | concern: capability | P2 | effort: M

  • Problem. Routing to Anthropic via OpenRouter can silently fail on tool-call chains due to schema differences (nextgen research §3.1). No schema-aware translation layer.
  • Operation. Add fn translate_tool_calls(messages: &mut Vec<Message>, target: ProviderType) that enforces Anthropic’s strict alternating-turn structure when target == Anthropic. Apply in provider_adapter.rs before dispatch.
  • Success. Test: OpenAI-style tool result chain translated to Anthropic format passes Anthropic validation rules.

FIX-J-08 [OPEN] surface: llm-bridge | axis: risk | concern: correctness | P2 | effort: S

  • Problem. Many provider response field fallbacks use .unwrap_or_default() silently. A changed provider response shape produces wrong token counts or costs.
  • Operation. Add validate_provider_response(raw: &Value, provider: ProviderType) -> Result<(), ProviderError> that checks required fields before extraction. Log a WARN if optional fields are unexpectedly absent.
  • Success. Test: response missing usage.input_tokens field → validation WARN emitted; call still succeeds with 0 tokens.

FIX-J-09 [OPEN] surface: llm-bridge | axis: hygiene | concern: modularity | P3 | effort: M

  • Problem. Per-provider adapters (gemini.rs, anthropic.rs, openai.rs, ollama_chat.rs) share structure but lack a common trait. Adding a new provider requires copying boilerplate.
  • Operation. Formalise trait ProviderAdapter { fn adapt_request(&self, ctx: &PackedContext) -> Value; fn extract_response(&self, raw: &Value) -> InferResult; }. Each file implements it. dispatch_to_provider calls the trait.
  • Success. New provider requires only implementing the two trait methods.

FIX-J-10 [OPEN] surface: llm-bridge | axis: capability | concern: capability | P2 | effort: M

  • Problem. No PII-aware routing. Tasks involving sensitive data (medical records, credentials) can be routed to any provider including cloud providers (nextgen research §3.2 P1 gap).
  • Operation. Add AgentTask.sensitivity: Option<SensitivityLevel { Public, Internal, Restricted }>. In best_for() (prior audit) and mcp_infer_tool_completion, filter out cloud providers when sensitivity == Restricted; allow only PopuliMesh or Ollama.
  • Success. Test: Restricted task never dispatched to OpenRouter or Anthropic.

Cluster K — HTTP Gateway (mcp_tools/http_gateway/)

Section titled “Cluster K — HTTP Gateway (mcp_tools/http_gateway/)”

Design Note. The HTTP gateway received significant recent hardening (bearer auth, CORS, origin guard — see recent commits). Three P0/P1 security issues remain: DEBUG println! leaking env vars (line 176-177), an origin guard starts_with prefix-bypass, and WebSocket tokens in URL query parameters appearing in access logs. The eval endpoint runs untrusted code with only a 5-second wall-clock timeout and no input size limit — a CPU-bound loop can exhaust the timeout but continue burning CPU. All FIX-K items are S/M effort; P0 and P1 items should be landed in the next sprint.

(FIX-K-01 through FIX-K-18 are fully specified in Part 2 above — all mechanical operations with clear success criteria. Key security items: FIX-K-01 and FIX-K-02 are P0, should land together.)


Cluster L — Memory Tools (mcp_tools/memory_tools/)

Section titled “Cluster L — Memory Tools (mcp_tools/memory_tools/)”

Design Note. Memory tools are the most defensively-written cluster in the MCP layer. The err_with_remediation() pattern with typed REM_* constants is excellent — it should be adopted as the standard across all MCP tool handlers (see FIX-X-03). The main gaps are operational: hardcoded item limits (10, 20) that should be operator-tunable, absence of rate limiting, and unclear memory isolation semantics between agents.

(FIX-L-01 through FIX-L-05 fully specified in Part 2.)


Cluster M — Task Tools (mcp_tools/task_tools/)

Section titled “Cluster M — Task Tools (mcp_tools/task_tools/)”

Design Note. Task tools manage task submission, completion, and cancellation. The repeated companion lookup/upsert pattern (3×, lifecycle.rs) is a copy-paste debt risk — one of the three could silently diverge. Gamification companion updates are fire-and-forget async tasks with no error logging, so companion state can silently desync from task state. Priority string parsing with a silent _ → Normal default is a foot-gun.

(FIX-M-01 through FIX-M-05 fully specified in Part 2.)


Design Note. dei_shim/ is a 2,769 LOC legacy compatibility layer retained after the vox-dei crate retirement. The nomenclature convergence commits (Mens→Populi, etc.) reduced the shim’s necessity but did not remove it. Its continued presence creates import ambiguity, keeps retired symbol names alive in code paths, and contradicts the retirement policy in AGENTS.md:140. The plan below follows a conservative deprecation path: annotate → migrate callers → gate with feature flag → remove.

(FIX-N-01 through FIX-N-05 fully specified in Part 2.)


Cluster O — Daemon Binary + ADR-022 Boundaries

Section titled “Cluster O — Daemon Binary + ADR-022 Boundaries”

Design Note. vox_orchestrator_d.rs is the daemon entry point. Three correctness gaps stand out: no signal handler (SIGTERM leaves the process without flushing the outbox), DB init failure at line 75 that logs a warning and continues (rather than halting), and a panic! in SessionManager initialization at line 120. The ADR-022 Phase B split-plane RPC flag matrix is implemented but not documented in the ADR (which was written before Phase B was completed).

(FIX-O-01 through FIX-O-09 fully specified in Part 2.)


Cluster P — Persistence Outbox + ops_orchestrator.rs

Section titled “Cluster P — Persistence Outbox + ops_orchestrator.rs”

Design Note. The persistence outbox is the degraded-mode safety net: when DB is unavailable, operations are queued and replayed on reconnect. Two correctness gaps: (1) the fence token at line 45-56 has no overflow guard at i64::MAX — a long-lived deployment accumulates tokens indefinitely, (2) clock-skew between nodes can cause a lock to appear expired when it is actually valid. The outbox schema uses additionalProperties: true and has no version field, making schema evolution undetectable.

(FIX-P-01 through FIX-P-10 fully specified in Part 2.)


Design Note. The planning module contains routing, synthesis, and replanning policy. Policy selection at policy.rs:6-10 is string-based and case-insensitive with no compile-time verification. An infinite replan loop is possible if the replanning condition never resolves. The planning module’s relationship to chat_tools/plan.rs is unclear — both deal with plan generation but apparently at different abstraction levels.

(FIX-Q-01 through FIX-Q-05 fully specified in Part 2.)


Cluster R — Scaling & Services (services/, scaling.rs)

Section titled “Cluster R — Scaling & Services (services/, scaling.rs)”

Design Note. The scaling and services layer wraps embeddings, routing-policy shim, gateway helpers, and agent count management. check_scaling at lines 657-771 is well-structured. The bundled services/ directory lacks explicit service factory patterns, making it harder to stub for testing. Scaling decisions need idempotence guarantees and overflow-safe arithmetic.

(FIX-R-01 through FIX-R-05 fully specified in Part 2.)


Design Note. Four agent prompt/policy documents exist: .vox/agents/orchestrator.md (pilot-facing operational manual), .vox/agents/vox-orchestrator.md (crate specialist scope), docs/agents/orchestrator.md (contributor-facing SSOT for boundaries), and crates/vox-skills/skills/orchestrator.skill.md (agent skill). The prior exploration found these are deliberately non-redundant — but they lack version stamps, making staleness undetectable. The doom-loop intervention policy and lock-propagation awareness (identified as P0/P1 capability gaps) have no representation in any prompt.

(FIX-S-01 through FIX-S-09 fully specified in Part 2.)


Design Note. Three test files exist for the orchestrator: orchestrator_e2e_test.rs (675 LOC, good concurrency and watchdog coverage), orchestrator_bootstrap_surface_parity_test.rs (50 LOC, thin), and ops_orchestrator_tests.rs (144 LOC, good lock/heartbeat/circuit-breaker coverage). Critical gaps: Socrates gate enforce vs. shadow mode has no test, trust-gate-relax path has no test, handoff invariants have no dedicated test file, conflict resolution is untested, and the doom-loop detector (FIX-B-11) needs a test from day one.

(FIX-T-01 through FIX-T-14 fully specified in Part 2.)


Design Note. Only one orchestrator-specific contract file exists: orchestrator-persistence-outbox.schema.json. The handoff envelope, agent event schema, and MCP dispatch error taxonomy have no machine-readable contracts. Adding schemas closes the feedback loop between code changes and external consumers (agent runtimes, CLI tools, monitoring systems).

(FIX-U-01 through FIX-U-05 fully specified in Part 2.)


Design Note. ADR-022 covers bootstrap and daemon boundaries but was written before the Phase B split-plane flag matrix was implemented. Two missing ADRs: event bus semantics (lossy vs. lossless lanes — a design decision with production consequences) and distributed lock fence-token discipline (clock-skew policy is undocumented, meaning future contributors will rediscover the race).

(FIX-V-01 through FIX-V-05 fully specified in Part 2.)


Design Note. The codebase has inconsistent error handling: memory_tools uses an excellent err_with_remediation() pattern; llm_bridge uses Result<T, String>; runtime.rs uses bare Err("...") strings; dispatch.rs uses ToolResult::err(string). This inconsistency means callers cannot write uniform error-handling code. The fix is twofold: define a top-level OrchestratorError enum and standardize the err_with_remediation() pattern across all MCP tools.

(FIX-X-01 through FIX-X-06 fully specified in Part 2.)


Design Note. Tokio task cancellation drops a future at the next .await point. Any future that holds a RwLock guard, a Mutex guard, or a DB transaction across an .await will leak the guard on cancellation — a deadlock waiting to happen. This is not a hypothetical: handle_command in runtime.rs holds orchestrator references across .await boundaries. A systematic audit is required.

(FIX-Y-01 through FIX-Y-04 fully specified in Part 2.)


Cluster Z — Observability (Beyond GenAI)

Section titled “Cluster Z — Observability (Beyond GenAI)”

Design Note. The GenAI telemetry layer is well-specified in the prior routing audit. The non-GenAI observability gap is in: slow-operation detection (no alert if a tool handler stalls for >5s), metric naming consistency (some spans still use vox_dei:: targets), log-level audit (DEBUG statements in hot paths), and per-agent/per-queue gauges that would make scaling decisions legible in dashboards.

(FIX-Z-01 through FIX-Z-06 fully specified in Part 2.)


Cluster AA — Security (Beyond vox-secrets/Gateway)

Section titled “Cluster AA — Security (Beyond vox-secrets/Gateway)”

Design Note. The gateway audit (Cluster K) covers the HTTP surface. Three additional security concerns span the broader system: (1) no tracing redaction middleware — a developer can debug!("{:?}", secret) and leak credentials to log aggregators; (2) no replay-attack protection on A2A delivery — a duplicate jwe_payload delivery is currently accepted; (3) the eval endpoint can initiate outbound network connections from the subprocess (SSRF risk).

(FIX-AA-01 through FIX-AA-07 fully specified in Part 2.)


Design Note. Orchestrator has no Drop implementation. When the process exits unexpectedly, the outbox is not flushed, the DB handle is not cleanly closed, and distributed locks may not be released. Restart recovery depends entirely on the outbox replay mechanism — which only works if the outbox was written before the crash. An RAII guard for distributed locks and a structured shutdown sequence are the two highest-value items.

(FIX-AB-01 through FIX-AB-05 fully specified in Part 2.)


StageFIX clustersThemeGate
1 — Immediate securityK-01, K-02, K-03, K-05, K-06Remove P0 security issuesAll P0 items pass security review
2 — Crash preventionB-02, B-06, E-01, H-01, H-02, H-03, J-01, O-02, O-03, X-01Eliminate panics and unchecked unwrapscargo test -p vox-orchestrator green; no new panics in fuzz
3 — Correctness & testsC-05, D-01, D-02, F-02, G-04, T-01..T-07, Y-01, Y-02, AB-01, AB-02Add critical test coverage; fix ordering bugs; shutdown/RAIITest count increases by ≥40; CI green
4 — FinOps & capabilityB-11, F-01, F-05, F-08, G-01, G-06, G-08, J-05, J-10Doom-loop, budget isolation, per-tool limits, backpressureLoad test passes; doom-loop test passes
5 — ModularityB-01, B-04, B-05, G-02, G-07, H-04, H-05, I-06, J-03Decompose large functions; introduce tool registryNo regression; dispatch.rs < 100 lines
6 — ObservabilityD-06, F-06, G-05, J-06, K-18, Z-01..Z-06, B-08, I-04, M-05Metrics, spans, slow-op detection/metrics shows orchestrator gauges; traces show per-phase spans
7 — Contracts & docsC-04, D-07, P-03, P-04, U-01..U-05, V-01..V-05, S-01..S-09JSON schemas, new ADRs, prompt reconciliationvox ci contracts-validate green; ADRs committed
8 — DEI shim retirementN-01..N-05Full shim removalrg 'dei_shim' crates/vox-orchestrator/src returns zero hits

The following are grep-able or test-verifiable done-conditions for the audit as a whole:

  • rg 'unwrap\(\)\|expect\(' crates/vox-orchestrator/src/**/*.rs | grep -v test | grep -v '#\[cfg(test' | wc -l decreases by ≥70% from baseline.
  • rg 'Err(String\|Result<.*String>' crates/vox-orchestrator/src | grep -v test | wc -l returns zero.
  • cargo test -p vox-orchestrator includes ≥6 new handoff tests, ≥6 Socrates gate tests, ≥5 outbox replay tests.
  • rg '"SYSTEM_INTERVENTION"\|"RBAC_VIOLATION"\|"LAZY_GENERATION"' crates/vox-orchestrator/src returns zero.
  • rg 'target: "vox_dei' crates/ | grep -v archive | wc -l returns zero.
  • rg 'dei_shim' crates/vox-orchestrator/src | grep -v 'deprecated\|tombstone' | wc -l returns zero (Stage 8 gate).
  • wc -l crates/vox-orchestrator-mcp/src/dispatch.rs < 100 (post FIX-G-02/G-07).
  • curl localhost:<port>/metrics returns vox_orchestrator_event_bus_subscribers, vox_orchestrator_queue_depth, vox_orchestrator_active_agents.
  • curl localhost:<port>/healthz returns 200 (FIX-K-09/O-04).
  • cargo audit -p vox-orchestrator returns zero high-severity advisories (FIX-AA-07).
  • All 8 rollout stage gates pass in CI.

  • model-orchestration-ssot-audit-2026.md — FIX-01..75 covering model routing, catalog, vox-secrets secret distribution, and telemetry alignment.
  • nextgen-orchestrator-research-2026.md — Research synthesis with P0–P3 capability gaps that motivated FIX-B-11 (doom-loop), FIX-D-08 (lock propagation), FIX-E-06 (entropy scoring), FIX-F-05 (fleet throttle), FIX-J-07 (schema-aware translation), FIX-J-10 (PII routing).
  • docs/src/adr/022-orchestrator-bootstrap-and-daemon-boundaries.md — ADR governing daemon/MCP boundary and Phase B RPC flag matrix.
  • docs/agents/orchestrator.md — Contributor-facing SSOT for orchestrator boundaries and Socrates policy.
  • OpenTelemetry GenAI semantic conventions v1.37 — Informs FIX-Z-02 span naming.
  • tokio-cancel-safety documentation — Informs FIX-Y-01..Y-04.
  • OWASP Top 10 2025 (A01 Broken Access Control, A04 Insecure Design) — Informs FIX-K-01..K-18.