Vox Model Autonomic System — L1/L2/L3 Design (2026-Q2)
Vox Model Autonomic System
Section titled “Vox Model Autonomic System”Council-ratified 2026-05-15. Companion to model-selection-2026-q2.md §8.
0. Problem
Section titled “0. Problem”The current model pipeline mixes three concerns into one hand-edited file:
the bootstrap catalog is simultaneously (a) the cold-start fallback, (b) the
SSOT humans edit when a new model ships, and (c) the pin list for
reproducibility. Result: every new frontier model release blocks on a manual
PR to model-catalog.bootstrap.v1.json, and selection logic is scattered
across two parallel paths (select() + resolve_model_with_registry_fallbacks)
plus several provider-default constants.
1. The three loops
Section titled “1. The three loops”┌───────────────────────────────────────────────────────────────┐│ L3 Council-review (quarterly + on alert) ││ • approves Provisional → Confirmed tier promotions ││ • approves premium_alias rotations ││ • reads council-report.md (auto-generated) │└───────────────────────────────────────────────────────────────┘ ▲ council-report.md┌───────────────────────────────────────────────────────────────┐│ L2 Continuous classification (NEW) ││ • classifier LLM (Haiku/Flash-tier) consumes ││ (id, description, param_graph, sample_pricing) ││ → emits {tier, strengths[], confidence} ││ • shadow-runs new model on eval panel for N samples ││ • DB scoreboard converges on success/cost/latency ││ • confidence ≥ threshold → Provisional → Confirmed │└───────────────────────────────────────────────────────────────┘ ▲ provisional ModelSpec + DB rows┌───────────────────────────────────────────────────────────────┐│ L1 Continuous discovery (extension of existing) ││ OpenRouter /models • LiteLLM pricing • Anthropic /models ││ • runs on a schedule (nightly cron), not just at startup ││ • diffs against registry → emits DiscoveryEvent │└───────────────────────────────────────────────────────────────┘2. Existing pieces we reuse
Section titled “2. Existing pieces we reuse”| Surface | What it does today | How L1/L2/L3 uses it |
|---|---|---|
OpenRouterCatalog::refresh() | One-shot /models fetch | L1: scheduled refresh |
LiteLLMCatalog::fetch() | Pricing enrichment | L1: scheduled refresh |
AnthropicDirectCatalog::refresh() | Key-gated Anthropic catalog | L1: scheduled refresh |
infer_strengths() (catalog.rs) | Parameter-graph → strengths | L2: prior for classifier |
ModelRegistry::premium_alias_for() | Task→pin lookup | L3: read from pins.yaml |
select(intent, registry) | Multi-axis SSOT picker | runtime: unchanged |
vox-db model_scoreboard | Per-model success/cost/latency | L2: convergence signal |
vox-db model_pricing_catalog | Telemetry-confirmed pricing | L2: confidence promotion |
3. New contracts
Section titled “3. New contracts”3.1 contracts/orchestration/model-pins.v1.yaml
Section titled “3.1 contracts/orchestration/model-pins.v1.yaml”Council-reviewed, infrequently changed. Pin list separate from catalog:
schema: vox.orchestration.pins/v1premium_alias: codegen: anthropic/claude-opus-4.7 research: google/gemini-3.1-pro review: anthropic/claude-sonnet-4.6 planning: openai/gpt-5.5-pro # …version_pins: # When CR-L0 eval-panel reproducibility matters, lock these. llm-panel.claude-sonnet: claude-sonnet-4-6 llm-panel.gpt-frontier: gpt-5.4council_signoff: rotation_id: 2026-Q2-rotation-2 approved_by: [council] approved_at: 2026-05-153.2 contracts/orchestration/catalog-fallback.v1.json
Section titled “3.2 contracts/orchestration/catalog-fallback.v1.json”What model-catalog.bootstrap.v1.json becomes after the rename: a minimal
emergency-offline subset (~5–10 models) covering each tier, used only
when L1 discovery has never succeeded. Not the SSOT.
3.3 Confidence states
Section titled “3.3 Confidence states”Each ModelSpec carries confidence: Confidence:
Provisional // discovered, classifier-tagged, no scoreboard data yetShadowed // running on eval panel; not eligible for production routingConfirmed // scoreboard data passes thresholds; eligible everywhereDeprecated // failing thresholds OR retired by council4. New telemetry events
Section titled “4. New telemetry events”// fired by select() on every selectionSelectionDecisionEvent { intent_caller: Option<&'static str>, // "repair-loop", "research", … task: TaskCategory, axes: (u8, u8, u8), // (cost, responsiveness, intelligence) chosen_model: String, reason: SelectionReason, timestamp_ms: u64,}
// fired by L1 when a model id appears that wasn't in the prior catalogDiscoveryEvent { source: DiscoverySource, // OpenRouter | LiteLLM | Anthropic | Mesh model_id: String, seen_at_ms: u64,}
// fired by L2 when classifier completesClassificationEvent { model_id: String, classifier_model: String, // which LLM classified it tier: ModelTier, strengths: Vec<StrengthTag>, confidence: f32, // 0.0–1.0 timestamp_ms: u64,}
// fired when confidence crosses a state boundaryConfidencePromotionEvent { model_id: String, from: Confidence, to: Confidence, evidence: PromotionEvidence, // ScoreboardThreshold | CouncilApproval timestamp_ms: u64,}These feed the L3 council report and CR-L8 corpus-feedback flywheel.
5. New CLI surfaces
Section titled “5. New CLI surfaces”vox models discover # run L1 refresh manuallyvox models classify <ID> # run L2 classifier on a model idvox models shadow <ID> # run eval-panel against a Provisional modelvox models council-report # generate the L3 quarterly markdown6. Roll-out phases
Section titled “6. Roll-out phases”| Phase | What | Status |
|---|---|---|
| A | SelectionDecisionEvent emit from select() | landed 2026-05-15 |
| B | Migrate registry_model_resolve to wrap select() | landed 2026-05-15 |
| C | Migrate vox-code-audit::default_*_model() to select() | landed 2026-05-15 |
| D | Split pinning from cataloging: model-pins.v1.yaml + rename bootstrap | landed 2026-05-15 |
| E | vox models classify scaffold + classifier prompt schema | landed 2026-05-15 |
| F | Nightly catalog-diff infrastructure + DiscoveryEvent | landed 2026-05-15 |
| G | Shadow-eval hook into llm-panel for Provisional models | landed 2026-05-15 |
Phases E/F/G land as scaffolds — the surfaces, types, and entry points exist and are testable, but real LLM-classifier calls and the cron scheduler are gated behind feature flags until council approves go-live.
7. Backwards compatibility
Section titled “7. Backwards compatibility”bootstrap_inference::*constants kept as last-resort fallbacks; not in the hot path.model-catalog.bootstrap.v1.jsonkept at its filename for one release with a deprecation pointer tocatalog-fallback.v1.json.resolve_model_with_registry_fallbacksretained as a thin wrapper overselect()so older callers don’t break.
8. Open questions
Section titled “8. Open questions”- Classifier model choice. Haiku 4.5 vs Gemini 3.1 Flash-Lite. Both support structured-output JSON schema. Lean Haiku for the prompt-cache TTL during shadow runs.
- Promotion threshold. Currently proposed: 30 successful calls + p50
latency < 2× catalog median + cost telemetry confidence
High. Council to ratify before turning F on. - Mesh peer classification. PopuliMesh entries don’t have an OpenRouter
description. Treat them as
Confirmed+Localat registration time and let scoreboard drive demotion.
Document dated 2026-05-15. SSOT for the model-autonomic system; supersedes the bootstrap-as-SSOT model.