Skip to content

Vox Model Autonomic System — L1/L2/L3 Design (2026-Q2)

Council-ratified 2026-05-15. Companion to model-selection-2026-q2.md §8.

The current model pipeline mixes three concerns into one hand-edited file: the bootstrap catalog is simultaneously (a) the cold-start fallback, (b) the SSOT humans edit when a new model ships, and (c) the pin list for reproducibility. Result: every new frontier model release blocks on a manual PR to model-catalog.bootstrap.v1.json, and selection logic is scattered across two parallel paths (select() + resolve_model_with_registry_fallbacks) plus several provider-default constants.

┌───────────────────────────────────────────────────────────────┐
│ L3 Council-review (quarterly + on alert) │
│ • approves Provisional → Confirmed tier promotions │
│ • approves premium_alias rotations │
│ • reads council-report.md (auto-generated) │
└───────────────────────────────────────────────────────────────┘
▲ council-report.md
┌───────────────────────────────────────────────────────────────┐
│ L2 Continuous classification (NEW) │
│ • classifier LLM (Haiku/Flash-tier) consumes │
│ (id, description, param_graph, sample_pricing) │
│ → emits {tier, strengths[], confidence} │
│ • shadow-runs new model on eval panel for N samples │
│ • DB scoreboard converges on success/cost/latency │
│ • confidence ≥ threshold → Provisional → Confirmed │
└───────────────────────────────────────────────────────────────┘
▲ provisional ModelSpec + DB rows
┌───────────────────────────────────────────────────────────────┐
│ L1 Continuous discovery (extension of existing) │
│ OpenRouter /models • LiteLLM pricing • Anthropic /models │
│ • runs on a schedule (nightly cron), not just at startup │
│ • diffs against registry → emits DiscoveryEvent │
└───────────────────────────────────────────────────────────────┘
SurfaceWhat it does todayHow L1/L2/L3 uses it
OpenRouterCatalog::refresh()One-shot /models fetchL1: scheduled refresh
LiteLLMCatalog::fetch()Pricing enrichmentL1: scheduled refresh
AnthropicDirectCatalog::refresh()Key-gated Anthropic catalogL1: scheduled refresh
infer_strengths() (catalog.rs)Parameter-graph → strengthsL2: prior for classifier
ModelRegistry::premium_alias_for()Task→pin lookupL3: read from pins.yaml
select(intent, registry)Multi-axis SSOT pickerruntime: unchanged
vox-db model_scoreboardPer-model success/cost/latencyL2: convergence signal
vox-db model_pricing_catalogTelemetry-confirmed pricingL2: confidence promotion

3.1 contracts/orchestration/model-pins.v1.yaml

Section titled “3.1 contracts/orchestration/model-pins.v1.yaml”

Council-reviewed, infrequently changed. Pin list separate from catalog:

schema: vox.orchestration.pins/v1
premium_alias:
codegen: anthropic/claude-opus-4.7
research: google/gemini-3.1-pro
review: anthropic/claude-sonnet-4.6
planning: openai/gpt-5.5-pro
# …
version_pins:
# When CR-L0 eval-panel reproducibility matters, lock these.
llm-panel.claude-sonnet: claude-sonnet-4-6
llm-panel.gpt-frontier: gpt-5.4
council_signoff:
rotation_id: 2026-Q2-rotation-2
approved_by: [council]
approved_at: 2026-05-15

3.2 contracts/orchestration/catalog-fallback.v1.json

Section titled “3.2 contracts/orchestration/catalog-fallback.v1.json”

What model-catalog.bootstrap.v1.json becomes after the rename: a minimal emergency-offline subset (~5–10 models) covering each tier, used only when L1 discovery has never succeeded. Not the SSOT.

Each ModelSpec carries confidence: Confidence:

Provisional // discovered, classifier-tagged, no scoreboard data yet
Shadowed // running on eval panel; not eligible for production routing
Confirmed // scoreboard data passes thresholds; eligible everywhere
Deprecated // failing thresholds OR retired by council
// fired by select() on every selection
SelectionDecisionEvent {
intent_caller: Option<&'static str>, // "repair-loop", "research", …
task: TaskCategory,
axes: (u8, u8, u8), // (cost, responsiveness, intelligence)
chosen_model: String,
reason: SelectionReason,
timestamp_ms: u64,
}
// fired by L1 when a model id appears that wasn't in the prior catalog
DiscoveryEvent {
source: DiscoverySource, // OpenRouter | LiteLLM | Anthropic | Mesh
model_id: String,
seen_at_ms: u64,
}
// fired by L2 when classifier completes
ClassificationEvent {
model_id: String,
classifier_model: String, // which LLM classified it
tier: ModelTier,
strengths: Vec<StrengthTag>,
confidence: f32, // 0.0–1.0
timestamp_ms: u64,
}
// fired when confidence crosses a state boundary
ConfidencePromotionEvent {
model_id: String,
from: Confidence,
to: Confidence,
evidence: PromotionEvidence, // ScoreboardThreshold | CouncilApproval
timestamp_ms: u64,
}

These feed the L3 council report and CR-L8 corpus-feedback flywheel.

vox models discover # run L1 refresh manually
vox models classify <ID> # run L2 classifier on a model id
vox models shadow <ID> # run eval-panel against a Provisional model
vox models council-report # generate the L3 quarterly markdown
PhaseWhatStatus
ASelectionDecisionEvent emit from select()landed 2026-05-15
BMigrate registry_model_resolve to wrap select()landed 2026-05-15
CMigrate vox-code-audit::default_*_model() to select()landed 2026-05-15
DSplit pinning from cataloging: model-pins.v1.yaml + rename bootstraplanded 2026-05-15
Evox models classify scaffold + classifier prompt schemalanded 2026-05-15
FNightly catalog-diff infrastructure + DiscoveryEventlanded 2026-05-15
GShadow-eval hook into llm-panel for Provisional modelslanded 2026-05-15

Phases E/F/G land as scaffolds — the surfaces, types, and entry points exist and are testable, but real LLM-classifier calls and the cron scheduler are gated behind feature flags until council approves go-live.

  • bootstrap_inference::* constants kept as last-resort fallbacks; not in the hot path.
  • model-catalog.bootstrap.v1.json kept at its filename for one release with a deprecation pointer to catalog-fallback.v1.json.
  • resolve_model_with_registry_fallbacks retained as a thin wrapper over select() so older callers don’t break.
  • Classifier model choice. Haiku 4.5 vs Gemini 3.1 Flash-Lite. Both support structured-output JSON schema. Lean Haiku for the prompt-cache TTL during shadow runs.
  • Promotion threshold. Currently proposed: 30 successful calls + p50 latency < 2× catalog median + cost telemetry confidence High. Council to ratify before turning F on.
  • Mesh peer classification. PopuliMesh entries don’t have an OpenRouter description. Treat them as Confirmed+Local at registration time and let scoreboard drive demotion.

Document dated 2026-05-15. SSOT for the model-autonomic system; supersedes the bootstrap-as-SSOT model.