Vox v1.0 LLM-Target Implementation Plan (2026)
Vox v1.0 LLM-Target Implementation Plan
Section titled “Vox v1.0 LLM-Target Implementation Plan”Companion to:
v1-release-criteria.md— defines CR-L0..CR-L8.vox-as-llm-target-audit-and-plan-2026.md— audit, gaps, attainability verdict, self-critique §9.
This doc moves from “what we measure” to “how we get there.” It exists because §9.6 of the audit names what the audit-doc-as-input-to-planning was missing: measurement specification, fixture-corpus budget, owners, dependency graph, CI subcommand contract, risk register, and rollback policy.
Scope of this plan. CR-L0..CR-L8 only. Mesh phases, vox-language-rules phases, agentic-VCS phases, and MENS distributed-training are tracked under their own plans and only referenced where a CR-L depends on them.
Timeline assumption. Today is 2026-05-15. Council ratification 2026-05-15 (D11=a, corpus absorbed into team at ~1 day/week per track) accepts a ~2-month slip risk on P3. Honest revised target: v1.0 GA Q1-2027 (was end-2026). Two-track parallelism is required; serial execution does not fit.
Council ratification 2026-05-15. All 25 council-decisions surfaced in this plan + sibling manifests (D1..D25) ratified by council on 2026-05-15. Decision log appended as §8.1 Ratification Log. Tier-1 ratifications (D1 reconciliation, D2 realistic bars, D3 two-track owner model) are operational immediately; Tier-2 (D11=a corpus path) shifts the P3 timeline as noted above.
§1 Phasing — P0 through P5
Section titled “§1 Phasing — P0 through P5”Five sequenced phases. Each phase lists its gate — the condition for moving to the next phase. Phases overlap; the gate is a logical predicate, not a calendar boundary.
§1.1 P0 — Prep (weeks 1–2, target close: 2026-05-29)
Section titled “§1.1 P0 — Prep (weeks 1–2, target close: 2026-05-29)”Goal. Land the prerequisites without which no CR-L can be measured.
| Task | Owner role | Effort | Gates |
|---|---|---|---|
P0.1 Define contracts/marquee/manifest.v1.yaml listing 3–5 Marquee apps with feature inventory, fixture path, expected deploy/run behavior. | DevOps lead | 3 days | Closes G3. |
P0.2 Define contracts/eval/ directory structure — subdirs humaneval-vox/, repair-corpus/, plan-fidelity/, spec-to-app/. README per subdir. | Corpus eng | 2 days | Closes G4 prereq. |
P0.3 Specify reference-LLM panel: {MENS-current, Claude-Sonnet-current, GPT-current}. Document version-pin policy + cost rate-card. | Council liaison | 2 days | Closes G5. |
P0.4 Draft vox audit <thing> CI contract (§4 below). One short doc; not yet implemented. | Language platform lead | 3 days | Closes G17. |
| P0.5 Council review of CR-D ↔ CR-L reconciliation notes added 2026-05-15. | Council | 1 week | Closes G2. |
P0.6 Fix the stale AGENTS.md §Grammar Unification claim that workflow/activity are “fully supported.” Replace with truthful “reserved per ADR-028.” | Language platform lead | 1 hour | Closes the immediate LLM-target footgun named in audit §3.5 + §7 OQ-4. |
| P0.7 Assign owners across roles to CR-L0..CR-L8 (filling the role slots used throughout this plan). | Council | 1 week | Closes G13. |
P0 gate. Marquee manifest exists; eval directories exist; reference-LLM panel pinned; CI contract drafted; council has approved reconciliation; AGENTS.md drift fixed; owners assigned. Approximately 2 calendar weeks.
§1.2 P1 — Quick wins (weeks 3–6, target close: 2026-06-26)
Section titled “§1.2 P1 — Quick wins (weeks 3–6, target close: 2026-06-26)”Goal. Land the cheap CR-L items that unblock measurement.
| Task | Owner role | Effort | Closes |
|---|---|---|---|
P1.1 Flip OrchestratorConfig::agentos_aci_envelope_enabled default to true. Add deprecation warning for explicit false. Migration shim emits guidance to any consumer reading the old shape. | Agent infra lead | 1 week | CR-L5 |
P1.2 Ship 5 of ~10 retirement-guard detectors from AGENTS.md §Retired Surfaces — start with @component fn, @server fn, @query fn, @mutation fn, @py.import. Each is one vox-code-audit detector following the anonymous_error.rs template. | Language platform lead | 2 weeks | CR-L6 (50%) |
P1.3 Generate contracts/retirement/retired-surfaces.v1.yaml from AGENTS.md §Retired Surfaces. Ship vox ci retirement-audit CI gate asserting every row has a wired detector and vice versa. | Language platform lead | 1 week | CR-L6 parity gate |
P1.4 Land remaining 5 retirement-guard detectors (recall(), @capacitor/*, axum::serve, rust-embed, vox-sherpa-transcribe, TURSO env var aliases). | Language platform lead | 1 week | CR-L6 (100%) |
P1.5 Implement vox audit umbrella CLI dispatching to per-thing subcommands; ship the contract from §4. Even stub subcommands return the right shape. | Language platform lead | 1 week | Closes G17 implementation. |
P1 gate. CR-L5 and CR-L6 measurably shipped. vox audit umbrella running with stub subcommands. Approximately 4 calendar weeks.
§1.3 P2 — Measurement infrastructure (weeks 5–12, parallel with P1)
Section titled “§1.3 P2 — Measurement infrastructure (weeks 5–12, parallel with P1)”Goal. Make CR-L1, CR-L2, CR-L4, CR-L8 measurable. (Bar achievement comes in P4.)
| Task | Owner role | Effort | Closes |
|---|---|---|---|
P2.1 Ship vox.lint.* + vox.repair.* telemetry collectors emitting structured events with rule-id, severity, source-span, autofix-state, repair-outcome. | Runtime/repair lead | 2 weeks | CR-L8 collector |
P2.2 Ship the quarterly export job (Phase 4 Task 7 of vox-language-rules-phase4-runtime-monitors-2026.md). Emits contracts/reports/corpus-feedback/<quarter>.json. | Runtime/repair lead | 2 weeks | CR-L8 export |
P2.3 CI gate: contracts/reports/corpus-feedback/ artifact older than 90 days fails CI. | Runtime/repair lead | 2 days | CR-L8 gate |
P2.4 Build vox audit humaneval runner: invokes reference-LLM panel against contracts/eval/humaneval-vox/*.spec.toml, scores compile + test-pass, emits report. | Corpus eng + Lang platform lead | 3 weeks | CR-L1 harness |
P2.5 Build vox audit mens-on-distribution runner: samples MENS emissions against the full lint/audit/retirement suite, emits rate. | Corpus eng + Lang platform lead | 1 week (reuses P2.4) | CR-L2 harness |
P2.6 Build vox audit plan-fidelity runner: drives orchestrator plan-mode against contracts/eval/plan-fidelity/*.plan.toml, scores success per fixture’s stated criteria. | Agent infra lead | 3 weeks | CR-L4 harness |
P2.7 Build vox audit repair-corpus runner: drives vox repair . against contracts/eval/repair-corpus/<project>/, scores final-state pass/fail per fixture’s test suite. | Runtime/repair lead | 3 weeks | CR-L3 harness |
| P2.8 Reproducibility policy: pin temperature 0.0 + seed for all measurement runs; CR-L3 counts majority-success over 5 attempts per fixture. Documented in eval README. | Council liaison | 2 days | Closes G6. |
P2 gate. Five vox audit subcommands implemented and producing reports against placeholder fixtures. Approximately 7 calendar weeks (overlaps P1).
§1.4 P3 — Corpus engineering (weeks 8–20, parallel with P2/P4)
Section titled “§1.4 P3 — Corpus engineering (weeks 8–20, parallel with P2/P4)”Goal. Land the fixture corpora at sufficient quality and size for CR-L bars to be measured meaningfully.
| Task | Owner role | Effort | Closes |
|---|---|---|---|
P3.1 Mine examples/golden/** for HumanEval-Vox candidates. Spec out 164 problems (anchored to HumanEval-Python’s count for direct comparability — see G18). Each problem = name.spec.toml (input prompt + reference solution + test cases). | Corpus eng | 6 weeks | CR-L1 fixtures |
P3.2 Held-out subset of humaneval-vox/: explicitly mark 30 problems as training_eligible: false and verify MENS training pipeline excludes them. CI check in vox-corpus. | Corpus eng + MENS lead | 1 week (overlaps P3.1) | Closes omission #4 (corpus contamination). |
P3.3 Build 50 repair-corpus/ projects: each is a multi-file Vox project with deliberately introduced bugs (compile errors / type errors / test-failing logic / effect-row violations). Author with mechanical mutators where possible; hand-curate quality bar. | Corpus eng | 6 weeks | CR-L3 fixtures (50% — minimum viable) |
P3.4 Build 50 plan-fidelity/ fixtures: each a multi-step plan in plan.toml with success criteria (e.g., “produces a PR that satisfies test X”). Source from real orchestrator session transcripts where possible. | Agent infra lead + Corpus eng | 4 weeks | CR-L4 fixtures (50%) |
P3.5 Build Marquee app fixtures: 3 apps living in apps/marquee/{todo-crud, chat, dashboard}/ with manifest registration. Each compiles, deploys, and runs in CI today. | DevOps lead + DX | 4 weeks | CR-L7 fixtures + closes G3 content |
P3.6 Build 10 spec-to-app/ fixtures: English spec → expected app shape. Spec format: {name, prompt, success_criteria, max_cost_usd, max_iterations}. Range from “todo CRUD” through “dashboard with auth.” | Corpus eng + Agent infra lead | 4 weeks | CR-L0 fixtures |
P3.7 Eval-data versioning policy: each contracts/eval/<thing>/manifest.v1.yaml carries a content hash; rate measurements are pinned to the hash. Re-measurement on hash change is mandatory and logged. | Corpus eng | 1 week | Closes G18 (sub-issue: versioning). |
P3 gate. 350+ fixtures landed across 5 corpora at v1-ready quality. Approximately 12 calendar weeks.
§1.5 P4 — Hard CR-L work (weeks 12–28, parallel with P3 tail)
Section titled “§1.5 P4 — Hard CR-L work (weeks 12–28, parallel with P3 tail)”Goal. Achieve the bars. Most P4 tasks start once P2 measurement infra is up and P3 fixtures are at minimum-viable size (50%).
| Task | Owner role | Effort | Closes |
|---|---|---|---|
P4.1 Extend vox-cli/src/commands/repair.rs to project scope: walk the project, collect cross-file diagnostics, propose a coordinated fix set, apply with per-file budget, re-check at project level. Reuse single-file vox repair as the inner loop. | Runtime/repair lead | 8 weeks | CR-L3 (multi-file capability) |
P4.2 Iterate vox repair . against repair-corpus/ until ≥ 70% project pass + ≥ 90% single-file aim. Twiddle prompt, attempt budget, fix-application order. | Runtime/repair lead | 4 weeks (overlap with P4.1) | CR-L3 bar |
P4.3 Iterate plan-mode (plan_loop.rs) against plan-fidelity/ until ≥ 85% success. Most changes in chat_tools/plan_loop.rs and the underlying orchestrator dispatch. | Agent infra lead | 4 weeks | CR-L4 bar |
P4.4 Ship vox new (extends phase1-build-targets-spec-2026.md vox init --kind work). | DevOps lead + DX | 3 weeks | CR-L7 part 1 |
P4.5 Ship vox deploy with structured JSON output, vox.deploy.* telemetry, OCI publish path. Reference: vox-deploy-codegen crate is the codegen surface; needs CLI wiring + a default deployment target story (probably Fly.io or Railway for v1.0). | DevOps lead | 6 weeks | CR-L7 part 2 |
P4.6 Ship vox doctor — top-level health check that runs vox check, vox audit retirement, vox audit on-distribution if a model is configured, validates installed runtime + CLI consistency. JSON output. | DevOps lead + Lang platform | 2 weeks | CR-L7 part 3 |
P4.7 CI integration test: vox new web → vox deploy → vox doctor on each Marquee app fixture, under the 120-second [CR-P3] budget. | DevOps lead | 2 weeks | CR-L7 gate |
P4.8 Build the CR-L0 agent loop driver: a vox audit spec-to-app subcommand that, per spec, spawns an autonomous agent (MCP-driven) with a fixed system prompt and lets it iterate against the reference-LLM panel. Cost-meters every LLM call. Reports per-spec outcome (final-state pass/fail + cost + iteration count). | Agent infra lead + Runtime/repair lead | 8 weeks | CR-L0 harness |
P4.9 Iterate CR-L0 loop against spec-to-app/ until ≥ 60% pass at ≤ $5/spec median cost. This iteration touches everything — orchestrator prompts, repair loop, plan-mode, retirement guards, generated-code quality. CR-L0 is the integration test; tuning it shakes out latent gaps in all other CR-L’s. | All leads | Continuous through P4 | CR-L0 bar |
| P4.10 Measure CR-L1 / CR-L2 against the final corpora. These should “fall out” once P2.4/P2.5 are running and P3.1/P3.2 are at full size. | Corpus eng | 1 week per measurement run | CR-L1, CR-L2 bars |
P4 gate. CR-L0..CR-L8 each have a measured number recorded in contracts/reports/<thing>/2026-Q4.json. Approximately 16 calendar weeks.
§1.6 P5 — Hardening (weeks 24–32)
Section titled “§1.6 P5 — Hardening (weeks 24–32)”Goal. Bar-or-demote decisions; release.
| Task | Owner role | Effort | |
|---|---|---|---|
P5.1 Run all eight vox audit subcommands quarterly + on every release-candidate tag. Lock the report shape; ensure release notes auto-cite the report. | Lang platform lead | 1 week | |
| P5.2 Per-CR-L bar-or-demote decision per §6 policy below. Council reviews actual numbers. | Council | 2 weeks | |
P5.3 Update AGENTS.md §Retired Surfaces, docs/agents/vox-language-surface.v1.json (to be created in P5.3), .well-known/llms.txt (to be created in P5.3), and cli-command-surface.generated.md to reflect v1.0 reality. | Lang platform lead | 1 week | |
| P5.4 Publish v1.0 with full CR table including measured numbers (not just bars), CR-L0 result called out prominently. | Release manager | — |
P5 gate. v1.0 GA.
§2 Dependency DAG
Section titled “§2 Dependency DAG” P0 (prep) / \ v v CR-L5 CR-L6 ← P1 quick wins (independent)
┌─ vox-language-rules │ Phase 2 detector framework v ┌─ CR-L6 detectors (P1.2–P1.4) v AGENTS.md retirement contract │ v ┌── vox-language-rules Phase 4 Task 7 │ (telemetry export) vCR-L8 telemetry pipeline (P2.1–P2.3) │ ├──────────────────┐ v vCR-L1 humaneval CR-L2 on-distributionharness (P2.4) harness (P2.5) │ │ v vCR-L1 fixtures CR-L2 measurement run164 problems (depends on CR-L1(P3.1, P3.2) fixtures) │ └──> CR-L1 bar (P4.10)
┌─ plan_loop.rs (existing) v CR-L4 harness (P2.6) │ v plan-fidelity fixtures (P3.4) │ v CR-L4 bar (P4.3)
┌─ repair.rs (existing single-file) v CR-L3 harness (P2.7) │ v repair-corpus fixtures (P3.3) │ v project-scope repair (P4.1, P4.2) → CR-L3 bar
┌─ phase1-build-targets (vox init) │ v vox new (P4.4) │ v vox deploy (P4.5) ──┐ │ │ v │ vox doctor (P4.6) ──┤ v Marquee fixtures (P3.5) │ v CR-L7 CI integration test (P4.7)
┌── all of the above ──┐ v v spec-to-app fixtures agent loop driver (P3.6) (P4.8) \ / \ / v v CR-L0 bar (P4.9) ★ Integration testCritical-path insight. CR-L0 sits at the bottom of the DAG. Every other CR-L feeds into it. CR-L0 cannot be tuned to its 60% bar until CR-L1..CR-L8 each contribute. Conversely, CR-L0 results reveal which sub-CR-L is the weakest link — making CR-L0 the most efficient debugging surface during P4.
Parallelism analysis.
- P0 ⊥ everything (sequential prereq).
- P1.1 ⊥ P1.2 ⊥ P1.3 (independent quick wins).
- P2 ⊥ P3 (measurement infra vs corpus engineering) — fully parallel.
- P4.1+P4.2 (CR-L3) ⊥ P4.3 (CR-L4) ⊥ P4.4-P4.7 (CR-L7) — three parallel tracks.
- P4.8+P4.9 (CR-L0) depends on all three above being at least at minimum-viable.
Minimum staffing. Three working tracks:
- Track A (Lang platform): CR-L1, CR-L2, CR-L6, CR-L8 infra
- Track B (Runtime/repair + Agent infra): CR-L3, CR-L4, CR-L0
- Track C (DevOps + DX): CR-L7, Marquee fixtures
Plus a Corpus engineer floating across A/B/C.
§3 Fixture-Corpus Budget
Section titled “§3 Fixture-Corpus Budget”Total: ~340 high-quality fixtures across 5 corpora, ~16 person-weeks of corpus engineering. This is the single largest invisible cost in the audit doc and the most likely scope-slip vector.
| Corpus | Count | Per-fixture cost (hours) | Total (hours) | Mining shortcuts |
|---|---|---|---|---|
humaneval-vox/ | 164 | 2 (mostly mechanical from examples/golden/) | 328 | High — existing @example blocks + mechanical mutators |
humaneval-vox/ held-out | (30 of the 164) | +1 hour to verify exclusion from MENS training | 30 | Just a flag flip + CI check |
repair-corpus/ | 50 | 8 (each is a multi-file project with introduced bugs that compile-or-test-fail) | 400 | Medium — ast_mutator from vox-corpus can introduce bugs at scale; quality bar still requires hand review |
plan-fidelity/ | 50 | 4 (each is a plan + success criteria + reference outcome) | 200 | Medium — mine real orchestrator transcripts |
marquee/ apps | 3–5 | 80 (each is a working app with tests + deploy config) | 320 | Low — these are real apps |
spec-to-app/ | 10 | 16 (each is a curated English spec with success criteria) | 160 | Low — depends on Marquee apps existing first |
| Total | ~280 | ~1438 hours |
Conversion. At ~30 productive hours/week per engineer, ~48 person-weeks. Across 1 corpus engineer + 0.25 FTE help from other leads, ~12 calendar weeks if started 2026-06-01.
Council ratification 2026-05-15 (D11=a, “corpus absorbed into team”). No dedicated corpus engineer hired or contracted. Corpus work absorbed into both project tracks at ~1 day/week per track (~16 hours/week combined). At ~16 hours/week against ~1438 hours, P3 stretches from ~12 to ~24 calendar weeks (≈ 6 months). The council explicitly accepts a ~2-month slip risk on P3; honest accounting puts it closer to 4 months relative to the original 12-week plan. v1.0 GA target accordingly revises from end-2026 to Q1-2027.
Owner (post-ratification). Both tracks share corpus responsibility. Compiler/Lang track owns humaneval-vox/ and repair-corpus/ fixtures; Agent/Runtime/CLI/Corpus track owns plan-fidelity/, spec-to-app/, and the Marquee app fixtures.
Held-out subset rationale. 30 of the 164 HumanEval-Vox problems must be training_eligible: false and verified as never having been ingested by the MENS training pipeline. This is the corpus-contamination guard (audit doc omission #4). Without it, the 80% CR-L1 number is leaked-evaluation marketing.
Versioning. Each contracts/eval/<thing>/manifest.v1.yaml carries a corpus_hash: <blake3> derived from the sorted content hashes of its fixtures. CR-L reports always cite the corpus hash they measured against. Adding/removing/editing a fixture requires a new hash and a re-measurement run with both old and new hash reported.
§4 CI Contract — vox audit <thing>
Section titled “§4 CI Contract — vox audit <thing>”Single contract for all CR-L measurement subcommands.
§4.1 Surface
Section titled “§4.1 Surface”vox audit <thing> [--json | --markdown | --html] [--baseline=<path>] [--threshold=<num>] [--corpus=<path>] [--llm-panel=<spec>]§4.2 Behavior
Section titled “§4.2 Behavior”- Default output.
--json. Single JSON object printed to stdout. - Report file. Always written to
contracts/reports/<thing>/<date>.json, regardless of stdout format. Atomic write. - Exit codes.
0— measurement complete, bar met (if--thresholdwas set), or no threshold set.1— measurement complete, bar not met.2— infrastructure error (corpus missing, LLM panel unreachable, etc.). Does not block CI on its own; logs to telemetry and skips.
- Telemetry. Every run emits
vox.audit.<thing>event with: corpus hash, llm panel version, duration, outcome, cost (if applicable). - Baseline comparison.
--baseline=<path>reads a prior report and emits delta in the result.
§4.3 Report JSON shape (canonical)
Section titled “§4.3 Report JSON shape (canonical)”{ "thing": "humaneval-vox", "schema_version": 1, "measured_at": "2026-Q4-2026-12-15T12:00:00Z", "corpus_hash": "blake3:abc123...", "corpus_size": 164, "llm_panel": [ {"id": "mens-current", "version": "v0.6.1"}, {"id": "claude-sonnet", "version": "claude-sonnet-4-7-20260801"} ], "reproducibility": {"temperature": 0.0, "seed": 42, "attempts_per_fixture": 5}, "results": { "overall_pass_rate": 0.84, "median_pass_rate": 0.83, "per_llm": [ {"id": "mens-current", "pass_rate": 0.81, "median_cost_usd": 0.02}, {"id": "claude-sonnet", "pass_rate": 0.88, "median_cost_usd": 0.18} ] }, "threshold": {"target": 0.80, "met": true}, "delta_vs_baseline": {"baseline_hash": "blake3:def456...", "absolute": 0.03, "relative_pct": 3.7}}§4.4 Subcommand inventory
Section titled “§4.4 Subcommand inventory”| Subcommand | CR-L | Notes |
|---|---|---|
vox audit spec-to-app | CR-L0 | Cost-metered. Required for v1.0 GA. |
vox audit humaneval | CR-L1 | Median-of-panel scoring. |
vox audit mens-on-distribution | CR-L2 | Single-model (MENS-current). |
vox audit repair-corpus | CR-L3 | Majority-of-5 attempts. |
vox audit plan-fidelity | CR-L4 | Reuses orchestrator MCP. |
vox audit aci-default | CR-L5 | Boolean check; returns immediately. |
vox audit retirement | CR-L6 | Reads contracts/retirement/retired-surfaces.v1.yaml and verifies wiring. |
vox audit deploy | CR-L7 | Drives vox new → vox deploy → vox doctor on Marquee fixtures. |
vox audit corpus-feedback | CR-L8 | Checks artifact age + shape. |
Implementation footprint. A single vox-audit crate with a Subcommand trait. Each CR-L registers an impl. CI runs vox audit all which fans out and emits a combined report.
§5 Risk Register (CR-L Specific)
Section titled “§5 Risk Register (CR-L Specific)”| # | Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|---|
| R1 | Corpus contamination. MENS training ingested examples/golden/; HumanEval-Vox built from same source = leaked eval. | High | Severe — measurement becomes marketing | Held-out subset of 30 problems with training_eligible: false; CI gate verifies exclusion (P3.2). | Corpus eng + MENS lead |
| R2 | Reference-LLM rate limits / cost during eval runs. Running CR-L0 against 10 specs × 5 attempts × panel-of-3 = 150 sessions per release-candidate. | Medium | Major — CI breaks or budget overruns | Cache identical-input responses; rate-limit retry budget per spec; alert if monthly LLM cost exceeds ceiling. | DevOps lead |
| R3 | Doc-drift recurrence (AGENTS.md vs pipeline.rs again). | Medium | Major — re-introduces the LLM-target footgun | CR-L6’s vox ci retirement-audit gate (P1.3). Plus a quarterly manual audit of AGENTS.md vs codebase per docs-reality-audit-program. | Council liaison |
| R4 | Council rejects CR-D ↔ CR-L reconciliation. | Low | Major — full plan must be re-shaped | P0.5 council review block; resolve before P1 starts. | Council |
| R5 | Marquee-app definition disagreement. Three apps too few? Five too many? Wrong feature mix? | Medium | Major — CR-L7 + CR-P1 unverifiable until resolved | P0.1 publishes manifest for review; council approves before P3.5 starts. | DevOps lead + Council |
| R6 | CR-L0 measured at 40–60%. Either we ship at the lower bar (and dilute the v1.0 claim) or we slip GA. | High | Severe — central marketing claim depends on it | Sub-bar at 40% (block GA per §6 below); reserve 4 weeks of buffer in P4 for CR-L0 iteration; treat CR-L0 as continuous-tuning during P4. | All leads |
| R7 | ACI default-on (CR-L5) breaks existing IDE consumers (Cursor / Antigravity / others reading old shape). | Medium | Moderate — release-note migration cost | Deprecation warning for one minor version before enforce; coordinate with downstream IDE vendors via the agent-feature-matrix channel. | Agent infra lead |
| R8 | vox audit subcommand sprawl. Each CR-L’s runner is built by a different lead → inconsistent shapes. | Medium | Moderate — debugging cost amplifies | P0.4 + P1.5 enforce single contract; PR template requires conforming to §4 shape; CI gate validates report JSON against schema. | Lang platform lead |
| R9 | Repair-corpus rot. Real-world bugs don’t match the 50 hand-crafted fixtures; CR-L3 70% on the corpus ≠ 70% on real broken projects. | Medium | Moderate — measurement validity | Quarterly refresh of 5 fixtures with bug patterns mined from vox.repair.* telemetry (CR-L8 feedback closes this loop). | Runtime/repair lead |
| R10 | Latency/cost ceilings not budgeted into CI compute. | High | Major — CI either breaks or runs only manually | CR-L0 cost-ceiling ($5/spec) bounds per-run cost; CI compute budget per release-candidate quantified in P0.3 LLM panel doc. | DevOps lead |
| R11 | Fixture-corpus engineering slips. No owner named → invisible work → 2+ month slip. | High | Severe — entire v1.0 timeline | Name owner in P0.7; treat as blocking-prereq; weekly progress check during P3. | Council + DevOps lead |
| R12 | MENS reaches v0.6 with poor on-distribution rate (CR-L2 < 95%). CR-L0 caps at MENS’s CR-L2. | Medium | Major — CR-L0 hits ceiling unrelated to language design | Run CR-L0 measurement against full panel; if MENS is the weakest panelist, demote MENS to “candidate” tier and report panel-median as primary number. | MENS lead |
| R13 | Council bandwidth. P0.5 + P5.2 + open-question resolution all require council. | Medium | Moderate — phase gates slip | Bundle council asks into 2 fixed reviews (start of P0, end of P4). Async-decide via the Foundation’s existing rubric. | Council liaison |
§6 Rollback / Demotion Policy
Section titled “§6 Rollback / Demotion Policy”For each CR-L, define what happens if the bar isn’t met by GA.
| CR-L | Bar | Sub-bar (block GA) | If measured between bar and sub-bar | If measured below sub-bar |
|---|---|---|---|---|
| CR-L0 | ≥ 60% pass at ≤ $5/spec | < 40% pass or > $10/spec median | Ship at observed rate; flag prominently in release notes; immediate v1.1 plan to close gap | Block GA. This is the integration test for the v1.0 claim. |
| CR-L1 | ≥ 80% HumanEval-Vox | < 60% | Ship at observed rate; release note explains panel median; quarterly re-measurement gate | Demote to v1.1; ship v1.0 without the claim |
| CR-L2 | ≥ 95% on-distribution | < 85% | Ship at observed rate; corpus-curation plan filed | Demote to v1.1 |
| CR-L3 | ≥ 70% project / ≥ 90% single-file | < 50% project | Ship at observed; CR-D2 90% number stays as the aspirational reference | Demote project-scope to v1.1; ship single-file only |
| CR-L4 | ≥ 85% plan fidelity | < 65% | Ship at observed; iterate plan-mode prompts in v1.0.x patches | Demote CR-D1 measurement to v1.1 |
| CR-L5 | Binary: default-on | n/a | Must land. No demotion. | Block GA if not landed |
| CR-L6 | Binary: parity gate passing | n/a | Must land. No demotion. | Block GA if not landed |
| CR-L7 | Binary: all three commands + CI integration test passing on Marquee fixtures | n/a | Must land all three. No demotion. | Block GA if any one missing |
| CR-L8 | Binary: pipeline running, < 90 days stale | n/a | Must land. No demotion. | Block GA if pipeline not running |
Summary. Three criteria are binary v1.0 gates (CR-L5, CR-L6, CR-L8 + CR-L7 sub-bar). Two are integration must-haves (CR-L0 sub-bar, CR-L3 sub-bar). The rest can demote with public release-note honesty.
The demotion policy itself is a v1.0 claim. We are saying: “If we miss a number, we say so plainly.” This is half the credibility of the LLM-target positioning. A council that quietly removes a CR-L when it doesn’t measure well destroys the entire framework.
Demotion publishing form (D18, ratified 2026-05-15). Per council decision:
- Release-note line is the minimum form for every demoted CR-L. One paragraph: criterion name, target bar, measured number, decision (demote / partial / re-baseline), rationale, follow-on plan reference.
- Full post-mortem is required when any CR-L demotes by more than 10 percentage points below its bar. Post-mortem lives at
docs/news/<release-date>-postmortem-<crl-id>.mdand includes timeline, what we tried, what we learned, what we’ll change. - Quarterly re-measurement post-v1.0 (D19, ratified 2026-05-15). All eight
vox auditsubcommands re-run quarterly against the latest panel pin. Numbers published in v1.0.x patch release notes; if any number drifts > 10pp below the v1.0 GA measurement, the relevant CR-L gets a quarterly post-mortem regardless of its v1.0 demotion status.
§7 New Criteria Proposals
Section titled “§7 New Criteria Proposals”§7.1 CR-L0 (adopted 2026-05-15)
Section titled “§7.1 CR-L0 (adopted 2026-05-15)”Already added to v1-release-criteria.md §5. Full text reproduced for convenience:
[CR-L0] End-to-End Agent Authorship Loop: Given a canonical English spec from
contracts/eval/spec-to-app/(10–20 specs of increasing complexity), an autonomous agent loop driving Vox (via MCP) must produce a passing application —vox checkclean, tests pass,vox deploysucceeds,vox doctorgreen — at ≥ 60% success rate with a per-spec token-cost ceiling of ≤ $5.00 against the panel reference LLMs. This is the integration test for the v1.0 LLM-target claim; CR-L1..CR-L8 are unit tests of its sub-loops. Sub-bar (block GA): observed rate < 40%.
Rationale for the numbers. A naive product of the other CR-L bars (85% planning × 95% on-distribution × 70% project repair × 95% deploy ≈ 54%) suggests 60% is mildly ambitious but reachable. $5/spec at 2026-05 panel rates allows ~50K input + 10K output tokens — enough headroom for 5–10 LLM round trips per spec.
Why this is the most important CR-L. It is the only one that exercises the interaction between Vox’s primitives and a real agent. Every other CR-L can pass while CR-L0 fails — that would mean Vox is locally good but doesn’t compose into an agentic workflow. CR-L0’s measurement reveals exactly where the composition breaks.
§7.2 Deferred-but-tracked (CR-L9..CR-L12, not adopted for v1.0)
Section titled “§7.2 Deferred-but-tracked (CR-L9..CR-L12, not adopted for v1.0)”Per council scoping decision 2026-05-15. Each is a real gap; each is tracked under an existing follow-on plan rather than expanded into v1.0 criteria:
| Proposed | Topic | Tracked under |
|---|---|---|
| CR-L9 | Endpoint auth coverage (@auth or @public on every @endpoint) | Should become a vox-code-audit rule + CI denial in v0.6; not v1.0-gated |
| CR-L10 | LSP ↔ CLI diagnostic parity | vox-lsp-capabilities-ssot-2026.md follow-on |
| CR-L11 | Emit-correctness gate (Vox→TS→React smoke vs rolling upstream) | vox-react-backend-interop-audit-2026.md follow-on |
| CR-L12 | Latency budget (vox check p99 / vox repair median) + per-repair cost ceiling | Sub-bullet under CR-E in a future revision of v1-release-criteria.md; partial coverage via CR-L0’s $5/spec ceiling |
These are deliberately not v1.0 criteria. Adding them would either inflate scope past end-2026 or dilute the bar height of the criteria that are kept.
§8 Open Questions Distilled
Section titled “§8 Open Questions Distilled”Status: 22 of 25 ratified by council 2026-05-15. Ratification log in §8.1 below. Items 3 carried open (mostly because they only become actionable later in the timeline).
Carried forward from vox-as-llm-target-audit-and-plan-2026.md §7 plus new ones surfaced by this plan:
- OQ-1 (carried). Is the bar height right at “realistic v1.0”? Numbers in this plan assume yes. Council confirmation needed at P0.5.
- OQ-2 (carried). CR-L items append-only vs replace CR-D1/D2/D3? This plan reconciles via “point to CR-L for measurement, keep CR-D as policy lineage.” Council confirmation at P0.5.
- OQ-3 (carried). Does v1.0 include the mesh? This plan does not gate on mesh. Confirmation needed: is that the right framing?
- OQ-4 (carried, partially closed). AGENTS.md §Grammar Unification corrective edit — P0.6 closes this.
- OQ-5 (carried). “Humans become orchestrators, not authors” as v1.0 marketing or v1.x stretch? This plan treats it as v1.0 marketing (CR-L0 is the integration test) but not as a measurable “0% human-edit ratio” gate. Council to confirm.
- OQ-6 (carried, closed). Eval corpus location — confirmed
contracts/eval/. - OQ-7 (new). Reference-LLM panel composition and version-pin policy. P0.3 surfaces this for council approval.
- OQ-8 (new). Corpus engineer hiring / staffing decision. P0.7 surfaces this; if no full-time corpus engineer can be staffed, P3 timeline doubles.
- OQ-9 (new). Marquee-app composition (which 3–5 apps?). P0.1 publishes manifest for council approval.
- OQ-10 (new). CR-L0 cost ceiling ($5/spec) — does it reflect 2026-Q4 panel rates, or do we re-baseline at GA?
- OQ-11 (new). What happens when MENS reaches v1.0 quality after v1.0 ships? Re-measure all CR-L’s quarterly? Update bars?
- OQ-12 (new). Demotion policy in §6 is council-controllable; do we publish demotions as a release-note line or a full post-mortem? The former is honest minimum; the latter builds credibility.
§8.1 Ratification Log (Council, 2026-05-15)
Section titled “§8.1 Ratification Log (Council, 2026-05-15)”Twenty-five decisions surfaced across this plan + sibling manifests; council ratified 22 in a single batch and noted 3 as “decide when actionable.” This log is the canonical record. Where a decision contradicts an earlier draft in this doc or in a manifest, this log governs and the upstream file has been updated to point here.
Tier 1 — Operational immediately
Section titled “Tier 1 — Operational immediately”| # | Decision | Ratification |
|---|---|---|
| D1 | CR-D ↔ CR-L reconciliation | CR-D1/D2/D3 are policy lineage; CR-L is measurement source of truth. No duplicate CI gates. Reflected in v1-release-criteria.md §4 amended paragraphs. |
| D2 | Bar height — realistic v1.0 vs aspirational | Realistic bars hold: 60/80/95/70/85 (CR-L0..L4). Aspirational pushes v1.0 to 2027+ and risks measurement-bending no-ship pressure. |
| D3 | Owner role assignments | Collapse to two tracks: (a) Compiler/Lang/Detector and (b) Agent/Runtime/CLI/Corpus. Corpus staffing path picked separately in D11. |
Tier 2 — Blocks P3 corpus engineering
Section titled “Tier 2 — Blocks P3 corpus engineering”| # | Decision | Ratification |
|---|---|---|
| D4 | Marquee slot 2 | Todo-CRUD with @auth (archetype marquee-todo-auth). Reflected in contracts/marquee/manifest.v1.yaml. |
| D5 | Marquee slot 3 | Realtime chat with actor primitives (archetype marquee-chat). |
| D6 | Marquee slot count | 3 apps (CR-P1 minimum). No expansion. |
| D7 | LLM panel composition | 3-member panel as drafted (MENS-current + Claude-Sonnet + GPT-frontier). Reflected in contracts/eval/llm-panel.v1.yaml. |
| D8 | Panel rebaseline cadence | Rebaseline at release-candidate tag. |
| D9 | MENS self-reference | Report separately for CR-L0; include in panel median for CR-L1/L2/L4. |
| D10 | HumanEval-Vox count anchor | 164 problems (matches HumanEval-Python). |
| D11 | Corpus engineering staffing | (a) Absorb into both tracks at ~1 day/week each. Accepts ~2-month slip risk on P3; honest accounting closer to 4 months. v1.0 GA target revises end-2026 → Q1-2027. |
Tier 3 — Affects later phases or framing
Section titled “Tier 3 — Affects later phases or framing”| # | Decision | Ratification |
|---|---|---|
| D12 | CR-L0 cost ceiling rebaseline at GA | Yes, rebaseline at RC against panel pricing. |
| D13 | Repair-corpus iteration & cost budget | 10 outer attempts × $1.50/project median. |
| D14 | Plan-fidelity iteration budget | 5 plan iterations × $0.75/plan. |
| D15 | Spec-to-app iteration budget | 25 iterations × $5/spec. |
| D16 | Mesh in v1.0? | Mesh Phase 2 LAN demoted from v1.0 to v1.1. v1.0 = single-machine + LLM-target only. v0.6/v0.7 targets unchanged. Reflected in mesh-and-language-distribution-ssot-2026.md banner. |
| D17 | ”Humans as orchestrators” framing | v1.0 marketing, NOT a measurable percentage gate. CR-L0 is the closest measurable proxy. |
| D18 | Demotion publishing form | Release-note line minimum; full post-mortem if >10pp below bar. §6 amended. |
| D19 | Post-v1.0 re-measurement cadence | Quarterly. §6 amended. |
| D20 | ACI default-on migration | Flip default to true in v0.6 with one-minor deprecation warning, NOT in v0.5.x. Coordinate via docs/agents/ai-ide-feature-matrix-2026.json channel. |
Tier 4 — Operational / process
Section titled “Tier 4 — Operational / process”| # | Decision | Ratification |
|---|---|---|
| D21 | vox-audit home crate | New top-level vox-audit crate (not a vox-cli submodule). Reflected in contracts/ci/vox-audit-contract.v1.yaml. |
| D22 | Exit-code-2 semantics | Infrastructure error logs telemetry, does NOT block CI. |
| D23 | Canonical baseline | Latest tagged release; manual-pin reserved for emergency rollback. |
| D24 | Plan-fidelity Wave 1/2/3 taxonomy | Accepted as drafted (Wave 1 single-file, Wave 2 multi-file with decorator discipline, Wave 3 cross-cutting). |
| D25 | HumanEval-Vox mutation policy | AST-mutations count toward held-out budget only if source was hand-authored AND held-out. Mutations of public examples remain training-eligible. Reflected in contracts/eval/humaneval-vox/manifest.v1.yaml. |
Items intentionally carried forward (decide when actionable)
Section titled “Items intentionally carried forward (decide when actionable)”- OQ-3 mesh framing detail — resolved at high level via D16; specific phase-by-phase v1.1 sequencing deferred to a future mesh SSOT revision.
- OQ-11 post-v1.0 cadence specifics — resolved at “quarterly” via D19; specific publication channel and bar-drift thresholds tuned during P5 hardening.
- OQ-12 demotion form — resolved at high level via D18; per-criterion templates for post-mortem authoring will be added as the first demotion (if any) occurs.
§9 Quick reference: per-CR-L summary
Section titled “§9 Quick reference: per-CR-L summary”| CR-L | Bar | Owner role | Phase | Status (2026-05-15) | Critical dep |
|---|---|---|---|---|---|
| CR-L0 | ≥ 60% pass + ≤ $5/spec | Agent infra + Runtime/repair | P4 | Proposed | All other CR-L’s at minimum-viable |
| CR-L1 | ≥ 80% HumanEval-Vox | Lang platform + Corpus eng | P2 harness, P3 corpus, P4 measure | Proposed | 164-problem corpus + held-out 30 |
| CR-L2 | ≥ 95% on-distribution | Lang platform + MENS lead | P2/P4 | Proposed | Reuses CR-L1 corpus; MENS quality |
| CR-L3 | ≥ 70% multi-file / ≥ 90% single-file | Runtime/repair | P2/P3/P4 | Proposed | 50-project repair-corpus |
| CR-L4 | ≥ 85% plan-mode fidelity | Agent infra | P2/P3/P4 | Proposed | 50-plan fixtures |
| CR-L5 | Default-on ACI envelope | Agent infra | P1 (week 3–4) | Proposed | None — single config flip |
| CR-L6 | Retirement-guard parity with AGENTS.md | Lang platform | P1 (weeks 3–6) | Proposed | 10 retired-pattern detectors |
| CR-L7 | vox new + deploy + doctor E2E | DevOps + DX | P4 | Proposed | Marquee app fixtures |
| CR-L8 | Telemetry export pipeline running | Runtime/repair | P2 (weeks 5–9) | Proposed | Phase 4 Task 7 of vox-language-rules |
§10 Where to look next
Section titled “§10 Where to look next”- CR-L0 spec source:
v1-release-criteria.md§5 (adopted 2026-05-15). - CR-L1..CR-L8 audit context:
vox-as-llm-target-audit-and-plan-2026.md— §2 evidence, §3 gaps, §5 criteria detail, §9 self-critique. - Phase plans this implementation plan pegs to:
vox-language-rules-phase2-lint-extension-2026.md— CR-L1@exampledecorator, CR-L6 detector frameworkvox-language-rules-phase4-runtime-monitors-2026.md— CR-L8 telemetry export (Task 7)phase1-build-targets-spec-2026.md— CR-L7 vox new lineageagentos-ssot-2026.md— CR-L5 envelope contractmens-training-ssot.md— held-out corpus guard (R1 mitigation)
- External orientations:
- HumanEval (Chen et al., 2021) — 164 problems; CR-L1 anchored here
- GRPO (Shao et al., 2024) — not gated for v1.0 but referenced in CR-L8 as the future closed-loop direction
Implementation plan dated 2026-05-15. Next review: end of P0 (2026-05-29) for council go/no-go.