v1-release-criteria
Vox v1.0 Release Criteria (Hardened)
Section titled “Vox v1.0 Release Criteria (Hardened)”To reach a stable v1.0, the Vox foundation must satisfy the following machine-verified and human-audited criteria.
Marquee app set (added 2026-05-15). “Marquee app” references throughout this document resolve to the canonical fixture set at
contracts/marquee/manifest.v1.yaml. Without that fixture, criteria [CR-P1], [CR-P3], [CR-E2], [CR-L0], and [CR-L7] are unverifiable. Defining the manifest is a prerequisite for implementation of any Marquee-gated criterion.
1. Production Validation
Section titled “1. Production Validation”- [CR-P1] At least 3 “Marquee” applications must be deployed and live on OCI-compliant infrastructure with zero manual configuration.
- [CR-P2] 99.9% uptime for the
vox-ml-cliinference endpoint over a 7-day soak test. - [CR-P3] Full “Zero-DX” deployment loop:
vox new web → vox deploymust take under 120 seconds end-to-end.
2. Architectural Integrity
Section titled “2. Architectural Integrity”- [CR-A1] K-Complexity Freeze: The core compiler (
vox-compiler) must maintain a cyclomatic complexity threshold under 15 for all primary lowering paths. - [CR-A2] Non-Null Boundary: 100% of internal FFI and IPC interfaces must use non-null, machine-verified schemas (VoxProto v1).
- [CR-A3] Crate Decoupling: The workspace must maintain zero circular dependencies across the 10 core crates defined in
crates/_frozen.md. - [CR-A4] Lifecycle Metadata Parity: All orchestration contracts that affect model routing/providers must declare lifecycle metadata (
experimental/stable/deprecated) and a migration window, with CI parity checks.
3. Performance & Efficiency
Section titled “3. Performance & Efficiency”- [CR-E1] Cold Start:
vox run --interpmust initialize and execute a “Hello World” script in under 50ms on standard x86/ARM hardware. - [CR-E2] Bundle Size: The standard “Marquee” application bundle (React + TanStack) must not exceed 800KB (gzip).
- [CR-E3] Training Parity: The native
vox-populitraining pipeline must achieve loss parity with reference PyTorch/LoRA implementations for thevox-langcorpus.
4. Agentic DX (Developer Experience)
Section titled “4. Agentic DX (Developer Experience)”- [CR-D1] Planning Mode Fidelity: AI agents must be able to execute a multi-step “Wave 2” plan with at least 85% success rate without human intervention. Measurement harness and Wave-2 fixture set defined by [CR-L4] — see
vox-as-llm-target-audit-and-plan-2026.md. - [CR-D2] Self-Healing:
vox repairmust successfully resolve 90% of syntactically valid but logically broken Vox programs identified during the v1 audit. Multi-file project-scope variant and measurement defined by [CR-L3]; the 90% number is the single-file aim and the 70% number is the project-scope gate. - [CR-D3] Documentation Coverage: 100% of
vox-clisubcommands must have machine-readable help and associated.voxexample scripts in the training corpus. New commands landing under [CR-L7] (vox new,vox deploy,vox doctor) inherit this requirement at their landing release.
5. LLM-Target Fidelity
Section titled “5. LLM-Target Fidelity”These criteria operationalize the load-bearing claim that recurs across the marquee/research corpus — that Vox is shaped so AI agents can author code reliably, and the compiler+lint+repair pipeline can heal what they produce. Full audit, evidence, sequencing, and open questions: vox-as-llm-target-audit-and-plan-2026.md. Implementation plan with phasing, owners, fixture-corpus budget, and risk register: v1-llm-target-implementation-plan-2026.md. Bar height: realistic v1.0 — measurable foundation gates.
- [CR-L0] End-to-End Agent Authorship Loop: Given a canonical English spec from
contracts/eval/spec-to-app/(10–20 specs of increasing complexity), an autonomous agent loop driving Vox (via MCP) must produce a passing application —vox checkclean, tests pass,vox deploysucceeds,vox doctorgreen — at ≥ 60% success rate with a per-spec token-cost ceiling of ≤ $5.00 against the panel reference LLMs. This is the integration test for the v1.0 LLM-target claim; CR-L1..CR-L8 are unit tests of its sub-loops. Sub-bar (block GA): observed rate < 40%. - [CR-L1] HumanEval-Vox: A canonical 200-program benchmark suite (
contracts/eval/humaneval-vox/) must reach ≥ 80% compile + test-pass rate when prompted to MENS or a reference LLM. Reported quarterly viavox audit humaneval. - [CR-L2] On-Distribution Rate: ≥ 95% of MENS-emitted Vox programs must clear
vox check --strict+ the 47-rule vox-code-audit + retirement-guard with zero errors and zero high-confidence warnings. Reported quarterly viavox audit mens-on-distribution. - [CR-L3] Project-Scope Self-Healing:
vox repair .must reach ≥ 70% success rate on a defined 50–100 multi-file broken-project corpus (contracts/eval/repair-corpus/). Single-filevox repairshould aim ≥ 90% as a sub-metric. - [CR-L4] Plan-Mode Fidelity Measurement: The “Wave 2” benchmark set referenced by [CR-D1] must exist as fixtures at
contracts/eval/plan-fidelity/with an automated harness that produces the 85% measurement (closes [CR-D1]‘s underspecification). - [CR-L5] ACI Envelope Default-On:
OrchestratorConfig::agentos_aci_envelope_enableddefaults totruestarting in v0.6; guardrail kernel rejects unclassified mutations at v1.0. - [CR-L6] Retirement-Guard Parity: Every row in
AGENTS.md§Retired Surfaces has either a parse-time / typeck detector or avox-arch-checkrule. CI gate (vox ci retirement-audit) fails on drift between policy doc and enforcement. - [CR-L7] Deploy CLI Completeness:
vox new,vox deploy, andvox doctorship with structured JSON output,vox.deploy.*/vox.doctor.*telemetry, and a CI integration test drivingvox new web → vox deploy → vox doctoron a Marquee app fixture inside the [CR-P3] 120-second budget. - [CR-L8] Diagnostic→Repair→Corpus Feedback Loop: A quarterly pipeline export from
vox.lint.*+vox.repair.*telemetry into vox-corpus runs in CI, emittingcontracts/reports/corpus-feedback/<quarter>.json(top-50 firing diagnostics, autofix accept/reject rates, repair outcome histogram). CI fails if the artifact is older than 90 days.
Approved by Vox Foundation Council — April 2026. §5 (CR-L0..CR-L8) and the Marquee-app manifest reference ratified by council 2026-05-15 (council batch D1–D25 in v1-llm-target-implementation-plan-2026.md §8 / 2026-05-15 ratification log). Per ratification D1: CR-L is the measurement source of truth; CR-D1/D2/D3 remain as policy lineage and reference the corresponding CR-L measurement harness. Per D2: realistic-v1.0 bars hold. Per D16: mesh Phase 2 LAN is demoted from v1.0 acceptance contract to v1.1.