Skip to content

Mens measurement gap analysis

This document defines the measurement groundwork needed to judge whether VoxMens is getting closer to the real product goal:

Emit the most accurate .vox code possible, with the lowest error rate, at the highest practical speed.

The current codebase measures many useful things, but it does not yet measure that full objective coherently.

Today, VoxMens has three broad measurement layers:

  1. training telemetry
  2. corpus/data quality telemetry
  3. generation/evaluation telemetry

All three matter, but they are not equivalent.

The main problem is that the system still treats some upstream proxies as if they were downstream product truth.

Examples:

  • training loss is treated as if it were close to code correctness,
  • corpus parse rate is treated as if it were close to generation quality,
  • benchmark strictness heuristics are treated as if they were canonical output guarantees.

Those are useful signals. They are not the top-line KPI.

Primary sources:

What these surfaces currently measure well:

  • train loss,
  • validation loss,
  • step progress,
  • checkpoint progress,
  • some skip/error categories during training,
  • wall-clock training progress.

What they do not directly measure:

  • whether the resulting model emits valid .vox,
  • whether emitted .vox is canonical,
  • whether repair loops are shrinking,
  • whether serving is getting faster,
  • whether task outcomes are semantically improving.

Primary source:

What this layer measures well:

  • training-data parseability,
  • construct coverage,
  • format validity of corpus artifacts,
  • some safety/quality proxies for the corpus.

What it does not measure:

  • model output quality,
  • model repair burden,
  • inference throughput,
  • semantic success of generated programs.

Primary sources:

What this layer measures reasonably well already:

  • pass@1 / pass@k for held-out eval-local benches,
  • first-pass compileability,
  • compileability after retries,
  • repair depth,
  • latency (partially),
  • a first approximation of strictness.

What it still misses:

  • tokenizer-true token counts and throughput,
  • stable error taxonomy at aggregate level,
  • semantic correctness beyond parse/typecheck,
  • HIR-level structure comparison or canonical IR comparison,
  • a unified “time-to-first-valid-Vox” KPI,
  • a single benchmark artifact contract used by all surfaces.

One of the most important findings is that producer and consumer surfaces still disagree about field names and ownership.

Relevant files:

Observed drift:

  • gate code looks for metrics.jsonl,
  • training now centers on telemetry.jsonl,
  • gate expects tokens_per_sec,
  • training prominently emits steps_per_sec_ema,
  • gate looks for supervised_ratio_pct,
  • training paths do not consistently publish the fields needed to compute that ratio in a durable way.

This means the gate can be logically correct but practically underfed.

Drift: benchmark artifacts vs strategic decision artifact

Section titled “Drift: benchmark artifacts vs strategic decision artifact”

Relevant files:

Observed drift:

  • eval_local writes one style of report,
  • mens_scorecard writes another,
  • strategic decisions now need both,
  • there is not yet one stable summary contract that joins them.

Drift: repair-loop evidence across CLI and MCP

Section titled “Drift: repair-loop evidence across CLI and MCP”

Relevant files:

Observed drift:

  • both now do diagnostics-informed retries,
  • only one path returns richer structured repair metadata,
  • strictness and canonicalization accounting are still not normalized into one shared analytics schema.

The second pass should treat the following as the required top-line KPIs for code-generation success.

These are the metrics that should decide whether VoxMens is materially better.

KPIMeaningWhy it matters
CompilePass@1valid .vox on first attemptBest direct measure of raw model correctness
CompilePass@Nvalid .vox within bounded repair budgetMeasures practical recoverability
CanonicalPass@1output canonicalizes and still validatesMeasures whether output matches strict serializer goals
TaskSuccessgenerated program satisfies task-level expected behaviorPrevents overfitting to syntax-only wins
TimeToFirstValidMswall-clock latency to first valid .voxCombines model speed with repair cost
ServeTokensPerSecinference throughput using real tokenizer countsNeeded for deployment tradeoffs
RepairStallRatepercent of tasks where retries stop making progressImportant operational pain signal

These are needed to explain changes in Tier 1, not to replace them.

KPIMeaning
RepairDepthMeanmean retries among tasks that eventually pass
DiagnosticCategoryHistogramdistribution of error categories
StrictnessFailureRateprose wrappers / markdown fences / extra narration
ValLossLastEpochtraining-side model fitness proxy
NoSupervisedSkipRatetraining-data supervision efficiency
TruncationFractionlost supervision due to context cap

These help interpret experiments but should not drive the main decision gate by themselves.

MetricWhy it is contextual only
train lossuseful but indirect
validation lossuseful but indirect
corpus parse ratedata quality, not model quality
construct coveragediversity signal, not product success
whitespace token countsweak proxy for real token economics

The following are currently worth keeping, but they should be explicitly demoted from decision-driving metrics:

This belongs to corpus/data QA, not to model quality. It should not be read as a direct measure of model improvement.

Important for understanding data breadth, but not enough to indicate that the model can correctly use those constructs under prompt conditions.

Strictness without compiler validation or canonicalization is not enough. The target is not “looks like code.” The target is “canonical valid Vox.”

Loss curves can help rank training runs, but they should not be used as the final justification for shipping or for deciding whether a custom model is needed.

What we are not measuring but need to measure

Section titled “What we are not measuring but need to measure”

This is arguably the most important missing operational metric.

Why:

  • a slower model that succeeds first-pass can beat a faster model that needs three repair rounds,
  • raw latency and repair depth need to be composed into one observable.

Where to instrument:

  • MCP generation path,
  • CLI generation path,
  • scorecard benchmark output.

2. Semantic success beyond compiler validity

Section titled “2. Semantic success beyond compiler validity”

Parse/typecheck success is necessary. It is not sufficient.

Needed next:

  • golden behavioral checks for a curated subset,
  • expected-shape verification at the HIR or route/component/workflow level,
  • later, executable or snapshot-based validation for selected tasks.

3. Diagnostic taxonomy as a first-class metric

Section titled “3. Diagnostic taxonomy as a first-class metric”

Current counts tell us that something failed. They do not tell us which failure classes dominate:

  • syntax punctuation,
  • indentation/layout confusion,
  • type mismatches,
  • invalid imports,
  • route/schema mismatches,
  • actor/workflow misuse.

Without that histogram, targeted data or decoding improvements remain guesswork.

We need true tokenizer-backed token counts and throughput rather than whitespace approximations.

Otherwise, model comparisons can be directionally wrong.

If VoxMens is going to become multi-lane, we need to measure when one lane degrades another.

Examples:

  • prose leakage into code-only lane,
  • code-only compactness loss after docs/chat blending,
  • repair-loop burden increase after introducing more general conversational data.
flowchart TD
training[TrainingTelemetry] --> summary[RunSummaryContract]
corpus[CorpusQualitySignals] --> summary
evalLocal[HeldOutEvalLocal] --> benchmark[BenchmarkSummaryContract]
scorecard[MensScorecard] --> benchmark
mcpGen[McpGenerationMetrics] --> runtime[RuntimeMetricsContract]
cliGen[CliGenerationMetrics] --> runtime
summary --> decision[DecisionGate]
benchmark --> decision
runtime --> decision

Minimal durable contracts needed in second pass

Section titled “Minimal durable contracts needed in second pass”

The second pass should not try to measure everything at once. It should create three stable contracts:

  1. Run summary contract

    • training-oriented,
    • one artifact per run,
    • includes pointers to telemetry and benchmark outputs.
  2. Benchmark summary contract

    • model-vs-model comparable,
    • includes compile, canonical, task, repair, speed, strictness.
  3. Runtime generation metrics contract

    • per-request or aggregated,
    • used by both CLI and MCP,
    • records time-to-first-valid and stall behavior,
    • initial schema path: contracts/eval/runtime-generation-kpi.schema.json.
    • vox_mens_scorecard_summary_v1 artifacts may include optional kpi_contract_alignment, which pins the same vox_runtime_generation_kpi_v1 schema id alongside the mens scorecard event schema $id for downstream eval joins.
  1. align training telemetry with gate readers,
  2. add TimeToFirstValidMs,
  3. add true token accounting to runtime generation,
  4. add structured repair outcome aggregation,
  5. create one benchmark summary schema.
  1. add diagnostic taxonomy histograms,
  2. add semantic golden checks for a curated subset,
  3. demote weak proxies in docs and dashboards.
  1. expand category/context breakdowns,
  2. add richer per-lane contamination monitoring once lanes are split cleanly.

The current system already measures enough to know that VoxMens is moving in the right direction.

It does not yet measure enough to answer the bigger strategic question with confidence:

Is QLoRA sufficient, or are the remaining failures structural enough that Vox needs a more custom model path?

To answer that question, the next pass must stop treating upstream proxies as final truth and instead build one end-to-end KPI chain around:

  • valid .vox,
  • canonical .vox,
  • task success,
  • repair burden,
  • real runtime cost.