Skip to content

Native ML Training Pipeline

Vox “dogfoods” itself: the language, compiler, and documentation all feed a native machine learning loop that trains the Mens code assistant model.

End-to-end map from .vox sources through goldens and corpus extraction to model inputs: Vox source → Mens pipeline SSOT. Training pair contract: Mens training data contract.

Canonical operator fine-tuning: vox mens train with Candle + qlora-rs on Hugging Face weights. --backend qlora and --tokenizer hf are the defaults; no Python training loop. SSOT: Mens native training. PopuliTrainBackend::BurnLora is rejected at runtime in this dispatch — the supported trainer is CandleQlora.

Legacy / side paths: A Burn + wgpu scratch LoRA stack still lives in vox-tensor (vox training native, small VoxTokenizer model) — no Python, optional CUDA only if you build GPU features for other subsystems. Use it for experimentation, not as a substitute for Mens HF QLoRA. Burn also matters for vox mens merge-weights and vox mens serve on merged .bin checkpoints. Objectives and artifacts differ from Candle QLoRA — see Burn vs QLoRA.

GPUs: For QLoRA on an NVIDIA workstation, build mens-candle-cuda and use vox mens train --device cuda. For Burn scratch training, wgpu (Vulkan / DX12 / Metal) is the default GPU path. Use CPU when drivers or CI forbid GPU.


┌─────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
│ golden/**/*.vox + examples.ssot.v1.yaml ──┐ │
│ docs … golden .vox ───┤──► vox mens corpus extract │
│ (+ prose per mix policy)│ │ │
│ vox-cli generate-data ───┘ │ │
└─────────────────────────────────────│───────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ CORPUS PIPELINE │
│ mens/data/validated.jsonl (raw Vox → instruction pairs)│
│ │ │
│ ▼ │
│ vox mens corpus validate (filter malformed pairs) │
│ │ │
│ ▼ │
│ mens/data/train.jsonl (rated + filtered pairs) │
└─────────────────────────────────────│───────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ TRAINING (Mens — canonical) │
│ │
│ **`vox mens train`** — Candle + **qlora-rs** QLoRA (default) │
│ `--backend qlora` + `--tokenizer hf` + HF safetensors │
│ Optional **CUDA** (`mens-candle-cuda`) / **Metal** │
│ SSOT: `reference/mens-training.md` │
│ │
│ Legacy / other: `vox training native` — Burn scratch LoRA │
│ (`VoxTokenizer` JSONL, wgpu/CPU). Not `vox mens` dispatch. │
│ `vox train` (mens-dei): local bails → `vox mens train …` │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ EVAL + BENCHMARK GATES │
│ vox mens corpus eval … → eval_results.json │
│ VOX_BENCHMARK=1 → spawns vox mens eval-local (held-out) │
│ Targets: vox_parse_rate ≥70%, coverage ≥50% (CI); VOX_EVAL_STRICT=1 fails promotion │
│ Held-out: VOX_BENCHMARK=1, VOX_BENCHMARK_MIN_PASS_RATE (default 0) │
└─────────────────────────────────────────────────────────────┘

All training pairs follow this JSONL schema (must match across all tools):

{
"prompt": "Write a minimal Vox program that prints hello",
"response": "fn main() {\n print(\"hello\")\n}\n",
"category": "function",
"rating": 5,
"schema_version": "vox_dogfood_v1"
}
FieldTypeRequiredDescription
promptstringThe instruction/question (serde also accepts instruction)
responsestringValid Vox code (serde also accepts output)
categorystringrecommendedConstruct type (function, actor, etc.)
ratingu8 1-5recommendedQuality rating; 5=ground truth docs
schema_versionstringoptionalVersion for migration tracking

Compile path: source text is lexed by vox-compiler (logos Token enum)—this is unrelated to Mens model vocabulary. See Vox source → Mens pipeline SSOT.

Mens QLoRA path (default): supervised strings are tokenized with the Hugging Face tokenizer for the chosen --model (tens of thousands of BPE tokens). See Mens native training § Tokenization SSOT.

Lab / Burn scratch: vox-tensor exposes a deterministic small VoxTokenizer (not a mirror of the Vox lexer keyword set):

  • 95 printable ASCII characters (IDs 3-97)
  • 35 Vox compound tokens (workflow, actor, fn, component, etc.)
  • 3 control tokens: [PAD]=0, [UNK]=1, [EOS]=2
  • Total vocab: 133 tokens
// vox:skip
// Vox example — tokenized natively using VoxTokenizer
fn greet(name: str) to str {
return "Hello, " + name
}

Encoding uses greedy longest-match on compound tokens before falling back to single chars.


VoxTransformer Architecture (Burn scratch path)

Section titled “VoxTransformer Architecture (Burn scratch path)”

The Burn-backed scratch transformer (crates/vox-tensor/src/vox_nn.rs, gpu feature) used with VoxTokenizer JSONL — distinct from HF QLoRA weights:

ParameterValueNotes
Layers12Transformer encoder blocks
Attention heads8Multi-head self-attention
Model dimension512Embedding size
FFN dimension2048Feed-forward inner size
Dropout0.1Applied in attention + FFN
Max sequence length512Tokens per training example
Vocab size133VoxTokenizer vocabulary

Terminal window
vox generate-data --limit 500 --output mens/data/train.jsonl

2. Extract corpus from real Vox files (canonical flow, PowerShell)

Section titled “2. Extract corpus from real Vox files (canonical flow, PowerShell)”
Terminal window
.\target\release\vox.exe mens corpus extract examples/golden/ -o mens/data/validated.jsonl
.\target\release\vox.exe mens corpus extract docs/ -o mens/data/validated.jsonl 2>$null
.\target\release\vox.exe mens corpus validate mens/data/validated.jsonl --no-recheck -o mens/data/validated.jsonl
.\target\release\vox.exe mens corpus pairs mens/data/validated.jsonl -o target/dogfood/train.jsonl --docs docs/src/ --docs docs/src/research/ --docs docs/src/adr/
# Rustdoc merge skipped: response is Rust prose, not Vox code

3. Start Mens fine-tuning (canonical — Candle QLoRA, native Rust)

Section titled “3. Start Mens fine-tuning (canonical — Candle QLoRA, native Rust)”
Terminal window
# Build with CUDA for RTX-class GPUs (see mens-training SSOT / AGENTS.md)
# Then minimal path:
.\target\release\vox.exe mens train --device cuda --data-dir target/dogfood --output-dir target/dogfood/run

Legacy Burn scratch (small VoxTokenizer model, wgpu — not HF QLoRA):

Terminal window
$env:VOX_BACKEND="cpu"; .\target\release\vox.exe train --data-dir target/dogfood --output-dir mens/runs/v1
# GPU: omit VOX_BACKEND=cpu when wgpu is available
Terminal window
.\target\release\vox.exe mens corpus eval target/dogfood/validated_mixed.jsonl -o mens/runs/latest/eval_results.json

Every documentation page with training_eligible: true in its frontmatter and a ```vox code block automatically contributes training pairs via vox mens corpus pairs --docs docs/src/.

This creates a closed feedback loop: better docs → more training data → better model → better completions → easier to write docs.

Frontmatter format for training-eligible docs:

---
title: "My Guide"
category: "How-To Guides"
constructs: [function, workflow]
training_eligible: true
difficulty: intermediate
---

The ML pipeline runs automatically via .github/workflows/ml_data_extraction.yml:

  • Nightly: Full corpus re-extraction at 4 AM UTC
  • On push: Triggered when *.vox, compiler crates, or docs/src/** change
  • Manual: workflow_dispatch with force_train or native_train option
  • Grammar drift: Fingerprint check forces full re-extraction when syntax changes

The train job runs on a self-hosted GPU runner when corpus changes or when manually triggered:

  • Native path (default): Prefer vox mens train with VOX_BACKEND=cpu for CI compatibility. Older workflows may still invoke vox train; --provider local now bails with the canonical Candle QLoRA command (no Python train_qlora script).
  • Workflow_dispatch native_train: false: If still wired to vox train --provider local, expect the bail message directing operators to vox mens train --backend qlora. Use vox mens train directly in updated automation.
  • Eval strict mode: VOX_EVAL_STRICT=1 — training fails when eval gate thresholds are not met.
  • Benchmark gate: VOX_BENCHMARK=1 — runs held-out benchmark from mens/data/heldout_bench/; VOX_BENCHMARK_MIN_PASS_RATE (e.g. 0.80) fails promotion when pass rate is below threshold.
  • Artifact retention: LoRA adapter target/dogfood/run/ uploaded as lora-adapter-$VCS_SHA, retained 90 days. Eval results eval_results.json / eval_gate_failed.json retained 30 days.
  • Logging: Training pair count and eval gate result (parse rate, coverage) are printed; eval gate failure writes eval_gate_failed.json and emits a warning.
Terminal window
# CI uses VOX_BACKEND=cpu by default (no GPU drivers required)
VOX_BACKEND=cpu vox mens train --data-dir target/dogfood --output-dir target/dogfood/run

Not wired on the current slim vox binary. Use external tooling or scripts until a corpus evol subcommand lands.

Terminal window
# Intended future shape (not implemented):
# EVOL_GATE=1 vox mens corpus evol …

Use vox mens corpus mix with mens/config/mix.yaml, or merge JSONL with your own tooling. There is no vox corpus merge subcommand today.

ModeCommandWhen to use
Mens Candle QLoRA (primary)vox mens train --device cuda (defaults: --backend qlora, --tokenizer hf; optional --model <hf_repo>)Native qlora-rs + HF weights; CUDA/Metal feature builds; see mens-training.md
Qwen3.5-4B (4080 16GB)cargo build -p vox-cli --release --features gpu,mens-candle-cuda then vox mens train --preset qwen_4080_16g --device cuda …Preset path; full proxy stack defaults on CUDA unless --qlora-allow-partial-proxy-stack
Burn scratch LoRAvox train --data-dir … / VOX_BACKEND=cpuNot vox mens QLoRA — small VoxTokenizer model + wgpu/CPU in vox-tensor
vox mens train --backend loraRejected at runtimeUse --backend qlora for Mens dispatch (SSOT)
Legacy vox train (mens-dei)vox train …--provider local → bail message → vox mens train --backend qlora; Together remote; --native Burn-only scratch
CI strictVOX_EVAL_STRICT=1Fail promotion on eval gate failure
CI benchmarkVOX_BENCHMARK=1Run held-out benchmark before promotion

Artifact layout: target/dogfood/train.jsonl (canonical input), target/dogfood/run/ (output). Version naming: lora-adapter-$VCS_SHA, eval-gate-$VCS_SHA.