Mesh, Dashboard & Distributed Compute — Research (2026-05-09)
Mesh, Dashboard & Distributed Compute — Research
Section titled “Mesh, Dashboard & Distributed Compute — Research”Status. Research synthesis. Not an implementation plan. Eventually decomposes into one or more plans-of-record (lock-leader durability, language-level distribution annotation, dashboard “Add a Node” wizard, etc.). Per feedback: scope-check before any of those plans gets implemented.
0. Reading guide
Section titled “0. Reading guide”This document spans four concentric horizons:
- What we have today (§1) — the truth about mesh, dashboard, durable workflows, and multi-agent VCS as of 2026-05-09. File-cited; not glossed.
- What’s missing (§2) — killer features, ranked. Mesh transport, dashboard control, language primitives.
- Security and trust (§3) — the threat model, the prior-art lessons, and the critical path from trusted-LAN to safely-internet-facing.
- The grand VOX network (§4) — what would make a global volunteer compute mesh tractable: language-level durability, idempotency, attestation, multi-agent VCS over mesh.
Companion specs (already exist; this doc cross-references them rather than restating):
- populi-mesh-north-star-2026.md — three-slice (S1/S2/S3) capability roadmap and seven workstreams (W1–W7).
- populi-mesh-improvement-backlog-2026.md — flat MESH-001..MESH-210 backlog.
- populi-mesh-a2a-durability-spec-2026.md — durable A2A store (W6).
- populi-mesh-config-baseline-spec-2026.md —
[mesh]config schema (W7). - populi-mesh-local-observability-spec-2026.md —
vox.mesh.*span attributes (W5). - populi-mesh-probe-correctness-spec-2026.md / -plan-2026.md — hardware probe trait (W2).
- vox-dashboard-design-brief-2026.md — surfaces, navigation, anti-patterns.
- dashboard-migration-research-2026.md — current Axum + React shape.
- external-frontend-interop-plan-2026.md — auth/transport interop.
- durability-runtime-audit-2026.md — workflow runtime gaps.
- vox-language-rules-and-enforcement-plan-2026.md — five-phase language enforcement plan.
- multi-agent-vcs-replication-research-2026.md / -spec-2026.md — convergence sets, op-fragments.
- agentic-version-control-automation-research-2026.md — capability tokens, banned commands.
- git-concurrency-policy.md — git-exec banned-command policy.
- cryptography-ssot-2026.md — sole crypto SSOT.
- ludus-identity-github-integration-research-2026.md — GitHub-attested identity model.
- ludus-security-and-anti-cheat-research-2026.md — DevRank, tiered trust.
- scientia-mesh-integration-research-2026.md — discovery feedback loop.
- nextgen-orchestrator-research-2026.md — synthesis of orchestrator failure modes.
1. Current state — audited
Section titled “1. Current state — audited”1.1 Mesh transport & control plane (today)
Section titled “1.1 Mesh transport & control plane (today)”What works. vox-populi ships single-node-correct: HTTP control plane with bearer auth (Mesh / Worker / Submitter / Admin roles, optional JWT-HS256), node registry persisted as JSON, A2A inbox with idempotency keys and lease envelopes, mens training in-process via Burn/Candle. Hardware probes (NVML / wgpu / DRM / Metal) advertise capacity. Bearer comparison uses subtle::ConstantTimeEq.
Architecture. A2A messages (A2ADeliverRequest, vox-mesh-types/src/a2a.rs:5) carry task spec, optional JWE-wrapped secrets, idempotency key, and W3C traceparent. Orchestrator forwards through vox-orchestrator/src/a2a/dispatch/mesh.rs; receiver polls inbox via vox-orchestrator/src/a2a/remote_worker.rs. The transport plugin vox-plugin-populi-mesh is the cdylib seam between vox-mesh-types (L0) and the HTTP handlers; orchestrator and CLI talk through a MeshDriver trait.
Five footguns the audit surfaced (severity-ordered).
- Local-first execution shadows the mesh. ADR-017 declares leases authoritative;
a2a/dispatch/mesh.rsstill local-falls-back on any send failure without checking whether the remote node has already claimed. Result: silent duplicate execution under packet loss; “the mesh” is closer to a debug visualization than a runtime. North-star §6.1 already calls this out. - JWE secrets are encrypted at send and never decrypted on the worker.
vox-orchestrator/src/a2a/jwe.rsdefinesdecrypt_jwe_compact(); the worker hot path validates well-formedness and walks away (remote_worker.rs:117). Cross-node secret sharing is theatre until W3 lands. - Hardware probes are unvalidated. No mock harness, no probe-replay test (W2 fixes this); operator labels override probe output, so routing decisions depending on probe results are unreliable.
- No resource accounting wired up.
vox-mesh-types/src/kudos.rsdefinesKudosPrimitiveandCreditJobRequest; nothing computesduration_ms, credits a node, or queries a contribution ledger. The “shared compute volunteer network” depends on this and the plumbing does not exist. - Bare HTTP, no TLS. All-local testing fine; cross-node deployments send
TaskSpec, signatures, and JWE-wrapped secrets over plaintext. Bearer tokens are in cleartextAuthorizationheaders. Scope-id is membership, not authentication.
1.2 Dashboard mesh-control surface (today)
Section titled “1.2 Dashboard mesh-control surface (today)”What ships. vox-dashboard is Phase-1: an Axum server hosting a React 19 SPA compiled from Vox view-language. Transport is WebSocket (/v1/ws) for events + HTTP POST /v1/tools/call for commands (transport.ts). Bearer token injects via meta tag. Localhost auto-binds with allow_unauthenticated = true.
What’s stub-only. Mesh routes return fixture JSON acks. The audit table:
| Surface | Route / file | Status |
|---|---|---|
| List nodes | GET /api/v2/mesh/nodes (api/mesh.rs:82) | Fixture stub. Comment: “Phase 2 replaces this stub with a live read from the orchestrator mesh registry.” |
| Add a node | — | Missing. No provisioning route. |
| Remove a node | — | Missing. |
| Configure node role | — | Missing. |
| Dispatch a job | generated/TaskDispatch.tsx:3 | UI-only stub (toggles local state). |
| View job status / logs | GET /api/v2/runs[/{id}] (api/runs.rs) | Fixture stub. |
| Kill / pause / replay | POST /api/v2/mesh/nodes/{id}/{kill,pause,replay} (api/mesh.rs:165) | Acks; no orchestrator wiring. |
| Topology view | generated/NetworkTab.tsx | Empty placeholder. |
| Models surface | GET /api/v2/models/usage_24h | Fixture stub. |
FFScript is unrelated. The FFScript mutation API (see ffscript-mutation-api-spec-2026.md, -panel-schema-spec, -linter-design) is a document-mutation API for FableForge — not dashboard panels. Vox dashboard panels are authored in Vox view-language that lowers to TSX. Don’t conflate them.
Gap for personal-mesh spin-up. The “I just opened the dashboard for the first time → my friend’s GPU is running my jobs” journey has no hand-off pairing flow. Stages 4–7 of that journey (wizard, pairing, donation-policy editor, live topology with handshake feedback) do not exist. Stages 8–10 (dispatch a job, watch tokens stream, see the run land in Runs) have stubs but no live wiring.
1.3 Durable workflows & language layer (today)
Section titled “1.3 Durable workflows & language layer (today)”Syntax-complete, execution-incomplete. Vox parses three durability keywords (vox-compiler/src/ast/decl/fundecl.rs), lowers them to a DurabilityKind enum (vox-compiler/src/hir/nodes/durability.rs), and emits identical async Rust for all three. Codegen ignores schedule_interval and durability metadata. Effect annotations (Net | Db | Fs | Env | Clock | Random | Spawn | Mcp(name) | Nothing — effect.rs) parse and propagate via typeck/effect_check.rs but don’t reject time.now() / random.* inside a workflow body.
Workflow runtime (vox-workflow-runtime). Linear interpreter with a JSON journal in vox-db. Idempotency is opt-in user-supplied with { activity_id: "…" } strings. Replay supports literal loops + deterministic conditionals; match and complex branching unsupported. Persistent actor state fields parse but do not compile (see durability-runtime-audit-2026.md §rationale).
Distribution today. Coupled to a mesh_* naming convention plus a with { required_labels, is_detached } decorator. No @remote annotation, no typed Worker value, no language awareness of distribution.
Concrete gaps vs. Temporal/Restate/Cadence: workflow.ID + execution.ID; structured signals/child-workflows; deterministic activity-ID derivation; saga compensation DSL; side-effect determinism guard at compile time. Phase 5 of vox-language-rules-and-enforcement-plan-2026.md commits to closing these — but as warnings, not errors, in v0.5.
1.4 Multi-agent VCS (today)
Section titled “1.4 Multi-agent VCS (today)”Worktree-per-agent under one daemon. Each agent gets .claude/worktrees/<branch>/. vox-orchestrator-d is single-instance per machine. File-level locks (Exclusive / SharedRead) live in an in-memory Arc<RwLock<HashMap<PathBuf, LockEntry>>> (vox-orchestrator-queue/src/locks/mod.rs).
Capability tokens. WorkingTreeWrite and BranchCreate (vox-orchestrator-types/src/vcs_capability.rs). Soft-private mints (#[doc(hidden)] pub fn mint); Phase 4 hardens to pub(crate) + sealed traits. No revocation today.
Banned-command enforcement. git-concurrency-policy.md: git stash, git reset --hard, git clean -f*, git restore ., git checkout . all rejected at the GitExec layer before spawning git. Agentic commits get three trailers (Co-authored-by, Vox-Model-Id, Vox-Workspace). Pre-commit secret scan; repo-local pre-commit hook bypassed.
Oplog (vox-orchestrator-queue/src/oplog/mod.rs). OperationEntry has id, agent_id, timestamp_ms, kind (13+ variants), snapshot_before/after, change_id, model_id, and a SHA-3-256 predecessor_hash chain. In-memory only. Lost on daemon restart. No replication.
The single-machine assumption is the load-bearing constraint. Two daemons on different mesh nodes coordinating the same repo would race on locks, oplog causality, file-affinity assignments, and worktree creation. Phase 3 of multi-agent-vcs-replication-spec-2026.md plans op-fragment gossip via Populi mesh; today is local-only.
2. Killer features missing
Section titled “2. Killer features missing”2.1 Mesh transport & control plane
Section titled “2.1 Mesh transport & control plane”Ranked by impact ÷ effort, drawing on the audit and on the populi-mesh north-star/backlog:
| Rank | Feature | Why it’s a killer | Crate home |
|---|---|---|---|
| 1 | Authoritative leases (W1, ADR-017) wired through dispatch | Closes the local-first shadow; turns the mesh from advisory into runtime | vox-orchestrator/src/a2a/dispatch + vox-orchestrator-queue (lease persistence) |
| 2 | JWE decrypt on the worker (W3) | Secrets actually become end-to-end | vox-orchestrator/src/a2a/remote_worker.rs + vox-crypto |
| 3 | TLS or WireGuard sidecar on the populi HTTP plane | Closes passive eavesdropping; prerequisite for everything else internet-facing | vox-populi/src/transport/ (option) or sidecar |
| 4 | Resource accounting end-to-end — populate KudosPrimitive::GpuComputeMs from TaskResult.duration_ms, persist a contribution ledger | Volunteer mesh requires this; types are already there | vox-populi/src/quota/ (new) + vox-mesh-types::Kudos |
| 5 | Per-key quota + reputation EMA in vox-mesh-types::NodeRecord sidecar | Abuse fuse before opening to strangers | vox-populi/src/quota/ |
| 6 | GitHub-attestation gate at pairing time (Ludus device-flow) | Sybil resistance without a Vox-owned account system | vox-identity + ludus-identity-* integration |
| 7 | Per-node sandbox boundary — move in-process executor behind SkillRuntime trait | Prerequisite for accepting work from non-paired peers | vox-skill-runtime (trait already exists) |
| 8 | Result attestation via signed deterministic replay with submitter-side spot-checking | Trust-but-verify scales without TEEs | vox-mesh-types::TaskResult.attestation (already a field) |
| 9 | Cross-node trace continuation (W5/S2) — receiver continues traceparent | Debugging multi-node failures becomes possible | vox-populi/src/transport/ + vox-orchestrator/src/a2a/ |
| 10 | Mesh-wide model inventory — refresh + aggregate so the planner sees what LoRAs live where | Ends “have to retry on local because we forgot the remote has the weights” | vox-populi/src/registry/ + vox-orchestrator::models |
2.2 Dashboard mesh-control GUI
Section titled “2.2 Dashboard mesh-control GUI”Ranked top-18; the table maps each to its crate home and to a complexity tier (S/M/L). Sources: prior art (Tailscale, Coolify, Ray, Grafana, Slurm-web, Modal/Replicate, Temporal UI) — see §3.5 for citations.
| # | Feature | Value (one line) | Crate | Cx |
|---|---|---|---|---|
| 1 | Live topology canvas with health colors | One glance → which node is on fire | vox-dashboard (frontend) + orchestrator EventBus | M |
| 2 | ”Add a Node” wizard (one-shot install + QR-code as coequal) | The on-ramp; without this, mesh is a bookshelf | vox-dashboard/src/api/mesh + vox-crypto + vox-identity | L |
| 3 | Per-node spend gauge + mesh-wide budget bar | Stops bill-shock; matches existing budget.* settings | vox-dashboard + vox-telemetry | S |
| 4 | Donation-policy editor (slots, kinds, NSFW filter, per-peer overrides) | Volunteers gate what their hardware does | new vox-mesh-policy (L2) consumed by orchestrator admission | M |
| 5 | ”Replay this run on another node” (stub route already exists) | Ergonomics win on top of an existing seam | vox-dashboard + vox-orchestrator-queue | S |
| 6 | Privacy-class indicator on every job + every span | Ties dashboard to vox.mesh.privacy_class from S1 obs spec | vox-dashboard + vox-mesh-types | S |
| 7 | Audit-log scrubber (timeline slider over oplog → state at instant) | The Temporal-replay equivalent for Vox | vox-orchestrator-queue (oplog reader) + vox-dashboard | M |
| 8 | Mesh-wide model registry view (“which LoRA / Ollama tag lives where”) | Answers “who can run llama-70b?” before dispatch | vox-dashboard + new vox-mesh-models query | M |
| 9 | Workflow visual debugger (timeline of activity calls) | Pairs Forge with the durable workflow runtime | vox-actor-runtime event hooks + vox-dashboard | L |
| 10 | Dry-run / cost-preview before dispatch | ”$0.18 on opus-4.7 at sonnet’s p50 latency” | vox-orchestrator planner cost estimator + vox-dashboard | M |
| 11 | ”What is this worker computing now?” (privacy-tier-gated live span) | Mesh-debugging-as-a-feature | vox-populi + vox-orchestrator-queue + vox-dashboard | M |
| 12 | Onboarding wizard for joining someone else’s mesh | Inverse of #2; paste an invite → become a worker | vox-dashboard + vox-populi | M |
| 13 | GitHub-Identity-bound role assignment (mesh-mate “trust ledger”) | “Allow @aurelia code-edit jobs; everyone else read-only” | vox-identity trust ledger + vox-mesh-policy | M |
| 14 | Mesh-aware command palette (⌘K) (“kill on node X”, “drain Y”, “send latest to friend-gpu”) | Operator power tool; already in design brief | vox-dashboard (cmdk.vox) | S |
| 15 | Per-node real-time terminal (Coolify-style) over WS, gated by privacy/role | Drops into the peer without SSH | vox-dashboard + new vox-mesh-shell-bridge | L |
| 16 | ”Drag-to-assign-role” on the topology canvas | Direct manipulation beats menus | vox-dashboard + orchestrator affinity API | M |
| 17 | Run-row drawer with full event tree + trace_id deep-link | Wires vox.mesh.trace_id through the UI | vox-dashboard + vox-db llm_interactions.trace_id | S |
| 18 | Node-health probe inline sparklines (CPU/RAM/GPU VRAM/queue) | Quick triage; populi probes already emit them | vox-populi + vox-dashboard | S |
Anti-features. Force-graph that re-layouts every event; toast-confetti per run; YAML-as-the-primary-editor; right-click context menus as the only path to actions; topology that pretends to be physically accurate; mesh-peer chat (becomes Slack); fan-out HTTP polling instead of using the WS bus.
Five things the dashboard should NOT do. Become a code editor. Gate or replace the CLI (every action must have a vox … equivalent and the dashboard should display it). Require login on 127.0.0.1. Run as a hosted multi-tenant SaaS. Become an end-user chat product.
2.3 Language-level features
Section titled “2.3 Language-level features”Top-7 ranked by ROI (impact ÷ effort) for a Vox→mesh→durable pipeline:
| Rank | Feature | Why now |
|---|---|---|
| 1 | Workflow determinism check (Phase 5 Task 3 brought to error-not-warning) | Turns ADR-019 from advisory to law; one rule, one diagnostic; extends existing effect_check.rs |
| 2 | Auto-derived activity_id = blake3(workflow_id ‖ call-site-span ‖ structural-arg-hash ‖ replay-counter) + @with_id override | Removes manual idempotency keys — the #1 distributed footgun |
| 3 | @remote annotation + serializable trait bound | Distribution becomes a typed concept; replaces mesh_* naming convention |
| 4 | DurablePromise[T] as the single awaitable primitive (Restate-shape) | One canonical shape (C4 in LANGUAGE_DESIGN_PRIORITIES.md); kills Future/Promise/Activity divergence |
| 5 | Content-addressed workflow versioning via existing vox-package | Eliminates version-skew bug class structurally; almost-free given existing CAS |
| 6 | par_map / pipeline / event_sourced syntax (desugars over existing primitives) | Three high-traffic patterns become one-liners; reduces decision count for LLM authors |
| 7 | vox workflow preview <fn>(args) — type-check + effect-infer + dry-run journal projector | Converts runtime questions to compile-time; LLM can self-verify before dispatch |
Notable omissions (lower ROI right now): supervision-tree restart strategies (port from OTP); typed Mailbox[T]; signal name within timeout sugar; CRDT integration for replicated state.
3. Security & trust for an open volunteer mesh
Section titled “3. Security & trust for an open volunteer mesh”3.1 Threat model
Section titled “3.1 Threat model”The mesh’s collapse condition today: any peer with VOX_MESH_TOKEN is fully trusted. Acceptable for power-user-LAN (current ambition per populi-mesh-north-star-2026.md §6.2); unsafe for anything wider.
| Adversary | Capability today | Component impacted |
|---|---|---|
| Malicious worker | Forge results; steal in-flight prompts; sell donated time; lie about hardware (operator labels override probes); tasks run in-process → full RCE on submitter side if exfil | vox-orchestrator/src/a2a/dispatch/mesh.rs, executor, no sandbox |
| Malicious submitter | Send malware-shaped TaskKinds; exhaust GPU memory; ship VoxScript with exfil; spam expensive jobs (no per-key quota) | vox-orchestrator/src/a2a/inbox, vox-skills, vox-wasm-engine |
| Malicious mesh peer | Replay old A2A envelopes; forge HS256-signed messages (symmetric secret); squat lease IDs | vox-populi/src/transport/handlers.rs, JWT auth |
| Sybil attacker | Spin up N fake nodes; vote-stuff redundancy checks; harvest credit/reputation; flood inbox | vox-mesh-types::NodeRecord, future trust ledger |
| Network attacker | Bare HTTP → passive read of prompts/results/headers; active MITM swaps results | All of vox-populi::transport, no TLS |
| Byzantine collator (post-leases) | Lease holder withholds completion to grief originator; gateway selectively drops dispatches | vox-orchestrator/src/a2a/dispatch/, ADR-017 lease path |
3.2 Lessons from prior art
Section titled “3.2 Lessons from prior art”- BOINC — Steal: redundant execution + majority voting; homogeneous redundancy classes for floating-point determinism; adaptive replication (only re-verify untrusted hosts) cuts overhead. Avoid: hand-tuned per-project credit economy; weak Sybil resistance. (Anderson 2019 paper, BOINC HR wiki)
- Folding@home — Steal: server-signed work units; validate returned WUs against expected energy/checksum bounds. Avoid: top-down trust; closed core.
- Akash / Bacalhau / Golem — Steal: Akash 2026’s Kata-Containers micro-VMs + composite CPU/GPU TEE attestation as the eventual sandbox tier. Avoid: blockchain payment rails — wrong cost curve for power-user dogfood. (Akash roadmap 2026, Akash TEE discussion)
- Tor / I2P — Steal: assume a hostile network. Avoid: onion routing — wrong threat model (we want auditable identity, not anonymity); latency cost is fatal.
- Tailscale + Headscale — Steal: identity-bound overlay; control-plane / data-plane separation; WireGuard transport; MagicDNS-like simple naming. Avoid: SaaS control plane lock-in.
- Nostr / ATProto — Steal: identity = keypair; signatures travel with messages; ATProto’s
mutable handle + immutable DIDpattern. Avoid: relay-as-only-Sybil-defense. - libp2p / Iroh / Hyperswarm — Steal: PeerID = BLAKE3(Ed25519 pubkey); pubkey-based dialing through NAT; bounded Circuit Relay V2. Avoid: full DHT-as-SSOT (ADR-020 settles this — HTTP populi remains SSOT).
3.3 Recommended architecture
Section titled “3.3 Recommended architecture”Layered identity (extends existing vox-identity + Ludus design):
- Long-term Ed25519 node identity (already present). PeerID = BLAKE3(pubkey). Persist; never rotate without explicit revocation.
- GitHub OAuth attestation for Sybil resistance (per ludus-identity-github-integration-research-2026.md). Bind
node_pubkey ↔ github_numeric_idvia signed self-attestation hosted as a verifiable gist/repo file. DevRank multipliers (ludus-security-and-anti-cheat-research-2026.md) gate trust tiers. - Per-pairing X25519 keys (W3, already speced).
- Per-job ephemeral Ed25519 subkey signed by long-term node key, scoped to single
task_id, lifetime = lease TTL.
Hybrid trust topology (not pure web-of-trust, not central registry):
- Local pairing graph is SSOT for “who can I dispatch to.”
- Optional public attestation registry = a signed JSON manifest in a known git repo; mirrors ATProto DID-doc.
- Trust ledger lives in vox-db on each node; replicated by gossip only between paired peers — no global consensus.
Revocation: short-lived attestations (≤ 30 days) + manual vox populi revoke <peer-id> propagating as a tombstone in the pairing-store. Refuses dispatch even if cert hasn’t expired. Deliberately not OCSP-shaped infra.
3.4 Sandbox escalation ladder
Section titled “3.4 Sandbox escalation ladder”Vox already has three executors at increasing strength (in-process Rust → WASM via vox-wasm-engine → containers via vox-container). Add a fourth tier (micro-VM via firecracker/kata) for the “stranger” scenario.
| TaskKind | Tier | Reason |
|---|---|---|
| VoxScript | T2 WASM (default), T3 container if file/net I/O declared | Vox lang has capability annotations; WASM enforces fuel + memory cheaply |
| Embed | T1 in-process | Pure compute on trusted model binary; untrusted input only |
| TextInfer | T1 trusted models, T3 container for arbitrary HF checkpoints | trust_remote_code=True makes weights code-equivalent |
| ImageGen / SpeechTranscribe | T3 container | Same model-as-code risk; large weights amortize container cost |
| TrainQLoRA | T3 container with GPU passthrough | Long-running, writes checkpoints, easy to abuse for crypto-mining |
| Anything from a non-paired peer | T4 micro-VM (kata/firecracker) | Same posture Akash 2026 lands on. Pre-wire the trait; ship later |
Hard rule: the orchestrator process itself never executes untrusted task payloads. Move the in-process executor behind the existing vox-skill-runtime trait so the boundary is consistent.
3.5 Result attestation & verification
Section titled “3.5 Result attestation & verification”| Approach | Cost | Determinism needed | Right for |
|---|---|---|---|
| Redundant exec + majority voting (BOINC) | 2–3× compute | Yes — homogeneous-redundancy lesson: vote only across “numerically equivalent” hosts | TrainQLoRA quality gates, deterministic Embed, batch ImageGen. Not sampled TextInfer |
| TEE attestation (SGX / SEV-SNP / H100 confidential compute) | Hardware-tied; Akash-style is real but heavyweight | No | Confidential inference where worker mustn’t read the prompt. Defer; no H100s in a personal mesh |
| Signed deterministic replay (proof-of-effort) | 1× compute + signed trace | Yes for replay | VoxScript, Embed, batch jobs. Worker signs (task_id, input_hash, output_hash, gpu_seconds, blake3 of intermediate states) with per-job ephemeral key; submitter spot-checks ~1% by re-running |
Mapping per TaskKind: VoxScript / Embed → signed deterministic replay (1×). TextInfer (greedy / temp=0) → signed replay; sampled → trust + spot-redundancy on seed matching. TrainQLoRA → redundant exec on 2 nodes for the first epoch (shape, loss-curve match), trust thereafter. ImageGen / SpeechTranscribe → 1–5% submitter-side replay sample.
3.6 Critical path to safe-to-deploy-publicly
Section titled “3.6 Critical path to safe-to-deploy-publicly”Ordered by what unblocks what:
- TLS or WireGuard sidecar on the populi HTTP plane. Closes passive eavesdropping; prerequisite for everything else.
- Replace JWT-HS256 with Ed25519-signed envelope (worker-signature handler path that already exists in
vox-identitybut is unused). Fixes “any token-holder can forge.” - Land W3 pairing decrypt so JWE round-trip stops being theatre.
- Per-node sandbox boundary — move in-process executor behind
SkillRuntime, default to container for non-vetted TaskKinds. - Per-key quota + reputation persistence — abuse fuse before opening to strangers.
- GitHub-attestation gate at pairing time.
- Result attestation (signed deterministic replay) on the worker, with submitter spot-check sampler.
- (Only after 1–7) TEE path for confidential inference; redundant-execution voting for batch jobs.
Steps 1–4 take Vox from LAN-only → vetted-public-peers. Steps 5–7 from vetted → open-with-bounded-blast-radius. Step 8 is the long tail.
3.7 Five things to explicitly NOT build
Section titled “3.7 Five things to explicitly NOT build”- Custom crypto. No new AEAD, no novel signature scheme, no homegrown ZK. Stick to
vox-cryptoper cryptography-ssot-2026.md. - Blockchain / token economy. Cost curve wrong for personal mesh; introduces governance burden.
- TEE-first architecture. SGX deprecated on consumer Intel; SEV-SNP server-only; H100 CC real but no consumer access. Build the interface; stub the implementation.
- Onion routing / Tor-style anonymity. Wrong threat model; we want auditable identity, not unlinkability.
- Web-of-trust transitive scoring. Math is right, UX is fatal. Use paired peers + GitHub attestation as binary gates; reputation as a signal, not a transitive capability.
4. The grand VOX network — what makes it tractable
Section titled “4. The grand VOX network — what makes it tractable”4.1 Language-level distributable durability
Section titled “4.1 Language-level distributable durability”The user’s load-bearing ask: durable Vox workflows distribute across the mesh without the usual footguns (non-determinism, retries, partial failure, idempotency, version skew, leaked secrets), especially for AI-managed code where an LLM is the author.
4.1.1 Effects-as-types: minimum credible enforcement
Section titled “4.1.1 Effects-as-types: minimum credible enforcement”Vox already does subset-checked propagation in effect_check.rs. Smallest credible step: close the holes, not redesign.
- Bottom-up inference over the call graph. A function’s inferred set = union of body’s builtin effects ∪ callees’ inferred sets. Phase 5 Task 1.
- Concrete row, not row-polymorphic for v1:
Net | Fs(read: GlobSet, write: GlobSet) | Time | Random | Secret | Spawn | Mcp(name) | Mailbox(actor_ty). workflowbody restriction: inferred set ⊆{Mailbox(_), Spawn(activity)}.Time | Random | Netrejected unless wrapped in anactivitycallee.- Error-message shape (Koka-style symmetric pairs):
vox/effect/missing-net-decl: function `fetch_quote` calls `std.http.get`requiring `net`, but is declared `@uses(fs)`.└─ called via: fetch_quote → resolve_url → std.http.get [span]suggested fix: replace `@uses(fs)` with `@uses(fs, net)`.
4.1.2 Determinism boundary
Section titled “4.1.2 Determinism boundary”Forbidden inside workflow body: time.now, time.sleep, clock.*, random.*, std.http.*, db.*, fs.*, env.*, calls to non-activity/non-workflow fns with effect ≠ Mailbox, direct mailbox recv() from arbitrary actors (signals only).
Allowed: pure computation, await activity_call(...), await child_workflow(...), await signal(name) within timeout, select { ... } over awaitables.
Escape hatch: side_effect { expr } desugars to an auto-generated single-shot activity whose activity_id is (call-site span hash, replay-counter). Temporal SideEffect semantics — the only sanctioned non-determinism inside a workflow.
4.1.3 Distribution as a typed concept
Section titled “4.1.3 Distribution as a typed concept”Three options, scored against ergonomics, type safety, and effect interaction:
| Option | Verdict |
|---|---|
@remote fn foo() annotation (Ray-shape) | Default winner. Adds Spawn and Net to the fn’s effect row. Familiar to LLMs. Return type = DurablePromise[T]. |
Worker value + worker.spawn(fn, args) (Erlang-shape) | Keep as power-user escape. Excellent type safety; verbose at call site. |
| Runtime-only (today) | Reject. Loses every compile-time guarantee. The bug class “I called a function that will be remote and broke at deploy” is structurally preventable with an annotation. |
4.1.4 Idempotency primitives
Section titled “4.1.4 Idempotency primitives”Borrow Restate’s “durable promises”: every awaitable side-effecting op gets an ID-bearing journal entry; replay finds the entry and returns its value rather than re-running.
- Auto-derived
activity_id=blake3(workflow_id ‖ call_site_span_hash ‖ structural_arg_hash ‖ replay_counter). Stable across renames; broken by code motion (acceptable — see versioning §4.1.6). @with_id(expr)override for business identity (order_id,payment_intent_id).@activity(dedup = "7d")window.- Compile-time warning
vox/activity/nondeterministic-id-argsif effect inference shows args containtime.now()/random.*. - Activity calls return
DurablePromise[T]notFuture[T]— unifies awakeable / side_effect / activity into one primitive.
4.1.5 Three durable parallel patterns as one-liners (sketches)
Section titled “4.1.5 Three durable parallel patterns as one-liners (sketches)”// vox:skip// Map-reduce — fan out N independent activities, await all, foldlet receipts = orders |> par_map fn(o) { @activity charge(o) } |> fold(Receipt::empty, Receipt::merge)// vox:skip// Pipeline with backpressure — stage1 → stage2 → stage3pipeline { source: read_orders() // stream emit stage: validate buffered(64) // bounded channel stage: @activity charge buffered(16) // each is journaled sink: @activity persist}// vox:skip// Event sourcing — append-only log + projectionevent_sourced actor InventoryStream { state: Inventory = Inventory::empty() on Restock(sku, qty) => state.add(sku, qty) on Reserve(sku, qty) => state.try_take(sku, qty) projection daily = state |> group_by_day // auto-materialised view}Each desugars to existing primitives — no new keywords beyond pipeline and event_sourced.
4.1.6 Versioning & code mobility
Section titled “4.1.6 Versioning & code mobility”Three options ranked for Vox specifically:
- Content-addressed code (Unison-shape) — strongly recommended.
vox-packagealready has CAS; compiler emits stable@generated-hashheaders (Phase 1 Task 9). A workflow run is(workflow_fn_hash, args_hash). Replay loads code by hash; the wrong-version problem is structurally absent. New versions have different hashes — they coexist until drained. - Patch markers (
workflow.version("change-1", 0, 2)). Necessary escape hatch when the author legitimately needs to alter logic for in-flight runs (Temporal proves this). Lint nudges toward content-addressed deploys. - Submit-time pinning. Useful for short workflows; deployment policy, not language feature.
Operational tool: vox workflow drain --version <hash> retires old code.
4.1.7 AI-agent ergonomics — features that reduce LLM footguns
Section titled “4.1.7 AI-agent ergonomics — features that reduce LLM footguns”- Stable diagnostic IDs in
vox/<category>/<kebab>namespace (Phase 2). Models trained on 0.5 still recognize 0.7 errors. vox check --for-llmmode emitting minimal repro per diagnostic.vox workflow preview <fn>(args)— type-check + effect-infer + dry-run journal projector; the schedule of activities that would run, without doing them. Highest-value LLM tool because it converts “does this distributed program do what I think” from runtime to compile-time.- Stable
activity_idacross refactors — auto-derive from span+arg-hash; LLMs that rename functions don’t accidentally invalidate dedup history. - Required
@uses(...)on allpub fn— effect annotations are prompt scaffolding the model already uses; making them mandatory means LLM-generated code carries its own audit trail. - Symmetric error pairs (
missing-X↔unjustified-X): LLMs learn the inverse rule for free. - One canonical shape per concept (C4 in LANGUAGE_DESIGN_PRIORITIES.md): no
Future[T]andPromise[T]andDurablePromise[T]— pick one, kill the rest. - Distinctive surface keywords for distributed concepts (P2):
workflownotasync fn,activitynot@retry,mailboxnotchan— resists drift to JavaScript/Python idioms.
4.2 Multi-agent collaboration via mesh — what it takes
Section titled “4.2 Multi-agent collaboration via mesh — what it takes”4.2.1 Distributed lock arbitration
Section titled “4.2.1 Distributed lock arbitration”| Architecture | 2–10 nodes | 100+ nodes | Vox fit |
|---|---|---|---|
| Single coordinator-d (today) | Works with hot-standby + WAL replication. Ceiling ≈ low thousands of locks/s | Bottleneck; SPoF unless elected | Phase-1 winner. Promote one daemon as lock-leader; others forward via existing A2A envelope |
| Lease-based Raft (etcd-shape) | 3-/5-node quorum; tolerates (N-1)/2 failures | Native habitat | Phase-3 target. Pair with capability-token-mint Raft group |
| Optimistic + retry (CRDT) | Best when contention is rare; vector-clock LWW per file | Scales as long as conflicts stay sparse | Already the model for non-overlapping byte-range edits. Keep for affinity hints; not for hard locks |
| Token-passing ring (Erlang) | Latency = ring × node count; bad failure handling | Falls over | Skip |
Lightest 2–10 node design: single coordinator-d with hot-standby + heartbeat-promoted leader. Existing in-memory Exclusive/SharedRead map becomes the durable Raft state-machine value when scaling to 100+; Phase-1 needs only durable WAL of lock grants so a leader crash doesn’t lose a held lock.
4.2.2 Op-log gossip & replication
Section titled “4.2.2 Op-log gossip & replication”Given predecessor_hash already gives causal ordering:
- Gossip pattern. Push-rumor-mongering for fresh ops + periodic anti-entropy (Demers et al. epidemic algorithms). Each peer maintains a Bloom filter / Merkle tree of known op-IDs; sync neighbors on a 30 s tick. Hot path O(new ops); catch-up O(log N × repo size). The
OpFragmentEnvelopein multi-agent-vcs-replication-spec-2026.md §Wire-protocol is already shaped for this. - Unknown-parent fragment. Hold + request backfill (spec’s stated behavior).
parking-lotkeyed by missing parent op-ID; on inbound op insert, drain dependents. Bound the holding queue (1024 entries × 64 KiB) and DLQ overflow to avox-dbtable for human triage. - Retention. Tiered: hot (last 10 K ops or 7 days) in
VecDeque, warm invox-dbindexed by(convergence_set, op_id, parent_op_ids), cold compacts by collapsing snapshot ops older than the latest coherent base into a singleCheckpointop (jj-shape squash). Prune when no liveAgentChangereferences. - Durability home. New table
convergence_op_log(op_id PK, set_id, parent_op_ids JSON, payload BLOB, signature, agent_id, produced_at)plus(set_id, produced_at)index invox-db. Same SQLite file as the capability ledger; separate tables.
4.2.3 File-affinity replication
Section titled “4.2.3 File-affinity replication”Today affinity is per-daemon. To go mesh-wide:
- LWW per file → cheap, double-claim race for ~RTT. OK for hint affinity; not authoritative.
- Vector-clock per file → right shape; widen
(path → daemon_id)to(path → (daemon_id, lamport)). Still needs anti-entropy gossip. - Quorum via consensus → strong but pays Raft latency on every file-write decision.
Recommendation: vector-clock LWW for hint; ExclusiveLock (single-coordinator §4.2.1) for the hard guarantee. Affinity = soft routing; lock token = hard contract.
4.2.4 Conflict tiering
Section titled “4.2.4 Conflict tiering”Four-tier funnel ordered by latency:
- Patch-commutativity. Today’s
MergePolicyV1byte-range overlap. Auto-merges the easy ~70%. - Lock-first. Hot paths (e.g.,
Cargo.toml, schema files) →Exclusive. Queue policy = FIFO with model-tier weighting (higher-trust tie-breaker). Max wait = 60 s soft → re-plan or requeue. - Three-way merge with LLM arbiter. “Socrates arbitration” (spec §Phase 4) — two semantically related edits; arbiter scores hallucination + author trust; if Δ > threshold pick a side, else fall through.
- Manual escalation.
vox vcs conflicts listmaterializes navigable artifact (spec Phase 2); human ack required for path-policy class (e.g.,vox-secrets/src/spec/mod.rs).
Add a LockWait outcome between AutoMerge and SurfaceConflict so tier-2 doesn’t masquerade as tier-3.
4.2.5 Branch & PR coordination
Section titled “4.2.5 Branch & PR coordination”- Per-task branches, not per-agent. The agentic-vcs research already binds branch to
AgentWorkspace; per-agent branches conflate identity with scope. - Push timing. When the task’s convergence set marks the change ready (tests pass, secret-scan clean,
vox_pr_openprecondition). Don’t push WIP across the mesh — keep them in the convergence set.vox_pushis Phase-2 capability per the agentic plan. - Convergence vs. N-PR. The
ConvergenceSetis the answer. Shared set → one PR; separate sets → separate PRs. Set membership = explicit user/orchestrator action. - Cross-node branch discovery. Reuse
ConvergenceSetAnnouncementenvelope. Each daemon advertises sets it owns; orchestrator joins listen + merge into a global set-registry. No forge round-trip for in-flight work.
4.2.6 Capability-token federation
Section titled “4.2.6 Capability-token federation”Three options for the Phase-1 → Phase-4 ramp:
- Mint locally + sign + validate-on-remote. Each daemon has a vox-secrets-issued keypair; tokens carry an Ed25519 signature; remote daemons verify against the federation’s root-of-trust. Lowest coordination cost; cleanest evolution. Matches Phase 4 of the agentic plan (“agent identities for op signing”).
- Mint via coordinator only. Forces every BranchCreate through the lock-leader. Tightly couples mint to lock-server availability — bad for partitioned operation.
- Per-task ephemeral tokens scoped to a single mesh dispatch. MENS dispatch envelope already has
idempotency_key; bind a capability set to the dispatch and let it expire on ack/fail. Excellent posture; layers on top of either of the above.
4.2.7 Code mobility
Section titled “4.2.7 Code mobility”| Need | Mechanism | Latency |
|---|---|---|
| Worker spawning, near-current snapshot | vox-package content-addressed bundle (already a primitive). Daemon ships CAS bundle of (working tree + recent op-log tail) | One round-trip; tens of MB |
| Long-running worker catch-up | Stream patches via mesh op-log gossip | Continuous; smallest payloads |
| Cold start / fallback | Pull from upstream forge (vox-forge) with explicit git fetch | Slowest; bypasses mesh trust |
Minimum-staleness combination: ship CAS bundle as seed (single round-trip), then subscribe to op-log gossip for the relevant convergence set. Forge pull only as backstop when the bundle’s base is older than the forge’s main.
4.2.8 Smallest credible MVP — two daemons, multi-agent, same repo, no data loss
Section titled “4.2.8 Smallest credible MVP — two daemons, multi-agent, same repo, no data loss”- Persist the lock map. Move the in-memory map into
vox-dbtablevcs_lock(path, kind, holder, expires_at, lease_id); WAL replay on daemon start. - Single lock-leader election.
lock_leaderrow in same DB with heartbeat; loser proxies via existing A2A envelope. Worktree-creation and branch-name minting flow through. Coordination overhead = one A2A round-trip per lock op. - Sign every capability mint and op-fragment with vox-secrets keys;
verify_signatureon the receiver. Extends existing JWE envelope. - Op-log durability + bounded gossip topic. Persist
OpFragmenttoconvergence_op_log; add theOpFragmenttopic to A2A dispatch with Bloom-filter anti-entropy every 30 s. MVP doesn’t need backfill ladders or epidemic-push tuning. - Code-mobility seed via vox-package bundle. Daemon B accepts a job from A → A ships CAS bundle (working tree + last 100 ops) → B subscribes to op-log topic. No forge involvement.
Steps 1–2 alone close the highest-likelihood-highest-severity risk (silent two-daemon lock race) and are the first credible “two daemons, no data loss” milestone. Steps 3–5 widen safety/freshness.
4.2.9 Top-5 risks for “100 agents, 10 nodes, one repo”
Section titled “4.2.9 Top-5 risks for “100 agents, 10 nodes, one repo””| # | Risk | L | S | Smallest fix |
|---|---|---|---|---|
| 1 | Two daemons race the same file lock — wrong-branch commits stack | High | High | Persist locks to vox-db + elect single leader before any second daemon comes online |
| 2 | Op-fragment with unknown parent stalls forever (producer offline) | Med | High | Bound wait queue (1024 / 64 KiB), DLQ overflow to vox-db, surface in dashboard |
| 3 | Capability token forged on a remote node (mint paths still pub) | Med | Very High | Sign every minted token with the local daemon’s vox-secrets key when a second daemon joins (Phase-3 brought forward) |
| 4 | Local op-log diverges from gossip → agent commits a force-push-ish op via banned-list bypass | Low | Very High | Move banned-list check from git_exec to a vox-bounded-fs process-spawn deny-list (already in agentic-vcs Phase 4 plan) |
| 5 | Affinity LWW oscillation — two daemons keep stealing the same file | Med | Med | Hold-down timer (60 s) + vector-clock comparison so the older claim wins ties |
5. Open questions / Wave-3 research
Section titled “5. Open questions / Wave-3 research”These weren’t answerable from the codebase alone and need either prototyping or another round of design:
- Sandbox for VoxScript with mesh-effects. WASM-based capability propagation is clean for in-process; what’s the right shape when the WASM module dispatches a remote activity? Does the ephemeral per-job key live in the host’s heap or the guest’s?
- Workflow signal typing. Restate’s awakeables are stringly-typed externally. Should Vox’s
signal name within timeoutcarry aSignal[T]type or stay string-keyed for cross-language interop? - Durable promise shape vs. existing
Future[T]. Single canonical primitive (C4) demands removingFuture[T]from std — what’s the compat story for non-distributed callers? - Per-region attestation revocation latency. Tombstones via gossip — what’s the worst-case revocation propagation time? Acceptable bound for personal-mesh; need numbers for public-internet.
- Lock-leader split-brain on partition heal. When two leaders briefly co-exist (network partition), how do their op-logs reconcile? Same answer as op-fragment unknown-parent, or is there a leader-specific case?
- Dashboard-as-orchestrator-extension. If the dashboard surfaces “drag-to-assign-role” with capability-token semantics, does the mint API need to gain a
via_dashboard: boolaudit field, or does the existing trailer chain suffice? - Mesh-aware
vox auditumbrella. tooling-convergence-findings-2026.md proposes avox auditumbrella; should mesh-health (probe correctness, lock backlog, oplog lag, attestation freshness) join the same surface, or stay in a separatevox populi audit? - Federation envelope shape. Should the op-fragment envelope adopt ForgeFed/ActivityPub’s Activity-object pattern for forge-event federation, or stay pure-binary? ActivityPub is too verbose as transport; the shape may still be useful.
6. Sources
Section titled “6. Sources”External (Wave-2 prior-art research):
- BOINC: A Platform for Volunteer Computing (Anderson, 2019) · BOINC HR wiki
- Akash Network 2026 Roadmap · Akash TEE discussion
- Why Tailscale · Tailscale Auth Keys
- Nostr (Wikipedia) · AT Protocol (Wikipedia)
- P2P Networking: WebRTC vs libp2p vs Iroh
- SoK: Analysis of Accelerator TEE Designs (NDSS ‘26)
- Acurast: Decentralized Serverless Cloud (2026)
- Confidential Computing Blueprint for AI Workloads
- Coolify · Ray Summit 2025 · Ray Clusters
- Grafana Loki OSS · Loki Live Tailing
- k9scli.io · Kubernetes Dashboard vs Lens vs k9s
- Slurm-web · HuggingFace Inference Endpoints
- Temporal Workflow Replay Debugger · Temporal Events & History
- Pijul Theory · Jujutsu · GitButler Workspace Branch
- Automerge · Yjs · Replicache Consistency Model
- Fossil Sync Protocol · Plastic SCM (Unity Version Control)
- etcd Raft library · Raft consensus
- ForgeFed protocol · Forgejo FAQ
- Demers et al.: Epidemic algorithms for replicated database maintenance
Internal (cross-references):
- Mesh: populi-mesh-north-star-2026.md, -improvement-backlog, -a2a-durability-spec, -config-baseline-spec, -local-observability-spec, -probe-correctness-spec, -probe-correctness-plan, scientia-mesh-integration-research.
- Dashboard: vox-dashboard-design-brief-2026.md, dashboard-migration-research-2026.md, external-frontend-interop-plan-2026.md, ffscript-mutation-api-spec-2026.md (FableForge, not dashboard panels).
- Durability / language: durability-runtime-audit-2026.md, vox-language-rules-and-enforcement-plan-2026.md, vox-language-rules-phase1-ssot-collapse-2026.md, feature-growth-boundaries.md, v0.5-core-ssot.md, nextgen-orchestrator-research-2026.md.
- Multi-agent VCS: multi-agent-vcs-replication-research-2026.md, -spec-2026.md, -impl-plan-phase1, agentic-version-control-automation-research-2026.md, agentic-vcs-automation-impl-plan-phase1, git-concurrency-policy.md.
- Security / identity: cryptography-ssot-2026.md, share-policy-2026.md, ludus-identity-github-integration-research-2026.md, ludus-security-and-anti-cheat-research-2026.md, telemetry-trust-ssot.md.
- Tooling / architecture: where-things-live.md, layers.toml, tooling-convergence-findings-2026.md, phase-numbering-index.md, v1-release-criteria.md.