Populi overlay personal cluster runbook
Populi overlay personal cluster runbook
Section titled “Populi overlay personal cluster runbook”Scope: Phase 6 personal clusters that use an overlay (for example WireGuard, Tailscale, ZeroTier) so Populi nodes behave like one fleet across the WAN. This is not a hosted public GPU pool and not default long-haul distributed training. See work-type placement matrix and ADR 017.
Preconditions
Section titled “Preconditions”- Every process that should share membership uses a consistent
VOX_MESH_SCOPE_IDwhen the control plane enforces scope (mens SSOT). - Bearer / JWT roles are configured via Clavis-backed secrets; never commit tokens to Compose files checked into git.
- TLS termination sits in front of
vox populi serveper ADR 008 when exposed beyond loopback.
Enrollment (high level)
Section titled “Enrollment (high level)”- Bring up the overlay so each node has stable virtual IPs or DNS names; verify MTU and UDP reachability for the overlay product you use.
- Deploy the control plane on a host that overlay peers can reach; bind to the overlay interface or a reverse proxy that listens there.
- Point workers at
VOX_MESH_CONTROL_ADDR/VOX_ORCHESTRATOR_MESH_CONTROL_URLusing the overlay URL, not a public LAN IP that disappears off-site. - Join + heartbeat: use the same intervals as LAN (see mens SSOT); add exponential backoff on 429/503 as for local clusters.
- Bootstrap tokens: prefer
VOX_MESH_BOOTSTRAP_TOKENexchange for one-shot join on new nodes instead of copying long-lived mesh tokens into chat or email.
Security posture
Section titled “Security posture”- Treat
GET /healthas the only intentionally unauthenticated route; everything under/v1/populi/*must see Bearer/JWT when the server is configured with secrets. - Split tokens: use worker vs submitter roles so compromise of a deliver-only client cannot reconfigure nodes.
- Scope id is a tenancy boundary: do not reuse one scope id across unrelated users “for convenience.”
- Quarantine (
POST /v1/populi/admin/quarantine) is the fast stop serving new mesh work lever for a suspect node while you investigate.
WAN boundaries and expectations
Section titled “WAN boundaries and expectations”| Topic | Expectation |
|---|---|
| Control plane RTT | Higher and more variable than LAN; heartbeats and lease renewals must use conservative timeouts in pilot configs. |
| Bulk artifacts / checkpoints | Do not assume large files ride the same path as HTTP join/heartbeat; use object storage, rsync over overlay, or another data plane you control. |
| Inference / interactive agents | Usable with lease-gated remote execution when implemented; expect latency and jitter to dominate UX on consumer links. |
| Long GPU training | Not default over overlay WAN in the matrix; pilot-only with checkpointing, explicit opt-in, and rollout checklist. |
| Distributed collectives | Out of scope by default across WAN; requires dedicated topology and ADR-level approval if promoted. |
Failure modes
Section titled “Failure modes”- Partition: nodes may appear stale in
GET /v1/populi/nodes; comparelast_seen_unix_msand applyVOX_MESH_MAX_STALE_MSclient-side filtering. - Asymmetric routing: verify both directions on the overlay before debugging Populi; traceroute/ping inside the tunnel first.
- Double execution: until ADR 017 is implemented for your task class, assume experimental relay does not provide ownership guarantees—local queues remain authoritative.
Related documentation
Section titled “Related documentation”- Deployment compose SSOT — image profiles and env blocks.
- Protocol convergence research 2026 — broader transport synthesis.
- Mens SSOT — current API and env reference.