Mens Coordination & Database Write Safety
Mens Coordination & Database Write Safety
Section titled “Mens Coordination & Database Write Safety”Single Source of Truth for how Vox mens nodes coordinate on Turso/libSQL, prevent simultaneous write conflicts, and deliver agent-to-agent messages reliably across process and machine boundaries.
[!IMPORTANT] All orchestrator coordination state (locks, op-log, A2A messages, heartbeats) persists to Turso when
VOX_MESH_ENABLED=1. On a single machine without mens these remain in-process only for zero-overhead local development.
Mental model: “Distributed” here means many orchestrator processes (e.g. two vox-mcp hosts) sharing durable Turso rows and HTTP A2A — not a single long-lived orchestrator singleton in one OS process. File routing and per-process structures still exist in each process; cross-node arbitration uses coordination tables (distributed_locks, etc.). The shared bootstrap factory lives in vox_orchestrator::bootstrap.
1. Architecture Overview
Section titled “1. Architecture Overview”┌────────────────────────────────────┐ ┌────────────────────────────────────┐│ Mens Node A (Device 1) │ │ Mens Node B (Device 2) ││ │ │ ││ Orchestrator A │ │ Orchestrator B ││ ├─ FileLockManager (in-process) │ │ ├─ FileLockManager (in-process) ││ ├─ MessageBus → DB-backed │ │ ├─ MessageBus → DB-backed ││ ├─ OpLog → persist to Turso │ │ ├─ OpLog → persist to Turso ││ └─ HeartbeatMonitor → Turso │ │ └─ HeartbeatMonitor → Turso ││ │ │ ││ EmbeddedReplica (local.db) ──────┼──┼──▶ Turso Cloud Primary │└────────────────────────────────────┘ └────────────────────────────────────┘ ▲ ▲ └──────── A2A HTTP relay ──────┘ /v1/a2a/deliver2. Turso Coordination Tables (Codex schema domain: coordination)
Section titled “2. Turso Coordination Tables (Codex schema domain: coordination)”All tables are added via the coordination Arca schema domain and created with
IF NOT EXISTS — safe for multi-node concurrent schema bootstrapping.
distributed_locks
Section titled “distributed_locks”Per-resource advisory fencing lock. Uses SQLite row atomicity (INSERT OR IGNORE)
as the CAS primitive — no external lock manager required.
| Column | Type | Purpose |
|---|---|---|
lock_key | TEXT PK | Logical resource path (e.g. "file:src/lib.rs") |
holder_node | TEXT | VOX_MESH_NODE_ID of lock owner |
holder_agent | TEXT | Agent session or task ID |
fence_token | INTEGER | Monotone counter; prevents ABA re-use |
acquired_at | TEXT | ISO8601 timestamp |
expires_at | TEXT | TTL-based expiry; sweep_expired_distributed_locks cleans stale rows |
repository_id | TEXT | Scope to git repository |
Lock acquisition protocol:
-- Attempt atomic acquisition (no-op if row exists and not expired)INSERT INTO distributed_locks (lock_key, holder_node, holder_agent, fence_token, expires_at, repository_id)VALUES (?, ?, ?, ?, datetime('now', '+30 seconds'), ?)ON CONFLICT(lock_key, repository_id) DO NOTHING;
-- Check if we wonSELECT fence_token FROM distributed_locksWHERE lock_key = ? AND repository_id = ? AND holder_node = ? AND expires_at > datetime('now');agent_oplog
Section titled “agent_oplog”Persisted mirror of the in-memory OpLog SHA-3 chain. Enables crash recovery
and cross-node auditability. Append-only; no OCC guard needed.
a2a_messages
Section titled “a2a_messages”Durable inbox for agent-to-agent messages. Cross-node delivery via the mens HTTP
relay endpoint POST /v1/a2a/deliver; fallback is DB polling.
mesh_heartbeats
Section titled “mesh_heartbeats”Cross-node heartbeat table. Updated by each node’s background tick. Any node can
query live_nodes_from_db(stale_threshold_ms) to see the full mens membership.
3. Conflict Resolution Strategy
Section titled “3. Conflict Resolution Strategy”Default: Last-Push-Wins (Turso sync)
Section titled “Default: Last-Push-Wins (Turso sync)”Turso applies last-push-wins at the row level during embedded replica sync. This
is acceptable for append-only tables (agent_oplog, a2a_messages) where
the AUTOINCREMENT primary key ensures no row is ever overwritten.
Opt-in: OCC for Contested Rows
Section titled “Opt-in: OCC for Contested Rows”For mutating tables (e.g. memories, agent_sessions) the occ module in
vox-orchestrator provides an application-layer guard:
SELECT written_atbefore writing.- Compare remote vs local ISO timestamp lexicographically.
- If remote is newer: apply
ConflictResolutionstrategy. - Default strategy:
TakeRight(remote wins; local write skipped). - On
DeferToAgent: creates aConflictManagerentry for human review.
Not Used: Turso MVCC (BEGIN CONCURRENT)
Section titled “Not Used: Turso MVCC (BEGIN CONCURRENT)”Turso’s experimental MVCC implementation has had acknowledged data-loss incidents
and is not stable as of 2026-03. We do not use BEGIN CONCURRENT.
Revisit when Turso marks it stable.
4. EmbeddedReplica for Mens Nodes
Section titled “4. EmbeddedReplica for Mens Nodes”When VOX_MESH_ENABLED=1 + VOX_DB_URL + VOX_DB_TOKEN are all set, VoxDb
automatically opens an EmbeddedReplica instead of a plain local file:
VOX_MESH_ENABLED=1VOX_DB_URL=libsql://my-db.turso.ioVOX_DB_TOKEN=<token>VOX_DB_PATH=/path/to/local-replica.db (optional; defaults to .vox/cache/db/local.db)Reads are sub-millisecond from the local file. Writes go to the primary and
replicate back. After shared-table writes, VoxDb::sync() is called
asynchronously to flush.
5. A2A Cross-Node Message Delivery
Section titled “5. A2A Cross-Node Message Delivery”Node A: MessageBus::send_routed(receiver, route=Remote { node_url }) │ ├─▶ Writes row to local a2a_messages (DB) │ └─▶ POST {node_url}/v1/a2a/deliver (JSON A2AMessage) │ ▼ Node B: inserts into its local a2a_messages Node B: MessageBus::poll_inbox_from_db() returns messageRetry on HTTP failure: 3 attempts with exponential backoff (500ms, 1s, 2s). After all retries fail: message remains in the DB inbox; receiver polls on next heartbeat cycle (≤60 s latency fallback).
6. Network Resilience
Section titled “6. Network Resilience”Connection Retries (Turso)
Section titled “Connection Retries (Turso)”attempt 1 → 500msattempt 2 → 1000ms + jitter(0..500ms)attempt 3 → 2000ms + jitter(0..500ms)...capped at 30sFormula: base_ms * 2^attempt + rand(0..jitter_ms), capped at max_ms=30_000.
Circuit Breaker (VOX_DB_CIRCUIT_BREAKER=1)
Section titled “Circuit Breaker (VOX_DB_CIRCUIT_BREAKER=1)”| State | Condition | Behavior |
|---|---|---|
| Closed | < N failures | Normal operation |
| Open | ≥ N consecutive failures | Returns StoreError::CircuitOpen immediately |
| Half-Open | After reset_timeout (30s) | One probe request allowed |
Default: N=5, reset_timeout=30s.
When Open: write callers buffer to AgentQueue for retry on recovery.
Mens HTTP Client Retries
Section titled “Mens HTTP Client Retries”PopuliHttpClient applies the same exponential backoff formula for join, heartbeat,
and A2A relay calls. Previously it had no retry logic at all.
7. Stale Lock Sweep
Section titled “7. Stale Lock Sweep”A background task (spawned by orchestrator at startup when DB is present) sweeps
expired rows from distributed_locks every 60 seconds:
DELETE FROM distributed_locks WHERE expires_at < datetime('now');This prevents phantom locks from crashed nodes that never released their rows. Lock TTL defaults: 30s for file edits, 5m for long-running tasks.
8. Environment Variables Reference
Section titled “8. Environment Variables Reference”| Variable | Default | Purpose |
|---|---|---|
VOX_MESH_ENABLED | false | Activate mens coordination |
VOX_MESH_NODE_ID | auto-generated | Stable node identity |
VOX_MESH_CONTROL_ADDR | unset | HTTP control plane URL |
VOX_MESH_SCOPE_ID | unset | Cluster tenancy ID |
VOX_DB_URL | unset | Turso remote URL |
VOX_DB_TOKEN | unset | Turso auth token |
VOX_DB_PATH | .vox/cache/db/local.db | Local replica path |
VOX_DB_CIRCUIT_BREAKER | false | Enable DB circuit breaker |
VOX_MESH_TOKEN | unset | Bearer token for mens HTTP routes |
9. Gaps & Future Work
Section titled “9. Gaps & Future Work”| Gap | Status | When |
|---|---|---|
Turso transform hook for server-side conflict resolution | Not available in Rust SDK | When Turso Go SDK ports to Rust |
| NATS JetStream for durable A2A at scale | Not needed at current mens size | When >100 concurrent agents |
Turso MVCC BEGIN CONCURRENT | Unstable | When Turso marks stable |
CRDT-based memory merging (cr-sqlite) | Research phase | When memory conflicts become common |
Related Documents
Section titled “Related Documents”docs/src/adr/004-codex-arca-turso-ssot.md— Turso naming conventionsdocs/src/reference/orchestration-unified.md— Orchestrator internalsdocs/src/reference/external-repositories.md— Repo discoverycrates/vox-orchestrator/src/locks.rs— In-process + distributed advisory lockscrates/vox-orchestrator/src/a2a.rs— A2A message buscrates/vox-orchestrator/src/occ.rs— OCC write guardscrates/vox-db/src/circuit_breaker.rs— DB circuit breakercrates/vox-db/src/schema/domains/sql/coordination.sql— coordination DDL (Arca fragment; merged ingamification_coordination.rs)