Lifecycle & Controller
The controller is the thin, stateless supervisor that cold-starts an engine on first connection, scales it to zero when idle, mints instant copy-on-write branches over shared immutable layers, and enforces single-writer-per-database fencing through the commit log's compare-and-swap token — turning a stateless engine plus an object-storage durability floor into an elastic, branchable database service.
Purpose & scope
This page expands source §2.E (Lifecycle / control) and the scale-to-zero and branching mechanisms described in §3 into a buildable specification for the controller/ component. The controller owns the lifecycle of engine instances (start / warm / idle / stop), the branch namespace over the versioned page store, and the single-writer fence that guarantees at most one writer per database.
The controller is deliberately not on the data path. It supervises engine processes and manipulates pointers and tokens; it never proxies queries, never holds durable state of its own, and never participates in commit acknowledgement. All durability lives in Slot C (the log service and page store on object storage); the controller is restartable, replaceable, and itself scale-to-zero-capable.
Note
"Engine instance" here means a running engine process (Slot B) bound to one database via a storage URL, holding a warm local cache. "Branch" means a named LSN pointer over shared immutable layers in the page store. The controller manages both; they are orthogonal — many branches can exist with zero running instances.
Responsibilities & non-goals
Responsibilities
- MUST cold-start an engine instance on first connection to an idle database, including process start and the initial cache warm.
- MUST detect idleness and tear down the engine (scale-to-zero) so only object-storage bytes bill at rest.
- MUST manage the branch namespace: create-from-base@LSN, list, delete, and track divergence against the versioned page store (see 04) and storage branch ops (see 03).
- MUST enforce single-writer fencing per database via the commit log's CAS token: acquire on becoming writer, renew via lease/heartbeat, and guarantee a stale writer's append is rejected.
- SHOULD implement keep-warm and warm-pool policies to bound cold-start tail latency.
- SHOULD survive its own crash and restart without losing any acked write or leaking a fence.
- MAY co-schedule N concurrent cold starts under a thundering-herd admission control.
Non-goals
- MUST NOT sit on the query data path or proxy SQL — that is the wire listener's job (see 07) or direct FFI (see 08).
- MUST NOT hold durable state outside Slot C; controller state is reconstructable from the log service and page-store manifest.
- MUST NOT attempt multi-writer concurrency within a single database — single-writer-per-DB is a deliberate architectural ceiling (source §4).
- MUST NOT resolve hot-row contention; that is serialized correctly by the single writer lane and handled per-tool (see 10).
Scale-to-zero model
Scale-to-zero rests on one invariant from source §3: compute is stateless — all durable state lives in Slot C. An engine instance is a cache plus a CPU; destroying it loses nothing durable. Therefore the controller is free to stop the engine whenever a database is idle and reconstruct an equivalent instance on the next connection.
What costs what
| State | Engine process | Local cache | What bills |
|---|---|---|---|
| Active | running | warm | compute + object-storage ops + idle storage bytes |
| Idle (pre-stop) | running | warm | compute (wasted) + storage bytes |
| Stopped (scaled to zero) | none | discarded | storage bytes only |
Cold-start budget
Cold-start is exactly process_start + cache_warm. The two terms have very different shapes and must be measured separately (forward to 09 Experiment 5).
- process_start
- Fork/exec or container spin-up of the engine binary, storage trait wiring, and a read of the page-store manifest / latest durable LSN. Bounded, mostly CPU-and-image-size dependent.
- cache_warm
- First reads after a cold start miss the local cache and fall through to object storage at hundreds-of-ms latency (see 05). Unbounded in the worst case (random working set); the dominant tail term. R2's zero egress (see 11) makes warm reads cheap but not faster.
Warning
Scale-to-zero must never compromise durability. Stopping an engine is only safe after the writer lease is cleanly released (or expired and fenced) and the in-flight WAL is durably acked. The controller MUST NOT stop an instance with an unacked commit in flight; it stops the engine, never a commit.
Engine instance lifecycle state machine
Each engine instance moves through six states. The controller owns all transitions; the engine reports readiness and idleness, and the storage layer reports lease status.
Cold → Warming → Active → Idle → Stopping → Cold. Idle returns to Active on a new connection; Stopping always lands back in Cold.
| State | Meaning | Enters on | Leaves to | Timeout / trigger |
|---|---|---|---|---|
| Cold | No process. Only durable state on object storage. | init; or from Stopping | Warming | first connection request |
| Warming | Process started; manifest read; cache filling. | Cold + connect | Active (on ready) / Cold (on crash) | warm_deadline (default 10 s) → fail/retry |
| Active | Serving queries; cache warm; writer lease held if writing. | Warming ready; Idle + connect | Idle | — |
| Idle | Running, zero in-flight work, lease draining. | Active + no work | Active (connect) / Stopping (timeout) | idle_timeout (default 30 s) |
| Stopping | Releasing lease, flushing, exiting process. | Idle + idle_timeout | Cold | drain_deadline (default 5 s) |
| Cold (terminal loop) | Process gone; back to rest. | Stopping complete | Warming | — |
Transition rules
- MUST only admit connections in Active; connections arriving in Warming queue behind the warm; connections in Idle pull the instance back to Active.
- MUST treat a crash in Warming or Active as an immediate transition to Cold, after fence reclamation (see failure modes).
- MUST complete the writer-lease release (or let it expire) before entering Cold from Stopping.
- SHOULD cancel a pending Stopping transition if a connection arrives during the
drain_deadlinewindow, returning to Active without a full cold-start. - MAY hold a Warming/Active instance in a warm pool past
idle_timeoutwhenwarm_pool_size > 0.
Branching model
Source §3 defines a branch as a new LSN pointer over shared immutable layers. Because the page store is versioned by LSN and its layers (delta + image) are immutable, a branch is created by recording a parent and a base LSN — no data is copied. This yields instant clones with near-zero marginal storage until divergence.
page store: immutable layers, versioned by LSN
image@0 ── delta@40 ── delta@120 ── delta@180 (main, head=180)
│
└─ base = main@120
branch "tool-x" head=120 ──> new deltas (CoW)
branch "tool-y" head=120 ──> new deltas (CoW)
shared bytes: image@0, delta@40, delta@120 (read by main + both branches)
private bytes: only deltas written AFTER each branch diverges
A branch is a pointer at base@LSN; reads below the divergence point hit shared immutable layers. Storage cost grows only with post-branch writes.
Branch operations API
The controller exposes branch ops that map onto the storage branch primitives in 03 and the versioned page store in 04. Library signatures (Rust controller crate):
/// A branch is a named LSN pointer over shared immutable layers.
pub struct BranchRef { pub db: DbId, pub name: String, pub head: Lsn }
pub trait BranchControl {
/// Instant clone: record (parent, base_lsn); copies no pages.
fn create_branch(&self, db: DbId, name: &str,
from: BranchRef, at: Lsn) -> Result<BranchRef>;
/// Enumerate branches with parent, base LSN, head LSN, divergence bytes.
fn list_branches(&self, db: DbId) -> Result<Vec<BranchInfo>>;
/// Drop a branch pointer; GC reclaims layers no live branch references.
fn delete_branch(&self, db: DbId, name: &str) -> Result<()>;
/// Bytes written since divergence (drives marginal-storage accounting).
fn divergence(&self, db: DbId, name: &str) -> Result<DivergeStats>;
}
pub struct BranchInfo {
pub r#ref: BranchRef, pub parent: Option<String>,
pub base_lsn: Lsn, pub created_at: Timestamp, pub writers: u32,
}
pub struct DivergeStats { pub private_bytes: u64, pub shared_bytes: u64, pub diverged_at: Lsn }
Divergence semantics
- MUST create a branch in O(1) — record parent + base LSN, copy no page data.
- MUST serve reads on a branch from shared immutable layers for any LSN at or below
diverged_at, and from branch-private deltas above it. - MUST treat each branch as an independent single-writer database — a branch gets its own commit log / CAS token, so a write to
tool-xnever fencestool-yormain. - MUST NOT mutate a parent's layers when a child branch writes; copy-on-write produces new layers only.
- SHOULD refuse
delete_branchon a branch that has live children unless cascade is explicit, to avoid orphaning shared layers. - MAY support fast-forward or rebase of a branch's base LSN; merge of divergent writes is explicitly out of scope (no automatic conflict resolution).
Tip — the dozens-of-tools pattern
The intended use (source §3): many internal tools each get a branch off one base. Provision one seed database, branch it once per tool, and each tool diverges only as it writes. Storage stays near the seed size until tools actually mutate data — branching is the primary lever that makes "many small databases" affordable (see also 10, where per-DB boundaries recover lock granularity).
Single-writer fencing
Single-writer-per-database (source §4, §9) is enforced by the commit log's compare-and-swap (CAS) token — the same S3-conditional-write mechanism that gives the log atomic ordered appends without a separate Raft/Paxos cluster (source §2.C). The controller turns that primitive into a lease: exactly one engine instance holds the right to append to a given database's log at a time. A stale writer is fenced — its CAS append is rejected because the token has moved on.
Fence token & lease
- fence_token
- Monotonic value advanced on every successful CAS append to the commit log (see 04). The latest committed token is the fence; an append carrying an older expected token fails the conditional write.
- writer_epoch
- Coarse generation bumped each time a new writer acquires the lane. Embedded in the token so a returning stale writer is detectable even across log truncation.
- lease
- Time-bounded right to be the writer, with TTL
lease_ttl. Held by acquiring the lane and renewed by heartbeat. Expiry makes the lane reclaimable.
Acquire / renew / handoff sequence
W1 (current writer) commit log (S3-CAS) W2 (new writer) | CAS append (epoch=7) ───────────> | ok, token=7.N | | heartbeat / renew lease ────────> | ok, lease extended | | ... | | | (W1 stalls: GC pause / net split) | | | | <──── acquire (lease expired) ── | W2 bumps epoch 7→8 | | ok, epoch=8 | | CAS append (epoch=7) ───────────> | REJECTED (stale: epoch<8) | <-- W1 fenced | observes rejection → step down | | | | <──── CAS append (epoch=8) ───── | ok, token=8.M
A stalled writer cannot corrupt the log: its conditional append is rejected the moment a newer epoch exists. Fencing is enforced by storage, not by trust.
- MUST acquire the writer lease (advancing
writer_epochvia CAS) before any engine instance issues its firstappend_walfor a database/branch. - MUST renew the lease via heartbeat at an interval <
lease_ttl / 3; on renewal failure the writer MUST stop accepting new writes immediately. - MUST reject any CAS append carrying an expected token older than the current fence — the stale writer is fenced (this is enforced by the log service, see 04).
- MUST treat a fenced rejection as fatal for that writer: step down, fail in-flight uncommitted transactions, and never retry the same epoch.
- MUST NOT ack a commit to a client before the CAS append succeeds under the current valid token (durability rule, source §8).
- SHOULD release the lease cleanly on graceful Stopping so the next writer skips the full
lease_ttlwait. - MAY expose lease ownership in
statusso a pooler / router (see 07) can direct writes to the current holder.
Danger — split-brain is impossible by construction, not by convention
Two engines may briefly believe they are the writer (e.g. a network partition before W1 notices its lease expired). Correctness does not depend on them agreeing: only one CAS epoch wins each append. W1's writes silently lose the race and are rejected; no acked write is ever lost or duplicated. This is why fencing rides on the commit log's CAS rather than a separate lock service.
Thundering herd & N concurrent cold starts
Scale-to-zero produces a cold cache after every idle period (source §3, §8 Exp 5). The dangerous shape is many simultaneous first-connections — either N clients hitting one cold database, or N distinct databases waking at once (the §10 "scale to ~1,000 at peak" case, where 1,000 simultaneous cold starts is the named risk). This forwards to 09 as an extension of Experiment 5: "N concurrent cold starts → find the spin-up saturation point."
One cold DB, many clients
- MUST dedupe: the first connection triggers exactly one Warming transition; concurrent connections for the same database wait on that single warm, they do not each spawn a process.
- SHOULD share the warming instance's cache fill so the (N−1) waiters land on an already-warm cache.
Many cold DBs at once
- SHOULD bound concurrent Warming transitions with an admission limit (
max_concurrent_warms) to avoid saturating CPU and the object store's request budget. - SHOULD queue excess cold-starts FIFO and surface queue depth in
status. - MAY pre-warm a configurable warm pool ahead of a known burst (deploy, cron, traffic spike).
Idle-timeout & keep-warm tuning
The single knob with the most leverage is idle_timeout: too short pays cold-start tax repeatedly on bursty-but-recurring traffic; too long wastes compute and defeats scale-to-zero. Tune it against the inter-arrival distribution of each tool, informed by Exp 5's warm-vs-cold read CDFs.
| Workload shape | idle_timeout | warm_pool | keep-warm ping |
|---|---|---|---|
| Bursty, recurring (every few min) | longer (e.g. 5 min) | 0 | optional |
| Latency-critical, low traffic | moderate | 1 | yes |
| Truly rare / archival | short (default) | 0 | no |
| Predictable spike (deploy/cron) | default | pre-warm N | scheduled |
Controller API surface
The controller is consumable as an in-process library (embedded mode) and over a control plane (server mode). Both expose the same operations: instance lifecycle, status, and branch management. The control surface is administrative and MUST be separate from the SQL data path.
Library / RPC contract
pub trait Controller {
// ---- lifecycle ----
/// Ensure an Active instance exists for db; warms if Cold. Idempotent.
fn start(&self, db: DbId, branch: &str) -> Result<InstanceHandle>;
/// Request graceful scale-to-zero (drain → Stopping → Cold).
fn stop(&self, db: DbId, branch: &str) -> Result<()>;
/// Lifecycle state + lease holder + queue depth + cache warmth.
fn status(&self, db: DbId, branch: &str) -> Result<InstanceStatus>;
// ---- branch ops (see BranchControl) ----
fn create_branch(&self, db: DbId, name: &str, from: BranchRef, at: Lsn) -> Result<BranchRef>;
fn list_branches(&self, db: DbId) -> Result<Vec<BranchInfo>>;
fn delete_branch(&self, db: DbId, name: &str) -> Result<()>;
}
pub struct InstanceStatus {
pub state: LifecycleState, // Cold|Warming|Active|Idle|Stopping
pub lease: Option<LeaseInfo>,// holder epoch, ttl remaining
pub head_lsn: Lsn,
pub cache_warm_pct: f32,
pub warm_queue_depth: u32,
}
HTTP control endpoints (server mode)
| Method & path | Action | Notes |
|---|---|---|
POST /v1/db/{db}/{branch}/start | start / warm | idempotent; returns instance handle |
POST /v1/db/{db}/{branch}/stop | graceful scale-to-zero | drains in-flight first |
GET /v1/db/{db}/{branch}/status | lifecycle + lease + warmth | cheap; for routers/dashboards |
POST /v1/db/{db}/branches | create branch from base@LSN | body: {name, from, at} |
GET /v1/db/{db}/branches | list branches + divergence | — |
DELETE /v1/db/{db}/branches/{name} | delete branch pointer | GC reclaims unreferenced layers |
Configuration
- idle_timeout
- Duration in Idle before transitioning to Stopping. Default
30s. The primary scale-to-zero / keep-warm knob. - warm_deadline
- Max time allowed in Warming before declaring a failed cold-start. Default
10s. - drain_deadline
- Max time to drain in-flight work during Stopping before forced exit. Default
5s. - lease_ttl
- Writer-lease lifetime; expiry makes the lane reclaimable. Default
10s. MUST exceed heartbeat interval × 3. - heartbeat_interval
- Lease renewal cadence. Default
lease_ttl / 4(e.g.2.5s). - warm_pool_size
- Instances kept warm past idle_timeout per database/template. Default
0(pure scale-to-zero). - max_concurrent_warms
- Admission cap on simultaneous Warming transitions (thundering-herd guard). Default sized to host CPU.
- keep_warm_ping
- Optional self-issued no-op to hold an instance Active. Off by default; use for latency-critical low-traffic tools.
Note — invariant between knobs
heartbeat_interval < lease_ttl / 3 MUST hold, or a single missed heartbeat risks an unnecessary fence. The controller MUST reject a configuration that violates this at start-up.
Failure modes & edge cases
| Failure | Detection | Required behaviour |
|---|---|---|
| Lost lease mid-write (GC pause / partition) | heartbeat renewal fails, or CAS append rejected (fenced) | Writer steps down, fails uncommitted txns, never retries old epoch. No acked write lost (CAS guarantees it). Client sees a retryable error; the new writer continues the lane. |
| Crash during Warming | warm_deadline elapses / process gone | Transition to Cold; if a lease was acquired, let it expire / reclaim via epoch bump. Next connection re-warms. No durable effect (nothing was committed). |
| Crash during Active with in-flight commit | process gone after CAS issued | On restart, follow §8 Exp 4 crash-safety: every CAS-acked commit MUST be present; an append issued-but-not-acked is resolved by the log's atomic CAS (all-or-nothing). No torn state, no acked-write loss. |
| Controller itself crashes | supervisor / health check | Reconstruct all state from the page-store manifest + commit log; controller holds no unique durable state. In-flight leases expire naturally and are reclaimed by epoch on next acquire. |
| Stop requested with work in flight | drain check finds active txns | Delay Stopping until drain or drain_deadline; never stop a commit, only an idle engine. |
| Connection arrives during Stopping | incoming connect in drain window | Cancel Stopping, return to Active without full cold-start (fast resume). |
| Two writers race (split-brain) | second CAS append rejected | Loser is fenced silently; exactly one epoch wins each append. Correctness independent of agreement (see Danger callout). |
| Delete branch with live children | list_branches shows children | Reject unless cascade explicit; never orphan shared layers; GC only reclaims layers no live branch references. |
Dependencies & existing pieces to start from
- MUST build on the commit log's S3 conditional-write (CAS) append primitive for fencing — it supplies the fence token without a separate consensus cluster (source §2.C, see 04).
- MUST use the versioned, LSN-stamped page store with immutable layers as the substrate for branching (source §2.C, §3, see 04).
- MUST drive instance start/stop through the engine's storage trait wiring (storage URL selects backend; see 03 and 02).
- SHOULD coordinate with the local cache for cache-warm reporting and warm-pool retention (see 05).
- MAY reuse existing process/container supervision (Cloud Run, a VM supervisor, or Worker isolate lifecycle) rather than building a scheduler (see 11).
Acceptance criteria / definition of done
- MUST demonstrate full lifecycle: Cold → Warming → Active → Idle → Stopping → Cold, with no compute billed in Cold and only storage bytes at rest.
- MUST pass a fencing test: a stalled writer's CAS append is rejected after a new writer bumps the epoch; no acked write is lost or duplicated.
- MUST create a branch in O(1) (no page copy) and prove reads below divergence hit shared layers while writes produce only branch-private deltas.
- MUST pass crash-safety equivalent to §8 Experiment 4 at lifecycle boundaries (crash in Warming, crash with in-flight commit, controller crash).
- MUST dedupe N concurrent connections to one cold database into exactly one Warming transition.
- SHOULD publish warm-vs-cold cold-start latency distributions (p50/p99) feeding 09 Exp 5, including the N-concurrent-cold-starts extension.
- SHOULD reconstruct controller state purely from Slot C after a controller restart, holding no unique durable state.
Open questions & risks
- MAY Should branch base-LSN rebase / fast-forward be supported, and if so, how to bound GC of layers a rebased branch abandons? Merge of divergent branches is explicitly excluded — confirm no tool needs it.
- MAY What is the right default
idle_timeoutper deployment target, given R2 zero-egress (11) makes cold reads cheap but not faster? - MAY How aggressively can the warm pool pre-warm before a known burst without re-introducing always-on compute cost?
- MAY Lease TTL vs cold-start trade-off: a long TTL slows writer handoff after a crash; a short TTL risks fencing healthy-but-paused writers. Tune against measured GC-pause tails.
- MAY On WASM targets (Workers), the isolate lifecycle is controlled by the platform, not the controller — how much of this state machine maps onto Worker isolates vs being delegated (11)?
- MAY Thundering-herd saturation point is unknown until measured; the §10 "1,000 simultaneous cold starts" risk needs the Exp 5 extension before committing to peak-1,000 scenarios.