Purpose & scope

This page expands source §2.E (Lifecycle / control) and the scale-to-zero and branching mechanisms described in §3 into a buildable specification for the controller/ component. The controller owns the lifecycle of engine instances (start / warm / idle / stop), the branch namespace over the versioned page store, and the single-writer fence that guarantees at most one writer per database.

The controller is deliberately not on the data path. It supervises engine processes and manipulates pointers and tokens; it never proxies queries, never holds durable state of its own, and never participates in commit acknowledgement. All durability lives in Slot C (the log service and page store on object storage); the controller is restartable, replaceable, and itself scale-to-zero-capable.

Note

"Engine instance" here means a running engine process (Slot B) bound to one database via a storage URL, holding a warm local cache. "Branch" means a named LSN pointer over shared immutable layers in the page store. The controller manages both; they are orthogonal — many branches can exist with zero running instances.

Responsibilities & non-goals

Responsibilities

  • MUST cold-start an engine instance on first connection to an idle database, including process start and the initial cache warm.
  • MUST detect idleness and tear down the engine (scale-to-zero) so only object-storage bytes bill at rest.
  • MUST manage the branch namespace: create-from-base@LSN, list, delete, and track divergence against the versioned page store (see 04) and storage branch ops (see 03).
  • MUST enforce single-writer fencing per database via the commit log's CAS token: acquire on becoming writer, renew via lease/heartbeat, and guarantee a stale writer's append is rejected.
  • SHOULD implement keep-warm and warm-pool policies to bound cold-start tail latency.
  • SHOULD survive its own crash and restart without losing any acked write or leaking a fence.
  • MAY co-schedule N concurrent cold starts under a thundering-herd admission control.

Non-goals

  • MUST NOT sit on the query data path or proxy SQL — that is the wire listener's job (see 07) or direct FFI (see 08).
  • MUST NOT hold durable state outside Slot C; controller state is reconstructable from the log service and page-store manifest.
  • MUST NOT attempt multi-writer concurrency within a single database — single-writer-per-DB is a deliberate architectural ceiling (source §4).
  • MUST NOT resolve hot-row contention; that is serialized correctly by the single writer lane and handled per-tool (see 10).

Scale-to-zero model

Scale-to-zero rests on one invariant from source §3: compute is stateless — all durable state lives in Slot C. An engine instance is a cache plus a CPU; destroying it loses nothing durable. Therefore the controller is free to stop the engine whenever a database is idle and reconstruct an equivalent instance on the next connection.

$0
compute cost at rest
bytes
only object-storage bills idle
start + warm
cold-start cost
1
writer lane per DB

What costs what

StateEngine processLocal cacheWhat bills
Activerunningwarmcompute + object-storage ops + idle storage bytes
Idle (pre-stop)runningwarmcompute (wasted) + storage bytes
Stopped (scaled to zero)nonediscardedstorage bytes only

Cold-start budget

Cold-start is exactly process_start + cache_warm. The two terms have very different shapes and must be measured separately (forward to 09 Experiment 5).

process_start
Fork/exec or container spin-up of the engine binary, storage trait wiring, and a read of the page-store manifest / latest durable LSN. Bounded, mostly CPU-and-image-size dependent.
cache_warm
First reads after a cold start miss the local cache and fall through to object storage at hundreds-of-ms latency (see 05). Unbounded in the worst case (random working set); the dominant tail term. R2's zero egress (see 11) makes warm reads cheap but not faster.

Warning

Scale-to-zero must never compromise durability. Stopping an engine is only safe after the writer lease is cleanly released (or expired and fenced) and the in-flight WAL is durably acked. The controller MUST NOT stop an instance with an unacked commit in flight; it stops the engine, never a commit.

Engine instance lifecycle state machine

Each engine instance moves through six states. The controller owns all transitions; the engine reports readiness and idleness, and the storage layer reports lease status.

Cold Warming Active Idle Stopping (Cold) connect ready no work timeout connect

Cold → Warming → Active → Idle → Stopping → Cold. Idle returns to Active on a new connection; Stopping always lands back in Cold.

StateMeaningEnters onLeaves toTimeout / trigger
ColdNo process. Only durable state on object storage.init; or from StoppingWarmingfirst connection request
WarmingProcess started; manifest read; cache filling.Cold + connectActive (on ready) / Cold (on crash)warm_deadline (default 10 s) → fail/retry
ActiveServing queries; cache warm; writer lease held if writing.Warming ready; Idle + connectIdle
IdleRunning, zero in-flight work, lease draining.Active + no workActive (connect) / Stopping (timeout)idle_timeout (default 30 s)
StoppingReleasing lease, flushing, exiting process.Idle + idle_timeoutColddrain_deadline (default 5 s)
Cold (terminal loop)Process gone; back to rest.Stopping completeWarming

Transition rules

  • MUST only admit connections in Active; connections arriving in Warming queue behind the warm; connections in Idle pull the instance back to Active.
  • MUST treat a crash in Warming or Active as an immediate transition to Cold, after fence reclamation (see failure modes).
  • MUST complete the writer-lease release (or let it expire) before entering Cold from Stopping.
  • SHOULD cancel a pending Stopping transition if a connection arrives during the drain_deadline window, returning to Active without a full cold-start.
  • MAY hold a Warming/Active instance in a warm pool past idle_timeout when warm_pool_size > 0.

Branching model

Source §3 defines a branch as a new LSN pointer over shared immutable layers. Because the page store is versioned by LSN and its layers (delta + image) are immutable, a branch is created by recording a parent and a base LSN — no data is copied. This yields instant clones with near-zero marginal storage until divergence.

                 page store: immutable layers, versioned by LSN
   image@0 ── delta@40 ── delta@120 ── delta@180  (main, head=180)
                              │
                              └─ base = main@120
                                   branch "tool-x"  head=120 ──> new deltas (CoW)
                                   branch "tool-y"  head=120 ──> new deltas (CoW)

   shared bytes: image@0, delta@40, delta@120     (read by main + both branches)
   private bytes: only deltas written AFTER each branch diverges

A branch is a pointer at base@LSN; reads below the divergence point hit shared immutable layers. Storage cost grows only with post-branch writes.

Branch operations API

The controller exposes branch ops that map onto the storage branch primitives in 03 and the versioned page store in 04. Library signatures (Rust controller crate):

/// A branch is a named LSN pointer over shared immutable layers.
pub struct BranchRef { pub db: DbId, pub name: String, pub head: Lsn }

pub trait BranchControl {
    /// Instant clone: record (parent, base_lsn); copies no pages.
    fn create_branch(&self, db: DbId, name: &str,
                     from: BranchRef, at: Lsn) -> Result<BranchRef>;

    /// Enumerate branches with parent, base LSN, head LSN, divergence bytes.
    fn list_branches(&self, db: DbId) -> Result<Vec<BranchInfo>>;

    /// Drop a branch pointer; GC reclaims layers no live branch references.
    fn delete_branch(&self, db: DbId, name: &str) -> Result<()>;

    /// Bytes written since divergence (drives marginal-storage accounting).
    fn divergence(&self, db: DbId, name: &str) -> Result<DivergeStats>;
}

pub struct BranchInfo {
    pub r#ref: BranchRef, pub parent: Option<String>,
    pub base_lsn: Lsn, pub created_at: Timestamp, pub writers: u32,
}
pub struct DivergeStats { pub private_bytes: u64, pub shared_bytes: u64, pub diverged_at: Lsn }

Divergence semantics

  • MUST create a branch in O(1) — record parent + base LSN, copy no page data.
  • MUST serve reads on a branch from shared immutable layers for any LSN at or below diverged_at, and from branch-private deltas above it.
  • MUST treat each branch as an independent single-writer database — a branch gets its own commit log / CAS token, so a write to tool-x never fences tool-y or main.
  • MUST NOT mutate a parent's layers when a child branch writes; copy-on-write produces new layers only.
  • SHOULD refuse delete_branch on a branch that has live children unless cascade is explicit, to avoid orphaning shared layers.
  • MAY support fast-forward or rebase of a branch's base LSN; merge of divergent writes is explicitly out of scope (no automatic conflict resolution).

Tip — the dozens-of-tools pattern

The intended use (source §3): many internal tools each get a branch off one base. Provision one seed database, branch it once per tool, and each tool diverges only as it writes. Storage stays near the seed size until tools actually mutate data — branching is the primary lever that makes "many small databases" affordable (see also 10, where per-DB boundaries recover lock granularity).

Single-writer fencing

Single-writer-per-database (source §4, §9) is enforced by the commit log's compare-and-swap (CAS) token — the same S3-conditional-write mechanism that gives the log atomic ordered appends without a separate Raft/Paxos cluster (source §2.C). The controller turns that primitive into a lease: exactly one engine instance holds the right to append to a given database's log at a time. A stale writer is fenced — its CAS append is rejected because the token has moved on.

Fence token & lease

fence_token
Monotonic value advanced on every successful CAS append to the commit log (see 04). The latest committed token is the fence; an append carrying an older expected token fails the conditional write.
writer_epoch
Coarse generation bumped each time a new writer acquires the lane. Embedded in the token so a returning stale writer is detectable even across log truncation.
lease
Time-bounded right to be the writer, with TTL lease_ttl. Held by acquiring the lane and renewed by heartbeat. Expiry makes the lane reclaimable.

Acquire / renew / handoff sequence

W1 (current writer)            commit log (S3-CAS)            W2 (new writer)
 |  CAS append (epoch=7) ───────────> |  ok, token=7.N                   |
 |  heartbeat / renew lease ────────> |  ok, lease extended              |
 |  ...                                |                                  |
 |  (W1 stalls: GC pause / net split)  |                                  |
 |                                     | <──── acquire (lease expired) ── |  W2 bumps epoch 7→8
 |                                     |  ok, epoch=8                     |
 |  CAS append (epoch=7) ───────────> |  REJECTED (stale: epoch<8)       |   <-- W1 fenced
 |  observes rejection → step down     |                                  |
 |                                     | <──── CAS append (epoch=8) ───── |  ok, token=8.M

A stalled writer cannot corrupt the log: its conditional append is rejected the moment a newer epoch exists. Fencing is enforced by storage, not by trust.

  • MUST acquire the writer lease (advancing writer_epoch via CAS) before any engine instance issues its first append_wal for a database/branch.
  • MUST renew the lease via heartbeat at an interval < lease_ttl / 3; on renewal failure the writer MUST stop accepting new writes immediately.
  • MUST reject any CAS append carrying an expected token older than the current fence — the stale writer is fenced (this is enforced by the log service, see 04).
  • MUST treat a fenced rejection as fatal for that writer: step down, fail in-flight uncommitted transactions, and never retry the same epoch.
  • MUST NOT ack a commit to a client before the CAS append succeeds under the current valid token (durability rule, source §8).
  • SHOULD release the lease cleanly on graceful Stopping so the next writer skips the full lease_ttl wait.
  • MAY expose lease ownership in status so a pooler / router (see 07) can direct writes to the current holder.

Danger — split-brain is impossible by construction, not by convention

Two engines may briefly believe they are the writer (e.g. a network partition before W1 notices its lease expired). Correctness does not depend on them agreeing: only one CAS epoch wins each append. W1's writes silently lose the race and are rejected; no acked write is ever lost or duplicated. This is why fencing rides on the commit log's CAS rather than a separate lock service.

Thundering herd & N concurrent cold starts

Scale-to-zero produces a cold cache after every idle period (source §3, §8 Exp 5). The dangerous shape is many simultaneous first-connections — either N clients hitting one cold database, or N distinct databases waking at once (the §10 "scale to ~1,000 at peak" case, where 1,000 simultaneous cold starts is the named risk). This forwards to 09 as an extension of Experiment 5: "N concurrent cold starts → find the spin-up saturation point."

One cold DB, many clients

  • MUST dedupe: the first connection triggers exactly one Warming transition; concurrent connections for the same database wait on that single warm, they do not each spawn a process.
  • SHOULD share the warming instance's cache fill so the (N−1) waiters land on an already-warm cache.

Many cold DBs at once

  • SHOULD bound concurrent Warming transitions with an admission limit (max_concurrent_warms) to avoid saturating CPU and the object store's request budget.
  • SHOULD queue excess cold-starts FIFO and surface queue depth in status.
  • MAY pre-warm a configurable warm pool ahead of a known burst (deploy, cron, traffic spike).

Idle-timeout & keep-warm tuning

The single knob with the most leverage is idle_timeout: too short pays cold-start tax repeatedly on bursty-but-recurring traffic; too long wastes compute and defeats scale-to-zero. Tune it against the inter-arrival distribution of each tool, informed by Exp 5's warm-vs-cold read CDFs.

Workload shapeidle_timeoutwarm_poolkeep-warm ping
Bursty, recurring (every few min)longer (e.g. 5 min)0optional
Latency-critical, low trafficmoderate1yes
Truly rare / archivalshort (default)0no
Predictable spike (deploy/cron)defaultpre-warm Nscheduled

Controller API surface

The controller is consumable as an in-process library (embedded mode) and over a control plane (server mode). Both expose the same operations: instance lifecycle, status, and branch management. The control surface is administrative and MUST be separate from the SQL data path.

Library / RPC contract

pub trait Controller {
    // ---- lifecycle ----
    /// Ensure an Active instance exists for db; warms if Cold. Idempotent.
    fn start(&self, db: DbId, branch: &str) -> Result<InstanceHandle>;
    /// Request graceful scale-to-zero (drain → Stopping → Cold).
    fn stop(&self, db: DbId, branch: &str) -> Result<()>;
    /// Lifecycle state + lease holder + queue depth + cache warmth.
    fn status(&self, db: DbId, branch: &str) -> Result<InstanceStatus>;

    // ---- branch ops (see BranchControl) ----
    fn create_branch(&self, db: DbId, name: &str, from: BranchRef, at: Lsn) -> Result<BranchRef>;
    fn list_branches(&self, db: DbId) -> Result<Vec<BranchInfo>>;
    fn delete_branch(&self, db: DbId, name: &str) -> Result<()>;
}

pub struct InstanceStatus {
    pub state: LifecycleState,   // Cold|Warming|Active|Idle|Stopping
    pub lease: Option<LeaseInfo>,// holder epoch, ttl remaining
    pub head_lsn: Lsn,
    pub cache_warm_pct: f32,
    pub warm_queue_depth: u32,
}

HTTP control endpoints (server mode)

Method & pathActionNotes
POST /v1/db/{db}/{branch}/startstart / warmidempotent; returns instance handle
POST /v1/db/{db}/{branch}/stopgraceful scale-to-zerodrains in-flight first
GET /v1/db/{db}/{branch}/statuslifecycle + lease + warmthcheap; for routers/dashboards
POST /v1/db/{db}/branchescreate branch from base@LSNbody: {name, from, at}
GET /v1/db/{db}/brancheslist branches + divergence
DELETE /v1/db/{db}/branches/{name}delete branch pointerGC reclaims unreferenced layers

Configuration

idle_timeout
Duration in Idle before transitioning to Stopping. Default 30s. The primary scale-to-zero / keep-warm knob.
warm_deadline
Max time allowed in Warming before declaring a failed cold-start. Default 10s.
drain_deadline
Max time to drain in-flight work during Stopping before forced exit. Default 5s.
lease_ttl
Writer-lease lifetime; expiry makes the lane reclaimable. Default 10s. MUST exceed heartbeat interval × 3.
heartbeat_interval
Lease renewal cadence. Default lease_ttl / 4 (e.g. 2.5s).
warm_pool_size
Instances kept warm past idle_timeout per database/template. Default 0 (pure scale-to-zero).
max_concurrent_warms
Admission cap on simultaneous Warming transitions (thundering-herd guard). Default sized to host CPU.
keep_warm_ping
Optional self-issued no-op to hold an instance Active. Off by default; use for latency-critical low-traffic tools.

Note — invariant between knobs

heartbeat_interval < lease_ttl / 3 MUST hold, or a single missed heartbeat risks an unnecessary fence. The controller MUST reject a configuration that violates this at start-up.

Failure modes & edge cases

FailureDetectionRequired behaviour
Lost lease mid-write (GC pause / partition)heartbeat renewal fails, or CAS append rejected (fenced)Writer steps down, fails uncommitted txns, never retries old epoch. No acked write lost (CAS guarantees it). Client sees a retryable error; the new writer continues the lane.
Crash during Warmingwarm_deadline elapses / process goneTransition to Cold; if a lease was acquired, let it expire / reclaim via epoch bump. Next connection re-warms. No durable effect (nothing was committed).
Crash during Active with in-flight commitprocess gone after CAS issuedOn restart, follow §8 Exp 4 crash-safety: every CAS-acked commit MUST be present; an append issued-but-not-acked is resolved by the log's atomic CAS (all-or-nothing). No torn state, no acked-write loss.
Controller itself crashessupervisor / health checkReconstruct all state from the page-store manifest + commit log; controller holds no unique durable state. In-flight leases expire naturally and are reclaimed by epoch on next acquire.
Stop requested with work in flightdrain check finds active txnsDelay Stopping until drain or drain_deadline; never stop a commit, only an idle engine.
Connection arrives during Stoppingincoming connect in drain windowCancel Stopping, return to Active without full cold-start (fast resume).
Two writers race (split-brain)second CAS append rejectedLoser is fenced silently; exactly one epoch wins each append. Correctness independent of agreement (see Danger callout).
Delete branch with live childrenlist_branches shows childrenReject unless cascade explicit; never orphan shared layers; GC only reclaims layers no live branch references.

Dependencies & existing pieces to start from

  • MUST build on the commit log's S3 conditional-write (CAS) append primitive for fencing — it supplies the fence token without a separate consensus cluster (source §2.C, see 04).
  • MUST use the versioned, LSN-stamped page store with immutable layers as the substrate for branching (source §2.C, §3, see 04).
  • MUST drive instance start/stop through the engine's storage trait wiring (storage URL selects backend; see 03 and 02).
  • SHOULD coordinate with the local cache for cache-warm reporting and warm-pool retention (see 05).
  • MAY reuse existing process/container supervision (Cloud Run, a VM supervisor, or Worker isolate lifecycle) rather than building a scheduler (see 11).

Acceptance criteria / definition of done

  • MUST demonstrate full lifecycle: Cold → Warming → Active → Idle → Stopping → Cold, with no compute billed in Cold and only storage bytes at rest.
  • MUST pass a fencing test: a stalled writer's CAS append is rejected after a new writer bumps the epoch; no acked write is lost or duplicated.
  • MUST create a branch in O(1) (no page copy) and prove reads below divergence hit shared layers while writes produce only branch-private deltas.
  • MUST pass crash-safety equivalent to §8 Experiment 4 at lifecycle boundaries (crash in Warming, crash with in-flight commit, controller crash).
  • MUST dedupe N concurrent connections to one cold database into exactly one Warming transition.
  • SHOULD publish warm-vs-cold cold-start latency distributions (p50/p99) feeding 09 Exp 5, including the N-concurrent-cold-starts extension.
  • SHOULD reconstruct controller state purely from Slot C after a controller restart, holding no unique durable state.

Open questions & risks

  • MAY Should branch base-LSN rebase / fast-forward be supported, and if so, how to bound GC of layers a rebased branch abandons? Merge of divergent branches is explicitly excluded — confirm no tool needs it.
  • MAY What is the right default idle_timeout per deployment target, given R2 zero-egress (11) makes cold reads cheap but not faster?
  • MAY How aggressively can the warm pool pre-warm before a known burst without re-introducing always-on compute cost?
  • MAY Lease TTL vs cold-start trade-off: a long TTL slows writer handoff after a crash; a short TTL risks fencing healthy-but-paused writers. Tune against measured GC-pause tails.
  • MAY On WASM targets (Workers), the isolate lifecycle is controlled by the platform, not the controller — how much of this state machine maps onto Worker isolates vs being delegated (11)?
  • MAY Thundering-herd saturation point is unknown until measured; the §10 "1,000 simultaneous cold starts" risk needs the Exp 5 extension before committing to peak-1,000 scenarios.

Related specifications

Serverless OLTP Engine — internal development specification. Draft, 2026-06-20. · Author