Tradeoffs & Risk Register · Serverless OLTP Engine Spec

Purpose & scope

This page is the consolidation point for honest costs and operational risk across the whole specification. It does two jobs:

Expands the four honest tradeoffs (source §4) into impact-bearing subsections: each names the mechanism, who it bites, and the concrete mitigation, with a pointer to the spec that owns the deeper treatment.
Pulls the risks scattered through specs 01–13 into one risk register with stable IDs, severity, likelihood, mitigation, residual risk, and the related spec — so nothing lives only in a footnote.

It is normative for the acceptance gates that decide whether the engine is allowed to hold real data, and for the decision boundary that places each tool on the correct backend. It is not a substitute for the benchmark plan (09) or the contention strategy (10); those own the measurements and the levers. This page tracks them.

Reading order

If you read only one thing here, read “The durability non-negotiable” and the risk register. R-02 (acked-write loss) is disqualifying and gates everything else; the rest is tuning and placement.

Responsibilities & non-goals

Responsibilities

State every architectural cost in plain terms, without marketing softening.
Maintain the canonical risk register and keep risk IDs stable across spec revisions.
Define the go/no-go decision boundary and the “when NOT to use this engine” guidance.
Pin the durability invariant and the sign-off criteria that gate production data.

Non-goals

Re-deriving the W1/W2 mechanics or benchmark methodology — that is 09 and 10.
Choosing a deployment substrate — that is 11.
Sequencing the build — that is 13.
Inventing new risks not grounded in the design note; this page consolidates, it does not speculate beyond the documented architecture.

Normative requirements

MUST NOT acknowledge a commit to the client before its WAL records are durably stored (CAS-append acked) on the commit log. Caching may hide read latency completely; it MUST NOT hide commit latency.
MUST pass Experiment 4 (crash safety of the commit path, 09) unconditionally before any real data is admitted to a database. A fast commit path that loses an acked write under crash is disqualifying regardless of latency numbers.
MUST run Experiments 1–3 (09) and record p50/p99/p999 for a tool before that tool is placed on the disaggregated backend.
MUST route any tool whose writes are same-row, same-DB, high-rate, and latency-critical to coupled Postgres rather than this engine (10).
MUST enforce single-writer fencing per database via the commit log’s CAS token (06); a writer that loses its lease MUST be fenced before another may append.
SHOULD mitigate cold start with a keep-warm ping or a longer idle timeout for latency-sensitive tools, tuned from Experiment 5 (06, 09).
SHOULD co-locate compute and object store in the same region/provider to avoid cross-cloud egress and latency surprises (11).
SHOULD keep the HNSW working-set graph resident in the local cache; vector search leans hardest of anything on the cache (05, 12).
MAY reserve a dedicated primitive (e.g. Redis) for a tool that is both extremely contended and latency-critical, accepting the added ops/consistency cost (10).

Honest tradeoff T1 — Write latency

Mechanism. A commit pays a network/S3 round-trip (or quorum/CAS ack) instead of a local fsync. The durability floor moved off the local disk and onto object storage, so the acknowledged-commit path crosses the network by construction. This is the price of disaggregation, not a bug to be cached away.

Impact: Per-commit latency rises from ~µs (local fsync) to ~ms (networked CAS ack). The tail (p99/p999) is what S3 jitter inflates, not the mean.
Who it bites: Sustained write firehoses — tools committing at high sequential rate where each commit blocks the next. Read-heavy and bursty workloads barely notice; their commits are infrequent and the read path is served from cache.
Mitigation: Group commit / queue + batch amortizes the round-trip across many concurrent commits (defeats W1). The mitigation that does not work: caching — it hides read latency but cannot hide commit latency without breaking durability. Any write-heavy tool MUST be benchmarked explicitly (Exp 1 + Exp 2) before commitment.

Benchmark, don’t assume

“Bursty barely notices, firehose must measure” is a guideline, not a guarantee. The sequential-commit ceiling is ~1/latency until group commit kicks in. See Experiment 1 (latency floor) and Experiment 2 (group-commit throughput curve) in 09.

Honest tradeoff T2 — Single writer per database

Mechanism. Without an added coordination layer, the engine permits one writer at a time per database — the SQLite-lineage ceiling. The commit log’s CAS token fences the single writer; there is no built-in multi-writer consensus.

Impact: Write concurrency within a single database does not scale by adding writers; it scales by adding databases (lanes). A naive per-DB lock means every writer in the DB serializes, even on unrelated rows.
Who it bites: One giant, high-concurrency single-database OLTP workload. It does not bite the target shape — many small databases / dozens of tools, each its own lane.
Mitigation: Sharding: many small DBs = many independent lanes (defeats W2). Database boundaries substitute for row-level lock granularity (10). The mitigation that does not work: batching — it makes one lane’s commits cheap but adds zero lanes.

Correct-but-slow by default

The single lane is correct for the hot-row case — every contended write sees the prior committed value, exactly as every serious database serializes contended writers. The cost is performance on the outlier, never correctness. See 10 for the layered strategy (serialize universally, shard where you control the SQL, route the irreducible outlier to Postgres).

Honest tradeoff T3 — Cold start

Mechanism. Compute is stateless: the controller stops the engine when idle (scale-to-zero), so the first request after an idle period pays process-start plus cache-warm. Caches start cold after every idle period by design.

Impact: The first request after idle has elevated latency: process/container/isolate spin-up + a burst of cache-miss reads against object storage. Steady-state requests are unaffected.
Who it bites: Latency-sensitive tools with spiky, intermittent traffic — and, at scale, the thundering-herd case of many simultaneous cold starts (e.g. 1,000 at once) saturating spin-up.
Mitigation: A keep-warm ping or a slightly longer idle timeout, tuned from Experiment 5’s warm-vs-cold distributions and its “N concurrent cold starts” extension (06, 09). The trade is residual idle billing against tail latency.

Honest tradeoff T4 — Maturity

Mechanism. Object-storage-native OLTP is the active frontier, not a settled field. The load-bearing pieces — SlateDB (LSM on object storage), S3-CAS commit-log designs, the libSQL rewrite — are fast-moving and have fewer battle-tested guarantees than a coupled Postgres.

Impact: Fewer hardened guarantees, thinner operational track record, APIs and durability semantics still evolving. Bugs in dependencies become our correctness bugs.
Who it bites: Anyone treating the engine as a drop-in Postgres replacement for mission-critical data on day one, and the team carrying the integration as dependencies churn.
Mitigation: Keep the Storage trait the fixed point so backends are swappable (03); pin and vet dependency versions; lean on the loom/simulation and crash-injection testing (Exp 4) rather than trusting upstream maturity; start tools on this engine that tolerate the risk profile, and reserve coupled Postgres for the ones that don’t.

The two weak axes — recap

The architecture has exactly two weak axes. They are routinely conflated; they are different bottlenecks with different levers. Conflating them leads to applying the wrong mitigation and concluding the architecture “doesn’t work” when the real fault was the lever choice.

Axis	Mechanism	Correct lever	Lever that does NOT help	Owns it
W1 — commit latency	each commit = network/S3 round-trip (~ms) instead of local fsync (~µs)	group commit / queue + batch (and read cache for reads)	caching cannot hide commit latency without breaking durability	09
W2 — write serialization	single writer per database (SQLite lineage)	many small DBs = many lanes (sharding)	batching adds no lanes; it only makes one lane’s commits cheap	10

The one correction to internalize

The queue defeats W1 (latency); the sharded many-DB design defeats W2 (lanes). Combined, group commit batches the expensive handoffs and many-small-DBs supplies independent lanes, making the coarse networked engine behave close to a fine-grained local one for most workloads. The only genuinely unsolvable case is many concurrent writers contending on the same row in the same database — that tool belongs on coupled Postgres.

Consolidated risk register

Every risk scattered through the spec, gathered here with a stable ID. Impact and Likelihood are coarse (Low/Med/High). Residual is the risk that remains after the listed mitigation is in place. IDs are stable across revisions; add new risks with new IDs, never renumber.

ID	Risk	Category	Impact	Likelihood	Mitigation	Residual	Related
R-01	Commit-latency tail — S3/CAS jitter inflates p99/p999 commit latency beyond a tool’s tolerance	Performance	High	High	Group commit / queue + batch; measure p99/p999 (Exp 1+2) per tool; same-region object store	Med — tail never fully removed; sequential firehose still latency-bound	09
R-02	Acked-write loss under crash — process killed after CAS-append/ack but before durability/materialization, losing an acknowledged commit	Correctness	Critical	Low	Durability invariant (never ack before WAL durable); fencing; gate via Experiment 4 (adversarial kill-9 + restart verification), loom/simulation	Low — but disqualifying if it ever recurs; zero tolerance	09, 02
R-03	Hot-row red-quadrant tool placed on the wrong backend — same-row, high-rate, latency-critical writes assigned to the disaggregated engine	Performance	High	Med	Exp 3 contention wall as red-quadrant detector; route outlier to coupled Postgres; offer shard-counter / event-log patterns where SQL is controlled	Low — caught at benchmark gate; correctness never at risk, only speed	10, 09
R-04	Thundering-herd cold starts — many databases spin up simultaneously after idle, saturating spin-up capacity	Operational	Med	Med	Keep-warm ping / longer idle timeout; tune from Exp 5 “N concurrent cold starts”; admission/queueing at controller	Med — tail under correlated spikes; residual idle billing if kept warm	06, 11
R-05	Cross-cloud egress surprise — compute and bucket in different clouds/regions bill cache-miss reads as egress at a premium	Operational	Med	Med	Co-locate compute + object store; prefer R2 (zero egress) for read-on-miss workloads; budget egress from miss-rate × cold-start frequency	Low — once co-located; per-operation fees remain	11
R-06	WASM port effort/cost for Workers — Cloudflare Workers run only JS/WASM, so the engine is a port (not a recompile); WASM isolates start slower and run larger	Maturity	Med	High (if edge target chosen)	Treat WASM as a scoped port; route storage through the Worker R2 binding (not raw S3 SDK); accept single-threaded/no-FS constraints; default to native (FFI) targets unless edge is required	Med — ongoing maintenance of two build forms; WASM perf gap	11, 08
R-07	HNSW cold-traversal latency — vector index is a graph; a cold traversal is many sequential cache-miss round-trips over object storage (cache-miss-is-egress at its worst)	Performance	High	Med	Keep working-set graph resident in local cache; S3 as cold floor only; usearch/hnswlib lineage adapted to paged storage; warm before serving	Med — first-touch / post-eviction traversals stay slow	12, 05
R-08	Dependency immaturity — SlateDB / S3-CAS log / libSQL rewrite are fast-moving frontier components with thinner guarantees than coupled Postgres	Maturity	High	Med	Keep `Storage` trait the swappable seam; pin + vet versions; crash-injection + loom rather than trusting upstream; reserve Postgres for data that can’t absorb the risk	Med — frontier churn persists; correctness rests on our own tests	03, 04
R-09	Lost-lease split-brain writer — a writer believes it still holds the lease after losing it, two writers append to one DB	Correctness	High	Low	Single-writer fencing via commit-log CAS token; the stale writer’s CAS fails and is fenced before any second appender succeeds	Low — fencing makes split-brain a fenced rejection, not corruption	06, 04
R-10	Cache / NVMe exhaustion — shared-buffer + local file cache (LFC) outgrows local NVMe; eviction thrash collapses to all-network reads	Operational	Med	Med	Size LFC to working set; eviction policy + spill bounds; monitor miss rate and NVMe headroom; scale instance or shard DBs before saturation	Med — pathological working sets still degrade to network latency	05

R-02 is the gate

R-02 (acked-write loss) is the highest-severity entry in the register and the only one with zero tolerance. It is gated by Experiment 4 and must pass before any real data is admitted. All other risks are managed; R-02 is a hard blocker.

Risk categories & severity model

The register uses four categories and a coarse severity model so risks can be triaged consistently.

Correctness

data integrity — highest severity, zero tolerance for acked-write loss

Performance

latency/throughput on the wrong side of the boundary

Operational

cost, capacity, cold-start, egress — managed, not blocking

Maturity

frontier dependencies / port effort — hedged via the trait seam

Severity ordering. Correctness > Performance > Operational > Maturity for blocking decisions: a Correctness risk that fails its gate stops the release; Performance risks move a tool to a different backend; Operational and Maturity risks are tracked and budgeted. Likelihood is reduced by mitigation; Impact is intrinsic to the mechanism and generally is not.

When NOT to use this engine

The platform is not all-or-nothing — benchmark the boundary once, then place each tool on the correct side. Two shapes belong on coupled Postgres, not on this engine:

MUST NOT host one giant, high-concurrency, single-database OLTP workload on this engine. The single-writer-per-DB ceiling (W2) cannot be sharded away when the workload is inherently one database, and per-DB locking serializes unrelated writers.
MUST NOT host the irreducible outlier — same-row, same-DB, high-rate, latency-critical writes. This is the red quadrant: contention is genuinely sequential, batching can’t add lanes, and the ~10ms networked handoff is paid per contended write. Route it to coupled Postgres.

Where it lands well

Most small / bursty / read-heavy internal tools land green: many small databases (independent lanes), reads served from cache, commits infrequent or batchable. The occasional write-heavy outlier gets a different backend. The architecture follows the workload — it does not demand the workload conform to it.

Decision boundary summary

The per-tool go/no-go rule (from 09), restated as the canonical decision boundary:

                       run Exp 1 + Exp 2  (latency floor + group-commit curve)
                                  │
            p99 single-commit acceptable for the tool's write frequency?
            AND group-commit throughput > the tool's aggregate write rate?
                          ┌───────┴───────┐
                        yes               no ──────────────┐
                          │                                │
                  run Exp 3 (contention wall)              │
                          │                                │
        heavy CONTENDED writes (same rows, sequential,     │
        can't batch)?  → RED QUADRANT                      │
                  ┌───────┴───────┐                        │
                 no              yes ──► coupled Postgres ◄─┘
                  │                       (that one tool)
                  ▼
        FITS — ship on disaggregated engine
          (Exp 4 MUST already have passed — durability gate)

Per-tool placement: latency & throughput decide fit; contention detects the red quadrant; Exp 4 is the unconditional durability gate behind all of it.

Quadrant	Write shape	Backend	Why
Green	read-heavy / bursty / independent-row writes across many small DBs	this engine	cache serves reads; group commit + many lanes handle writes
Amber	sustained independent-row write firehose, single DB	this engine — benchmark first	group commit may suffice; Exp 1+2 decide
Red	same-row, same-DB, high-rate, latency-critical	coupled Postgres	contention is sequential; networked handoff too costly per write

The durability non-negotiable

One invariant overrides every performance consideration in this spec:

Durability rule

Never acknowledge a commit from an in-memory buffer before its WAL is durably stored. Caching hides read latency completely; it must never hide commit latency — doing so produces acked-write loss (R-02 / Experiment 4). A fast commit path that loses an acked write under crash is disqualifying, regardless of latency numbers.

Concretely, the commit path crosses two adversarial points that Experiment 4 attacks with kill -9:

(a) after CAS-append issued, before client ack: On restart: the commit is either fully durable (and MUST be replayed/visible) or never acked — no torn or half state, no client believing a lost write succeeded.
(b) after client ack, before page materialization: On restart: every acked commit MUST be present and recoverable from the durable log; page materialization is a derivable, retryable step, not a durability boundary.

Crash injection MUST be deterministic (seeded) and integrate with the loom/simulation testing. This gate is unconditional and runs before any latency optimization is accepted.

Acceptance criteria & sign-off

Definition of done for the risk posture — the engine may hold real data only when all of the following hold:

MUST — Experiment 4 passes: zero acked-write loss, no torn/half state across both adversarial crash points, verified on restart (R-02, R-09).
MUST — Single-writer fencing demonstrated: a writer that loses its lease is fenced (its CAS fails) before any second writer can append (R-09).
MUST — Per-tool placement recorded: Exp 1+2 results on file, and Exp 3 run for any tool with plausible contention, before that tool goes live (R-01, R-03).
MUST — No red-quadrant tool on the disaggregated backend; the red-quadrant detector (Exp 3) reviewed for each onboarded tool (R-03).
SHOULD — Cold-start budget set from Exp 5 (warm/cold distributions + N-concurrent extension); keep-warm/idle-timeout configured for latency-sensitive tools (R-04).
SHOULD — Egress model validated: compute and object store co-located, or R2 zero-egress confirmed; per-operation fees budgeted (R-05).
SHOULD — Cache sizing validated against working set; NVMe headroom and miss-rate monitoring in place (R-07, R-10).
MUST — Every register risk has an owner and a current Residual rating; R-02 reviewed at every release.

Sign-off

Sign-off requires: (1) the durability gate green, (2) the decision boundary applied to every onboarded tool, and (3) the risk register reviewed with no open Critical/High residual lacking a mitigation owner. Maturity (T4 / R-08) is acknowledged, not closed — it is carried as an accepted, monitored risk hedged by the swappable storage seam.

Open questions

What numeric p99/p999 commit-latency threshold separates Amber from Red per tool class — a single platform default, or per-tool budgets? (depends on Exp 1 data)
At what concurrency does the thundering-herd spin-up saturate, and is admission control needed at the controller? (Exp 5 “N concurrent cold starts”)
Is the WASM port’s cold-start and size penalty acceptable for the edge target, or does it confine Workers+R2 to read-mostly tools? (R-06)
What is the eviction policy and LFC sizing heuristic that keeps the HNSW working set resident without starving B-tree pages? (R-07, R-10)
How are dependency-immaturity regressions (SlateDB / CAS-log upstream) detected before they reach the durability gate — continuous crash-injection in CI? (R-08)

Related specifications

Benchmark & Validation PlanOwns Exp 1–5; the durability gate (R-02) and decision boundary live here Hot-Row Contention StrategyW2 lever and the red-quadrant routing rule (R-03) Lifecycle & ControllerScale-to-zero/cold start (R-04) and single-writer fencing (R-09) Deployment TargetsEgress economics (R-05) and the WASM port fork (R-06) Local CacheCache/NVMe exhaustion (R-10) and HNSW cold traversal (R-07) Storage InterfaceThe swappable seam that hedges dependency immaturity (R-08) Object-Storage BackendCAS commit log behind durability (R-02) and fencing (R-09) Capabilities: Build-in vs ComposeHNSW vector index and its cold-traversal caveat (R-07) Engine CoreWAL/MVCC commit path the durability invariant constrains Roadmap & Build SequenceWhen each mitigation lands across the build sequence