Tradeoffs & Risk Register
Object-storage-native OLTP buys scale-to-zero, true embeddability, and instant branching by paying for them in commit latency, write serialization, cold starts, and ecosystem maturity — this page makes every one of those costs explicit, names who it bites, and consolidates them into a single tracked risk register with mitigations and acceptance gates.
Purpose & scope
This page is the consolidation point for honest costs and operational risk across the whole specification. It does two jobs:
- Expands the four honest tradeoffs (source §4) into impact-bearing subsections: each names the mechanism, who it bites, and the concrete mitigation, with a pointer to the spec that owns the deeper treatment.
- Pulls the risks scattered through specs 01–13 into one risk register with stable IDs, severity, likelihood, mitigation, residual risk, and the related spec — so nothing lives only in a footnote.
It is normative for the acceptance gates that decide whether the engine is allowed to hold real data, and for the decision boundary that places each tool on the correct backend. It is not a substitute for the benchmark plan (09) or the contention strategy (10); those own the measurements and the levers. This page tracks them.
Reading order
If you read only one thing here, read “The durability non-negotiable” and the risk register. R-02 (acked-write loss) is disqualifying and gates everything else; the rest is tuning and placement.
Responsibilities & non-goals
Responsibilities
- State every architectural cost in plain terms, without marketing softening.
- Maintain the canonical risk register and keep risk IDs stable across spec revisions.
- Define the go/no-go decision boundary and the “when NOT to use this engine” guidance.
- Pin the durability invariant and the sign-off criteria that gate production data.
Non-goals
Normative requirements
- MUST NOT acknowledge a commit to the client before its WAL records are durably stored (CAS-append acked) on the commit log. Caching may hide read latency completely; it MUST NOT hide commit latency.
- MUST pass Experiment 4 (crash safety of the commit path, 09) unconditionally before any real data is admitted to a database. A fast commit path that loses an acked write under crash is disqualifying regardless of latency numbers.
- MUST run Experiments 1–3 (09) and record p50/p99/p999 for a tool before that tool is placed on the disaggregated backend.
- MUST route any tool whose writes are same-row, same-DB, high-rate, and latency-critical to coupled Postgres rather than this engine (10).
- MUST enforce single-writer fencing per database via the commit log’s CAS token (06); a writer that loses its lease MUST be fenced before another may append.
- SHOULD mitigate cold start with a keep-warm ping or a longer idle timeout for latency-sensitive tools, tuned from Experiment 5 (06, 09).
- SHOULD co-locate compute and object store in the same region/provider to avoid cross-cloud egress and latency surprises (11).
- SHOULD keep the HNSW working-set graph resident in the local cache; vector search leans hardest of anything on the cache (05, 12).
- MAY reserve a dedicated primitive (e.g. Redis) for a tool that is both extremely contended and latency-critical, accepting the added ops/consistency cost (10).
Honest tradeoff T1 — Write latency
Mechanism. A commit pays a network/S3 round-trip (or quorum/CAS ack) instead of a local fsync. The durability floor moved off the local disk and onto object storage, so the acknowledged-commit path crosses the network by construction. This is the price of disaggregation, not a bug to be cached away.
- Impact
- Per-commit latency rises from ~µs (local fsync) to ~ms (networked CAS ack). The tail (p99/p999) is what S3 jitter inflates, not the mean.
- Who it bites
- Sustained write firehoses — tools committing at high sequential rate where each commit blocks the next. Read-heavy and bursty workloads barely notice; their commits are infrequent and the read path is served from cache.
- Mitigation
- Group commit / queue + batch amortizes the round-trip across many concurrent commits (defeats W1). The mitigation that does not work: caching — it hides read latency but cannot hide commit latency without breaking durability. Any write-heavy tool MUST be benchmarked explicitly (Exp 1 + Exp 2) before commitment.
Benchmark, don’t assume
“Bursty barely notices, firehose must measure” is a guideline, not a guarantee. The sequential-commit ceiling is ~1/latency until group commit kicks in. See Experiment 1 (latency floor) and Experiment 2 (group-commit throughput curve) in 09.
Honest tradeoff T2 — Single writer per database
Mechanism. Without an added coordination layer, the engine permits one writer at a time per database — the SQLite-lineage ceiling. The commit log’s CAS token fences the single writer; there is no built-in multi-writer consensus.
- Impact
- Write concurrency within a single database does not scale by adding writers; it scales by adding databases (lanes). A naive per-DB lock means every writer in the DB serializes, even on unrelated rows.
- Who it bites
- One giant, high-concurrency single-database OLTP workload. It does not bite the target shape — many small databases / dozens of tools, each its own lane.
- Mitigation
- Sharding: many small DBs = many independent lanes (defeats W2). Database boundaries substitute for row-level lock granularity (10). The mitigation that does not work: batching — it makes one lane’s commits cheap but adds zero lanes.
Correct-but-slow by default
The single lane is correct for the hot-row case — every contended write sees the prior committed value, exactly as every serious database serializes contended writers. The cost is performance on the outlier, never correctness. See 10 for the layered strategy (serialize universally, shard where you control the SQL, route the irreducible outlier to Postgres).
Honest tradeoff T3 — Cold start
Mechanism. Compute is stateless: the controller stops the engine when idle (scale-to-zero), so the first request after an idle period pays process-start plus cache-warm. Caches start cold after every idle period by design.
- Impact
- The first request after idle has elevated latency: process/container/isolate spin-up + a burst of cache-miss reads against object storage. Steady-state requests are unaffected.
- Who it bites
- Latency-sensitive tools with spiky, intermittent traffic — and, at scale, the thundering-herd case of many simultaneous cold starts (e.g. 1,000 at once) saturating spin-up.
- Mitigation
- A keep-warm ping or a slightly longer idle timeout, tuned from Experiment 5’s warm-vs-cold distributions and its “N concurrent cold starts” extension (06, 09). The trade is residual idle billing against tail latency.
Honest tradeoff T4 — Maturity
Mechanism. Object-storage-native OLTP is the active frontier, not a settled field. The load-bearing pieces — SlateDB (LSM on object storage), S3-CAS commit-log designs, the libSQL rewrite — are fast-moving and have fewer battle-tested guarantees than a coupled Postgres.
- Impact
- Fewer hardened guarantees, thinner operational track record, APIs and durability semantics still evolving. Bugs in dependencies become our correctness bugs.
- Who it bites
- Anyone treating the engine as a drop-in Postgres replacement for mission-critical data on day one, and the team carrying the integration as dependencies churn.
- Mitigation
- Keep the
Storagetrait the fixed point so backends are swappable (03); pin and vet dependency versions; lean on theloom/simulation and crash-injection testing (Exp 4) rather than trusting upstream maturity; start tools on this engine that tolerate the risk profile, and reserve coupled Postgres for the ones that don’t.
The two weak axes — recap
The architecture has exactly two weak axes. They are routinely conflated; they are different bottlenecks with different levers. Conflating them leads to applying the wrong mitigation and concluding the architecture “doesn’t work” when the real fault was the lever choice.
| Axis | Mechanism | Correct lever | Lever that does NOT help | Owns it |
|---|---|---|---|---|
| W1 — commit latency | each commit = network/S3 round-trip (~ms) instead of local fsync (~µs) | group commit / queue + batch (and read cache for reads) | caching cannot hide commit latency without breaking durability | 09 |
| W2 — write serialization | single writer per database (SQLite lineage) | many small DBs = many lanes (sharding) | batching adds no lanes; it only makes one lane’s commits cheap | 10 |
The one correction to internalize
The queue defeats W1 (latency); the sharded many-DB design defeats W2 (lanes). Combined, group commit batches the expensive handoffs and many-small-DBs supplies independent lanes, making the coarse networked engine behave close to a fine-grained local one for most workloads. The only genuinely unsolvable case is many concurrent writers contending on the same row in the same database — that tool belongs on coupled Postgres.
Consolidated risk register
Every risk scattered through the spec, gathered here with a stable ID. Impact and Likelihood are coarse (Low/Med/High). Residual is the risk that remains after the listed mitigation is in place. IDs are stable across revisions; add new risks with new IDs, never renumber.
| ID | Risk | Category | Impact | Likelihood | Mitigation | Residual | Related |
|---|---|---|---|---|---|---|---|
| R-01 | Commit-latency tail — S3/CAS jitter inflates p99/p999 commit latency beyond a tool’s tolerance | Performance | High | High | Group commit / queue + batch; measure p99/p999 (Exp 1+2) per tool; same-region object store | Med — tail never fully removed; sequential firehose still latency-bound | 09 |
| R-02 | Acked-write loss under crash — process killed after CAS-append/ack but before durability/materialization, losing an acknowledged commit | Correctness | Critical | Low | Durability invariant (never ack before WAL durable); fencing; gate via Experiment 4 (adversarial kill-9 + restart verification), loom/simulation | Low — but disqualifying if it ever recurs; zero tolerance | 09, 02 |
| R-03 | Hot-row red-quadrant tool placed on the wrong backend — same-row, high-rate, latency-critical writes assigned to the disaggregated engine | Performance | High | Med | Exp 3 contention wall as red-quadrant detector; route outlier to coupled Postgres; offer shard-counter / event-log patterns where SQL is controlled | Low — caught at benchmark gate; correctness never at risk, only speed | 10, 09 |
| R-04 | Thundering-herd cold starts — many databases spin up simultaneously after idle, saturating spin-up capacity | Operational | Med | Med | Keep-warm ping / longer idle timeout; tune from Exp 5 “N concurrent cold starts”; admission/queueing at controller | Med — tail under correlated spikes; residual idle billing if kept warm | 06, 11 |
| R-05 | Cross-cloud egress surprise — compute and bucket in different clouds/regions bill cache-miss reads as egress at a premium | Operational | Med | Med | Co-locate compute + object store; prefer R2 (zero egress) for read-on-miss workloads; budget egress from miss-rate × cold-start frequency | Low — once co-located; per-operation fees remain | 11 |
| R-06 | WASM port effort/cost for Workers — Cloudflare Workers run only JS/WASM, so the engine is a port (not a recompile); WASM isolates start slower and run larger | Maturity | Med | High (if edge target chosen) | Treat WASM as a scoped port; route storage through the Worker R2 binding (not raw S3 SDK); accept single-threaded/no-FS constraints; default to native (FFI) targets unless edge is required | Med — ongoing maintenance of two build forms; WASM perf gap | 11, 08 |
| R-07 | HNSW cold-traversal latency — vector index is a graph; a cold traversal is many sequential cache-miss round-trips over object storage (cache-miss-is-egress at its worst) | Performance | High | Med | Keep working-set graph resident in local cache; S3 as cold floor only; usearch/hnswlib lineage adapted to paged storage; warm before serving | Med — first-touch / post-eviction traversals stay slow | 12, 05 |
| R-08 | Dependency immaturity — SlateDB / S3-CAS log / libSQL rewrite are fast-moving frontier components with thinner guarantees than coupled Postgres | Maturity | High | Med | Keep Storage trait the swappable seam; pin + vet versions; crash-injection + loom rather than trusting upstream; reserve Postgres for data that can’t absorb the risk |
Med — frontier churn persists; correctness rests on our own tests | 03, 04 |
| R-09 | Lost-lease split-brain writer — a writer believes it still holds the lease after losing it, two writers append to one DB | Correctness | High | Low | Single-writer fencing via commit-log CAS token; the stale writer’s CAS fails and is fenced before any second appender succeeds | Low — fencing makes split-brain a fenced rejection, not corruption | 06, 04 |
| R-10 | Cache / NVMe exhaustion — shared-buffer + local file cache (LFC) outgrows local NVMe; eviction thrash collapses to all-network reads | Operational | Med | Med | Size LFC to working set; eviction policy + spill bounds; monitor miss rate and NVMe headroom; scale instance or shard DBs before saturation | Med — pathological working sets still degrade to network latency | 05 |
R-02 is the gate
R-02 (acked-write loss) is the highest-severity entry in the register and the only one with zero tolerance. It is gated by Experiment 4 and must pass before any real data is admitted. All other risks are managed; R-02 is a hard blocker.
Risk categories & severity model
The register uses four categories and a coarse severity model so risks can be triaged consistently.
Severity ordering. Correctness > Performance > Operational > Maturity for blocking decisions: a Correctness risk that fails its gate stops the release; Performance risks move a tool to a different backend; Operational and Maturity risks are tracked and budgeted. Likelihood is reduced by mitigation; Impact is intrinsic to the mechanism and generally is not.
When NOT to use this engine
The platform is not all-or-nothing — benchmark the boundary once, then place each tool on the correct side. Two shapes belong on coupled Postgres, not on this engine:
- MUST NOT host one giant, high-concurrency, single-database OLTP workload on this engine. The single-writer-per-DB ceiling (W2) cannot be sharded away when the workload is inherently one database, and per-DB locking serializes unrelated writers.
- MUST NOT host the irreducible outlier — same-row, same-DB, high-rate, latency-critical writes. This is the red quadrant: contention is genuinely sequential, batching can’t add lanes, and the ~10ms networked handoff is paid per contended write. Route it to coupled Postgres.
Where it lands well
Most small / bursty / read-heavy internal tools land green: many small databases (independent lanes), reads served from cache, commits infrequent or batchable. The occasional write-heavy outlier gets a different backend. The architecture follows the workload — it does not demand the workload conform to it.
Decision boundary summary
The per-tool go/no-go rule (from 09), restated as the canonical decision boundary:
run Exp 1 + Exp 2 (latency floor + group-commit curve)
│
p99 single-commit acceptable for the tool's write frequency?
AND group-commit throughput > the tool's aggregate write rate?
┌───────┴───────┐
yes no ──────────────┐
│ │
run Exp 3 (contention wall) │
│ │
heavy CONTENDED writes (same rows, sequential, │
can't batch)? → RED QUADRANT │
┌───────┴───────┐ │
no yes ──► coupled Postgres ◄─┘
│ (that one tool)
▼
FITS — ship on disaggregated engine
(Exp 4 MUST already have passed — durability gate)
Per-tool placement: latency & throughput decide fit; contention detects the red quadrant; Exp 4 is the unconditional durability gate behind all of it.
| Quadrant | Write shape | Backend | Why |
|---|---|---|---|
| Green | read-heavy / bursty / independent-row writes across many small DBs | this engine | cache serves reads; group commit + many lanes handle writes |
| Amber | sustained independent-row write firehose, single DB | this engine — benchmark first | group commit may suffice; Exp 1+2 decide |
| Red | same-row, same-DB, high-rate, latency-critical | coupled Postgres | contention is sequential; networked handoff too costly per write |
The durability non-negotiable
One invariant overrides every performance consideration in this spec:
Durability rule
Never acknowledge a commit from an in-memory buffer before its WAL is durably stored. Caching hides read latency completely; it must never hide commit latency — doing so produces acked-write loss (R-02 / Experiment 4). A fast commit path that loses an acked write under crash is disqualifying, regardless of latency numbers.
Concretely, the commit path crosses two adversarial points that Experiment 4 attacks with kill -9:
- (a) after CAS-append issued, before client ack
- On restart: the commit is either fully durable (and MUST be replayed/visible) or never acked — no torn or half state, no client believing a lost write succeeded.
- (b) after client ack, before page materialization
- On restart: every acked commit MUST be present and recoverable from the durable log; page materialization is a derivable, retryable step, not a durability boundary.
Crash injection MUST be deterministic (seeded) and integrate with the loom/simulation testing. This gate is unconditional and runs before any latency optimization is accepted.
Acceptance criteria & sign-off
Definition of done for the risk posture — the engine may hold real data only when all of the following hold:
- MUST — Experiment 4 passes: zero acked-write loss, no torn/half state across both adversarial crash points, verified on restart (R-02, R-09).
- MUST — Single-writer fencing demonstrated: a writer that loses its lease is fenced (its CAS fails) before any second writer can append (R-09).
- MUST — Per-tool placement recorded: Exp 1+2 results on file, and Exp 3 run for any tool with plausible contention, before that tool goes live (R-01, R-03).
- MUST — No red-quadrant tool on the disaggregated backend; the red-quadrant detector (Exp 3) reviewed for each onboarded tool (R-03).
- SHOULD — Cold-start budget set from Exp 5 (warm/cold distributions + N-concurrent extension); keep-warm/idle-timeout configured for latency-sensitive tools (R-04).
- SHOULD — Egress model validated: compute and object store co-located, or R2 zero-egress confirmed; per-operation fees budgeted (R-05).
- SHOULD — Cache sizing validated against working set; NVMe headroom and miss-rate monitoring in place (R-07, R-10).
- MUST — Every register risk has an owner and a current Residual rating; R-02 reviewed at every release.
Sign-off
Sign-off requires: (1) the durability gate green, (2) the decision boundary applied to every onboarded tool, and (3) the risk register reviewed with no open Critical/High residual lacking a mitigation owner. Maturity (T4 / R-08) is acknowledged, not closed — it is carried as an accepted, monitored risk hedged by the swappable storage seam.
Open questions
- What numeric p99/p999 commit-latency threshold separates Amber from Red per tool class — a single platform default, or per-tool budgets? (depends on Exp 1 data)
- At what concurrency does the thundering-herd spin-up saturate, and is admission control needed at the controller? (Exp 5 “N concurrent cold starts”)
- Is the WASM port’s cold-start and size penalty acceptable for the edge target, or does it confine Workers+R2 to read-mostly tools? (R-06)
- What is the eviction policy and LFC sizing heuristic that keeps the HNSW working set resident without starving B-tree pages? (R-07, R-10)
- How are dependency-immaturity regressions (SlateDB / CAS-log upstream) detected before they reach the durability gate — continuous crash-injection in CI? (R-08)