Serverless OLTP Engine — Development Specification

Abstract

The design target is one OLTP engine that satisfies three properties usually treated as mutually exclusive:

MUST be embeddable — linked in-process as a native library, with function-call latency on the hot path and no socket between the application and the engine.
MUST be runnable as a remote server over the same engine code, exposing a wire protocol for multi-client access.
MUST ground durable state on disaggregated storage — durability bottoms out on object storage (S3 / Cloudflare R2 / MinIO), reached only on cache miss.

From those it is optimized simultaneously for scale-to-zero (stateless compute that idles to nothing, only storage bytes billing at rest), true embeddability (the hot path stays in-process function calls, never a network round-trip), and branching (a database fork is a cheap LSN pointer over shared immutable layers).

Reference shape

"SQLite that forgot where its file was and found it on S3." The engine keeps SQLite's in-process, link-it-anywhere ergonomics, but its storage backend is pluggable and pointed at the network instead of a local file. The closest shipping analog is libSQL / Turso; this specification describes what you would assemble to build your own, owned end-to-end.

engine, two delivery modes

in-proc

embedded hot path

→0

idle compute cost

O(1)

branch creation

The tension and its resolution

Two of the requirements pull in opposite directions. Embeddability wants the engine to be the application's address space; separated storage wants durable state to live over the network so compute can be stateless. Classic systems pick one side:

Requirement	Implies	Classic system that has it
Embeddable	engine = library linked in-process; function-call latency; no protocol between app and engine	SQLite
Separate storage	durable state lives over the network; compute is stateless and disposable	Neon / Aurora

Resolution — move the seam inward

Keep the engine a library, but make its storage backend pluggable and point it at the network instead of a local file. The seam is a storage interface inside the library, not a server boundary. "Embeddable" and "disaggregated" stop being contradictory the moment the storage seam is a pluggable backend rather than a server you connect to.

Server mode falls out for free

Server mode is then just "the same library, wrapped in a wire-protocol listener." The engine, the MVCC manager, the cache, and the storage trait are identical in both modes; the only difference is whether the caller reaches the engine through an FFI function call or through a network socket. There is no second engine to maintain.

One engine, two front doors: FFI for embedded callers, a wire listener for remote clients. The storage trait is the only new internal protocol.

The three balanced requirements

The architecture is judged on satisfying these three at the same time, not one at a time. Each is delivered by a specific mechanism, detailed in the linked component specs.

Requirement	Mechanism	Owned by
Scale-to-zero	Compute is stateless — all durable state lives in the storage tier. The controller stops the engine when idle; only object-storage bytes bill at rest. Cold start = process start + cache warm.	CTL
True embeddability	A `LocalFileStorage` backend plus the in-process local cache keep the hot path as function calls, never network calls. The same binary embeds or serves; storage choice is configuration (a connection string), not a rebuild.	CACHE · STOR
Branching	The page store is versioned by LSN and its layers are immutable, so a branch is a cheap pointer (copy-on-write). Many tools each take a branch off one base with near-zero marginal storage until they diverge.	CTL · OBJ

Scale-to-zero

Because compute holds no durable state, the lifecycle controller may terminate the engine process the moment a database goes idle, with no checkpoint to flush beyond what the WAL already made durable. At rest a database costs only its object-storage bytes — no idle compute bill. The price is paid on the first request after idle: process start plus cache warm. Tuning the idle timeout and an optional keep-warm ping trades that cold-start tail against idle cost; see Lifecycle & Controller and the cold-read measurement in the benchmark plan.

True embeddability

Embeddability depends entirely on the local cache. Without it, every read would be a hundreds-of-milliseconds object-storage round-trip and the "in-process" claim would be hollow. The shared-buffer page cache plus a larger local file cache (LFC) spilling to NVMe keep the working set resident, so reads resolve in-process and only cache misses descend to object storage. The pure-embedded path uses LocalFileStorage for zero network at all (dev, offline, edge); the disaggregated path uses ObjectStorage — same code, different connection string.

Branching

The page store materializes pages by (PageId, Lsn) over immutable delta and image layers. A branch is created by allocating a new LSN pointer that shares all existing layers; writes on the branch append new layers without touching the parent. Storage cost grows only with divergence. This is what makes "dozens of internal tools, each on its own branch off one base" affordable — and unlike Postgres's pgvector, an index built into the engine (e.g. HNSW) branches with the database because it is just more pages through the same trait (see Capabilities).

Layer map — the three slots

The system is three horizontal slots. The pluggable storage interface sits at the bottom of Slot B and is the only genuinely new protocol; everything below it speaks standard object-storage APIs.

SLOT A — Interface
  Embedded:  FFI / native addon  → engine runs inside the host process (Bun, etc.)
  Server:    Postgres wire protocol (pgwire) or HTTP/WS → same engine, network listener
  Pooler (server mode only): PgBouncer / pgcat, transaction mode

SLOT B — Engine (the library)
  SQL parser · planner · executor · MVCC/txn manager
  Local cache (hot pages stay in-process — this is what preserves embeddability)
  >>> Pluggable Storage Interface <<<  (the seam)

SLOT C — Storage (disaggregated)
  Log service:  durable ordered commit log; commit = quorum/CAS ack
  Page store:   LSM layers, versioned by LSN; materializes pages on demand
  Durability floor: Object storage (S3 / R2), reached only on cache miss

Three slots. The storage interface is the seam; the only new protocol (Engine ↔ Storage) is internal.

Boundary	Protocol	Notes
App ↔ Engine (embedded)	direct function calls	no protocol on the hot path
App ↔ Engine (server)	Postgres wire protocol or HTTP	see Server Mode
Engine ↔ Storage	custom RPC: `Append(WAL)` for writes, `GetPage@LSN` for reads	the only new protocol; internal; see Storage Interface
Storage ↔ Durability	S3 API (GET / PUT, conditional writes)	conditional write = CAS; see Object-Storage Backend

Document map

The fourteen component specifications, grouped by concern. Each builds against the seam and the requirements above; follow the cross-links rather than reading start-to-finish unless you want the whole picture.

Conventions

Requirement keywords (RFC 2119)

Normative requirements use the RFC 2119 keywords, rendered as coloured tags in requirement lists.

MUST an absolute requirement of the specification; equivalently REQUIRED or SHALL. A conforming implementation cannot omit it.
SHOULD a strong recommendation; equivalently RECOMMENDED. May be ignored only with a fully understood, documented reason.
MAY a truly optional item; equivalently OPTIONAL. Implementations interoperate whether or not it is provided.
MUST NOT an absolute prohibition; equivalently SHALL NOT.

Status legend

Every page carries a status badge in its header. The lifecycle is:

Draft v0.1 — under active authoring; content may change without notice and may contain open questions.
In review — content-complete and circulating for engineering sign-off; comments open.
Approved — ratified; changes require a version bump and a changelog entry.

All pages are currently Draft v0.1, dated 2026-06-20.

Inline status pills

Within prose and tables, the deployment quadrant of a workload is flagged with pills: green fits the disaggregated engine, amber needs tuning or measurement, red belongs on coupled Postgres, blue informational, and build marks a build-it-yourself artifact.

Cross-links and navigation

The left sidebar, the right-hand "On this page" table of contents, heading anchors, code-copy buttons, the theme toggle, and the previous/next links at the foot of each page are all generated by ../assets/app.js from a single canonical manifest (../assets/specs-section.js) — page order lives in exactly one place. Hover any h2 or h3 to reveal its # anchor for deep-linking. Cross-references between specs use the SPEC id (e.g. STOR, OBJ) and appear as the cards in the document map above and in each page's "Related specifications" section. This page is the hub and has no previous link, which is expected.

Glossary

Terms used throughout the specifications. Where a term is the subject of a component spec, the linked page is authoritative.

OLTP: Online Transaction Processing — short, frequent read/write transactions on current data (the workload this engine targets), as opposed to OLAP analytical scans.
MVCC: Multi-Version Concurrency Control — readers see a consistent snapshot without blocking writers, implemented here as snapshot isolation over LSN-stamped row versions.
LSN: Log Sequence Number — a monotonically increasing position in the commit log that stamps every version and identifies the point-in-time a page is materialized at (GetPage@LSN).
WAL: Write-Ahead Log — the ordered record of changes written and made durable before a commit is acknowledged; the engine's write path appends WAL and never acks from an in-memory buffer.
LSM: Log-Structured Merge tree — the storage organization of the page store: buffer writes in memory, flush to immutable on-disk/on-S3 layers, merge via compaction.
delta layer: An immutable LSM layer holding incremental (key, LSN) change records since the last image; replayed over an image to reconstruct a page at a target LSN.
image layer: An immutable LSM layer holding full page snapshots at a given LSN, serving as the base that delta layers are applied on top of.
CAS: Compare-And-Swap — an atomic conditional write (here, an S3 conditional PUT) that succeeds only if the current value matches expectation; the primitive that gives the commit log atomic ordered appends and single-writer fencing without a separate consensus cluster.
PITR: Point-In-Time Recovery — restoring or reading the database as of a chosen LSN/timestamp; the retention window bounds how far back layers are kept before compaction and GC reclaim them.
fencing token: A monotonically increasing token (issued by the commit log's CAS) that lets storage reject writes from a stale or evicted writer, enforcing single-writer-per-database.
FFI: Foreign Function Interface — calling the engine's C ABI directly from a host runtime so it runs in-process; Bun's bun:ffi binds engine.h with no socket.
NAPI: Node-API — a stable native-addon ABI; an alternative engine binding that works in both Bun and Node from one package.
pgwire: The Postgres wire protocol — what engine-server speaks in server mode, so ordinary Postgres clients and tools (and Bun.sql) connect with zero extra dependencies.
pooler: A connection pooler (PgBouncer / pgcat in transaction mode) placed in front of the server to absorb serverless connection bursts.
HNSW: Hierarchical Navigable Small World — the graph-based approximate-nearest-neighbour index used for built-in vector search; its random-access traversal leans hardest of anything on the local cache.
HTAP: Hybrid Transactional/Analytical Processing — served here as a system (the row engine plus DuckDB over shared storage), not as one engine doing both.
LFC: Local File Cache — a large second-tier cache spilling hot pages to local NVMe beneath the in-memory shared-buffer cache, extending the working set that stays off the S3 hot path.
scale-to-zero: Stopping all compute for an idle database so only object-storage bytes bill at rest; the engine restarts on the next request, paying a cold start.
branch: A copy-on-write fork of a database created as a new LSN pointer over shared immutable layers; near-zero marginal storage until the branch diverges.
disaggregated storage: Separating durable storage from compute so state lives over the network on object storage and compute is stateless and disposable.
group commit: Batching several pending commits into one durable storage round-trip to amortize per-commit network/S3 latency (the lever against W1); it does not add write lanes (W2).
egress: Data transferred out of object storage, billed per GB on S3/GCS; cache misses on a disaggregated engine are egress, so a cold cache costs more — which is why R2's zero egress matters structurally here.

One-line summary

The whole system in one sentence

Build a Rust engine that exposes a narrow Storage trait, ship it as both a linkable library and a pgwire server, back the trait with an LSM-on-S3 page store plus an S3-CAS commit log, and bind it into Bun via bun:ffi for embedded use or Bun.sql for server use — embeddable and disaggregated stop being contradictory the moment the storage seam is a pluggable backend instead of a server you connect to.

Related specifications

Start with the architecture overview, then follow the rollout order.

ARCHArchitecture OverviewThe full picture of slots, seam, and shared-engine dual mode. STORStorage InterfaceThe seam that makes everything else possible. ROADRoadmap & Build SequenceThe four-step build order, expanded with milestones. RISKTradeoffs & Risk RegisterRead this before committing real workloads.