Serverless OLTP Engine — Development Specification
A single OLTP engine that links in-process as an embeddable library and runs as a remote server over the same code, backed by disaggregated storage on S3-compatible object storage, and engineered for scale-to-zero, true embeddability, and instant branching at once — this is the hub that frames the design and indexes the fourteen component specs that build it.
Abstract
The design target is one OLTP engine that satisfies three properties usually treated as mutually exclusive:
- MUST be embeddable — linked in-process as a native library, with function-call latency on the hot path and no socket between the application and the engine.
- MUST be runnable as a remote server over the same engine code, exposing a wire protocol for multi-client access.
- MUST ground durable state on disaggregated storage — durability bottoms out on object storage (S3 / Cloudflare R2 / MinIO), reached only on cache miss.
From those it is optimized simultaneously for scale-to-zero (stateless compute that idles to nothing, only storage bytes billing at rest), true embeddability (the hot path stays in-process function calls, never a network round-trip), and branching (a database fork is a cheap LSN pointer over shared immutable layers).
Reference shape
"SQLite that forgot where its file was and found it on S3." The engine keeps SQLite's in-process, link-it-anywhere ergonomics, but its storage backend is pluggable and pointed at the network instead of a local file. The closest shipping analog is libSQL / Turso; this specification describes what you would assemble to build your own, owned end-to-end.
The tension and its resolution
Two of the requirements pull in opposite directions. Embeddability wants the engine to be the application's address space; separated storage wants durable state to live over the network so compute can be stateless. Classic systems pick one side:
| Requirement | Implies | Classic system that has it |
|---|---|---|
| Embeddable | engine = library linked in-process; function-call latency; no protocol between app and engine | SQLite |
| Separate storage | durable state lives over the network; compute is stateless and disposable | Neon / Aurora |
Resolution — move the seam inward
Keep the engine a library, but make its storage backend pluggable and point it at the network instead of a local file. The seam is a storage interface inside the library, not a server boundary. "Embeddable" and "disaggregated" stop being contradictory the moment the storage seam is a pluggable backend rather than a server you connect to.
Server mode falls out for free
Server mode is then just "the same library, wrapped in a wire-protocol listener." The engine, the MVCC manager, the cache, and the storage trait are identical in both modes; the only difference is whether the caller reaches the engine through an FFI function call or through a network socket. There is no second engine to maintain.
One engine, two front doors: FFI for embedded callers, a wire listener for remote clients. The storage trait is the only new internal protocol.
The three balanced requirements
The architecture is judged on satisfying these three at the same time, not one at a time. Each is delivered by a specific mechanism, detailed in the linked component specs.
| Requirement | Mechanism | Owned by |
|---|---|---|
| Scale-to-zero | Compute is stateless — all durable state lives in the storage tier. The controller stops the engine when idle; only object-storage bytes bill at rest. Cold start = process start + cache warm. | CTL |
| True embeddability | A LocalFileStorage backend plus the in-process local cache keep the hot path as function calls, never network calls. The same binary embeds or serves; storage choice is configuration (a connection string), not a rebuild. |
CACHE · STOR |
| Branching | The page store is versioned by LSN and its layers are immutable, so a branch is a cheap pointer (copy-on-write). Many tools each take a branch off one base with near-zero marginal storage until they diverge. | CTL · OBJ |
Scale-to-zero
Because compute holds no durable state, the lifecycle controller may terminate the engine process the moment a database goes idle, with no checkpoint to flush beyond what the WAL already made durable. At rest a database costs only its object-storage bytes — no idle compute bill. The price is paid on the first request after idle: process start plus cache warm. Tuning the idle timeout and an optional keep-warm ping trades that cold-start tail against idle cost; see Lifecycle & Controller and the cold-read measurement in the benchmark plan.
True embeddability
Embeddability depends entirely on the local cache. Without it, every read would be a hundreds-of-milliseconds object-storage round-trip and the "in-process" claim would be hollow. The shared-buffer page cache plus a larger local file cache (LFC) spilling to NVMe keep the working set resident, so reads resolve in-process and only cache misses descend to object storage. The pure-embedded path uses LocalFileStorage for zero network at all (dev, offline, edge); the disaggregated path uses ObjectStorage — same code, different connection string.
Branching
The page store materializes pages by (PageId, Lsn) over immutable delta and image layers. A branch is created by allocating a new LSN pointer that shares all existing layers; writes on the branch append new layers without touching the parent. Storage cost grows only with divergence. This is what makes "dozens of internal tools, each on its own branch off one base" affordable — and unlike Postgres's pgvector, an index built into the engine (e.g. HNSW) branches with the database because it is just more pages through the same trait (see Capabilities).
Layer map — the three slots
The system is three horizontal slots. The pluggable storage interface sits at the bottom of Slot B and is the only genuinely new protocol; everything below it speaks standard object-storage APIs.
SLOT A — Interface Embedded: FFI / native addon → engine runs inside the host process (Bun, etc.) Server: Postgres wire protocol (pgwire) or HTTP/WS → same engine, network listener Pooler (server mode only): PgBouncer / pgcat, transaction mode SLOT B — Engine (the library) SQL parser · planner · executor · MVCC/txn manager Local cache (hot pages stay in-process — this is what preserves embeddability) >>> Pluggable Storage Interface <<< (the seam) SLOT C — Storage (disaggregated) Log service: durable ordered commit log; commit = quorum/CAS ack Page store: LSM layers, versioned by LSN; materializes pages on demand Durability floor: Object storage (S3 / R2), reached only on cache miss
Three slots. The storage interface is the seam; the only new protocol (Engine ↔ Storage) is internal.
| Boundary | Protocol | Notes |
|---|---|---|
| App ↔ Engine (embedded) | direct function calls | no protocol on the hot path |
| App ↔ Engine (server) | Postgres wire protocol or HTTP | see Server Mode |
| Engine ↔ Storage | custom RPC: Append(WAL) for writes, GetPage@LSN for reads | the only new protocol; internal; see Storage Interface |
| Storage ↔ Durability | S3 API (GET / PUT, conditional writes) | conditional write = CAS; see Object-Storage Backend |
Document map
The fourteen component specifications, grouped by concern. Each builds against the seam and the requirements above; follow the cross-links rather than reading start-to-finish unless you want the whole picture.
Architecture — the engine and its storage
Storage trait — the single most important artifact and the system's seam.
OBJObject-Storage BackendLSM page store on S3/R2, delta/image layers, and the S3-CAS commit log.
CACHELocal CacheShared-buffer cache + LFC spilling to NVMe — the thing that keeps embeddability real.
CTLLifecycle & ControllerCold start, idle stop (scale-to-zero), branch-on-LSN, and single-writer fencing.
Interfaces — how callers reach the engine
engine-server binary: same library wrapped in a pgwire / HTTP listener, plus pooler.
BUNBun IntegrationEmbedded via bun:ffi over engine.h (or NAPI); server via built-in Bun.sql.
Validation — proving the boundary holds
Operations & planning
Recommended reading order & rollout
The build (and the recommended reading order) follows a four-step rollout. Each step is independently shippable and adds exactly one capability; you can stop after any step and have a working system.
- Ship the embedded library first.
bun:ffibinding over the C ABI, with theLocalFileStoragebackend. Fastest path to a working demo, zero infrastructure. Read Engine Core, Storage Interface, and Bun Integration. - Add the
ObjectStoragebackend. Flip the connection string fromfile://tos3://. Now it is disaggregated and scale-to-zero capable, still embedded — same code path. Read Object-Storage Backend and Local Cache. - Add
engine-server+ pgwire. Remote / server mode for multi-client access and tools that expect Postgres. Read Server Mode & Wire Protocol. - Add the controller. Idle stop and branch-on-LSN deliver scale-to-zero and instant clones for the dozens-of-tools goal. Read Lifecycle & Controller, then validate with the benchmark plan and place each tool with the hot-row strategy.
First-time reader path
If you are new to the project, read Architecture Overview next for the full picture, then follow the four rollout steps above in order. The remaining specs (Deployment, Capabilities, Roadmap, Risks) are reference material consulted as decisions arise.
Conventions
Requirement keywords (RFC 2119)
Normative requirements use the RFC 2119 keywords, rendered as coloured tags in requirement lists.
- MUST an absolute requirement of the specification; equivalently REQUIRED or SHALL. A conforming implementation cannot omit it.
- SHOULD a strong recommendation; equivalently RECOMMENDED. May be ignored only with a fully understood, documented reason.
- MAY a truly optional item; equivalently OPTIONAL. Implementations interoperate whether or not it is provided.
- MUST NOT an absolute prohibition; equivalently SHALL NOT.
Status legend
Every page carries a status badge in its header. The lifecycle is:
- Draft v0.1 — under active authoring; content may change without notice and may contain open questions.
- In review — content-complete and circulating for engineering sign-off; comments open.
- Approved — ratified; changes require a version bump and a changelog entry.
All pages are currently Draft v0.1, dated 2026-06-20.
Inline status pills
Within prose and tables, the deployment quadrant of a workload is flagged with pills: green fits the disaggregated engine, amber needs tuning or measurement, red belongs on coupled Postgres, blue informational, and build marks a build-it-yourself artifact.
Cross-links and navigation
The left sidebar, the right-hand "On this page" table of contents, heading anchors, code-copy buttons, the theme toggle, and the previous/next links at the foot of each page are all generated by ../assets/app.js from a single canonical manifest (../assets/specs-section.js) — page order lives in exactly one place. Hover any h2 or h3 to reveal its # anchor for deep-linking. Cross-references between specs use the SPEC id (e.g. STOR, OBJ) and appear as the cards in the document map above and in each page's "Related specifications" section. This page is the hub and has no previous link, which is expected.
Glossary
Terms used throughout the specifications. Where a term is the subject of a component spec, the linked page is authoritative.
- OLTP
- Online Transaction Processing — short, frequent read/write transactions on current data (the workload this engine targets), as opposed to OLAP analytical scans.
- MVCC
- Multi-Version Concurrency Control — readers see a consistent snapshot without blocking writers, implemented here as snapshot isolation over LSN-stamped row versions.
- LSN
- Log Sequence Number — a monotonically increasing position in the commit log that stamps every version and identifies the point-in-time a page is materialized at (
GetPage@LSN). - WAL
- Write-Ahead Log — the ordered record of changes written and made durable before a commit is acknowledged; the engine's write path appends WAL and never acks from an in-memory buffer.
- LSM
- Log-Structured Merge tree — the storage organization of the page store: buffer writes in memory, flush to immutable on-disk/on-S3 layers, merge via compaction.
- delta layer
- An immutable LSM layer holding incremental
(key, LSN)change records since the last image; replayed over an image to reconstruct a page at a target LSN. - image layer
- An immutable LSM layer holding full page snapshots at a given LSN, serving as the base that delta layers are applied on top of.
- CAS
- Compare-And-Swap — an atomic conditional write (here, an S3 conditional PUT) that succeeds only if the current value matches expectation; the primitive that gives the commit log atomic ordered appends and single-writer fencing without a separate consensus cluster.
- PITR
- Point-In-Time Recovery — restoring or reading the database as of a chosen LSN/timestamp; the retention window bounds how far back layers are kept before compaction and GC reclaim them.
- fencing token
- A monotonically increasing token (issued by the commit log's CAS) that lets storage reject writes from a stale or evicted writer, enforcing single-writer-per-database.
- FFI
- Foreign Function Interface — calling the engine's C ABI directly from a host runtime so it runs in-process; Bun's
bun:ffibindsengine.hwith no socket. - NAPI
- Node-API — a stable native-addon ABI; an alternative engine binding that works in both Bun and Node from one package.
- pgwire
- The Postgres wire protocol — what
engine-serverspeaks in server mode, so ordinary Postgres clients and tools (andBun.sql) connect with zero extra dependencies. - pooler
- A connection pooler (PgBouncer / pgcat in transaction mode) placed in front of the server to absorb serverless connection bursts.
- HNSW
- Hierarchical Navigable Small World — the graph-based approximate-nearest-neighbour index used for built-in vector search; its random-access traversal leans hardest of anything on the local cache.
- HTAP
- Hybrid Transactional/Analytical Processing — served here as a system (the row engine plus DuckDB over shared storage), not as one engine doing both.
- LFC
- Local File Cache — a large second-tier cache spilling hot pages to local NVMe beneath the in-memory shared-buffer cache, extending the working set that stays off the S3 hot path.
- scale-to-zero
- Stopping all compute for an idle database so only object-storage bytes bill at rest; the engine restarts on the next request, paying a cold start.
- branch
- A copy-on-write fork of a database created as a new LSN pointer over shared immutable layers; near-zero marginal storage until the branch diverges.
- disaggregated storage
- Separating durable storage from compute so state lives over the network on object storage and compute is stateless and disposable.
- group commit
- Batching several pending commits into one durable storage round-trip to amortize per-commit network/S3 latency (the lever against W1); it does not add write lanes (W2).
- egress
- Data transferred out of object storage, billed per GB on S3/GCS; cache misses on a disaggregated engine are egress, so a cold cache costs more — which is why R2's zero egress matters structurally here.
One-line summary
The whole system in one sentence
Build a Rust engine that exposes a narrow Storage trait, ship it as both a linkable library and a pgwire server, back the trait with an LSM-on-S3 page store plus an S3-CAS commit log, and bind it into Bun via bun:ffi for embedded use or Bun.sql for server use — embeddable and disaggregated stop being contradictory the moment the storage seam is a pluggable backend instead of a server you connect to.
Related specifications
Start with the architecture overview, then follow the rollout order.