Abstract

The design target is one OLTP engine that satisfies three properties usually treated as mutually exclusive:

  • MUST be embeddable — linked in-process as a native library, with function-call latency on the hot path and no socket between the application and the engine.
  • MUST be runnable as a remote server over the same engine code, exposing a wire protocol for multi-client access.
  • MUST ground durable state on disaggregated storage — durability bottoms out on object storage (S3 / Cloudflare R2 / MinIO), reached only on cache miss.

From those it is optimized simultaneously for scale-to-zero (stateless compute that idles to nothing, only storage bytes billing at rest), true embeddability (the hot path stays in-process function calls, never a network round-trip), and branching (a database fork is a cheap LSN pointer over shared immutable layers).

Reference shape

"SQLite that forgot where its file was and found it on S3." The engine keeps SQLite's in-process, link-it-anywhere ergonomics, but its storage backend is pluggable and pointed at the network instead of a local file. The closest shipping analog is libSQL / Turso; this specification describes what you would assemble to build your own, owned end-to-end.

1
engine, two delivery modes
in-proc
embedded hot path
→0
idle compute cost
O(1)
branch creation

The tension and its resolution

Two of the requirements pull in opposite directions. Embeddability wants the engine to be the application's address space; separated storage wants durable state to live over the network so compute can be stateless. Classic systems pick one side:

RequirementImpliesClassic system that has it
Embeddableengine = library linked in-process; function-call latency; no protocol between app and engineSQLite
Separate storagedurable state lives over the network; compute is stateless and disposableNeon / Aurora

Resolution — move the seam inward

Keep the engine a library, but make its storage backend pluggable and point it at the network instead of a local file. The seam is a storage interface inside the library, not a server boundary. "Embeddable" and "disaggregated" stop being contradictory the moment the storage seam is a pluggable backend rather than a server you connect to.

Server mode falls out for free

Server mode is then just "the same library, wrapped in a wire-protocol listener." The engine, the MVCC manager, the cache, and the storage trait are identical in both modes; the only difference is whether the caller reaches the engine through an FFI function call or through a network socket. There is no second engine to maintain.

Bun app (embedded, FFI) remote clients (server, pgwire) wire listener ENGINE (library) parser · planner MVCC · cache Storage trait (seam) object storage

One engine, two front doors: FFI for embedded callers, a wire listener for remote clients. The storage trait is the only new internal protocol.

The three balanced requirements

The architecture is judged on satisfying these three at the same time, not one at a time. Each is delivered by a specific mechanism, detailed in the linked component specs.

RequirementMechanismOwned by
Scale-to-zero Compute is stateless — all durable state lives in the storage tier. The controller stops the engine when idle; only object-storage bytes bill at rest. Cold start = process start + cache warm. CTL
True embeddability A LocalFileStorage backend plus the in-process local cache keep the hot path as function calls, never network calls. The same binary embeds or serves; storage choice is configuration (a connection string), not a rebuild. CACHE · STOR
Branching The page store is versioned by LSN and its layers are immutable, so a branch is a cheap pointer (copy-on-write). Many tools each take a branch off one base with near-zero marginal storage until they diverge. CTL · OBJ

Scale-to-zero

Because compute holds no durable state, the lifecycle controller may terminate the engine process the moment a database goes idle, with no checkpoint to flush beyond what the WAL already made durable. At rest a database costs only its object-storage bytes — no idle compute bill. The price is paid on the first request after idle: process start plus cache warm. Tuning the idle timeout and an optional keep-warm ping trades that cold-start tail against idle cost; see Lifecycle & Controller and the cold-read measurement in the benchmark plan.

True embeddability

Embeddability depends entirely on the local cache. Without it, every read would be a hundreds-of-milliseconds object-storage round-trip and the "in-process" claim would be hollow. The shared-buffer page cache plus a larger local file cache (LFC) spilling to NVMe keep the working set resident, so reads resolve in-process and only cache misses descend to object storage. The pure-embedded path uses LocalFileStorage for zero network at all (dev, offline, edge); the disaggregated path uses ObjectStorage — same code, different connection string.

Branching

The page store materializes pages by (PageId, Lsn) over immutable delta and image layers. A branch is created by allocating a new LSN pointer that shares all existing layers; writes on the branch append new layers without touching the parent. Storage cost grows only with divergence. This is what makes "dozens of internal tools, each on its own branch off one base" affordable — and unlike Postgres's pgvector, an index built into the engine (e.g. HNSW) branches with the database because it is just more pages through the same trait (see Capabilities).

Layer map — the three slots

The system is three horizontal slots. The pluggable storage interface sits at the bottom of Slot B and is the only genuinely new protocol; everything below it speaks standard object-storage APIs.

SLOT A — Interface
  Embedded:  FFI / native addon  → engine runs inside the host process (Bun, etc.)
  Server:    Postgres wire protocol (pgwire) or HTTP/WS → same engine, network listener
  Pooler (server mode only): PgBouncer / pgcat, transaction mode

SLOT B — Engine (the library)
  SQL parser · planner · executor · MVCC/txn manager
  Local cache (hot pages stay in-process — this is what preserves embeddability)
  >>> Pluggable Storage Interface <<<  (the seam)

SLOT C — Storage (disaggregated)
  Log service:  durable ordered commit log; commit = quorum/CAS ack
  Page store:   LSM layers, versioned by LSN; materializes pages on demand
  Durability floor: Object storage (S3 / R2), reached only on cache miss

Three slots. The storage interface is the seam; the only new protocol (Engine ↔ Storage) is internal.

BoundaryProtocolNotes
App ↔ Engine (embedded)direct function callsno protocol on the hot path
App ↔ Engine (server)Postgres wire protocol or HTTPsee Server Mode
Engine ↔ Storagecustom RPC: Append(WAL) for writes, GetPage@LSN for readsthe only new protocol; internal; see Storage Interface
Storage ↔ DurabilityS3 API (GET / PUT, conditional writes)conditional write = CAS; see Object-Storage Backend

Document map

The fourteen component specifications, grouped by concern. Each builds against the seam and the requirements above; follow the cross-links rather than reading start-to-finish unless you want the whole picture.

Architecture — the engine and its storage

Interfaces — how callers reach the engine

Validation — proving the boundary holds

Operations & planning

Recommended reading order & rollout

The build (and the recommended reading order) follows a four-step rollout. Each step is independently shippable and adds exactly one capability; you can stop after any step and have a working system.

  1. Ship the embedded library first. bun:ffi binding over the C ABI, with the LocalFileStorage backend. Fastest path to a working demo, zero infrastructure. Read Engine Core, Storage Interface, and Bun Integration.
  2. Add the ObjectStorage backend. Flip the connection string from file:// to s3://. Now it is disaggregated and scale-to-zero capable, still embedded — same code path. Read Object-Storage Backend and Local Cache.
  3. Add engine-server + pgwire. Remote / server mode for multi-client access and tools that expect Postgres. Read Server Mode & Wire Protocol.
  4. Add the controller. Idle stop and branch-on-LSN deliver scale-to-zero and instant clones for the dozens-of-tools goal. Read Lifecycle & Controller, then validate with the benchmark plan and place each tool with the hot-row strategy.

First-time reader path

If you are new to the project, read Architecture Overview next for the full picture, then follow the four rollout steps above in order. The remaining specs (Deployment, Capabilities, Roadmap, Risks) are reference material consulted as decisions arise.

Conventions

Requirement keywords (RFC 2119)

Normative requirements use the RFC 2119 keywords, rendered as coloured tags in requirement lists.

  • MUST an absolute requirement of the specification; equivalently REQUIRED or SHALL. A conforming implementation cannot omit it.
  • SHOULD a strong recommendation; equivalently RECOMMENDED. May be ignored only with a fully understood, documented reason.
  • MAY a truly optional item; equivalently OPTIONAL. Implementations interoperate whether or not it is provided.
  • MUST NOT an absolute prohibition; equivalently SHALL NOT.

Status legend

Every page carries a status badge in its header. The lifecycle is:

  • Draft v0.1 — under active authoring; content may change without notice and may contain open questions.
  • In review — content-complete and circulating for engineering sign-off; comments open.
  • Approved — ratified; changes require a version bump and a changelog entry.

All pages are currently Draft v0.1, dated 2026-06-20.

Inline status pills

Within prose and tables, the deployment quadrant of a workload is flagged with pills: green fits the disaggregated engine, amber needs tuning or measurement, red belongs on coupled Postgres, blue informational, and build marks a build-it-yourself artifact.

Cross-links and navigation

The left sidebar, the right-hand "On this page" table of contents, heading anchors, code-copy buttons, the theme toggle, and the previous/next links at the foot of each page are all generated by ../assets/app.js from a single canonical manifest (../assets/specs-section.js) — page order lives in exactly one place. Hover any h2 or h3 to reveal its # anchor for deep-linking. Cross-references between specs use the SPEC id (e.g. STOR, OBJ) and appear as the cards in the document map above and in each page's "Related specifications" section. This page is the hub and has no previous link, which is expected.

Glossary

Terms used throughout the specifications. Where a term is the subject of a component spec, the linked page is authoritative.

OLTP
Online Transaction Processing — short, frequent read/write transactions on current data (the workload this engine targets), as opposed to OLAP analytical scans.
MVCC
Multi-Version Concurrency Control — readers see a consistent snapshot without blocking writers, implemented here as snapshot isolation over LSN-stamped row versions.
LSN
Log Sequence Number — a monotonically increasing position in the commit log that stamps every version and identifies the point-in-time a page is materialized at (GetPage@LSN).
WAL
Write-Ahead Log — the ordered record of changes written and made durable before a commit is acknowledged; the engine's write path appends WAL and never acks from an in-memory buffer.
LSM
Log-Structured Merge tree — the storage organization of the page store: buffer writes in memory, flush to immutable on-disk/on-S3 layers, merge via compaction.
delta layer
An immutable LSM layer holding incremental (key, LSN) change records since the last image; replayed over an image to reconstruct a page at a target LSN.
image layer
An immutable LSM layer holding full page snapshots at a given LSN, serving as the base that delta layers are applied on top of.
CAS
Compare-And-Swap — an atomic conditional write (here, an S3 conditional PUT) that succeeds only if the current value matches expectation; the primitive that gives the commit log atomic ordered appends and single-writer fencing without a separate consensus cluster.
PITR
Point-In-Time Recovery — restoring or reading the database as of a chosen LSN/timestamp; the retention window bounds how far back layers are kept before compaction and GC reclaim them.
fencing token
A monotonically increasing token (issued by the commit log's CAS) that lets storage reject writes from a stale or evicted writer, enforcing single-writer-per-database.
FFI
Foreign Function Interface — calling the engine's C ABI directly from a host runtime so it runs in-process; Bun's bun:ffi binds engine.h with no socket.
NAPI
Node-API — a stable native-addon ABI; an alternative engine binding that works in both Bun and Node from one package.
pgwire
The Postgres wire protocol — what engine-server speaks in server mode, so ordinary Postgres clients and tools (and Bun.sql) connect with zero extra dependencies.
pooler
A connection pooler (PgBouncer / pgcat in transaction mode) placed in front of the server to absorb serverless connection bursts.
HNSW
Hierarchical Navigable Small World — the graph-based approximate-nearest-neighbour index used for built-in vector search; its random-access traversal leans hardest of anything on the local cache.
HTAP
Hybrid Transactional/Analytical Processing — served here as a system (the row engine plus DuckDB over shared storage), not as one engine doing both.
LFC
Local File Cache — a large second-tier cache spilling hot pages to local NVMe beneath the in-memory shared-buffer cache, extending the working set that stays off the S3 hot path.
scale-to-zero
Stopping all compute for an idle database so only object-storage bytes bill at rest; the engine restarts on the next request, paying a cold start.
branch
A copy-on-write fork of a database created as a new LSN pointer over shared immutable layers; near-zero marginal storage until the branch diverges.
disaggregated storage
Separating durable storage from compute so state lives over the network on object storage and compute is stateless and disposable.
group commit
Batching several pending commits into one durable storage round-trip to amortize per-commit network/S3 latency (the lever against W1); it does not add write lanes (W2).
egress
Data transferred out of object storage, billed per GB on S3/GCS; cache misses on a disaggregated engine are egress, so a cold cache costs more — which is why R2's zero egress matters structurally here.

One-line summary

The whole system in one sentence

Build a Rust engine that exposes a narrow Storage trait, ship it as both a linkable library and a pgwire server, back the trait with an LSM-on-S3 page store plus an S3-CAS commit log, and bind it into Bun via bun:ffi for embedded use or Bun.sql for server use — embeddable and disaggregated stop being contradictory the moment the storage seam is a pluggable backend instead of a server you connect to.

Related specifications

Start with the architecture overview, then follow the rollout order.

Serverless OLTP Engine — internal development specification. Draft, 2026-06-20. · Author