Architecture Overview
One Rust engine, three slots, one new internal protocol: an OLTP engine that links into a host process as a library or stands up as a wire-protocol server, while its durable state lives over the network on object storage — embeddable and storage-disaggregated at once because the seam between them is a pluggable Storage interface, not a server boundary.
Purpose & scope
This document is the top-level map of the system. It fixes the three architectural slots, the responsibilities of each, the protocols that cross between them, and the read/write data flows that traverse all three. Every later spec (02–08) owns the interior of exactly one slot or one inter-slot boundary; this page is the contract that holds them together and the place to resolve disagreements about where a responsibility lives.
The governing design tension, and its resolution, is the reason the architecture has the shape it does:
| Requirement | Implies | Classic system |
|---|---|---|
| Embeddable | engine = library linked in-process; function-call latency on the hot path | SQLite |
| Separate storage | durable state lives over the network; compute is stateless | Neon / Aurora |
Resolution. Keep the engine a library, but make its storage backend pluggable and point it at the network instead of at a local file. The seam is a storage interface inside the library, not a server boundary. Server mode is then merely "the same library, wrapped in a wire-protocol listener." Reference shape: "SQLite that forgot where its file was and found it on S3." Closest shipping analog: libSQL/Turso.
Note
This spec describes the architecture, not an implementation status. Everything here is to build unless a row explicitly names an existing component to start from (e.g. SlateDB, pgwire, libSQL). The one genuinely new protocol introduced by this design is the internal Engine↔Storage RPC; everything else reuses an existing standard.
Responsibilities & non-goals
In scope for the architecture as a whole
- Define the three slots (Interface, Engine, Storage) and the seam between Engine and Storage.
- Define the four protocols at the slot boundaries and which ones are new.
- Define the canonical read path (
GetPage@LSN) and write path (Append(WAL)→ durable ack → materialization). - Define process and trust boundaries for embedded vs server deployment of the same binary.
- Fix the cross-cutting invariants every slot must uphold (durability floor, single-writer-per-DB, embeddability-depends-on-cache).
Explicit non-goals (owned elsewhere or out of band)
- SQL surface, planner, executor and MVCC internals — Engine Core.
- The exact
Storagetrait signatures and versioning — Storage Interface (this page shows the canonical shape only). - LSM layer format, compaction, GC, and the S3-CAS commit log — Object-Storage Backend.
- Multi-writer coordination beyond single-writer-per-DB. The architecture caps at one writer per database by design (SQLite-lineage ceiling); a coordination layer is out of scope. See Hot-Row Contention.
- Analytical/columnar execution (HTAP). The row engine stays small; OLAP is composed over shared storage — Capabilities.
System context & the three slots
The system is a vertical stack of three slots. An application sits above Slot A; durable bytes bottom out below Slot C. The hot path lives in Slot B (the local cache); the network is only crossed on a cache miss (reads) or on commit (writes).
The three-slot stack. The dashed box is the seam: the engine calls the Storage interface instead of touching disk. Storage choice (file vs object) is config, not a rebuild.
Slot A — Interface
- Embedded
- FFI / native addon. The engine runs inside the host process (Bun, Node via NAPI, any C-ABI caller). No socket, no wire protocol; the application calls the engine's C ABI directly. See Bun Integration.
- Server
- The same engine wrapped in a network listener speaking the Postgres wire protocol (pgwire) or HTTP/WS. See Server Mode & Wire Protocol.
- Pooler
- Server mode only. PgBouncer / pgcat in transaction mode, fronting the listener to absorb serverless connection bursts. Absent entirely in embedded deployments.
Slot B — Engine (the library)
The engine owns: the SQL parser; logical and physical planning; the executor; the MVCC / transaction manager (snapshot isolation via LSN-stamped versions); WAL generation; the local cache that keeps hot pages in-process; and the pluggable Storage interface — the single most important artifact in the system. The cache is not an optimization here; it is load-bearing for embeddability (see Cross-cutting NFRs). Detail in Engine Core and Local Cache.
Slot C — Storage (disaggregated)
- Log service
- Durable, totally-ordered commit log. A commit is acknowledged only on a quorum/CAS ack. Durability bottoms out on S3 conditional writes (compare-and-swap), giving atomic ordered appends and single-writer fencing without a separate Raft/Paxos cluster.
- Page store
- LSM layers versioned by LSN; materializes pages on demand. In-memory layer → delta layers
(key, LSN)→ image layers (full page snapshots), with compaction and GC past the PITR window. SlateDB is the model; RocksDB if a local cache tier is wanted underneath. - Durability floor
- Any S3-compatible object store (AWS S3, Cloudflare R2, MinIO). Reached only on a cache miss for reads, and on commit for writes.
Protocol matrix
Four boundaries, four protocols. Exactly one of them is new, and it is internal. The rest are direct function calls or existing standards.
| Boundary | Direction | Protocol | Operations | New? |
|---|---|---|---|---|
| App ↔ Engine (embedded) | in-process | Direct function calls over the C ABI — no protocol | engine_open / engine_query / engine_close | no reuse C ABI |
| App ↔ Engine (server) | network | Postgres wire protocol (pgwire) or HTTP/JSON | Simple + extended query, COPY, prepared stmts | no reuse pgwire |
| Engine ↔ Storage | internal RPC | Custom internal RPC — the only new protocol | Append(WAL) (writes), GetPage@LSN (reads), Flush | YES internal |
| Storage ↔ Durability | network | S3 API | GET / PUT + conditional writes (CAS) | no reuse S3 API |
Why this matters
The architecture invents one protocol, and it sits entirely behind the engine's library boundary. Applications never see it; operators never configure it. That containment is what lets the same engine be embedded (App↔Engine is a function call) or served (App↔Engine is pgwire) without changing anything below Slot A.
Engine ↔ Storage — canonical interface shape
The engine calls this narrow trait instead of touching disk. The full, versioned signatures live in Storage Interface; the canonical shape is fixed here so every reader shares one mental model:
/// The seam. The engine knows only this; it never names a file or a bucket.
pub trait Storage: Send + Sync {
/// Read path. Return the page as of (≤) the given snapshot LSN.
fn get_page(&self, page_id: PageId, lsn: Lsn) -> Result<Page, StorageError>;
/// Write path. Durably append WAL records; return the assigned commit LSN.
/// MUST NOT return Ok until the records are durable (see durability rule).
fn append_wal(&self, records: &[WalRecord]) -> Result<Lsn, StorageError>;
/// Force any buffered state to the durability floor.
fn flush(&self) -> Result<(), StorageError>;
}
- PageId
- Opaque page identifier (database + relation + page number). Stable across versions.
- Lsn
- Monotonic log sequence number. Totally orders all commits in a database; also the version stamp for MVCC reads and the branch pointer for clones.
- WalRecord
- A single redo record produced by the executor for a committing transaction. The unit the log service appends and durably orders.
- Page
- A fixed-size page image materialized from the page store at or before the requested LSN.
Two reference implementations sit behind this trait and are selected by connection string, not by recompilation:
LocalFileStorage— a single.dbfile. Pure embedded, zero network. Dev, offline, edge. (file://./local.db)ObjectStorage— the disaggregated backend of Slot C. (s3://bucket/mydb)
Read path data flow
Reads resolve a page as of a snapshot LSN through a cache hierarchy. The network (object storage) is touched only on a full miss. This hierarchy is what keeps S3's hundreds-of-milliseconds latency off the hot path and is the reason the engine stays embeddable.
executor needs GetPage(page_id, lsn)
│
▼
┌───────────────┐ hit ┌──────────────────────────┐
│ shared buffer │ ──────▶ │ return page (in-process) │ ◀── hot path, ~µs
│ (in memory) │ └──────────────────────────┘
└──────┬────────┘
│ miss
▼
┌───────────────┐ hit ┌──────────────────────────┐
│ local file │ ──────▶ │ promote → shared buffer │ ◀── warm, local NVMe
│ cache (LFC) │ └──────────────────────────┘
└──────┬────────┘
│ miss
▼
┌───────────────────────────────────────────────────┐
│ Slot C page store: pick image layer ≤ lsn, │
│ replay delta layers (key, LSN) up to lsn │
└──────┬────────────────────────────────────────────┘
│ layer(s) not resident
▼
┌───────────────────────────────────────────────────┐
│ object storage GET (S3 / R2) ◀── cold, ~100s of ms │
│ materialize page, populate LFC + shared buffer │
└───────────────────────────────────────────────────┘
Read path: shared buffer → local file cache → page-store layers → object storage on miss. Each tier populates the tier above on the way back.
- Executor requests
get_page(page_id, lsn)wherelsnis the transaction's snapshot. - Shared buffer (in-memory page cache) — hit returns in-process at function-call latency.
- Local file cache (LFC, local NVMe) — hit promotes the page to the shared buffer.
- Page store resolves the version: select the image layer at or below
lsn, then replay delta layers up tolsn. - On miss at the layer level, issue an object-storage
GET, materialize the page, and populate both caches on the way back.
Cache miss = egress (on S3/GCS)
Every cold read is billed egress on S3/GCS, so the cache-miss rate is coupled to the egress bill — and scale-to-zero guarantees a cold cache after every idle period. Cloudflare R2's zero egress removes that coupling structurally. See Deployment Targets.
Write path data flow
Writes append WAL records to the durable, ordered commit log; the commit is acked only after those records are durable; page materialization happens later, off the commit path. This ordering is the durability rule made concrete.
COMMIT (txn produces WAL records)
│
▼
┌───────────────────────────────────────────────┐
│ Engine: append_wal(records) │
└──────┬────────────────────────────────────────┘
▼
┌───────────────────────────────────────────────┐
│ Slot C log service: append to ordered commit │
│ log; durability via S3 conditional write (CAS) │ ◀── network round-trip
│ → assign commit LSN, single-writer fencing │
└──────┬────────────────────────────────────────┘
│ durable ack (and only then)
▼
┌───────────────────────────────────────────────┐
│ Engine: return commit LSN to caller ── ACK ──▶ │ ◀── client sees success
└──────┬────────────────────────────────────────┘
│ (asynchronous, off the commit path)
▼
┌───────────────────────────────────────────────┐
│ Page store: apply WAL → delta layers → image │
│ layers; populate cache; GC past PITR window │
└───────────────────────────────────────────────┘
Write path: Append(WAL) → durable CAS ack → client ack → later page materialization. The ack edge is the durability boundary; nothing crosses it before the WAL is durable.
- The committing transaction's redo records are handed to
append_wal(records). - The log service appends them to the totally-ordered commit log; durability bottoms out on an S3 conditional write (CAS), which also assigns the commit LSN and enforces single-writer fencing.
- Only after the durable ack does the engine return the commit LSN to the caller. The client never sees success before the WAL is durable.
- Page materialization (apply WAL → delta layers → image layers, populate caches, GC) runs asynchronously, off the commit path.
Commit latency is not cacheable
Caching hides read latency completely. It MUST NOT hide commit latency — acking from an in-memory buffer before the WAL is durable produces acked-write loss on crash. Commit pays a network/CAS round-trip (~ms) where a local engine pays an fsync (~µs); the lever for that cost is group commit (batching), not caching. See Benchmark Plan (W1) and Hot-Row Contention.
Process & trust boundaries
The defining property: embedded and server are the same binary. Storage choice is config (a connection string), not a rebuild. The difference between the two deployments is only whether a network listener exists in front of the engine.
| Aspect | Embedded | Server |
|---|---|---|
| Engine location | inside the host process (Bun, Node, any C-ABI caller) | inside engine-server, a standalone process |
| App ↔ Engine boundary | function call (no socket, no protocol) | TCP socket; pgwire or HTTP/WS |
| Trust boundary | none between app and engine — same address space, same trust domain | network listener; authn/authz at the wire layer; pooler in front |
| Concurrency surface | host-process threads call the C ABI | connections → pooler (txn mode) → engine |
| Failure blast radius | engine fault can take down the host process | engine fault is isolated to the server process |
| Storage backend | file:// or s3:// by connection string | file:// or s3:// by connection string |
Same engine, two boundaries. Embedded has no socket and no trust boundary inside the process; server puts a network listener and pooler in front. Both reach the same storage; the backend is selected by connection string.
Component responsibility matrix
Each slot and boundary is owned by exactly one downstream spec. This page owns the contract between them; it does not own any slot's interior.
| Slot / boundary | Owns | Owning spec |
|---|---|---|
| Slot A — embedded interface | C ABI (engine.h), FFI/NAPI binding ergonomics | 08 · Bun Integration |
| Slot A — server interface | pgwire/HTTP listener, pooler, authn/authz | 07 · Server Mode |
| Slot B — engine core | parser, planner, executor, MVCC/txn, WAL generation | 02 · Engine Core |
| Slot B — local cache | shared buffer + LFC, eviction, warm/cold behavior | 05 · Local Cache |
| Engine ↔ Storage seam | the Storage trait, versioning, error model | 03 · Storage Interface |
| Slot C — page store + log | LSM layers, commit log, S3-CAS, compaction/GC/PITR | 04 · Object-Storage Backend |
| Lifecycle & control | cold-start, scale-to-zero, branch-on-LSN, fencing | 06 · Lifecycle & Controller |
Cross-cutting non-functional requirements
Three invariants bind every slot. They are restated here because no single downstream spec can enforce them alone — they are properties of the whole.
Durability rule
A commit MUST NOT be acknowledged before its WAL records are durably stored. Caching may hide read latency completely; it must never hide commit latency. Violating this yields acked-write loss on crash, which is disqualifying regardless of latency numbers (validated by Experiment 4).
Single-writer-per-DB
Without an added coordination layer, one writer at a time per database (a SQLite-lineage ceiling), fenced by the commit log's CAS token. This is excellent for "many small databases / dozens of tools" and wrong for one giant high-concurrency database. The lever for write parallelism is many small DBs (independent lanes), not multi-writer in one DB. See Hot-Row Contention.
Embeddability depends on the local cache
Embeddability is not a property of the API alone — it depends on the local cache. Without it, every read is a network call and the "in-process function-call latency" promise collapses. The cache is therefore a correctness-adjacent component for the embeddability NFR, not merely a performance tweak. See Local Cache.
Configuration
Architecture-level knobs that select deployment shape and storage. Per-slot knobs live in their own specs; these are the ones that change which slots exist.
- connection_string
- Selects the storage backend.
file://<path>→LocalFileStorage;s3://<bucket>/<db>→ObjectStorage. Same code path; no rebuild. - mode
embedded(no listener; engine in host process) orserver(network listener + optional pooler). Same binary.- listener
- Server mode only.
pgwire|http|ws, plus bind address/port. Ignored in embedded mode. - idle_timeout
- Lifecycle: how long an idle engine stays up before scale-to-zero teardown. Trades cold-start frequency against at-rest compute cost. Owned by 06.
- group_commit
- Enable/queue parameters for batching the durable-append round-trip (the W1 lever). Owned by 04 / measured in 09.
// Embedded, pure-local — zero network
const h = db.engine_open(Buffer.from("file://./local.db\0"));
// Embedded, disaggregated — same code path, flip the string
const h = db.engine_open(Buffer.from("s3://bucket/mydb\0"));
// Server — same binary, listener wraps the lib (see 07)
// engine-server --mode=server --listener=pgwire --bind=0.0.0.0:5432 \
// --connection=s3://bucket/mydb
Failure modes & edge cases
| Failure | Where | Required behaviour |
|---|---|---|
| Crash after CAS-append issued, before client ack | write path | On restart the commit is present and visible (it was durable before any ack would have fired). No torn state. (Exp 4a) |
| Crash after client ack, before page materialization | write path | The acked commit survives; materialization re-derives pages from the durable WAL. No acked-write loss. (Exp 4b) |
| CAS conflict on append (lost the fence) | log service | This process is no longer the single writer; it MUST stop writing and surface a fencing error rather than overwrite. Owned by 06. |
| Object-storage GET miss / transient error on read | read path | Retry with backoff; reads never affect durability. A persistent miss for a live page is a corruption signal, not a normal path. |
| Object-storage latency spike (tail) | read & write | Reads degrade to cold latency; commits pay the tail. Surfaces as p99/p999, not mean. Mitigation is group commit (writes) and cache (reads), never acking early. |
| Cold start after idle (scale-to-zero) | lifecycle | First request pays process-start + cache-warm. Caches start cold after every idle period. Mitigate via keep-warm ping or longer idle_timeout. (Exp 5) |
| Many concurrent writers, same row, same DB | concurrency | Serialized correctly through the single lane — correct, just slow. The irreducible red-quadrant case routes to coupled Postgres. See 10. |
Behaviour & algorithms — the boundary invariants
The architecture's behaviour reduces to a small set of ordering and versioning invariants that every slot must preserve.
- LSN total order. Every commit in a database receives a strictly monotonic LSN from the log service. The LSN is simultaneously the commit order, the MVCC snapshot stamp (
get_pageresolves "≤ lsn"), and the branch pointer for clones. - Durable-before-ack. The only synchronous step on the commit path is the durable append; everything downstream (page materialization) is asynchronous and re-derivable from the WAL.
- Immutable layers. Page-store layers are append-only and versioned by LSN. Because they never mutate in place, a branch is a new LSN pointer over the same shared immutable layers (copy-on-write) — instant clones with near-zero marginal storage until divergence.
- Fencing via CAS. Single-writer-per-DB is enforced by the commit log's compare-and-swap token, not by a separate consensus cluster. Losing the CAS means losing the fence; the loser must stop.
- Cache is a hint, never a source of truth. The shared buffer and LFC may be dropped at any time (scale-to-zero, eviction) without affecting correctness; they only affect latency.
Normative requirements
- MUST The engine MUST reach durable storage only through the
Storageinterface; no slot above Slot C may name a file path or bucket directly. - MUST A commit MUST NOT be acknowledged to the caller before its WAL records are durable at the durability floor.
- MUST The same compiled binary MUST support both embedded and server modes; the storage backend MUST be selectable by connection string without a rebuild.
- MUST The Engine↔Storage RPC (
Append(WAL),GetPage@LSN,Flush) MUST remain internal and MUST NOT be exposed to applications. - MUST Writes to a database MUST be serialized to a single writer, fenced by the commit log's CAS token.
- MUST
GetPage@LSNMUST return the page version at or before the requested LSN (snapshot-consistent reads). - MUST NOT Caching MUST NOT be used to hide or shortcut commit latency; an in-memory buffer MUST NOT be the basis of a commit ack.
- SHOULD The embedded hot path SHOULD resolve cache-hit reads as in-process function calls without crossing a socket.
- SHOULD Server mode SHOULD sit behind a transaction-mode pooler to absorb serverless connection bursts.
- SHOULD The write path SHOULD support group commit to amortize the durable-append round-trip across concurrent committers.
- MAY A deployment MAY use
LocalFileStoragefor fully offline/edge operation with zero network on the data path. - MAY A deployment MAY ship the engine as a NAPI addon to share one package across Bun and Node.
Acceptance criteria / definition of done
The architecture as a whole is "done" when:
- MUST One binary embeds via FFI and serves via pgwire, with the backend chosen only by connection string (demonstrated end-to-end on both
file://ands3://). - MUST Crash-injection at both adversarial points (after CAS-append/before ack; after ack/before materialization) shows every acked commit present and no torn state on restart — Experiment 4 passes unconditionally.
- MUST A branch is created as a new LSN pointer over shared immutable layers, with measured marginal storage ≈ 0 until divergence.
- SHOULD Single-commit latency floor (Experiment 1) and the group-commit throughput curve (Experiment 2) are characterized with p50/p99/p999, same-region and cross-region.
- SHOULD Scale-to-zero round-trip (idle teardown → cold start → warm) is measured (Experiment 5) and feeds
idle_timeouttuning. - SHOULD The contention wall (Experiment 3) confirms N separate databases scale as N independent write lanes.
Dependencies / existing pieces to start from
| Slot | Build from scratch? | Start from |
|---|---|---|
| Engine core | no | libSQL (SQLite-compat, pluggable); LeanStore/Umbra lineage (research OLTP) |
| Storage trait | yes (thin) | your own — the key artifact |
| Page store on S3 | no | SlateDB (LSM on object storage); RocksDB for a local tier |
| Commit log | mostly no | S3 conditional-write (CAS) append-log designs; WarpStream-style log-on-S3 lineage |
| Durability | no | S3 / Cloudflare R2 / MinIO |
| Pooler (server) | no | PgBouncer, pgcat |
| Wire protocol | no | pgwire (Rust), jackc/pgproto3 (Go) |
| Bun binding | yes (thin) | bun:ffi over engine.h, or a NAPI addon |
Build outputs that realize the slots: libengine.a / .so / .dylib (embeddable library), engine.h (the stable C ABI boundary), and engine-server (the listener wrapper). Language is Rust — FFI-friendly, links into anything, emits both cdylib and staticlib.
Open questions & risks
- Maturity. Object-storage-native OLTP is the active frontier (SlateDB, S3-CAS log designs, the libSQL rewrite). Fewer battle-tested guarantees than coupled Postgres; expect the page-store and commit-log internals to move.
- Write firehose. Sustained high-rate writes pay the durable-append round-trip per group. Read-heavy/bursty workloads barely notice; sustained writers must be benchmarked explicitly (Exp 1/2) before commitment.
- WASM fork. A WASM target (Cloudflare Workers) is a port, not a recompile: no native fs, single-threaded isolate, storage via the Worker's binding rather than a raw S3 SDK. The engine-form decision precedes everything else for that substrate — see 11.
- Cold-start tail under thundering herd. Many simultaneous cold starts (e.g. 1,000 at once) may saturate spin-up; the per-cold-start measurement (Exp 5) needs an N-concurrent extension.
- Storage-trait stability. The trait is the one fixed point the whole system depends on; getting
get_page/append_walsemantics and versioning right early is the highest-leverage decision. Detailed in 03.