Purpose & scope

The engine core is Slot B of the layer map: the library that turns SQL text into committed state changes. It owns the SQL frontend (parser, planner, executor), the transaction and concurrency machinery (MVCC under snapshot isolation, single-writer serialization), and WAL generation. It never touches a disk directly — all durable reads and writes go through the Storage trait defined in Storage Interface, which is the architectural seam that lets the same binary run embedded or disaggregated.

One artifact serves both deployment shapes. Embedded, the host process (Bun, etc.) links the library and calls it through the C ABI — function-call latency, no socket. As a server, the identical library is wrapped in a wire-protocol listener (Server Mode). The choice of storage backend is a connection-string scheme, not a recompile.

Note

"Embeddable and disaggregated" stop being contradictory the moment the storage seam is a pluggable backend inside the library rather than a server you connect to. The engine core is the in-process compute; everything network-shaped lives behind the trait.

Language & build outputs

The engine is written in Rust: FFI-friendly, links into anything, no GC pauses on the hot path, and a memory model that lets us reason about the FFI boundary precisely. It compiles to both a dynamic and a static library so hosts can choose linking strategy.

Build outputs

ArtifactFormConsumer
libengine.astaticlibStatic linking into a host binary / NAPI addon
libengine.so / libengine.dylib / libengine.dllcdylibDynamic loading via dlopen / bun:ffi
engine.hC ABI headerThe stable boundary every runtime binds to
engine-serverbinaryThe library + a wire-protocol listener (see Server Mode)

engine.h is the contract. Every binding — bun:ffi, a NAPI addon, the server's own internal calls — goes through the exact same exported symbols. Cargo configuration:

[lib]
crate-type = ["cdylib", "staticlib", "rlib"]
# rlib so engine-server can depend on the engine as a normal Rust crate
# cdylib + staticlib for the FFI consumers

Non-goals

  • MUST NOT implement OLAP / columnar execution in the engine core. The row engine has a row-oriented layout; analytical aggregation belongs to a second engine (DuckDB) reading the shared object store — see Capabilities.
  • MUST NOT implement the storage backend (LSM layers, S3 CAS log, page materialization) here. That lives behind the trait in Storage Interface and Object-Storage Backend.
  • MUST NOT embed the connection pooler, the lifecycle controller, or the wire-protocol parsing logic. Those are Lifecycle and Server Mode concerns layered on top.
  • SHOULD NOT own the page cache implementation internally beyond the buffer-pool interface; the cache hierarchy is specified in Local Cache, though the engine consumes it on the read path.

The C ABI surface

The exported C ABI is the stable boundary. It is deliberately narrow, opaque-handle based, and error-code driven — no Rust types, no panics, and no implicit allocation crossing the line uncontrolled. The connection string passed to engine_open selects the storage backend by scheme (file://LocalFileStorage, s3://ObjectStorage); the engine itself is identical either way.

/* engine.h — stable C ABI. All functions are thread-safe unless noted.
   Opaque handles; callers never see Rust layout. */

typedef struct EngineHandle EngineHandle;   /* a connection */
typedef struct EngineResult EngineResult;   /* a buffered query result */
typedef struct EngineStmt   EngineStmt;     /* a prepared statement */

typedef enum {
    ENGINE_OK            = 0,
    ENGINE_ERR_SQL       = 1,   /* parse / plan / type error */
    ENGINE_ERR_CONSTRAINT= 2,   /* unique / fk / check violation */
    ENGINE_ERR_CONFLICT  = 3,   /* serialization / write conflict */
    ENGINE_ERR_STORAGE   = 4,   /* backend I/O, CAS rejected, S3 fault */
    ENGINE_ERR_TXN       = 5,   /* illegal state-machine transition */
    ENGINE_ERR_MISUSE    = 6,   /* null handle, use-after-free, bad arg */
    ENGINE_ERR_INTERNAL  = 7    /* bug; engine remains defined, never UB */
} EngineStatus;

/* ---- lifecycle ------------------------------------------------------ */
/* url: "file://./local.db" or "s3://bucket/mydb?region=..." (NUL-terminated).
   Returns NULL on failure; caller then has no handle to query for error. */
EngineHandle* engine_open(const char* url);
void          engine_close(EngineHandle* h);          /* idempotent on NULL */

/* ---- one-shot execution -------------------------------------------- */
/* DDL/DML with no result set. Row count via engine_changes(). */
EngineStatus  engine_exec(EngineHandle* h, const char* sql);

/* Buffered query. *out receives an owned EngineResult on ENGINE_OK,
   else NULL. Result is a JSON document by default; a streaming
   row-iterator is available via the cursor API below. */
EngineStatus  engine_query(EngineHandle* h, const char* sql, EngineResult** out);

/* ---- prepared statements ------------------------------------------- */
EngineStatus  engine_prepare(EngineHandle* h, const char* sql, EngineStmt** out);
/* bind by 1-based positional index; value is a NUL-terminated, typed
   literal encoding (see "Parameter encoding"). */
EngineStatus  engine_bind(EngineStmt* s, int idx, const char* value);
/* step: ENGINE_OK = a row is available; ENGINE_DONE signalled via *done. */
EngineStatus  engine_step(EngineStmt* s, int* done);
EngineStatus  engine_finalize(EngineStmt* s);   /* frees the statement */
EngineStatus  engine_reset(EngineStmt* s);      /* re-execute, keep bindings */

/* ---- transactions -------------------------------------------------- */
EngineStatus  engine_begin(EngineHandle* h);
EngineStatus  engine_commit(EngineHandle* h);    /* blocks until WAL durable */
EngineStatus  engine_rollback(EngineHandle* h);

/* ---- branching ----------------------------------------------------- */
/* Open a copy-on-write branch at the current LSN of h's database.
   Returns a NEW handle on a NEW logical database; NULL on failure. */
EngineHandle* engine_branch(EngineHandle* h, const char* name);

/* ---- result / row access ------------------------------------------- */
int           engine_result_rows(const EngineResult* r);
int           engine_result_cols(const EngineResult* r);
const char*   engine_result_colname(const EngineResult* r, int col);
/* Borrowed pointer into r; valid until engine_result_free(r). */
const char*   engine_result_value(const EngineResult* r, int row, int col);
/* Same for the streaming statement cursor at the current step. */
const char*   engine_column_value(const EngineStmt* s, int col);

/* ---- errors -------------------------------------------------------- */
/* Borrowed, thread-local-per-handle C string describing the LAST error
   on h. Valid until the next call on h. Empty string if none. */
const char*   engine_last_error(EngineHandle* h);
long long     engine_changes(EngineHandle* h);   /* rows affected, last stmt */
long long     engine_last_lsn(EngineHandle* h);  /* commit LSN of last commit */

/* ---- freeing ------------------------------------------------------- */
void          engine_result_free(EngineResult* r);   /* idempotent on NULL */
/* statements are freed by engine_finalize; handles by engine_close.   */

Ownership & lifetime of returned pointers

EngineHandle*
Owned by the caller. Created by engine_open / engine_branch, destroyed only by engine_close. The engine never frees a handle on the caller's behalf.
EngineResult*
Owned by the caller. Must be released exactly once with engine_result_free. All const char* values and column names obtained from it are borrowed — they point into the result's arena and become dangling the instant the result is freed.
EngineStmt*
Owned by the caller; released by engine_finalize. Borrowed column values from engine_column_value are valid only until the next engine_step / engine_reset / engine_finalize on that statement.
const char* (errors, values, colnames)
Always borrowed, never freed by the caller. Each has the lifetime stated above. Callers MUST copy out before the owning object advances or is freed.

Error model — no panics across FFI

  • MUST return an EngineStatus from every fallible function and populate engine_last_error on the handle; functions that return a pointer return NULL on failure.
  • MUST NOT let a Rust panic unwind across the FFI boundary. Every exported function is wrapped in catch_unwind; a caught panic maps to ENGINE_ERR_INTERNAL and leaves the handle in a defined, queryable state — never undefined behaviour.
  • MUST detect misuse (null handle, use-after-free of a finalized statement, double free) and return ENGINE_ERR_MISUSE rather than dereferencing.
  • SHOULD distinguish retryable (ENGINE_ERR_CONFLICT, transient ENGINE_ERR_STORAGE) from terminal errors so bindings can implement retry policy without string-matching messages.

Parameter encoding (proposed)

To keep the ABI string-only and stable, bound values use a one-character type tag prefix: i int8, f float8, s text, b base64-bytes, n NULL (value ignored), v vector (comma list). Example: i42, shello, n. A future revision MAY add a binary-typed bind path; the string path remains for FFI ergonomics.

Warning

The connection-string scheme is load-bearing for security and behaviour: file:// selects a purely local backend with no network egress; s3:// (and r2://) select the disaggregated backend whose commit pays a network round-trip. The engine MUST reject unknown schemes at engine_open rather than silently defaulting.

Execution pipeline

A statement flows through four stages. Each has a single, testable responsibility.

SQL text parser(AST) logicalplan physicalplan executor

SQL text → parser → logical plan → physical plan → executor.

StageInput → OutputResponsibility
ParserSQL text → ASTLex and parse to a syntax tree; reject malformed SQL with ENGINE_ERR_SQL. No catalog lookups.
Logical planAST → logical operatorsBind names against the catalog, resolve types, apply rewrite rules (predicate pushdown, constant folding, subquery flattening). Backend-agnostic.
Physical planlogical → physical operatorsChoose access methods (B-tree seek vs scan, HNSW probe), join algorithms, and order using catalog statistics. Emits operators that read pages via the buffer pool / Storage trait.
Executorphysical → rows / mutationsVolcano/vectorized pull execution; acquires the MVCC snapshot for reads, emits WalRecords for writes, drives commit through the transaction manager.
  • MUST resolve all object names and types in the logical-plan stage; the executor never re-binds.
  • SHOULD cache prepared-statement physical plans keyed by SQL text + parameter types to amortize parse/plan cost across engine_reset cycles.
  • MAY use vectorized batch execution for read operators; row mutations stay row-at-a-time to keep WAL generation simple.

Concurrency & transactions

The engine provides snapshot isolation through MVCC. Every row version is stamped with the LSN of the transaction that created it (and, on update/delete, the LSN that superseded it). Reads see a consistent snapshot LSN; they never block writers and are never blocked by them.

Readers — snapshot LSN

  • MUST capture a snapshot LSN at the start of each read transaction (or each statement in autocommit) and serve every page at that LSN via get_page(page_id, lsn).
  • MUST make a row version visible to a snapshot iff its creation LSN ≤ snapshot LSN and its supersede LSN (if any) > snapshot LSN.
  • MAY run unlimited concurrent readers; readers do not contend with each other or with the writer.

Writers — one per database

There is one writer per database. Write transactions are serialized through a single write lane; this is the SQLite-lineage ceiling, deliberately accepted. It is correct — concurrent writes to the same row must be serialized in any database — and the architecture recovers parallelism through many small databases rather than many writers per database.

  • MUST serialize all write transactions for a database through a single lane; only one transaction holds write intent at a time.
  • MUST assign commit LSNs monotonically; the WAL order is the commit order.
  • SHOULD support group commit — batching independent transactions' WAL records into one durable append to amortize the network handoff (defeats W1, commit latency).
  • MUST NOT rely on batching to add write lanes; lanes come only from sharding into more databases (defeats W2, write serialization).

Note

W1 (commit latency) and W2 (write serialization) are different bottlenecks with different levers: the group-commit queue defeats W1; the many-small-databases design defeats W2. The single irreducible case — many writers contending on the same row in the same database at high rate — is routed to coupled Postgres. See Hot-Row Contention Strategy.

WAL generation

The executor emits write-ahead-log records and hands them to the storage backend; it does not write them itself. Storage::append_wal returns the commit LSN once the records are durably stored, and only then does the engine acknowledge the commit. The engine generates records; storage makes them durable.

WalRecord — conceptual format

/// One logged change. The engine emits these; Storage::append_wal
/// makes them durable and returns the commit LSN of the batch.
struct WalRecord {
    txn_id: TxnId,          // owning transaction
    lsn:    Lsn,            // position in the global log order (assigned on append)
    page_id: PageId,        // page this change targets
    kind:   ChangeKind,     // see below
    payload: ChangePayload, // physical or logical change
}

enum ChangeKind {
    Insert,                 // new row version
    Update,                 // supersede old version, write new
    Delete,                 // tombstone a version
    Commit,                 // commit marker — durability point for the txn
    Abort,                  // explicit rollback marker
}

enum ChangePayload {
    /// Physical redo/undo: byte image of the page slot before & after.
    Physical { before: Option<Bytes>, after: Option<Bytes> },
    /// Logical change: enough to replay against the page store
    /// (e.g. keyed row delta) without a full byte image.
    Logical  { key: RowKey, delta: Bytes },
    /// Marker records (Commit / Abort) carry no page payload.
    Marker,
}

A committing transaction emits its data records followed by a single Commit marker. The marker's durable append is the atomic commit point: until append_wal returns the commit LSN, the transaction is not committed; after it returns, the transaction is durable even across a crash.

Durability rule (restated)

  • MUST obtain the commit LSN from Storage::append_wal — meaning the WAL (including the Commit marker) is durably stored — before acknowledging the commit to the caller.
  • MUST NOT acknowledge a commit from an in-memory buffer. Caching hides read latency completely; it must never hide commit latency, or acked writes can be lost on crash.
  • MUST treat page materialization as asynchronous and subordinate to the WAL: a commit is durable once its WAL record is stored, regardless of whether the page store has yet materialized the new page.

Danger

A fast commit path that loses an acked write under crash is disqualifying, regardless of latency numbers. This is verified by Experiment 4 in the Benchmark Plan (kill -9 after CAS-append-issued-before-ack, and after-ack-before-materialization). Durability is non-negotiable.

Transaction lifecycle state machine

Each transaction moves through a small, explicit state machine. Illegal transitions return ENGINE_ERR_TXN.

                 begin                first write
   ┌────────┐ ───────────▶ ┌────────┐ ──────────▶ (still Active, holds write lane)
   │  Idle  │              │ Active │
   └────────┘ ◀─────────── └────────┘
        ▲     rollback/abort    │ commit
        │                       ▼
        │                 ┌────────────┐  append_wal returns commit LSN
        │   abort on      │ Committing │ ───────────────────────────────┐
        │   storage fail  └────────────┘                                 ▼
        │                       │                                 ┌────────────┐
        └───────────────────────┴──────────────────────────────▶ │ Committed  │
                                │  failure                        └────────────┘
                                ▼
                          ┌──────────┐
                          │ Aborted  │
                          └──────────┘

Idle → Active → Committing → Committed | Aborted.

TransitionTriggerRequires
Idle → Activeengine_begin (or first statement in autocommit)Acquire a snapshot LSN; allocate a TxnId. No write lane yet.
Active (read) → Active (writer)first mutating statementAcquire the single write lane for the database; buffer WalRecords.
Active → Committingengine_commitEmit the Commit marker; call Storage::append_wal. Transaction is not yet durable.
Committing → Committedappend_wal returns a commit LSNWAL durable. Publish row versions at the commit LSN; release the write lane; ack the caller.
Committing → Abortedstorage failure / CAS rejectedNo commit LSN obtained; discard buffered records; release the write lane; surface ENGINE_ERR_STORAGE / ENGINE_ERR_CONFLICT.
Active → Abortedengine_rollback, constraint failure, or panicDiscard buffered records; release any held write lane; return to Idle-equivalent (handle reusable).
  • MUST reject engine_commit / engine_rollback when no transaction is Active with ENGINE_ERR_TXN.
  • MUST release the write lane on entry to Committed or Aborted — never leak it on the error path.
  • MUST NOT publish any row version before the Committing → Committed transition; a snapshot taken mid-commit MUST NOT observe partial state.

Memory & threading model across the FFI boundary

The engine is the only party that allocates and frees engine-owned objects. Callers obtain opaque handles and borrowed pointers; ownership rules above govern who frees what. Across the FFI line, all memory crossing is either (a) a NUL-terminated UTF-8 C string the caller owns and the engine copies, or (b) an opaque handle / borrowed pointer the engine owns.

Thread-safety of a handle

EngineHandle (connection)
Represents a single logical connection with at most one in-flight statement and at most one open transaction. It is not safe to use concurrently from multiple threads. A caller wanting parallelism opens multiple handles.
Multiple handles, same database
Safe and expected. Readers share the snapshot machinery; writers serialize through the one write lane. The engine's internal shared state (catalog, buffer pool, lane) is Send + Sync and synchronized internally.
Branch handles
A handle returned by engine_branch is a fully independent connection on a new logical database; it has its own write lane and may be driven from its own thread.
  • MUST NOT be called concurrently on a single EngineHandle; behaviour is then ENGINE_ERR_MISUSE at best. One handle = one thread of execution at a time.
  • MUST make engine-internal shared state (across handles to the same DB) safe for concurrent access without caller-visible global locks beyond the documented single write lane.
  • SHOULD keep engine_last_error scoped per handle so concurrent handles do not clobber each other's error strings.
  • MAY run executor read operators on an internal thread pool, provided no caller-visible handle is touched from more than one thread.

Tip

The single-threaded-per-handle rule maps cleanly onto host runtimes: a Bun process opens one handle per concurrent unit of work, and the WASM target (Cloudflare Workers, single-threaded isolate) gets exactly one handle per request — no shared mutable state to reason about. See Bun Integration and Deployment Targets.

Build-from-existing guidance

Do not start from a blank parser. The SQL frontend and the B-tree access method are enormous, well-trodden surfaces; reuse a proven core and spend the novel effort on the storage seam and the commit path.

Starting pointWhyCost
libSQL (SQLite-compatible, pluggable)Mature SQL frontend, B-tree, WAL concepts; already designed for a pluggable storage layer; SQLite-lineage single-writer semantics match this engine's model.Adapt the storage layer to our Storage trait; add LSN-stamped MVCC if not present.
LeanStore / Umbra lineage (research OLTP)Modern buffer-management and MVCC designs aimed at out-of-memory working sets — close to the "object storage is the cold floor" model.Less batteries-included; more integration and SQL-surface work.
DuckDB pattern (reference only)Proof of a clean embeddable engine + pluggable extension shape — but columnar/OLAP, so a pattern reference, not a base for the row engine.Not a starting codebase for OLTP; cited for its embedding pattern. OLAP is composed alongside, not built in — see Capabilities.
  • SHOULD begin from a SQLite-compatible core (libSQL) for fastest time-to-working-demo, given the matching single-writer model.
  • MUST preserve the architectural invariant whichever base is chosen: all durable I/O routes through the Storage trait, never directly to a file or the network.

Configuration

Configuration is supplied via the connection-string query parameters and a small set of open-time options; the engine has no separate config file in embedded mode.

KnobDefaultEffect
scheme (URL)Selects backend: file:// local, s3:///r2:// disaggregated.
group_commit_window_ms2Max time the writer batches commits before forcing a durable append (W1 lever).
group_commit_max_txns64Max transactions per durable WAL batch.
buffer_pool_mb256In-process page cache size; the embeddability floor (see Local Cache).
statement_timeout_ms0 (off)Abort a statement exceeding the budget with ENGINE_ERR_TXN.
read_snapshotstatementstatement or transaction scoped snapshot LSN capture.

Failure modes & edge cases

ConditionEngine behaviour
Panic inside an exported callCaught by catch_unwindENGINE_ERR_INTERNAL; handle stays defined and queryable.
append_wal fails / CAS rejected mid-commitCommitting → Aborted; no commit LSN; buffered records discarded; ENGINE_ERR_STORAGE or ENGINE_ERR_CONFLICT (fencing).
Crash after CAS-append issued, before client ackOn restart, WAL replay determines outcome; if the marker is durable the commit is present, else it never happened — no torn state (Exp 4a).
Crash after ack, before page materializationCommit is durable in the WAL; page store materializes lazily on next read — acked write is never lost (Exp 4b).
Use-after-free of a result/statementENGINE_ERR_MISUSE; no dereference of freed memory.
Concurrent calls on one handleMisuse; not supported. Open multiple handles.
Cold read after scale-to-zeroBuffer pool empty; reads hit the backend and pay object-store latency until warm (informs idle-timeout tuning, Exp 5).
Unknown URL schemeengine_open returns NULL; no silent default backend.

Acceptance criteria / definition of done

  • MUST build all four artifacts (libengine.a, the cdylib triplet, engine.h, engine-server) from one Cargo workspace on macOS, Linux, and Windows.
  • MUST open, exec, query, prepare/bind/step/finalize, and begin/commit/rollback against both file:// and s3:// backends with byte-identical SQL behaviour.
  • MUST pass Experiment 4 (crash safety of the commit path) unconditionally — every acked commit present, no torn or half state, no acked-write loss — before any real data.
  • MUST demonstrate snapshot isolation: a long-running reader observes a stable snapshot LSN while a concurrent writer commits new versions.
  • MUST survive a fuzz/loom suite over the FFI boundary with zero panics escaping and zero use-after-free under sanitizers.
  • SHOULD sustain group-commit throughput at least 10× the single-commit ceiling (Exp 1 vs Exp 2) on independent rows.
  • SHOULD create a branch via engine_branch in O(1) — a new LSN pointer, near-zero marginal storage — and write to it without affecting the base.

Open questions & risks

  • MAY the result default be JSON, or should a binary columnar row-iterator be the primary path for large results to avoid serialization overhead across FFI? (Affects Bun Integration ergonomics.)
  • MAY the engine expose explicit cursors over engine_query for streaming large result sets, rather than always buffering an EngineResult?
  • SHOULD resolve whether MVCC garbage collection of old row versions is driven by the engine or coordinated with the page store's compaction + PITR-window GC (see Object-Storage Backend).
  • SHOULD decide how much of the buffer pool the engine owns vs. delegates to Local Cache for the LFC spill-to-NVMe tier.
  • MAY the WAL payload be physical, logical, or hybrid? Physical is simplest for replay; logical is smaller and friendlier to the LSM page store — open for the storage-trait co-design.
  • SHOULD evaluate the WASM build constraints early (single-threaded, no native FS, object store via Worker binding) since they bound the threading model — see Deployment Targets.

Warning

Object-storage-native OLTP is the active frontier (SlateDB, S3-CAS log designs, the libSQL rewrite). The engine core leans on fast-moving foundations with fewer battle-tested guarantees than a coupled Postgres. Track maturity as a first-class risk — see Tradeoffs & Risks.

Related specifications

Serverless OLTP Engine — internal development specification. Draft, 2026-06-20. · Author