Pluggable Storage Interface
A single narrow Rust trait that the engine calls instead of touching disk — the one seam that makes embeddable and storage-disaggregated stop contradicting each other, because the choice between a local file and object storage becomes a backend swap rather than a server boundary.
Purpose & scope
This page specifies trait Storage: the contract every durability backend implements and the only interface the engine core (02) ever uses to read pages or commit writes. It is the seam from Slot B → Slot C of the layer map. Everything above it (parser, planner, executor, MVCC, cache) is backend-agnostic; everything below it (a local file, an LSM-on-S3 page store, an S3-CAS commit log) is reachable only through these methods.
The source design note calls this "the single most important artifact" and "the key artifact" to build from scratch (source §2.B, §6). The decoupling argument rests entirely on this trait being narrow, stable, and rigorously contracted: keep the engine a library, but make its storage backend pluggable and point it at the network instead of a local file. The seam is a storage interface inside the library, not a server boundary. Server mode (07) is then just the same library wrapped in a wire-protocol listener — the storage trait is identical in both forms.
Why this seam resolves the tension
Embeddable implies engine-as-library with function-call latency (SQLite). Separate storage implies durable state over the network with stateless compute (Neon/Aurora). Both hold simultaneously when the storage backend is a trait object: LocalFileStorage gives pure in-process embeddability; ObjectStorage gives disaggregation; the engine above does not know or care which is wired in.
In scope
- MUST The exact method signatures, associated types, and per-method contracts of
trait Storage. - MUST Durability and ordering semantics for the write path (no ack-before-durable).
- MUST Snapshot-read semantics for the read path (version visible at-or-before a requested LSN).
- MUST Single-writer fencing primitives and the URL-scheme backend dispatch.
- MUST A backend-agnostic conformance test suite every implementation passes.
Non-goals
- MUST NOT Specify ObjectStorage internals (LSM layering, compaction, S3-CAS log mechanics) — those live in 04.
- MUST NOT Specify the in-process page cache / LFC — that is 05. The trait is the cache's miss path, not the cache itself.
- MUST NOT Define SQL, MVCC visibility rules, or WAL record encoding beyond the opaque
WalRecordhanded across the seam (02). - MUST NOT Add SQL execution, query planning, or transaction-manager logic into a backend. Backends are dumb durable stores.
Responsibilities & the contract surface
A Storage implementation owns exactly five durability concerns and nothing above them:
- Durable ordered log
- Accept WAL records, assign them a monotonic LSN, and make them durable before returning a commit LSN.
- Versioned page materialization
- Return the version of any page as it was visible at-or-before a requested LSN (MVCC read floor).
- Durable commit point
- Expose the latest LSN known to be durable, so the engine can advance its visibility horizon and recover after a crash.
- Single-writer fencing
- Issue, renew, and revoke one writer token via compare-and-swap, so a stale writer cannot corrupt shared state.
- Snapshot / branch pointers
- Create a copy-on-write branch as a new LSN pointer over shared immutable history, and resolve a branch to its base LSN.
The engine is responsible for everything else: it decides what to log, when to commit, and which LSN a transaction reads at. The backend never interprets WAL payloads, never makes visibility decisions, and never holds SQL state.
The source trait, then the realistic v1
The design note (source §2.B) gives the minimal seam — three methods. It is correct as a north star but under-specified for a buildable engine. We reproduce it verbatim, then extend to a v1 trait, justifying every addition. Nothing in v1 contradicts the source; each addition makes an implicit requirement explicit.
Source trait (verbatim, source §2.B)
trait Storage {
fn get_page(&self, page_id: PageId, lsn: Lsn) -> Page; // read path
fn append_wal(&self, records: &[WalRecord]) -> Lsn; // write path, returns commit LSN
fn flush(&self) -> Result<()>;
}
What the source trait omits (and why v1 must add it)
| Addition | Why it is forced by the rest of the spec |
|---|---|
get_pages (batch read) | On ObjectStorage each cache miss is a network round-trip (hundreds of ms). A planner that touches N pages and issues N serial single-page reads pays N round-trips. Batch read lets the backend coalesce, parallelize, and amortize — the only way executor reads on a cold cache are affordable (source §2.C, §2.D). |
durable_lsn() / get_commit_lsn() | The source forbids acking before durable (source §8 durability rule). To recover after a crash and to advance MVCC visibility, the engine must read the latest durable commit point — otherwise it cannot know which acked commits survived (source Experiment 4). |
Fencing: acquire / renew / release token (CAS) | The architecture is single-writer-per-DB, fenced via the commit log's CAS token (source §2.C, §2.E, §4). A returning zombie writer after a controller restart must be rejected; that requires an explicit fence token on the write path. |
Branch pointer ops: create_branch(base_lsn) / resolve_branch | Branch = a new LSN pointer over shared immutable layers, copy-on-write → instant clones (source §2.E, §3). The controller (06) needs trait-level branch creation; the read path needs to resolve a branch to a base LSN. |
| GC / PITR hooks | Compaction + GC past the PITR window is a backend duty (source §2.C). The engine must declare the retention floor (oldest LSN any live reader or branch needs) so the backend can safely reclaim older versions without breaking snapshot reads. |
v1 trait (the buildable contract)
/// The one seam. The engine calls this; it never touches disk directly.
/// All methods are `async` because a backend may be network-bound (S3/R2).
/// `LocalFileStorage` resolves them synchronously under the hood.
#[async_trait]
pub trait Storage: Send + Sync + 'static {
// ---- Read path -------------------------------------------------------
/// Page version visible at-or-before `lsn`. MVCC snapshot read floor.
async fn get_page(&self, page_id: PageId, lsn: Lsn) -> Result<Page, StorageError>;
/// Batch read: same semantics as `get_page` per id, but the backend MAY
/// coalesce / parallelize. Result order corresponds to `ids` order.
async fn get_pages(
&self,
ids: &[PageId],
lsn: Lsn,
) -> Result<Vec<Result<Page, StorageError>>, StorageError>;
// ---- Write path ------------------------------------------------------
/// Durably append WAL records under a valid fence token. Returns the
/// commit LSN ONLY after the records are durable. No ack-before-durable.
async fn append_wal(
&self,
token: &FenceToken,
records: &[WalRecord],
) -> Result<Lsn, StorageError>;
/// Force any buffered durable state fully settled. Idempotent.
async fn flush(&self) -> Result<(), StorageError>;
// ---- Durable commit point -------------------------------------------
/// Latest LSN known to be durable on this backend (recovery + visibility).
async fn durable_lsn(&self) -> Result<Lsn, StorageError>;
/// Latest LSN that is durable AND a committed transaction boundary.
/// Equals the high-water mark a fresh reader may safely read at.
async fn get_commit_lsn(&self) -> Result<Lsn, StorageError>;
// ---- Single-writer fencing (CAS token) ------------------------------
/// Acquire the single-writer token. CAS over the previous epoch; a new
/// holder strictly increases `epoch`, invalidating all prior tokens.
async fn acquire_fence(&self, owner: WriterId) -> Result<FenceToken, StorageError>;
/// Renew (lease extend) an existing token. Fails if superseded.
async fn renew_fence(&self, token: &FenceToken) -> Result<FenceToken, StorageError>;
/// Voluntarily relinquish the token so the next writer can acquire fast.
async fn release_fence(&self, token: FenceToken) -> Result<(), StorageError>;
// ---- Snapshot / branch (copy-on-write) ------------------------------
/// Create a branch as a new LSN pointer over shared immutable layers.
async fn create_branch(
&self,
name: &str,
base_lsn: Lsn,
) -> Result<BranchId, StorageError>;
/// Resolve a branch to (base_lsn, head_lsn). `head == base` for a fresh
/// branch; head advances as the branch takes its own writes.
async fn resolve_branch(&self, branch: BranchId)
-> Result<BranchRef, StorageError>;
// ---- GC / PITR hooks -------------------------------------------------
/// Declare the retention floor: the oldest LSN any live reader or branch
/// still needs. The backend MAY reclaim versions strictly older than this.
async fn set_retention_floor(&self, lsn: Lsn) -> Result<(), StorageError>;
/// Oldest LSN still recoverable (start of the PITR window).
async fn pitr_floor(&self) -> Result<Lsn, StorageError>;
}
The fence token is on the write path, not ambient
append_wal takes &FenceToken by value-of-reference, not from hidden state. This is deliberate: it makes a stale writer's commit a type-level error path. A backend that accepts a WAL append without checking the token against the current epoch is non-conformant (see the conformance suite, §"fencing rejects stale writer").
Associated types
The trait is generic only over these concrete types. They are part of the stable ABI surface; changing their wire/byte representation is a breaking change governed by the versioning policy.
/// Page identity. Stable for the life of the database. `u64` is ample for
/// addressable pages at the fixed page size; never reused after free.
#[derive(Clone, Copy, PartialEq, Eq, Hash, Debug)]
pub struct PageId(pub u64);
/// Log Sequence Number. STRICTLY MONOTONIC, gap-free per database, never
/// reused. Totally orders every durable mutation and every read snapshot.
#[derive(Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Debug)]
pub struct Lsn(pub u64);
/// A fixed-size page buffer plus the LSN that produced this version.
/// `PAGE_SIZE` is a compile-time constant (default 4096; see Configuration).
pub struct Page {
pub id: PageId,
pub lsn: Lsn, // version stamp: the LSN this image reflects
pub bytes: Box<[u8; PAGE_SIZE]>,
}
/// One opaque WAL record. The backend does NOT interpret it; it only stores
/// it durably and orders it. Encoding/semantics are owned by Engine Core (02).
pub struct WalRecord {
pub bytes: bytes::Bytes, // pre-serialized by the engine
}
/// Single-writer fence token. Monotonic `epoch` is the CAS key; any acquire
/// strictly increases it, fencing every prior holder. `lease_until` bounds
/// validity so a crashed holder's token expires.
#[derive(Clone)]
pub struct FenceToken {
pub epoch: u64,
pub owner: WriterId,
pub lease_until: Instant,
}
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub struct WriterId(pub u128); // unique per engine instance/process
#[derive(Clone, Copy, PartialEq, Eq, Debug)]
pub struct BranchId(pub u64);
pub struct BranchRef {
pub id: BranchId,
pub base_lsn: Lsn, // fork point on the parent
pub head_lsn: Lsn, // current head of this branch
}
/// Error taxonomy. Distinguishes RETRYABLE transients from FATAL faults and,
/// critically, the FENCED case so the engine can step down a stale writer.
#[derive(Debug, thiserror::Error)]
pub enum StorageError {
/// Page/branch/LSN does not exist (e.g. requested LSN below PITR floor).
#[error("not found: {0}")]
NotFound(String),
/// This writer has been fenced; a newer epoch holds the token. The engine
/// MUST stop writing immediately and not retry under the same token.
#[error("fenced: current epoch {current} > held {held}")]
Fenced { held: u64, current: u64 },
/// CAS lost a race for the fence (another writer won). Caller MAY back off.
#[error("contended: fence acquire lost the CAS race")]
Contended,
/// Transient backend failure (S3 5xx, throttling, timeout). RETRYABLE
/// with backoff; carries no durability claim either way (see contracts).
#[error("transient: {0}")]
Transient(String),
/// Durability could NOT be confirmed. The commit MUST be treated as failed;
/// the engine MUST NOT ack. Never silently coerced to success.
#[error("durability unconfirmed: {0}")]
DurabilityUnconfirmed(String),
/// Detected corruption / checksum mismatch on a materialized page.
#[error("corruption: {0}")]
Corruption(String),
/// Configuration/usage error (bad URL scheme, wrong page size). FATAL.
#[error("invalid: {0}")]
Invalid(String),
}
Why the error taxonomy is load-bearing
Fenced vs Contended vs DurabilityUnconfirmed are not cosmetic. Fenced means "stop writing forever under this token"; Contended means "you may try to acquire again"; DurabilityUnconfirmed means "do not ack this commit." Collapsing them into a generic error is what produces acked-write loss (source Experiment 4) — the one disqualifying failure.
Per-method contracts
Every method below states preconditions, postconditions, ordering, idempotency, durability, and error cases. These are the normative behaviours the conformance suite checks.
| Method | Pre / Post | Ordering & idempotency | Durability | Key error cases |
|---|---|---|---|---|
get_page(id, lsn) |
Pre: lsn ≥ pitr_floor(). Post: returns the page version with the greatest version-LSN that is ≤ lsn; Page.lsn ≤ lsn always holds. |
Read-only; fully idempotent — same (id, lsn) always yields the same bytes. No ordering effect. |
None (read). | NotFound (lsn below PITR floor / unknown page), Corruption, Transient. |
get_pages(ids, lsn) |
As get_page applied per id. Post: result vector aligns 1:1 with ids; per-id errors are reported per element, not as a whole-call failure. |
Read-only, idempotent. Backend MAY reorder I/O internally but MUST NOT reorder results. | None (read). | Outer Transient (whole batch); inner per-id NotFound / Corruption. |
append_wal(token, records) |
Pre: token is the current fence epoch. Post: on Ok(lsn), every record is durable and assigned consecutive LSNs ending at the returned commit LSN; durable_lsn() ≥ lsn thereafter. |
Strictly serialized per database. Returned LSNs are monotonically increasing across calls. NOT idempotent by default — a retried append after an unknown outcome MAY double-apply; the engine MUST dedup via WAL record identity (see Failure modes). | MUST be durable before returning. No ack-before-durable, ever (source §8). | Fenced (stale token → engine steps down), DurabilityUnconfirmed (do not ack), Transient (retryable, outcome unknown). |
flush() |
Post: all previously-acked durable state is fully settled to the durability floor. | Idempotent. A no-op if nothing is pending. Safe to call repeatedly. | Strengthens, never weakens, durability. | Transient, DurabilityUnconfirmed. |
durable_lsn() |
Post: returns an LSN L such that everything at-or-below L is durable. Monotonically non-decreasing across calls (never goes backward). |
Read-only, idempotent. | Reports durability; does not change it. | Transient. |
get_commit_lsn() |
Post: greatest durable LSN that is also a committed transaction boundary; the safe read snapshot for a fresh reader. | Read-only, idempotent, monotonic non-decreasing. | Reports durability. | Transient. |
acquire_fence(owner) |
Post: on Ok, returns a token whose epoch is strictly greater than every previously issued epoch; all prior tokens are now invalid. |
CAS over previous epoch; total order on epochs. Not idempotent (each success burns an epoch). | Token state itself is durable (CAS on the floor). | Contended (lost CAS — caller may retry), Transient. |
renew_fence(token) |
Pre: token still current. Post: extended lease_until, same epoch. |
Idempotent within a lease window. | Durable lease extension. | Fenced (superseded), Transient. |
release_fence(token) |
Post: token retired; next acquire_fence succeeds without waiting for lease expiry. |
Idempotent (releasing an already-retired token is Ok). |
Durable. | Transient. |
create_branch(name, base_lsn) |
Pre: base_lsn ≥ pitr_floor(). Post: new BranchId with head_lsn == base_lsn, sharing all immutable layers (copy-on-write, near-zero marginal storage). |
Not idempotent (each call mints a branch). Creation MUST NOT mutate the base. | Branch metadata durable on success. | NotFound (base below PITR floor), Transient. |
resolve_branch(branch) |
Post: returns {base_lsn, head_lsn} for the branch. |
Read-only, idempotent. | None. | NotFound (unknown branch). |
set_retention_floor(lsn) |
Pre: no live reader or branch needs an LSN below lsn. Post: backend MAY GC versions strictly older than lsn; pitr_floor() may rise toward lsn. |
Idempotent (advisory floor). MUST only move forward; a lower value than the current floor is rejected. | Eventually frees durable bytes; never affects acked-and-still-needed data. | Invalid (floor moved backward). |
pitr_floor() |
Post: oldest still-recoverable LSN; reads at-or-above it are guaranteed valid. | Read-only, idempotent, monotonic non-decreasing. | Reports. | Transient. |
The two invariants that override everything
- MUST append_wal MUST be durable before it returns the commit LSN. Never ack a commit from an in-memory buffer before the WAL is durably stored (source §8 durability rule). A fast commit path that loses an acked write under crash is disqualifying regardless of latency (source Experiment 4 gate).
- MUST get_page MUST return the version visible at-or-before the requested LSN. The returned
Page.lsn ≤ requested lsn, and it is the greatest such version. This is the MVCC read floor; violating it breaks snapshot isolation. - MUST Caching may hide read latency completely; it MUST NEVER hide commit latency. The cache (05) sits above this trait on the read path only.
Read returns the at-or-before-LSN version; the commit ack crosses back only after the WAL is durable on the floor.
Backend selection by URL scheme
The backend is chosen by the connection string passed at open time — the same engine code path in both cases (source §2.B, §5). The host runtime never changes; only the URL changes.
/// Dispatch a storage URL to a concrete backend. Called once per open.
pub fn open_storage(url: &str) -> Result<Box<dyn Storage>, StorageError> {
match Url::parse(url).map_err(|e| StorageError::Invalid(e.to_string()))?.scheme() {
// Pure embedded: a single .db file. Zero network. Dev / offline / edge.
"file" => Ok(Box::new(LocalFileStorage::open(url)?)),
// Disaggregated: LSM-on-S3 page store + S3-CAS commit log (spec 04).
"s3" | "r2" | "gs" => Ok(Box::new(ObjectStorage::open(url)?)),
other => Err(StorageError::Invalid(format!("unknown scheme: {other}"))),
}
}
| Scheme | Backend | Shape | When |
|---|---|---|---|
file:// | LocalFileStorage | Pure embedded, zero network, in-process function calls | Dev, offline, edge-local, fastest demo (source §5 rollout step 1) |
s3:// · r2:// · gs:// | ObjectStorage | Disaggregated: LSM layers + S3-CAS log on object storage | Scale-to-zero, branching, multi-substrate (source §2.C, §10) |
The Bun open call is one string
engine_open("file://./local.db") versus engine_open("s3://bucket/mydb") is the entire difference between pure-embedded and disaggregated (source §5). The trait makes "storage choice is config, not a rebuild" literally true. See 08.
Reference implementations
LocalFileStorage — a single .db file
The pure-embedded backend. Specified here in full because it has no separate page (its internals are simple); ObjectStorage internals are deferred to 04.
- MUST Persist all state in a single
.dbfile (source §2.B). Pages, WAL, fence epoch, branch metadata, and PITR floor live in that one file. - MUST Make
append_waldurable viafsync(orF_FULLFSYNCon macOS) before returning — local durability is microseconds, not the network round-tripObjectStoragepays. - MUST Implement fencing with an in-file epoch CAS (advisory lock + epoch counter) so it passes the identical conformance suite, even though a single local process is the common case.
- SHOULD Implement branches as additional LSN-pointer roots within the same file, sharing immutable page versions (copy-on-write within the file).
- MAY Use a libSQL/SQLite-compatible on-disk file layout to reuse a battle-tested format (source §2.A, §6).
ObjectStorage — the disaggregated backend (brief)
Backs the trait with an LSM page store written to object storage plus an ordered commit log bottoming out on S3 conditional writes (CAS), giving atomic ordered appends and single-writer fencing without a separate Raft/Paxos cluster (source §2.C). It is the realization of acquire_fence (the CAS token) and create_branch (LSN pointer over immutable layers). Full internals — layer format, compaction, GC past the PITR window, the S3-CAS log mechanics, R2 zero-egress reads — are specified in 04. From the engine's side it is just another Storage.
Conformance test suite
A backend-agnostic test battery. Any Storage implementation MUST pass the entire suite to be considered conformant. This is what makes a new backend trustworthy — it is run against LocalFileStorage, ObjectStorage (incl. MinIO/R2 targets), and any future backend identically. The suite is parametric over fn() -> Box<dyn Storage>.
/// Run the full conformance battery against any backend factory.
/// Every assertion is normative; a failing backend is non-conformant.
pub async fn run_conformance(make: impl Fn() -> BoxFut<Box<dyn Storage>>) {
durability_after_ack(&make).await; // C1
monotonic_lsn(&make).await; // C2
snapshot_read_correctness(&make).await;// C3
fencing_rejects_stale_writer(&make).await; // C4
crash_consistency_hooks(&make).await; // C5
batch_read_equivalence(&make).await; // C6
branch_isolation(&make).await; // C7
retention_safety(&make).await; // C8
}
| Test | What it proves | Assertion |
|---|---|---|
| C1 · durability after ack | No ack-before-durable. | After append_wal returns Ok(lsn), drop and re-open the backend; the records at lsn MUST be present and readable. Inject a crash after CAS-append but before ack and after ack but before page materialization (source Experiment 4 (a)/(b)) — every acked commit survives; no torn/half state; no acked-write loss. |
| C2 · monotonic LSN | LSN total order. | Across many concurrent and sequential append_wal calls, returned LSNs are strictly increasing and gap-free; durable_lsn() and get_commit_lsn() never decrease. |
| C3 · snapshot read correctness | MVCC read floor. | Write versions of a page at L1 < L2 < L3; get_page(id, L2) returns the L2 image (not L3); get_page(id, L1) returns the L1 image; a read at an LSN between two versions returns the lower one. Page.lsn ≤ requested always. |
| C4 · fencing rejects stale writer | Single-writer safety. | Writer A acquires (epoch e). Writer B acquires (epoch e+1, fencing A). A's append_wal with its old token MUST fail with Fenced — not silently succeed, not corrupt state. B's append succeeds. |
| C5 · crash-consistency hooks | Deterministic recovery. | Driven by seeded crash injection (and loom / simulation testing, source Experiment 4): kill at adversarial points; on restart, recovered durable_lsn() ≥ every acked commit LSN, and replay is deterministic. |
| C6 · batch read equivalence | get_pages == N × get_page. | For any ids, get_pages(ids, lsn) returns element-wise identical results to calling get_page per id, in order — batching only changes I/O scheduling, never results. |
| C7 · branch isolation | Copy-on-write branches. | Branch B off base L0; writes on B advance B's head only; the parent at L0 is unchanged; reading B at its base reflects parent state at L0. Creating a branch does not mutate the base. |
| C8 · retention safety | GC never breaks live reads. | With a live reader/branch needing L_old, set_retention_floor(L > L_old) either is rejected or does not reclaim L_old; reads at-or-above pitr_floor() always succeed. |
C1 and C4 are unconditional gates
A backend that fails C1 (durability after ack) or C4 (fencing) is rejected outright, no matter how fast it is. Durability is non-negotiable (source Experiment 4 gate / §8 decision rule 3). Hook these into the benchmark plan's fault-injection harness (09).
Configuration
| Knob | Type | Default | Meaning |
|---|---|---|---|
PAGE_SIZE | const usize | 4096 | Fixed page size; compile-time constant. Backends MUST agree; a mismatch on open is Invalid. |
storage_url | string | (required) | Selects the backend by scheme (§Backend selection). |
fence_lease_ms | u64 | 10000 | Writer lease duration; must be renewed before expiry or the token lapses. |
fence_renew_ms | u64 | 3000 | Renewal interval (< lease, with margin for a network round-trip). |
pitr_window | duration | 7d | How far back point-in-time recovery / branches may fork; drives GC floor. |
group_commit_window_ms | u64 | 2 | Hint: how long append_wal may batch concurrent commits to amortize the durable round-trip (W1 mitigation, source §8). Batching never changes durability semantics. |
batch_read_max | usize | 256 | Max page ids per get_pages coalesced request. |
Group commit belongs above the durability line
group_commit_window_ms batches the expensive durable handoffs (defeats W1 latency), but it MUST NOT ack any commit before its batch is durable. Group commit defeats commit latency; it never adds write lanes (W2) — that is the many-small-DBs lever (source §8, 10).
Failure modes & edge cases
| Scenario | Backend behaviour | Engine response |
|---|---|---|
| Crash after CAS-append, before ack returns | Record may be durable but unacknowledged. | On restart, recovery reconciles via durable_lsn(); an unacked commit either fully present or fully absent — never torn (C1, source Exp 4a). |
| Crash after ack, before page materialization | WAL is durable; page image may be unmaterialized. | Acked commit MUST be recoverable by replay; page is re-materialized on next read (C1, source Exp 4b). |
append_wal returns Transient (S3 5xx / timeout) | Outcome unknown — may or may not be durable. | Engine MUST NOT ack; MAY retry. Retry safety relies on WAL record identity dedup (append is not idempotent). |
| Stale writer returns after controller restart | Its token's epoch is below current; backend rejects with Fenced. | Engine steps down immediately; does not retry under the same token (C4). |
Read at an LSN below pitr_floor() | NotFound. | Engine surfaces a snapshot-too-old error; cannot fabricate a reclaimed version. |
Two writers race acquire_fence | One wins CAS, other gets Contended. | Loser backs off and retries or yields — single-writer-per-DB is preserved (source §4). |
| Page checksum mismatch on read | Corruption. | Engine fails the read loudly; never serves suspect bytes. |
| Cold cache after scale-to-zero | Every read is a backend round-trip (no torn semantics, just slow). | Mitigated by the cache (05) and batch reads; not a correctness issue (source Exp 5). |
Trait stability & versioning policy
The trait is the stable internal boundary that both reference backends and the engine compile against (and, transitively, the C ABI surface of engine.h for embedded hosts, source §2.A). It is versioned with a STORAGE_TRAIT_VERSION integer.
- MUST Bump
STORAGE_TRAIT_VERSION(major) for any signature change, associated-type byte-layout change, or contract weakening. The engine MUST refuse to open a backend reporting an incompatible major version. - SHOULD Add new capability via new methods with default implementations (e.g. a future
get_page_range) rather than altering existing signatures, so existing backends remain conformant. This is a minor bump. - MUST Treat the durability and snapshot-read invariants (§Per-method contracts) as frozen — they MUST NOT be relaxed across any version.
- SHOULD Gate every published trait version on the full conformance suite for all in-tree backends before release.
- MAY Mark associated types
#[non_exhaustive](e.g.StorageError) so new error variants are a minor, not major, change — callers already handle a default arm. - MUST NOT Leak backend-specific concepts (S3 keys, file offsets, LSM layer ids) into the trait surface; doing so would couple the engine to one backend and break the seam.
Normative requirements
- MUST The engine accesses durable state ONLY through
trait Storage; no engine code path touches a file, socket, or S3 SDK directly. - MUST
append_walreturns a commit LSN only after the records are durable on the backend's durability floor. - MUST
get_page(id, lsn)returns the greatest page version with version-LSN≤ lsn. - MUST LSNs are strictly monotonic, gap-free, and never reused per database.
- MUST
append_walvalidates the suppliedFenceTokenagainst the current epoch and rejects a stale token withFenced. - MUST Every backend pass the full conformance suite, with C1 and C4 as unconditional gates.
- MUST
set_retention_floornever reclaim a version still needed by a live reader or branch. - SHOULD
get_pagescoalesce backend I/O to amortize the per-miss network round-trip. - SHOULD Backends distinguish
Transient(retryable, outcome unknown) fromDurabilityUnconfirmed(do not ack) fromFenced(stop writing). - MAY
LocalFileStoragereuse a SQLite/libSQL-compatible on-disk layout. - MUST NOT Any backend interpret
WalRecordcontents or make MVCC visibility decisions. - MUST NOT Group commit, batching, or caching ack a commit before it is durable.
Acceptance criteria — definition of done
- MUST
trait Storage(v1) and all associated types compile and are documented with their contracts. - MUST
LocalFileStorageimplements every method and passes the full conformance suite (C1–C8). - MUST
ObjectStoragepasses the same suite against at least one real object store (MinIO acceptable for CI; R2/S3 for release) — see 04. - MUST
open_storagedispatchesfile://ands3:///r2://to the correct backend, exercised by the embedded path (08). - MUST Crash-injection tests (C1/C5) run under the benchmark/validation harness (09) and show zero acked-write loss.
- SHOULD A swap from
file://tos3://changes only the URL — no engine recompile and identical query behaviour. - SHOULD
STORAGE_TRAIT_VERSIONand an incompatibility-refusal path exist and are tested.
Open questions & risks
- SHOULD Resolve: should
append_walbecome idempotent via a client-supplied request id (exactly-once on retry), or stay non-idempotent with engine-side WAL dedup? Leaning engine-side dedup to keep the trait narrow. - SHOULD Decide whether branch writes (
head_lsnadvancing) require their own fence token per branch, or share the parent database's single-writer token. - MAY Investigate a
get_page_range/ prefetch hint for scan-heavy plans to further reduce round-trips on a cold cache. - MAY Consider exposing read-replica / stale-read semantics (read at a slightly older durable LSN) for read scaling — likely a v2 addition.
- SHOULD Constrain the WASM target (11): the async trait must drive object storage through a Worker binding, not a raw S3 SDK; verify the trait surface is binding-agnostic.
- SHOULD Maturity risk: object-storage-native OLTP (SlateDB, S3-CAS log designs) is the active frontier (source §4); the conformance suite is the primary defense — expand it as new failure modes surface.
Dependencies & pieces to start from
- Engine Core (02)
- Produces
WalRecords, chooses read LSNs, and is the sole caller of this trait. - ObjectStorage backend (04)
- The disaggregated implementation: LSM-on-S3 + S3-CAS commit log; realizes fencing and branching.
- SlateDB / RocksDB lineage
- Reference for the LSM page store behind
ObjectStorage(source §2.C, §6). - libSQL / SQLite layout
- Optional on-disk format reuse for
LocalFileStorage(source §2.A, §6). loom/ simulation testing- Powers the crash-consistency conformance tests C1/C5 (source Experiment 4).