Object-Storage Backend
The s3:// implementation of the Storage trait: two cooperating sub-systems — an LSM page store and a CAS-fenced commit log — both bottoming out on an S3-compatible object store as the only durability floor, with no Raft or Paxos cluster anywhere in the design.
Purpose & scope
This spec expands source §2.C (object-storage backend) and the relevant rows of the §6 shopping list into a buildable design. It defines the disaggregated ObjectStorage backend that satisfies the Storage trait defined in 03 — Storage Interface. Where LocalFileStorage targets a single .db file for dev/edge/offline, ObjectStorage is the production backend that makes the engine storage-disaggregated and scale-to-zero.
The backend is split into exactly two cooperating sub-systems, each of which terminates on the same S3-compatible durability tier:
- Page store
- An LSM tree whose layers live as immutable objects in the bucket, versioned by LSN. Serves the read path —
get_page(page_id, lsn)— by materializing the page version visible at or before a snapshot LSN. - Commit log
- An ordered append-only WAL whose durability bottoms out on S3 conditional writes (compare-and-swap). Serves the write path —
append_wal(records)— and simultaneously provides single-writer fencing without an external consensus cluster.
Both are described below in full, including layer formats, key schema, algorithms, failure modes, configuration, and acceptance criteria.
Responsibilities & non-goals
Responsibilities
- MUST implement the
Storagetrait (get_page,append_wal,flush) over an S3-compatible object store, with no dependency on a local POSIX filesystem for durability. - MUST make all durable objects immutable once written; mutation happens only by writing new LSN-versioned objects and advancing a manifest pointer.
- MUST provide atomic ordered append and single-writer fencing through S3 compare-and-swap, with no Raft/Paxos/ZooKeeper dependency.
- MUST never acknowledge a commit before the corresponding WAL is durably committed to the log (durability rule from §8 of the source).
- SHOULD keep object-storage latency off the read hot path by deferring to the local cache layer (05 — Local Cache); an object-store miss is the egress/latency cost this backend exposes.
- SHOULD support point-in-time recovery (PITR) by retaining log segments and image layers across a configurable retention window.
- MAY run an optional local caching tier (RocksDB-style) underneath the page store for hot-layer reuse.
Non-goals
- MUST NOT implement multi-writer concurrency control for a single database — single-writer-per-DB is the deliberate ceiling (source §4); concurrency is recovered by the many-small-DBs sharding lever, not here.
- MUST NOT own SQL parsing, planning, MVCC visibility, or WAL generation — those live in the engine core (02). This backend consumes already-formed
WalRecords and servesPages. - MUST NOT implement the in-process page cache or the local file cache (LFC); those are 05 — Local Cache.
- Columnar/OLAP layout is out of scope — analytics compose around shared storage via DuckDB (source §11), not inside this backend.
Overview & architecture
The backend sits below the storage seam. The engine never touches S3 directly; it calls the trait, the trait routes through the local cache, and only a cache miss (read) or a commit (write) reaches the two sub-systems below.
engine core (MVCC, executor)
│ Storage trait
▼
┌───────────────────────┐
│ local cache (05) │ hot pages, in-process
└───────────┬───────────┘
miss │ commit
┌────────────┴────────────┐
▼ ▼
┌────────────────┐ ┌──────────────────┐
│ PAGE STORE │ │ COMMIT LOG │
│ LSM on S3 │ │ CAS append log │
│ (read path) │ │ (write path) │
└───────┬────────┘ └────────┬─────────┘
│ immutable objects │ CAS PUT-if-absent
▼ ▼
┌─────────────────────────────────┐
│ DURABILITY FLOOR (S3 / R2 / │
│ MinIO) — one bucket per DB │
└─────────────────────────────────┘
The two sub-systems are independent on the write/read axis but share one immutable, LSN-versioned object namespace in a single bucket per database.
The write path appends WAL records to the commit log (CAS-acked = durable = commit). A background apply loop replays committed WAL into the page store's in-memory layer, which flushes to delta layers and compacts into image layers. The read path resolves a page version from the page store's layer hierarchy. The two paths are decoupled in time: a commit is durable the instant the log CAS succeeds, even before the page store has materialized the new page version.
Page store — LSM on object storage
The page store is an LSM tree whose every persistent layer is an immutable object in the bucket. Model it on SlateDB (LSM designed natively for S3/R2); use RocksDB only if a local caching tier is wanted underneath. The page store is keyed by (page_id, lsn) and returns the page version visible at or before a query LSN.
Layer hierarchy
Three layer kinds, newest to oldest:
- In-memory layer (memtable)
- A sorted, in-RAM map of
(page_id, lsn) → page bytesfor recently applied WAL. Not durable on its own — the commit log is the durable record until the memtable flushes. Read-visible immediately after apply. - Delta layers
- Immutable objects holding
(page_id, lsn) → bytesrecords over a bounded LSN range. Each delta layer covers a contiguous LSN span and a key range. Produced by flushing a full memtable. Holds diffs — only the pages touched in that LSN span. - Image layers
- Immutable objects holding full page snapshots (one materialized page per
page_id) as of a single image LSN. Produced by compaction. Bound read amplification: a read need never scan delta layers older than the most recent covering image layer.
newest ┌─────────────────────────────┐
▲ │ in-memory layer (memtable) │ RAM, (page_id,lsn)→bytes
│ ├─────────────────────────────┤
│ │ delta layer L300..L399 │ diffs, immutable object
read │ delta layer L200..L299 │ diffs, immutable object
scan │ delta layer L100..L199 │ diffs, immutable object
│ ├─────────────────────────────┤
│ │ IMAGE layer @ L100 │ full snapshot of all pages
▼ └─────────────────────────────┘
oldest (compaction folds deltas → image;
GC drops layers below PITR window)
Read scans newest→oldest and stops at the first version ≤ query LSN; the image layer is the floor, so no scan crosses below it.
Write buffering, flush, compaction, GC
- Buffer. Committed WAL records are applied into the in-memory layer, producing new
(page_id, lsn)entries. The memtable is the only mutable structure in the page store. - Flush. When the memtable reaches
flush_threshold_bytes(or a flush interval elapses), it is serialized to a new immutable delta layer object and the memtable is reset. Flush is a single multipart PUT; the object name encodes its LSN range. - Compact. Periodically, a set of delta layers (plus the prior image layer) is merged into a new image layer at a chosen image LSN. Compaction reduces the number of layers a read must scan, bounding read amplification. The inputs remain readable until the manifest is advanced to reference the new image, then they become GC candidates.
- GC. Layers fully below the PITR window — i.e. no longer needed to reconstruct any LSN inside
pitr_window— are deleted from the bucket. GC MUST never delete a layer still referenced by the live manifest or by any retained branch pointer.
Note — immutability is the safety property
Every persistent layer is write-once. A reader that opened an object before a compaction completes keeps reading a valid object; there is no in-place mutation, so there are no torn reads from concurrent compaction. Visibility flips atomically when the manifest pointer advances, not when bytes change.
Read path — resolving get_page(page_id, lsn)
A read resolves the version of page_id visible at or before the snapshot lsn by scanning the layer hierarchy newest→oldest and returning the first match whose record LSN is ≤ the query LSN.
resolve get_page(page_id, query_lsn):
1. look up page_id in the in-memory layer;
return the entry with the greatest lsn ≤ query_lsn, if any.
2. else, for each delta layer overlapping query_lsn, newest→oldest:
if it holds (page_id, lsn') with lsn' ≤ query_lsn, return it.
3. else, read the full page from the covering image layer
(the most recent image layer whose image_lsn ≤ query_lsn).
4. the image layer is the floor — it always has a version,
so the scan terminates.
Layers 1 and 2 are typically served from the local cache; step 3 (and any uncached delta layer) is the point where a read reaches object storage.
Egress / latency cost lives here
An object-storage miss on the read path is the one cost this architecture exposes: S3/GCS GETs are billed as egress and carry hundreds-of-ms tail latency. Keeping the working set resident is the job of 05 — Local Cache; on R2 the egress component is zero (see Durability tier and 11 — Deployment Targets), leaving only per-operation fees and latency.
- MUST return the version with the greatest record LSN that is ≤ the query LSN; never a version newer than the snapshot.
- MUST terminate the scan at the covering image layer, which always holds a version.
- SHOULD issue uncached layer fetches concurrently when multiple candidate layers overlap, to amortize S3 latency.
- SHOULD populate the local cache with every layer block fetched on a miss.
Commit log — ordered append on S3 CAS
The commit log is an ordered, append-only WAL. Its durability bottoms out on S3 conditional writes (compare-and-swap) — there is no separate consensus cluster. A commit is durable the moment the CAS write that places its WAL segment succeeds.
What CAS buys us (the 2026 unlock)
A single primitive — conditional PUT (put-if-absent / put-if-match-ETag) — gives two distinct guarantees at once:
- (a) Atomic ordered append
- Each log segment is written with a put-if-absent at a deterministic key
log/<seq>where<seq>is the next sequence number. If the object already exists, the CAS fails and the writer knows another append claimed that slot. The total order is the sequence of keys; there are no gaps and no two records share a slot. - (b) Single-writer fencing
- Because only one CAS at a given sequence can succeed, the winner of the next-segment CAS is the current writer. A stale writer that lost the lease will fail its CAS (the slot is already taken) and fence itself off. The CAS token doubles as the fencing token — no lock service required.
Why this is the 2026-specific unlock
Atomic ordered append and single-writer fencing have historically required a Raft/Paxos cluster (etcd, ZooKeeper) or a managed log. Once object stores shipped strong compare-and-swap (S3 added conditional writes; R2 and MinIO support conditional If-None-Match/If-Match), both guarantees collapse into the durability tier itself. This is the WarpStream-style "log on S3" lineage (source §6): no brokers, no quorum process, the bucket is the consensus surface.
Append protocol
append_wal(records) -> commit_lsn:
1. serialize records into a log segment payload.
2. seq = last_known_seq + 1
key = "db/<db_id>/log/" + zeropad(seq)
3. PUT key with header If-None-Match: * (put-if-absent)
on 200/201 -> we own slot seq; commit_lsn = max LSN in payload
on 412 -> slot taken by another writer:
a) refresh last_known_seq from the bucket,
b) if we are still the fenced writer, retry at seq+1,
c) else FENCED -> surrender writer role.
4. only after the conditional PUT succeeds is the commit durable;
ack the client now, never before.
- MUST treat the successful conditional PUT as the commit point; the engine MUST NOT ack a transaction before it returns success.
- MUST use a conditional precondition (
If-None-Match: *for new slots, orIf-Match: <etag>for a head/manifest object) on every write that establishes order or ownership. - MUST interpret a precondition-failed (HTTP 412) at the writer's own slot as a possible fencing event and re-validate writer identity before retrying.
- SHOULD batch many transactions' WAL into one segment (group commit) to amortize the per-CAS round-trip — this defeats W1 (commit latency), not W2; see 09 and 10.
- MUST NOT rely on object-store clocks or timestamps for ordering; order is defined solely by the CAS-won sequence.
Object layout / key schema
One bucket (or one prefix) per database. Everything is immutable except the manifest and the writer-lease head, which advance via If-Match CAS. All payload objects are versioned by LSN; nothing is overwritten in place.
s3://<bucket>/db/<db_id>/
│
├── manifest.json # mutable via If-Match CAS; the root pointer.
│ # references current image + live delta/log set,
│ # current_lsn, writer-lease epoch, branch list.
│
├── lease.json # mutable via If-Match CAS; current writer epoch
│ # + fencing token. Loser of CAS is fenced.
│
├── log/ # commit log — ordered, append-only, CAS slots
│ ├── 00000000000000000001 # segment seq 1 (put-if-absent at this key)
│ ├── 00000000000000000002
│ └── ... # gap-free; key order == commit order
│
├── delta/ # delta layers — (page_id, lsn) diff records
│ ├── L0000000100-L0000000199.delta # covers LSN span [100,199]
│ ├── L0000000200-L0000000299.delta
│ └── ...
│
├── image/ # image layers — full page snapshots @ image_lsn
│ ├── img-L0000000100.image # snapshot of all pages as of LSN 100
│ ├── img-L0000000500.image
│ └── ...
│
└── branches/ # branch pointers — copy-on-write over shared layers
├── main.json # { base_lsn: 873, image: img-L..., ... }
├── feature-x.json # { base_lsn: 540, parent: "main", ... }
└── pr-1234.json
- manifest.json
- The single mutable root. Advanced by an
If-Match: <etag>PUT so two writers cannot both win an advance. Lists the live image layer, the set of live delta layers, the highest committed log seq,current_lsn, and the branch index. - lease.json
- The writer-lease head. Holds the current writer epoch + fencing token; a new writer claims it via CAS, fencing the prior holder. Optional if fencing is folded into the log-slot CAS, but recommended for explicit lease handoff.
- log/<zeropad seq>
- Append-only WAL segments. Fixed-width zero-padded sequence keeps lexical order equal to commit order. Written put-if-absent.
- delta/L<lo>-L<hi>.delta
- Immutable diff layer over the inclusive LSN span. Name encodes the span so the read path can prune by query LSN without opening the object.
- image/img-L<lsn>.image
- Immutable full-page snapshot at
image_lsn. The read-path floor. - branches/<name>.json
- A branch is a cheap pointer: a base LSN over the shared immutable layers (copy-on-write). New writes to a branch produce branch-scoped delta/log objects; unchanged layers are shared with the parent. See 06 — Lifecycle & Controller.
Durability tier
The durability floor is any S3-compatible object store. The backend depends only on: GET, PUT (multipart), DELETE, LIST, and conditional writes (If-None-Match / If-Match).
| Store | Conditional writes | Egress on cache-miss read | Notes |
|---|---|---|---|
| AWS S3 | Yes (conditional PUT) | Billed per GB out | Reference target; cross-region read variant matters for the latency floor (§8 Exp 1). |
| Cloudflare R2 | Yes (If-None-Match/If-Match) | $0 zero egress | Removes the cost that scales with miss rate; Class A/B per-op fees remain. See link 11 / 11. |
| MinIO | Yes | Self-hosted / LAN | Self-managed floor for the generic-VM target; full control, you own ops. |
Why R2 zero-egress matters for this engine specifically
A disaggregated engine reads pages back from object storage on every cache miss, and scale-to-zero produces a cold cache after every idle period — so miss rate is structurally coupled to the egress bill on S3/GCS. R2 charges zero egress at any volume, decoupling miss rate from bandwidth cost entirely; only the per-operation fee remains. The tradeoff: WASM build required for the Workers+R2 target (source §10).
- MUST require strong CAS semantics from the store; a store without conditional writes cannot back the commit log.
- SHOULD co-locate compute and bucket in the same region to minimize the read latency floor and avoid cross-cloud egress premiums.
- MAY abstract the store behind a thin object-client trait so AWS S3, R2, and MinIO are swapped by config, not rebuild.
Failure modes & edge cases
| Failure | Mechanism | Handling |
|---|---|---|
| CAS contention | Two writers race the same log slot; loser gets HTTP 412. | Loser refreshes last_known_seq, re-validates lease; if still the writer, retry at next slot with bounded backoff; else fence and surrender. Expected and benign for a single legitimate writer. |
| Lost-lease / split brain | A paused-then-resumed old writer tries to append. | Its CAS fails at the now-claimed slot (412) → it is fenced; it MUST NOT retry as writer. Single-writer-per-DB is preserved with no external lock. |
| Partial multipart | Process dies mid multipart-PUT of a delta/image layer. | Incomplete multipart uploads never become visible objects; the manifest is never advanced to reference a partial layer. Abandoned parts are cleaned by lifecycle policy / abort-incomplete-multipart. No torn layer is ever read. |
| Crash after CAS, before ack | Log segment is durable but client never saw the ack. | On restart the segment is replayed from the log; the commit is present and applied. Client may safely retry (idempotent on the commit LSN). No acked-write loss; matches §8 Experiment 4(a). |
| Crash after ack, before page materialization | Commit durable in log; page store has not yet applied it. | Apply loop replays the log forward from the last applied LSN into the memtable. The page version is reconstructed; the acked write is honored. §8 Experiment 4(b). |
| Torn read during compaction | Reader fetches a layer being replaced. | Impossible by construction — layers are immutable; visibility flips only on manifest advance. The reader's object stays valid; GC defers deletion until no live reference remains. |
| Clock skew between nodes | — | Non-issue: ordering is clock-free, defined entirely by the CAS-won log sequence, never by wall-clock timestamps. |
Durability rule (non-negotiable)
Never ack a commit before its WAL is durably in the log. Caching may hide read latency completely; it MUST NOT hide commit latency, or an acked write can be lost on crash. A fast commit path that loses an acked write under crash is disqualifying regardless of its latency numbers (source §8, Experiment 4 gate).
Configuration
| Knob | Default (proposed) | Effect |
|---|---|---|
flush_threshold_bytes | 64 MiB | Memtable size that triggers a flush to a delta layer. Larger = fewer, bigger delta objects; more RAM and longer crash-replay tail. |
flush_interval | 5 s | Time-based flush even below the size threshold; bounds memtable loss window before flush (commit log is still durable). |
delta_layer_target_bytes | 64 MiB | Target serialized size of a delta layer object. |
image_layer_target_bytes | 128 MiB | Target size of an image layer segment (large page sets may split image layers by key range). |
compaction_cadence | every N delta layers or T minutes | When to fold deltas into a new image. Tighter cadence = lower read amplification, higher write amplification / op cost. |
pitr_window | 7 days | Retention horizon for log segments + image layers; GC may not drop layers needed to reconstruct any LSN inside this window. |
cache_tier | none | Optional local caching tier under the page store (e.g. RocksDB). none = SlateDB-style pure-S3 model. |
group_commit_window | 2 ms / 256 txns | Batching window for the commit log CAS append (amortizes W1). Coordinated with the engine commit path. |
log_segment_max_bytes | 4 MiB | Max payload per log segment object before rolling to the next CAS slot. |
cas_max_retries / cas_backoff | 8 / exp 1–100 ms | Bounded retry on benign CAS contention before declaring a fencing event. |
Dependencies & existing pieces to start from
- Page store
- SlateDB — LSM designed natively for object storage (S3/R2); the closest existing model for the layer/flush/compaction machinery. RocksDB only if a local caching tier is wanted underneath.
- Commit log
- S3 conditional-write (CAS) append-log designs; WarpStream-style log-on-S3 lineage for the broker-free ordered-append pattern (source §6).
- Durability
- AWS S3 / Cloudflare R2 / MinIO — any S3-compatible store with conditional writes.
- Object client
- An S3-compatible SDK exposing conditional PUT; for the Workers+R2 target, the R2 binding rather than a raw S3 SDK (source §10 / 11).
Acceptance criteria / definition of done
- MUST pass §8 Experiment 4 unconditionally: kill
-9at (a) after CAS-append before ack and (b) after ack before page materialization — every acked commit present on restart, no torn state, no acked-write loss. - MUST demonstrate two concurrent writers against one DB resolve to exactly one survivor via CAS fencing, with the loser fenced and no split-brain log corruption.
- MUST serve
get_page(page_id, lsn)returning the correct at-or-before version across memtable, delta, and image layers, verified against a known LSN history. - MUST show flush → compaction → GC reduces live layer count over time and that GC never deletes a layer inside the PITR window or referenced by a live branch.
- SHOULD report single-commit latency p50/p99/p999 (§8 Exp 1) and a group-commit throughput curve (§8 Exp 2) for a same-region object store.
- SHOULD run the full backend against AWS S3, R2, and MinIO with no code changes (config-only swap).
- SHOULD be exercised under simulation/fault-injection (e.g.
loom) for the CAS append + apply loop.
Open questions & risks
- Image-layer split policy. For large page sets, do image layers split by key range, and how does the read path index which split covers a
page_idwithout a LIST per read? - Group-commit / fencing interaction. When batching many txns into one CAS segment, how is a mid-batch fencing event surfaced to the engine so only durable txns are acked?
- Branch GC. Reference counting across branch pointers so GC can drop a layer only when no branch retains it — exact bookkeeping in the manifest vs. a separate refcount object.
- Conditional-write portability. Subtle differences between S3 conditional PUT, R2
If-None-Match, and MinIO semantics; pin a tested behavior matrix per store. - Maturity. Object-storage-native OLTP (SlateDB, S3-CAS log designs, libSQL rewrite) is the active 2026 frontier — fewer battle-tested guarantees than coupled Postgres (source §4); pin specific versions and re-validate against upstream changes.
- WASM constraints. Single-threaded, no native FS on Workers — does the apply loop / compaction model survive the WASM port, or move out-of-band? (11.)