Purpose & scope

The durability floor of this engine is object storage (S3 / R2 / GCS / MinIO), reached through the object-storage backend via get_page(page_id, lsn). A raw object-store GET costs tens-to-hundreds of milliseconds at the p99 tail. If every read paid that cost, the engine would not be embeddable in any meaningful sense — engine_query() over bun:ffi would block for the duration of a network call on the hot path, defeating the entire premise of an in-process library.

The local cache exists to keep that latency off the hot path. It serves the overwhelming majority of reads from process-local memory or local NVMe, and falls through to object storage only on a genuine miss. The source design note states the dependency plainly, and this page treats it as the load-bearing claim:

Embeddability depends on this component

Without a local cache, every read is a network call. The "hot path is in-process function calls" property that makes the engine embeddable (source §3, True embeddability) is produced entirely by this cache. The LocalFileStorage backend avoids the network by construction; the ObjectStorage backend only avoids it because of the cache described here. Treat a regression in hit ratio as a regression in embeddability, not a tuning nuisance.

Scope of this spec: the in-process page cache, the local file cache (LFC) on NVMe, the tiering and admission/eviction policy between them, version-correct serving against MVCC snapshots, the durability boundary the cache must respect, cold-start warming, configuration, failure modes, and acceptance criteria. It is the expansion of source §2.D (Local cache) and the embeddability point in source §3.

Responsibilities & non-goals

Responsibilities

  • MUST serve reads for resident pages from memory (tier 1) or local NVMe (tier 2) without contacting object storage.
  • MUST return the page version consistent with the requesting reader's snapshot LSN (see Engine Core MVCC).
  • MUST treat cached page versions as immutable: a new committed version appends/invalidates, it never mutates a cached page in place.
  • MUST emit per-tier hit-ratio and miss-to-object-storage counters for capacity planning and the benchmark plan.
  • SHOULD warm proactively on cold start to bound the post-idle p99 (source §4, Cold start; Experiment 5).
  • SHOULD spill cleanly from memory to NVMe so working sets larger than RAM stay off the network.

Non-goals

  • MUST NOT participate in the commit-acknowledgement path. The cache is a read accelerator only; durability is owned by the WAL / commit log (source §8, durability rule).
  • MUST NOT be the source of truth for any page. The object-store floor is authoritative; the cache is reconstructable and disposable.
  • MAY be empty at any time (after eviction, restart, or scale-to-zero) without affecting correctness — only performance.
  • Replication, cross-node coherence, and distributed cache invalidation are out of scope: a single engine process owns its caches, and single-writer-per-DB (source §4) means there is no second writer to invalidate against.

Two-tier design

The cache is two tiers backed by one durability floor. Tier 1 is the in-process shared-buffer-style page cache; tier 2 is the local file cache (LFC) on NVMe. Tier 1 absorbs the genuinely hot working set at memory latency; the LFC extends the resident set far past RAM at NVMe latency, so the engine still avoids object storage for warm-but-not-hot pages.

Tier 1 — in-process page cache (shared-buffer style) RAM · ~ns–µs · pinned hot pages, MVCC version chains, dirty-on-WAL-flush only Tier 2 — local file cache (LFC) on NVMe local SSD · ~tens of µs · larger warm set, admission-gated, survives evictions from tier 1 Durability floor — object storage (S3 / R2 / GCS / MinIO) ~tens–hundreds of ms · authoritative · reached only on a true miss · cache-miss = egress (except R2)

Read fall-through: tier 1 → tier 2 (LFC) → object-storage floor. Only the bottom hop crosses the network and bills egress.

Tier 1 — in-process page cache (shared buffers)

A fixed-size, in-process array of page frames, modelled on PostgreSQL shared buffers / a buffer pool. Pages live in the engine's address space, so a hit is a pointer dereference plus a version-chain walk — no syscall, no copy across a process boundary, no network. This is precisely the property that preserves embeddability: an FFI engine_query() that hits tier 1 never yields the calling thread to I/O.

  • MUST store pages keyed by (page_id, version), where version is the page's producing LSN (see layer format).
  • MUST support pinning during read/execute so an in-use frame cannot be evicted mid-operation.
  • SHOULD hold the hot MVCC version chain for a page, not just the latest version, so concurrent snapshot readers at different LSNs hit tier 1.
  • SHOULD be sized as a fraction of process RAM (typ. 25–50%), leaving headroom for executor working memory and the host runtime (Bun).

Tier 2 — local file cache (LFC) on NVMe

A larger, on-disk cache backed by a file (or a directory of slab files) on local NVMe. The LFC holds pages that have aged out of tier 1 but are still warm enough to be worth keeping off the network. A tier-2 hit is a local NVMe read (tens of microseconds) — three to four orders of magnitude faster than an object-store GET, and it bills no egress. The LFC is what lets a working set far larger than RAM still avoid the durability floor.

  • MUST be local, ephemeral storage (instance NVMe / scratch disk), never network-attached, never the durability floor.
  • MUST store the same versioned pages as tier 1 and verify integrity on read (checksum) before serving.
  • SHOULD be sized in the GB–tens-of-GB range, far larger than tier 1, bounded by available local disk.
  • MAY be disabled entirely (lfc.enabled = false) for deployments with no usable local disk (e.g. Cloudflare Workers, see Deployment Targets), in which case tier 1 falls through directly to object storage.

Tiering policy between the tiers

The two tiers form an inclusive-by-default hierarchy with controlled promotion and demotion:

TransitionTriggerPolicy
floor → tier 1miss serviced by object storagefetched page is installed in tier 1 and admitted to the LFC if it passes admission (below).
tier 1 → tier 2 (demote)eviction from tier 1evicted page is written to the LFC if not already present and admission permits; otherwise dropped.
tier 2 → tier 1 (promote)tier-1 miss, tier-2 hitpage is read from NVMe into a tier-1 frame and pinned for the read; reference counters updated.
tier 2 → dropLFC eviction or fullpage is discarded; next access falls to the floor. No write-back (pages are immutable and the floor is authoritative).

Because every cached page version is immutable and reconstructable from the floor, there is never a dirty write-back from the LFC: demotion is a copy, eviction is a delete. The only "dirty" data in the system lives in the WAL until it is durable — and that data is governed by the durability boundary, not the cache.

Eviction & admission policy

Tier 1 — replacement

Tier 1 replacement is a low-overhead scan-resistant policy. CLOCK (second-chance) is the default candidate for its near-zero per-access cost; 2Q or a CLOCK-Pro variant is the upgrade path if scan workloads (e.g. large sequential reads, the HNSW cold traversal in Capabilities) pollute a plain LRU.

CandidateStrengthCost / riskUse
LRUsimple, good recencyper-access list churn; one big scan evicts the hot setbaseline / reference only
CLOCKO(1)-ish, lock-light, approximates LRUweaker than LRU under mixed recency/frequencydefault tier 1
2Qscan-resistant (separates recent vs frequent)two queues to tuneupgrade if scans pollute

LFC — admission

The LFC is admission-gated, not write-everything. Writing every evicted or fetched page to NVMe burns write endurance and lets one-shot pages (full-table scans, the analytics export path) evict the warm set. Admission policy decides whether a page earns a slot:

  • SHOULD admit on second access (a one-touch page is not yet proven warm); track a lightweight frequency sketch / ghost queue to detect re-reference.
  • SHOULD apply a frequency-based admission filter (TinyLFU-style) so a high-frequency incoming page can displace a low-frequency resident one, and a one-shot scan page is rejected.
  • MAY bypass admission for pages explicitly fetched by a warming pass (they are admitted because warming chose them).
  • MUST NOT let an admission decision block or slow a read: admission is asynchronous to the read path.

Metrics (mandatory)

These counters are not optional telemetry; they are the inputs to capacity planning and to benchmark Experiment 5 (cold vs warm reads).

cache.t1.hit_ratio
tier-1 hits / total reads, rolling window. The headline embeddability indicator.
cache.t2.hit_ratio
LFC hits / (reads that missed tier 1). Measures the LFC's contribution past RAM.
cache.overall_hit_ratio
(t1 + t2 hits) / total reads. 1 − this = fraction of reads that became object-store GETs.
cache.miss.object_reads
count of reads that fell through to object storage (each is a network round-trip and, except on R2, billable egress — see Deployment Targets).
cache.miss.object_read_latency
p50/p99/p999 of the floor read, captured as a histogram (never mean-only).
cache.t1.evictions / cache.t2.evictions
eviction rate per tier; a rising t1 eviction rate with flat hit ratio signals undersized tier 1.
cache.t2.admit / cache.t2.reject
admission accept/reject counts; high reject is healthy (scan resistance working).

Version correctness

The cache is not a key→bytes map; it is a versioned store, and correctness under MVCC is non-negotiable. Pages are versioned by LSN. A reader executes against a fixed snapshot LSN (see Engine Core, snapshot isolation via LSN-stamped versions). When that reader asks the cache for a page, the cache MUST return the version of that page that is visible at the reader's snapshot LSN — i.e. the newest version whose producing LSN is ≤ the snapshot LSN — not merely "the latest version present."

  • MUST resolve get_page(page_id, snapshot_lsn) to the page version with the greatest producing LSN ≤ snapshot_lsn.
  • MUST treat each cached page version as immutable. A newly committed version is appended to the page's version chain (or installed as a new keyed entry); it MUST NOT overwrite the bytes of an existing cached version.
  • MUST NOT serve a version with producing LSN > snapshot_lsn to that reader (that would expose a future write — a snapshot-isolation violation).
  • MUST, on a chain miss (no resident version satisfies the snapshot), fall through to object storage with the snapshot LSN and let the layer format reconstruct the correct version.
  • SHOULD retain older versions still referenced by live snapshots and release them once no snapshot ≤ their successor's LSN remains (cache version GC tracks the engine's oldest live snapshot, mirroring MVCC vacuum).

This "append, never mutate" rule is the cache-level reflection of the storage format itself: the page store is built from immutable delta and image layers keyed by (key, LSN). The cache mirrors that immutability so a cache hit and a floor read are indistinguishable in result — same version selection, same bytes.

Invalidation on commit

Because there is a single writer per database (source §4), invalidation is local and simple: when the writer commits at LSN L, the pages it modified gain a new version L. The cache appends those versions; it does not evict the prior versions while live snapshots still reference them. There is no cross-process coherence problem to solve — no second writer can produce a version this cache does not already know about.

Why append-not-mutate matters for correctness

A long-running reader holding snapshot LSN S and a writer committing at LSN L > S coexist. If the cache mutated the page in place at commit, the reader would suddenly see L's bytes through its S snapshot — a torn, non-repeatable read. Appending a new version and selecting by snapshot LSN keeps both readers correct from the same cache.

Interfaces / data formats

The cache sits between the engine's buffer-access layer and the Storage trait. It implements the read path of that trait's get_page by consulting tiers before delegating the miss to the underlying ObjectStorage implementation.

/// Identity of a cached page version. The (page_id, lsn) pair is the cache key;
/// `lsn` is the LSN at which this version of the page was produced.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct PageKey { pub page_id: PageId, pub lsn: Lsn }

pub trait PageCache: Send + Sync {
    /// Return the page version visible at `snapshot_lsn`: the resident version
    /// with the greatest producing LSN <= snapshot_lsn. On miss, fetch from the
    /// underlying Storage at snapshot_lsn, install, and return. Pins until dropped.
    fn get(&self, page_id: PageId, snapshot_lsn: Lsn) -> Result<PinnedPage>;

    /// Append a freshly committed version (producing LSN = `lsn`) into tier 1.
    /// MUST NOT mutate any existing cached version of `page_id`.
    fn put_committed(&self, page_id: PageId, lsn: Lsn, page: Page);

    /// Best-effort proactive load of `keys` into the cache (warming). Non-blocking
    /// w.r.t. the read path; admission is bypassed (warming is the admission signal).
    fn warm(&self, keys: &[PageKey]);

    /// Release versions no live snapshot can observe (oldest_live_snapshot driven).
    fn gc_versions(&self, oldest_live_snapshot: Lsn);

    /// Snapshot of per-tier counters for metrics export.
    fn stats(&self) -> CacheStats;
}

pub struct CacheStats {
    pub t1_hits: u64,  pub t2_hits: u64,  pub misses: u64,   // misses == object-store reads
    pub t1_evictions: u64, pub t2_evictions: u64,
    pub t2_admits: u64,    pub t2_rejects: u64,
    pub object_read_latency_us: Histogram,                    // p50/p99/p999
}

The on-NVMe LFC entry format is fixed-layout so an entry can be validated before it is trusted:

LFC slab entry
┌────────────┬────────────┬──────────────┬───────────────┬──────────────┐
│ magic (4B) │ page_id(8B) │  lsn (8B)    │ crc32c (4B)   │ page bytes    │
└────────────┴────────────┴──────────────┴───────────────┴──────────────┘
  fixed                                    of page bytes    PAGE_SIZE (e.g. 8 KiB)
- magic mismatch  -> treat slot as empty
- crc32c mismatch -> corruption: drop slot, count as miss, re-fetch from floor

Behaviour & algorithms

Read path (the hot path)

get(page_id, snapshot_lsn):
  1. TIER 1: walk version chain for page_id.
       resident version v with max(v.lsn) <= snapshot_lsn ?
         -> HIT: pin frame, bump CLOCK ref bit, t1_hits++, RETURN          (memory latency)
  2. TIER 2 (LFC): look up satisfying (page_id, lsn) slot, verify crc32c.
       valid ?
         -> HIT: read NVMe -> install into tier 1 frame, pin, t2_hits++, RETURN  (NVMe latency)
  3. MISS: storage.get_page(page_id, snapshot_lsn)   // network round-trip to floor
       install version into tier 1; async-admit to LFC if admission passes
       misses++ ; record object_read_latency ; RETURN                     (object-store latency)

Three-stop fall-through. Steps 1–2 are local; only step 3 crosses the network and bills egress.

Commit / invalidation path (NOT the hot path)

On a successful commit at LSN L — and only after the WAL is durable (see below) — the writer calls put_committed(page_id, L, page) for each modified page. The cache appends version L to tier 1's chain. Prior versions remain for live snapshots; gc_versions(oldest_live_snapshot) reclaims them later. No NVMe write-back occurs at commit: the LFC learns the new version lazily, on demotion.

Version GC

The engine reports its oldest live snapshot LSN (the same value MVCC vacuum uses). The cache drops any version v of a page where a newer resident version v′ exists with v′.lsn ≤ oldest_live_snapshot — i.e. no live reader can ever select v again. This keeps version chains bounded under a long-running-snapshot workload.

The durability boundary (critical callout)

Cache READ latency, never COMMIT latency

The cache hides read latency completely. It MUST NEVER hide commit latency. A commit MUST NOT be acknowledged to the client until its WAL records are durably stored on the commit-log floor (the S3 conditional-write / CAS append, source §2.C). Acknowledging a commit from an in-memory buffer before the WAL is durable produces acked-write loss: the client is told "committed," the process crashes, and the write is gone. This is disqualifying regardless of how good the latency numbers look (source §8, durability rule; Experiment 4).

This is the single sharpest line in the whole component, and it follows directly from the W1/W2 distinction in the source:

  • MUST NOT source a commit acknowledgement from any cache tier. The ack is owned exclusively by the WAL/commit-log durability signal.
  • MUST NOT reorder or shortcut the WAL-durable → ack → put_committed sequence. The page enters the cache after durability, never before.
  • MUST tolerate a crash between WAL-durable and put_committed: on restart the page is simply re-fetched/re-materialized from the floor; the committed data is safe because it is in the durable WAL, not because it was in the cache.
  • MUST keep the cache strictly off the W1 (commit-latency) axis — caching is a W1 non-mitigation. Source §8 is explicit: "cache cannot hide commit latency without breaking durability." Group commit beats W1; the cache only beats reads.

Restate the two levers, so nobody confuses them

The cache addresses read latency. Group commit addresses commit latency (W1). Many-small-DBs / sharding addresses write serialization (W2). They are three different levers for three different problems; the cache is exactly one of them and stays in its lane.

Cold start & warming

Scale-to-zero (source §3) means the engine process is torn down when idle, so caches start cold after every idle period. The first request after idle pays process-start plus cache-warm, and a cold read goes all the way to the floor at object-storage latency (and, off R2, egress cost). The lifecycle controller drives the cold/idle transitions; Deployment Targets sets the substrate-specific cold-start cost (container spin-up vs Worker isolate).

Warming strategies

  • SHOULD warm on start (warm.on_start = true) by prefetching the root/metadata pages and a recorded hot-set manifest before serving the first query, bounding the warm-path p99.
  • SHOULD persist a small hot-set manifest (the most-referenced PageKeys) to the floor on graceful shutdown and replay it via warm() on next start.
  • MAY survive a warm LFC across restarts when local NVMe is durable across the process lifecycle (e.g. a long-lived VM that restarts the engine without losing the scratch disk); on ephemeral substrates the LFC is cold like everything else.
  • MAY support keep-warm: the controller issues a periodic lightweight ping to defer scale-to-zero and keep caches resident for latency-sensitive tools (source §4, "keep-warm ping").
  • MUST NOT let warming block correctness or commits — warming is best-effort and asynchronous; a query that races warming simply takes the normal miss path.

Tuning hook

The cold-vs-warm p99 gap measured by Experiment 5 is the input to the idle-timeout / keep-warm decision. A large gap argues for a longer idle timeout or keep-warm on hot tools; a small gap (e.g. on R2 where misses are cheap) argues for aggressive scale-to-zero.

Configuration

KeyType / defaultMeaning
cache.t1.sizebytes · default 25% RAMtier-1 page-cache capacity. Larger = higher t1 hit ratio, less executor headroom.
cache.t1.policyclock | 2q | lru · default clocktier-1 replacement policy.
lfc.enabledbool · default trueenable the NVMe LFC. Set false on substrates with no local disk (Workers).
lfc.pathpath · e.g. /var/cache/engine/lfcNVMe directory for LFC slab files. MUST be local, ephemeral storage.
lfc.sizebytes · default 8 GiBLFC capacity, bounded by free space at lfc.path.
lfc.admissiontinylfu | second_touch | always · default tinylfuLFC admission policy.
cache.page_sizebytes · default 8192page/frame size; MUST match the engine + layer-format page size.
warm.on_startbool · default trueprefetch metadata + hot-set manifest before serving first query.
warm.manifest_pathfloor object keywhere the hot-set manifest is persisted/loaded.
cache.target_hit_ratioratio · default 0.95overall hit-ratio SLO; breaching it raises an alert (capacity signal).

Failure modes & edge cases

FailureEffectRequired handling
NVMe full (LFC out of space)cannot admit new pages to tier 2evict LFC LRU to make room; if still full, skip admission and serve the page from tier 1 / floor. degraded, never incorrect
LFC slab corruption (crc32c mismatch)a tier-2 entry is unreadabledrop the slot, count as a miss, re-fetch from the floor. Corruption in a disposable cache is always recoverable from the durability floor.
NVMe path unwritable / disk failureLFC cannot operatedisable LFC for the process lifetime, fall through tier 1 → floor, raise an alert. Engine stays correct, just colder.
tier-1 OOM pressurehost runtime starvedrespect cache.t1.size as a hard cap; shrink on memory pressure signal before the host OOM-kills the process.
floor read fails on misspage cannot be servedsurface the error to the query (retry policy lives in the object-storage backend); the cache MUST NOT fabricate or stale-serve a page it does not hold.
crash between WAL-durable and put_committedcommitted page not yet cachednone needed — re-materialize from floor on next read; the commit is safe in the durable WAL.
cold start after scale-to-zeroboth tiers emptywarm-on-start + manifest replay; first reads take the miss path. expected, not a fault
scan / one-shot read floodrisk of evicting hot setscan-resistant tier-1 policy (CLOCK/2Q) + LFC admission filter reject one-touch pages.

Invariant across every failure: the cache is disposable. Any corruption, eviction, or loss is recoverable by re-fetching from the durability floor, which is authoritative. No failure of the cache can ever cause data loss — only added latency.

Metric cards

≥ 0.95
overall hit ratio (target SLO)
≥ 0.90
tier-1 hit ratio (embeddability indicator)
< 1 ms
warm read p99 (tier 1 / LFC)
~100s ms
cold read p99 (floor, post scale-to-zero)
1 − hit
fraction of reads that become object-store GETs (= egress, off R2)
0
acked writes ever sourced from cache (durability boundary)

Dependencies / existing pieces to start from

  • SHOULD reuse a proven buffer-pool/CLOCK implementation from the engine-core lineage (libSQL / LeanStore-Umbra buffer manager) rather than writing tier 1 from scratch (source §2.A, §6).
  • SHOULD adopt the LFC concept directly from the Neon local-file-cache design (NVMe spill below shared buffers) — the closest shipping analog to this two-tier shape.
  • MAY lean on RocksDB's local-tier caching underneath the page store if the object-storage backend uses it instead of SlateDB (source §2.C notes RocksDB "if you want local-tier caching underneath").
  • SHOULD use a TinyLFU/W-TinyLFU admission sketch (Caffeine lineage) for the LFC.

Acceptance criteria / definition of done

  • MUST serve a tier-1 hit with zero syscalls and zero network I/O (verified by syscall trace on a warm-loop read).
  • MUST return snapshot-correct versions: a property test with interleaved long readers and a writer shows every reader sees exactly its snapshot's version, never a future one.
  • MUST pass Experiment 4 crash injection: no acked write is ever lost when the process is killed between WAL-durable and put_committed.
  • MUST recover from LFC corruption: a deliberately flipped crc32c byte yields a miss + floor re-fetch, never a served bad page (fault-injection test).
  • MUST export every metric in the metrics list with per-tier granularity.
  • SHOULD meet cache.target_hit_ratio ≥ 0.95 on the representative read-heavy/bursty internal-tool workload after warm-up.
  • SHOULD reduce post-idle warm p99 measurably with warm.on_start = true vs off, quantified by Experiment 5 (warm vs cold CDFs).
  • MUST keep working (correct, slower) with lfc.enabled = false for the WASM/Workers target.

Open questions & risks

  • MAY need a CLOCK-Pro / W-TinyLFU upgrade for tier 1 if HNSW vector traversal (the "nastiest cache case", source §11) pollutes CLOCK — its random pointer-chasing is the adversarial pattern. Open: should the vector index get a dedicated cache partition?
  • MAY want version-chain length limits per page to bound memory under pathologically long-lived snapshots; trade-off between chain GC aggressiveness and forcing live readers to the floor.
  • Open: optimal default split of RAM between tier 1 and host-runtime/executor working memory across substrates (Cloud Run container vs Worker isolate) — needs per-target measurement.
  • Open: should the LFC persist a warm hot-set across graceful restarts on durable-NVMe VMs, and is the complexity worth it vs always warming from the manifest?
  • Risk: on egress-billed stores (S3/GCS), a low hit ratio couples directly to the egress bill — the cache's SLO is partly a cost SLO. R2's zero egress (source §10) structurally de-risks this; non-R2 deployments must watch cache.miss.object_reads as a spend signal.
  • Risk: warming aggressiveness vs cold-start time — over-eager warm-on-start can lengthen the very cold start it aims to soften. Needs the Experiment 5 curve to tune.

Related specifications

Serverless OLTP Engine — internal development specification. Draft, 2026-06-20. · Author