Local Cache · Serverless OLTP Engine Spec

Purpose & scope

The durability floor of this engine is object storage (S3 / R2 / GCS / MinIO), reached through the object-storage backend via get_page(page_id, lsn). A raw object-store GET costs tens-to-hundreds of milliseconds at the p99 tail. If every read paid that cost, the engine would not be embeddable in any meaningful sense — engine_query() over bun:ffi would block for the duration of a network call on the hot path, defeating the entire premise of an in-process library.

The local cache exists to keep that latency off the hot path. It serves the overwhelming majority of reads from process-local memory or local NVMe, and falls through to object storage only on a genuine miss. The source design note states the dependency plainly, and this page treats it as the load-bearing claim:

Embeddability depends on this component

Without a local cache, every read is a network call. The "hot path is in-process function calls" property that makes the engine embeddable (source §3, True embeddability) is produced entirely by this cache. The LocalFileStorage backend avoids the network by construction; the ObjectStorage backend only avoids it because of the cache described here. Treat a regression in hit ratio as a regression in embeddability, not a tuning nuisance.

Scope of this spec: the in-process page cache, the local file cache (LFC) on NVMe, the tiering and admission/eviction policy between them, version-correct serving against MVCC snapshots, the durability boundary the cache must respect, cold-start warming, configuration, failure modes, and acceptance criteria. It is the expansion of source §2.D (Local cache) and the embeddability point in source §3.

Responsibilities & non-goals

Responsibilities

MUST serve reads for resident pages from memory (tier 1) or local NVMe (tier 2) without contacting object storage.
MUST return the page version consistent with the requesting reader's snapshot LSN (see Engine Core MVCC).
MUST treat cached page versions as immutable: a new committed version appends/invalidates, it never mutates a cached page in place.
MUST emit per-tier hit-ratio and miss-to-object-storage counters for capacity planning and the benchmark plan.
SHOULD warm proactively on cold start to bound the post-idle p99 (source §4, Cold start; Experiment 5).
SHOULD spill cleanly from memory to NVMe so working sets larger than RAM stay off the network.

Non-goals

MUST NOT participate in the commit-acknowledgement path. The cache is a read accelerator only; durability is owned by the WAL / commit log (source §8, durability rule).
MUST NOT be the source of truth for any page. The object-store floor is authoritative; the cache is reconstructable and disposable.
MAY be empty at any time (after eviction, restart, or scale-to-zero) without affecting correctness — only performance.
Replication, cross-node coherence, and distributed cache invalidation are out of scope: a single engine process owns its caches, and single-writer-per-DB (source §4) means there is no second writer to invalidate against.

Two-tier design

The cache is two tiers backed by one durability floor. Tier 1 is the in-process shared-buffer-style page cache; tier 2 is the local file cache (LFC) on NVMe. Tier 1 absorbs the genuinely hot working set at memory latency; the LFC extends the resident set far past RAM at NVMe latency, so the engine still avoids object storage for warm-but-not-hot pages.

Read fall-through: tier 1 → tier 2 (LFC) → object-storage floor. Only the bottom hop crosses the network and bills egress.

Tier 1 — in-process page cache (shared buffers)

A fixed-size, in-process array of page frames, modelled on PostgreSQL shared buffers / a buffer pool. Pages live in the engine's address space, so a hit is a pointer dereference plus a version-chain walk — no syscall, no copy across a process boundary, no network. This is precisely the property that preserves embeddability: an FFI engine_query() that hits tier 1 never yields the calling thread to I/O.

MUST store pages keyed by (page_id, version), where version is the page's producing LSN (see layer format).
MUST support pinning during read/execute so an in-use frame cannot be evicted mid-operation.
SHOULD hold the hot MVCC version chain for a page, not just the latest version, so concurrent snapshot readers at different LSNs hit tier 1.
SHOULD be sized as a fraction of process RAM (typ. 25–50%), leaving headroom for executor working memory and the host runtime (Bun).

Tier 2 — local file cache (LFC) on NVMe

A larger, on-disk cache backed by a file (or a directory of slab files) on local NVMe. The LFC holds pages that have aged out of tier 1 but are still warm enough to be worth keeping off the network. A tier-2 hit is a local NVMe read (tens of microseconds) — three to four orders of magnitude faster than an object-store GET, and it bills no egress. The LFC is what lets a working set far larger than RAM still avoid the durability floor.

MUST be local, ephemeral storage (instance NVMe / scratch disk), never network-attached, never the durability floor.
MUST store the same versioned pages as tier 1 and verify integrity on read (checksum) before serving.
SHOULD be sized in the GB–tens-of-GB range, far larger than tier 1, bounded by available local disk.
MAY be disabled entirely (lfc.enabled = false) for deployments with no usable local disk (e.g. Cloudflare Workers, see Deployment Targets), in which case tier 1 falls through directly to object storage.

Tiering policy between the tiers

The two tiers form an inclusive-by-default hierarchy with controlled promotion and demotion:

Transition	Trigger	Policy
floor → tier 1	miss serviced by object storage	fetched page is installed in tier 1 and admitted to the LFC if it passes admission (below).
tier 1 → tier 2 (demote)	eviction from tier 1	evicted page is written to the LFC if not already present and admission permits; otherwise dropped.
tier 2 → tier 1 (promote)	tier-1 miss, tier-2 hit	page is read from NVMe into a tier-1 frame and pinned for the read; reference counters updated.
tier 2 → drop	LFC eviction or full	page is discarded; next access falls to the floor. No write-back (pages are immutable and the floor is authoritative).

Because every cached page version is immutable and reconstructable from the floor, there is never a dirty write-back from the LFC: demotion is a copy, eviction is a delete. The only "dirty" data in the system lives in the WAL until it is durable — and that data is governed by the durability boundary, not the cache.

Eviction & admission policy

Tier 1 — replacement

Tier 1 replacement is a low-overhead scan-resistant policy. CLOCK (second-chance) is the default candidate for its near-zero per-access cost; 2Q or a CLOCK-Pro variant is the upgrade path if scan workloads (e.g. large sequential reads, the HNSW cold traversal in Capabilities) pollute a plain LRU.

Candidate	Strength	Cost / risk	Use
LRU	simple, good recency	per-access list churn; one big scan evicts the hot set	baseline / reference only
CLOCK	O(1)-ish, lock-light, approximates LRU	weaker than LRU under mixed recency/frequency	default tier 1
2Q	scan-resistant (separates recent vs frequent)	two queues to tune	upgrade if scans pollute

LFC — admission

The LFC is admission-gated, not write-everything. Writing every evicted or fetched page to NVMe burns write endurance and lets one-shot pages (full-table scans, the analytics export path) evict the warm set. Admission policy decides whether a page earns a slot:

SHOULD admit on second access (a one-touch page is not yet proven warm); track a lightweight frequency sketch / ghost queue to detect re-reference.
SHOULD apply a frequency-based admission filter (TinyLFU-style) so a high-frequency incoming page can displace a low-frequency resident one, and a one-shot scan page is rejected.
MAY bypass admission for pages explicitly fetched by a warming pass (they are admitted because warming chose them).
MUST NOT let an admission decision block or slow a read: admission is asynchronous to the read path.

Metrics (mandatory)

These counters are not optional telemetry; they are the inputs to capacity planning and to benchmark Experiment 5 (cold vs warm reads).

cache.t1.hit_ratio: tier-1 hits / total reads, rolling window. The headline embeddability indicator.
cache.t2.hit_ratio: LFC hits / (reads that missed tier 1). Measures the LFC's contribution past RAM.
cache.overall_hit_ratio: (t1 + t2 hits) / total reads. 1 − this = fraction of reads that became object-store GETs.
cache.miss.object_reads: count of reads that fell through to object storage (each is a network round-trip and, except on R2, billable egress — see Deployment Targets).
cache.miss.object_read_latency: p50/p99/p999 of the floor read, captured as a histogram (never mean-only).
cache.t1.evictions / cache.t2.evictions: eviction rate per tier; a rising t1 eviction rate with flat hit ratio signals undersized tier 1.
cache.t2.admit / cache.t2.reject: admission accept/reject counts; high reject is healthy (scan resistance working).

Version correctness

The cache is not a key→bytes map; it is a versioned store, and correctness under MVCC is non-negotiable. Pages are versioned by LSN. A reader executes against a fixed snapshot LSN (see Engine Core, snapshot isolation via LSN-stamped versions). When that reader asks the cache for a page, the cache MUST return the version of that page that is visible at the reader's snapshot LSN — i.e. the newest version whose producing LSN is ≤ the snapshot LSN — not merely "the latest version present."

MUST resolve get_page(page_id, snapshot_lsn) to the page version with the greatest producing LSN ≤ snapshot_lsn.
MUST treat each cached page version as immutable. A newly committed version is appended to the page's version chain (or installed as a new keyed entry); it MUST NOT overwrite the bytes of an existing cached version.
MUST NOT serve a version with producing LSN > snapshot_lsn to that reader (that would expose a future write — a snapshot-isolation violation).
MUST, on a chain miss (no resident version satisfies the snapshot), fall through to object storage with the snapshot LSN and let the layer format reconstruct the correct version.
SHOULD retain older versions still referenced by live snapshots and release them once no snapshot ≤ their successor's LSN remains (cache version GC tracks the engine's oldest live snapshot, mirroring MVCC vacuum).

This "append, never mutate" rule is the cache-level reflection of the storage format itself: the page store is built from immutable delta and image layers keyed by (key, LSN). The cache mirrors that immutability so a cache hit and a floor read are indistinguishable in result — same version selection, same bytes.

Invalidation on commit

Because there is a single writer per database (source §4), invalidation is local and simple: when the writer commits at LSN L, the pages it modified gain a new version L. The cache appends those versions; it does not evict the prior versions while live snapshots still reference them. There is no cross-process coherence problem to solve — no second writer can produce a version this cache does not already know about.

Why append-not-mutate matters for correctness

A long-running reader holding snapshot LSN S and a writer committing at LSN L > S coexist. If the cache mutated the page in place at commit, the reader would suddenly see L's bytes through its S snapshot — a torn, non-repeatable read. Appending a new version and selecting by snapshot LSN keeps both readers correct from the same cache.

Interfaces / data formats

The cache sits between the engine's buffer-access layer and the Storage trait. It implements the read path of that trait's get_page by consulting tiers before delegating the miss to the underlying ObjectStorage implementation.

/// Identity of a cached page version. The (page_id, lsn) pair is the cache key;
/// `lsn` is the LSN at which this version of the page was produced.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct PageKey { pub page_id: PageId, pub lsn: Lsn }

pub trait PageCache: Send + Sync {
    /// Return the page version visible at `snapshot_lsn`: the resident version
    /// with the greatest producing LSN <= snapshot_lsn. On miss, fetch from the
    /// underlying Storage at snapshot_lsn, install, and return. Pins until dropped.
    fn get(&self, page_id: PageId, snapshot_lsn: Lsn) -> Result<PinnedPage>;

    /// Append a freshly committed version (producing LSN = `lsn`) into tier 1.
    /// MUST NOT mutate any existing cached version of `page_id`.
    fn put_committed(&self, page_id: PageId, lsn: Lsn, page: Page);

    /// Best-effort proactive load of `keys` into the cache (warming). Non-blocking
    /// w.r.t. the read path; admission is bypassed (warming is the admission signal).
    fn warm(&self, keys: &[PageKey]);

    /// Release versions no live snapshot can observe (oldest_live_snapshot driven).
    fn gc_versions(&self, oldest_live_snapshot: Lsn);

    /// Snapshot of per-tier counters for metrics export.
    fn stats(&self) -> CacheStats;
}

pub struct CacheStats {
    pub t1_hits: u64,  pub t2_hits: u64,  pub misses: u64,   // misses == object-store reads
    pub t1_evictions: u64, pub t2_evictions: u64,
    pub t2_admits: u64,    pub t2_rejects: u64,
    pub object_read_latency_us: Histogram,                    // p50/p99/p999
}

The on-NVMe LFC entry format is fixed-layout so an entry can be validated before it is trusted:

LFC slab entry
┌────────────┬────────────┬──────────────┬───────────────┬──────────────┐
│ magic (4B) │ page_id(8B) │  lsn (8B)    │ crc32c (4B)   │ page bytes    │
└────────────┴────────────┴──────────────┴───────────────┴──────────────┘
  fixed                                    of page bytes    PAGE_SIZE (e.g. 8 KiB)
- magic mismatch  -> treat slot as empty
- crc32c mismatch -> corruption: drop slot, count as miss, re-fetch from floor

Behaviour & algorithms

Read path (the hot path)

get(page_id, snapshot_lsn):
  1. TIER 1: walk version chain for page_id.
       resident version v with max(v.lsn) <= snapshot_lsn ?
         -> HIT: pin frame, bump CLOCK ref bit, t1_hits++, RETURN          (memory latency)
  2. TIER 2 (LFC): look up satisfying (page_id, lsn) slot, verify crc32c.
       valid ?
         -> HIT: read NVMe -> install into tier 1 frame, pin, t2_hits++, RETURN  (NVMe latency)
  3. MISS: storage.get_page(page_id, snapshot_lsn)   // network round-trip to floor
       install version into tier 1; async-admit to LFC if admission passes
       misses++ ; record object_read_latency ; RETURN                     (object-store latency)

Three-stop fall-through. Steps 1–2 are local; only step 3 crosses the network and bills egress.

Commit / invalidation path (NOT the hot path)

On a successful commit at LSN L — and only after the WAL is durable (see below) — the writer calls put_committed(page_id, L, page) for each modified page. The cache appends version L to tier 1's chain. Prior versions remain for live snapshots; gc_versions(oldest_live_snapshot) reclaims them later. No NVMe write-back occurs at commit: the LFC learns the new version lazily, on demotion.

Version GC

The engine reports its oldest live snapshot LSN (the same value MVCC vacuum uses). The cache drops any version v of a page where a newer resident version v′ exists with v′.lsn ≤ oldest_live_snapshot — i.e. no live reader can ever select v again. This keeps version chains bounded under a long-running-snapshot workload.

The durability boundary (critical callout)

Cache READ latency, never COMMIT latency

The cache hides read latency completely. It MUST NEVER hide commit latency. A commit MUST NOT be acknowledged to the client until its WAL records are durably stored on the commit-log floor (the S3 conditional-write / CAS append, source §2.C). Acknowledging a commit from an in-memory buffer before the WAL is durable produces acked-write loss: the client is told "committed," the process crashes, and the write is gone. This is disqualifying regardless of how good the latency numbers look (source §8, durability rule; Experiment 4).

This is the single sharpest line in the whole component, and it follows directly from the W1/W2 distinction in the source:

MUST NOT source a commit acknowledgement from any cache tier. The ack is owned exclusively by the WAL/commit-log durability signal.
MUST NOT reorder or shortcut the WAL-durable → ack → put_committed sequence. The page enters the cache after durability, never before.
MUST tolerate a crash between WAL-durable and put_committed: on restart the page is simply re-fetched/re-materialized from the floor; the committed data is safe because it is in the durable WAL, not because it was in the cache.
MUST keep the cache strictly off the W1 (commit-latency) axis — caching is a W1 non-mitigation. Source §8 is explicit: "cache cannot hide commit latency without breaking durability." Group commit beats W1; the cache only beats reads.

Restate the two levers, so nobody confuses them

The cache addresses read latency. Group commit addresses commit latency (W1). Many-small-DBs / sharding addresses write serialization (W2). They are three different levers for three different problems; the cache is exactly one of them and stays in its lane.

Cold start & warming

Scale-to-zero (source §3) means the engine process is torn down when idle, so caches start cold after every idle period. The first request after idle pays process-start plus cache-warm, and a cold read goes all the way to the floor at object-storage latency (and, off R2, egress cost). The lifecycle controller drives the cold/idle transitions; Deployment Targets sets the substrate-specific cold-start cost (container spin-up vs Worker isolate).

Warming strategies

SHOULD warm on start (warm.on_start = true) by prefetching the root/metadata pages and a recorded hot-set manifest before serving the first query, bounding the warm-path p99.
SHOULD persist a small hot-set manifest (the most-referenced PageKeys) to the floor on graceful shutdown and replay it via warm() on next start.
MAY survive a warm LFC across restarts when local NVMe is durable across the process lifecycle (e.g. a long-lived VM that restarts the engine without losing the scratch disk); on ephemeral substrates the LFC is cold like everything else.
MAY support keep-warm: the controller issues a periodic lightweight ping to defer scale-to-zero and keep caches resident for latency-sensitive tools (source §4, "keep-warm ping").
MUST NOT let warming block correctness or commits — warming is best-effort and asynchronous; a query that races warming simply takes the normal miss path.

Tuning hook

The cold-vs-warm p99 gap measured by Experiment 5 is the input to the idle-timeout / keep-warm decision. A large gap argues for a longer idle timeout or keep-warm on hot tools; a small gap (e.g. on R2 where misses are cheap) argues for aggressive scale-to-zero.

Configuration

Key	Type / default	Meaning
`cache.t1.size`	bytes · default 25% RAM	tier-1 page-cache capacity. Larger = higher t1 hit ratio, less executor headroom.
`cache.t1.policy`	`clock` \| `2q` \| `lru` · default `clock`	tier-1 replacement policy.
`lfc.enabled`	bool · default `true`	enable the NVMe LFC. Set false on substrates with no local disk (Workers).
`lfc.path`	path · e.g. `/var/cache/engine/lfc`	NVMe directory for LFC slab files. MUST be local, ephemeral storage.
`lfc.size`	bytes · default 8 GiB	LFC capacity, bounded by free space at `lfc.path`.
`lfc.admission`	`tinylfu` \| `second_touch` \| `always` · default `tinylfu`	LFC admission policy.
`cache.page_size`	bytes · default 8192	page/frame size; MUST match the engine + layer-format page size.
`warm.on_start`	bool · default `true`	prefetch metadata + hot-set manifest before serving first query.
`warm.manifest_path`	floor object key	where the hot-set manifest is persisted/loaded.
`cache.target_hit_ratio`	ratio · default 0.95	overall hit-ratio SLO; breaching it raises an alert (capacity signal).

Failure modes & edge cases

Failure	Effect	Required handling
NVMe full (LFC out of space)	cannot admit new pages to tier 2	evict LFC LRU to make room; if still full, skip admission and serve the page from tier 1 / floor. degraded, never incorrect
LFC slab corruption (crc32c mismatch)	a tier-2 entry is unreadable	drop the slot, count as a miss, re-fetch from the floor. Corruption in a disposable cache is always recoverable from the durability floor.
NVMe path unwritable / disk failure	LFC cannot operate	disable LFC for the process lifetime, fall through tier 1 → floor, raise an alert. Engine stays correct, just colder.
tier-1 OOM pressure	host runtime starved	respect `cache.t1.size` as a hard cap; shrink on memory pressure signal before the host OOM-kills the process.
floor read fails on miss	page cannot be served	surface the error to the query (retry policy lives in the object-storage backend); the cache MUST NOT fabricate or stale-serve a page it does not hold.
crash between WAL-durable and `put_committed`	committed page not yet cached	none needed — re-materialize from floor on next read; the commit is safe in the durable WAL.
cold start after scale-to-zero	both tiers empty	warm-on-start + manifest replay; first reads take the miss path. expected, not a fault
scan / one-shot read flood	risk of evicting hot set	scan-resistant tier-1 policy (CLOCK/2Q) + LFC admission filter reject one-touch pages.

Invariant across every failure: the cache is disposable. Any corruption, eviction, or loss is recoverable by re-fetching from the durability floor, which is authoritative. No failure of the cache can ever cause data loss — only added latency.

Metric cards

≥ 0.95

overall hit ratio (target SLO)

≥ 0.90

tier-1 hit ratio (embeddability indicator)

< 1 ms

warm read p99 (tier 1 / LFC)

~100s ms

cold read p99 (floor, post scale-to-zero)

1 − hit

fraction of reads that become object-store GETs (= egress, off R2)

acked writes ever sourced from cache (durability boundary)

Dependencies / existing pieces to start from

SHOULD reuse a proven buffer-pool/CLOCK implementation from the engine-core lineage (libSQL / LeanStore-Umbra buffer manager) rather than writing tier 1 from scratch (source §2.A, §6).
SHOULD adopt the LFC concept directly from the Neon local-file-cache design (NVMe spill below shared buffers) — the closest shipping analog to this two-tier shape.
MAY lean on RocksDB's local-tier caching underneath the page store if the object-storage backend uses it instead of SlateDB (source §2.C notes RocksDB "if you want local-tier caching underneath").
SHOULD use a TinyLFU/W-TinyLFU admission sketch (Caffeine lineage) for the LFC.

Acceptance criteria / definition of done

MUST serve a tier-1 hit with zero syscalls and zero network I/O (verified by syscall trace on a warm-loop read).
MUST return snapshot-correct versions: a property test with interleaved long readers and a writer shows every reader sees exactly its snapshot's version, never a future one.
MUST pass Experiment 4 crash injection: no acked write is ever lost when the process is killed between WAL-durable and put_committed.
MUST recover from LFC corruption: a deliberately flipped crc32c byte yields a miss + floor re-fetch, never a served bad page (fault-injection test).
MUST export every metric in the metrics list with per-tier granularity.
SHOULD meet cache.target_hit_ratio ≥ 0.95 on the representative read-heavy/bursty internal-tool workload after warm-up.
SHOULD reduce post-idle warm p99 measurably with warm.on_start = true vs off, quantified by Experiment 5 (warm vs cold CDFs).
MUST keep working (correct, slower) with lfc.enabled = false for the WASM/Workers target.

Open questions & risks

MAY need a CLOCK-Pro / W-TinyLFU upgrade for tier 1 if HNSW vector traversal (the "nastiest cache case", source §11) pollutes CLOCK — its random pointer-chasing is the adversarial pattern. Open: should the vector index get a dedicated cache partition?
MAY want version-chain length limits per page to bound memory under pathologically long-lived snapshots; trade-off between chain GC aggressiveness and forcing live readers to the floor.
Open: optimal default split of RAM between tier 1 and host-runtime/executor working memory across substrates (Cloud Run container vs Worker isolate) — needs per-target measurement.
Open: should the LFC persist a warm hot-set across graceful restarts on durable-NVMe VMs, and is the complexity worth it vs always warming from the manifest?
Risk: on egress-billed stores (S3/GCS), a low hit ratio couples directly to the egress bill — the cache's SLO is partly a cost SLO. R2's zero egress (source §10) structurally de-risks this; non-R2 deployments must watch cache.miss.object_reads as a spend signal.
Risk: warming aggressiveness vs cold-start time — over-eager warm-on-start can lengthen the very cold start it aims to soften. Needs the Experiment 5 curve to tune.

Related specifications

ENGEngine CoreMVCC snapshot LSNs and version visibility the cache must honour. STORStorage Interfacethe trait whose read path the cache fronts before the floor. OBJObject-Storage Backendimmutable LSN-versioned layer format the cache mirrors and falls through to. CTLLifecycle & Controllerscale-to-zero and keep-warm that drive cold/warm cache transitions. BENCHBenchmark & Validation PlanExperiment 4 (acked-write durability) and Experiment 5 (cold vs warm reads). DEPLOYDeployment Targetssubstrate cold-start cost and why R2 zero-egress de-risks misses. CAPCapabilities: Build-in vs ComposeHNSW vector search — the hardest cache case, leaning hardest on tier 1.