Local Cache
The two-tier local cache is the component that makes an object-storage-backed engine feel embedded: it keeps hot pages in-process so the common read is a memory access, not a hundreds-of-millisecond network round-trip — and it is forbidden, on pain of acked-write loss, from ever extending that same speedup to the commit path.
Purpose & scope
The durability floor of this engine is object storage (S3 / R2 / GCS / MinIO), reached through the object-storage backend via get_page(page_id, lsn). A raw object-store GET costs tens-to-hundreds of milliseconds at the p99 tail. If every read paid that cost, the engine would not be embeddable in any meaningful sense — engine_query() over bun:ffi would block for the duration of a network call on the hot path, defeating the entire premise of an in-process library.
The local cache exists to keep that latency off the hot path. It serves the overwhelming majority of reads from process-local memory or local NVMe, and falls through to object storage only on a genuine miss. The source design note states the dependency plainly, and this page treats it as the load-bearing claim:
Embeddability depends on this component
Without a local cache, every read is a network call. The "hot path is in-process function calls" property that makes the engine embeddable (source §3, True embeddability) is produced entirely by this cache. The LocalFileStorage backend avoids the network by construction; the ObjectStorage backend only avoids it because of the cache described here. Treat a regression in hit ratio as a regression in embeddability, not a tuning nuisance.
Scope of this spec: the in-process page cache, the local file cache (LFC) on NVMe, the tiering and admission/eviction policy between them, version-correct serving against MVCC snapshots, the durability boundary the cache must respect, cold-start warming, configuration, failure modes, and acceptance criteria. It is the expansion of source §2.D (Local cache) and the embeddability point in source §3.
Responsibilities & non-goals
Responsibilities
- MUST serve reads for resident pages from memory (tier 1) or local NVMe (tier 2) without contacting object storage.
- MUST return the page version consistent with the requesting reader's snapshot LSN (see Engine Core MVCC).
- MUST treat cached page versions as immutable: a new committed version appends/invalidates, it never mutates a cached page in place.
- MUST emit per-tier hit-ratio and miss-to-object-storage counters for capacity planning and the benchmark plan.
- SHOULD warm proactively on cold start to bound the post-idle p99 (source §4, Cold start; Experiment 5).
- SHOULD spill cleanly from memory to NVMe so working sets larger than RAM stay off the network.
Non-goals
- MUST NOT participate in the commit-acknowledgement path. The cache is a read accelerator only; durability is owned by the WAL / commit log (source §8, durability rule).
- MUST NOT be the source of truth for any page. The object-store floor is authoritative; the cache is reconstructable and disposable.
- MAY be empty at any time (after eviction, restart, or scale-to-zero) without affecting correctness — only performance.
- Replication, cross-node coherence, and distributed cache invalidation are out of scope: a single engine process owns its caches, and single-writer-per-DB (source §4) means there is no second writer to invalidate against.
Two-tier design
The cache is two tiers backed by one durability floor. Tier 1 is the in-process shared-buffer-style page cache; tier 2 is the local file cache (LFC) on NVMe. Tier 1 absorbs the genuinely hot working set at memory latency; the LFC extends the resident set far past RAM at NVMe latency, so the engine still avoids object storage for warm-but-not-hot pages.
Read fall-through: tier 1 → tier 2 (LFC) → object-storage floor. Only the bottom hop crosses the network and bills egress.
Tier 1 — in-process page cache (shared buffers)
A fixed-size, in-process array of page frames, modelled on PostgreSQL shared buffers / a buffer pool. Pages live in the engine's address space, so a hit is a pointer dereference plus a version-chain walk — no syscall, no copy across a process boundary, no network. This is precisely the property that preserves embeddability: an FFI engine_query() that hits tier 1 never yields the calling thread to I/O.
- MUST store pages keyed by
(page_id, version), where version is the page's producing LSN (see layer format). - MUST support pinning during read/execute so an in-use frame cannot be evicted mid-operation.
- SHOULD hold the hot MVCC version chain for a page, not just the latest version, so concurrent snapshot readers at different LSNs hit tier 1.
- SHOULD be sized as a fraction of process RAM (typ. 25–50%), leaving headroom for executor working memory and the host runtime (Bun).
Tier 2 — local file cache (LFC) on NVMe
A larger, on-disk cache backed by a file (or a directory of slab files) on local NVMe. The LFC holds pages that have aged out of tier 1 but are still warm enough to be worth keeping off the network. A tier-2 hit is a local NVMe read (tens of microseconds) — three to four orders of magnitude faster than an object-store GET, and it bills no egress. The LFC is what lets a working set far larger than RAM still avoid the durability floor.
- MUST be local, ephemeral storage (instance NVMe / scratch disk), never network-attached, never the durability floor.
- MUST store the same versioned pages as tier 1 and verify integrity on read (checksum) before serving.
- SHOULD be sized in the GB–tens-of-GB range, far larger than tier 1, bounded by available local disk.
- MAY be disabled entirely (
lfc.enabled = false) for deployments with no usable local disk (e.g. Cloudflare Workers, see Deployment Targets), in which case tier 1 falls through directly to object storage.
Tiering policy between the tiers
The two tiers form an inclusive-by-default hierarchy with controlled promotion and demotion:
| Transition | Trigger | Policy |
|---|---|---|
| floor → tier 1 | miss serviced by object storage | fetched page is installed in tier 1 and admitted to the LFC if it passes admission (below). |
| tier 1 → tier 2 (demote) | eviction from tier 1 | evicted page is written to the LFC if not already present and admission permits; otherwise dropped. |
| tier 2 → tier 1 (promote) | tier-1 miss, tier-2 hit | page is read from NVMe into a tier-1 frame and pinned for the read; reference counters updated. |
| tier 2 → drop | LFC eviction or full | page is discarded; next access falls to the floor. No write-back (pages are immutable and the floor is authoritative). |
Because every cached page version is immutable and reconstructable from the floor, there is never a dirty write-back from the LFC: demotion is a copy, eviction is a delete. The only "dirty" data in the system lives in the WAL until it is durable — and that data is governed by the durability boundary, not the cache.
Eviction & admission policy
Tier 1 — replacement
Tier 1 replacement is a low-overhead scan-resistant policy. CLOCK (second-chance) is the default candidate for its near-zero per-access cost; 2Q or a CLOCK-Pro variant is the upgrade path if scan workloads (e.g. large sequential reads, the HNSW cold traversal in Capabilities) pollute a plain LRU.
| Candidate | Strength | Cost / risk | Use |
|---|---|---|---|
| LRU | simple, good recency | per-access list churn; one big scan evicts the hot set | baseline / reference only |
| CLOCK | O(1)-ish, lock-light, approximates LRU | weaker than LRU under mixed recency/frequency | default tier 1 |
| 2Q | scan-resistant (separates recent vs frequent) | two queues to tune | upgrade if scans pollute |
LFC — admission
The LFC is admission-gated, not write-everything. Writing every evicted or fetched page to NVMe burns write endurance and lets one-shot pages (full-table scans, the analytics export path) evict the warm set. Admission policy decides whether a page earns a slot:
- SHOULD admit on second access (a one-touch page is not yet proven warm); track a lightweight frequency sketch / ghost queue to detect re-reference.
- SHOULD apply a frequency-based admission filter (TinyLFU-style) so a high-frequency incoming page can displace a low-frequency resident one, and a one-shot scan page is rejected.
- MAY bypass admission for pages explicitly fetched by a warming pass (they are admitted because warming chose them).
- MUST NOT let an admission decision block or slow a read: admission is asynchronous to the read path.
Metrics (mandatory)
These counters are not optional telemetry; they are the inputs to capacity planning and to benchmark Experiment 5 (cold vs warm reads).
- cache.t1.hit_ratio
- tier-1 hits / total reads, rolling window. The headline embeddability indicator.
- cache.t2.hit_ratio
- LFC hits / (reads that missed tier 1). Measures the LFC's contribution past RAM.
- cache.overall_hit_ratio
- (t1 + t2 hits) / total reads. 1 − this = fraction of reads that became object-store GETs.
- cache.miss.object_reads
- count of reads that fell through to object storage (each is a network round-trip and, except on R2, billable egress — see Deployment Targets).
- cache.miss.object_read_latency
- p50/p99/p999 of the floor read, captured as a histogram (never mean-only).
- cache.t1.evictions / cache.t2.evictions
- eviction rate per tier; a rising t1 eviction rate with flat hit ratio signals undersized tier 1.
- cache.t2.admit / cache.t2.reject
- admission accept/reject counts; high reject is healthy (scan resistance working).
Version correctness
The cache is not a key→bytes map; it is a versioned store, and correctness under MVCC is non-negotiable. Pages are versioned by LSN. A reader executes against a fixed snapshot LSN (see Engine Core, snapshot isolation via LSN-stamped versions). When that reader asks the cache for a page, the cache MUST return the version of that page that is visible at the reader's snapshot LSN — i.e. the newest version whose producing LSN is ≤ the snapshot LSN — not merely "the latest version present."
- MUST resolve
get_page(page_id, snapshot_lsn)to the page version with the greatest producing LSN ≤snapshot_lsn. - MUST treat each cached page version as immutable. A newly committed version is appended to the page's version chain (or installed as a new keyed entry); it MUST NOT overwrite the bytes of an existing cached version.
- MUST NOT serve a version with producing LSN >
snapshot_lsnto that reader (that would expose a future write — a snapshot-isolation violation). - MUST, on a chain miss (no resident version satisfies the snapshot), fall through to object storage with the snapshot LSN and let the layer format reconstruct the correct version.
- SHOULD retain older versions still referenced by live snapshots and release them once no snapshot ≤ their successor's LSN remains (cache version GC tracks the engine's oldest live snapshot, mirroring MVCC vacuum).
This "append, never mutate" rule is the cache-level reflection of the storage format itself: the page store is built from immutable delta and image layers keyed by (key, LSN). The cache mirrors that immutability so a cache hit and a floor read are indistinguishable in result — same version selection, same bytes.
Invalidation on commit
Because there is a single writer per database (source §4), invalidation is local and simple: when the writer commits at LSN L, the pages it modified gain a new version L. The cache appends those versions; it does not evict the prior versions while live snapshots still reference them. There is no cross-process coherence problem to solve — no second writer can produce a version this cache does not already know about.
Why append-not-mutate matters for correctness
A long-running reader holding snapshot LSN S and a writer committing at LSN L > S coexist. If the cache mutated the page in place at commit, the reader would suddenly see L's bytes through its S snapshot — a torn, non-repeatable read. Appending a new version and selecting by snapshot LSN keeps both readers correct from the same cache.
Interfaces / data formats
The cache sits between the engine's buffer-access layer and the Storage trait. It implements the read path of that trait's get_page by consulting tiers before delegating the miss to the underlying ObjectStorage implementation.
/// Identity of a cached page version. The (page_id, lsn) pair is the cache key;
/// `lsn` is the LSN at which this version of the page was produced.
#[derive(Clone, Copy, PartialEq, Eq, Hash)]
pub struct PageKey { pub page_id: PageId, pub lsn: Lsn }
pub trait PageCache: Send + Sync {
/// Return the page version visible at `snapshot_lsn`: the resident version
/// with the greatest producing LSN <= snapshot_lsn. On miss, fetch from the
/// underlying Storage at snapshot_lsn, install, and return. Pins until dropped.
fn get(&self, page_id: PageId, snapshot_lsn: Lsn) -> Result<PinnedPage>;
/// Append a freshly committed version (producing LSN = `lsn`) into tier 1.
/// MUST NOT mutate any existing cached version of `page_id`.
fn put_committed(&self, page_id: PageId, lsn: Lsn, page: Page);
/// Best-effort proactive load of `keys` into the cache (warming). Non-blocking
/// w.r.t. the read path; admission is bypassed (warming is the admission signal).
fn warm(&self, keys: &[PageKey]);
/// Release versions no live snapshot can observe (oldest_live_snapshot driven).
fn gc_versions(&self, oldest_live_snapshot: Lsn);
/// Snapshot of per-tier counters for metrics export.
fn stats(&self) -> CacheStats;
}
pub struct CacheStats {
pub t1_hits: u64, pub t2_hits: u64, pub misses: u64, // misses == object-store reads
pub t1_evictions: u64, pub t2_evictions: u64,
pub t2_admits: u64, pub t2_rejects: u64,
pub object_read_latency_us: Histogram, // p50/p99/p999
}
The on-NVMe LFC entry format is fixed-layout so an entry can be validated before it is trusted:
LFC slab entry
┌────────────┬────────────┬──────────────┬───────────────┬──────────────┐
│ magic (4B) │ page_id(8B) │ lsn (8B) │ crc32c (4B) │ page bytes │
└────────────┴────────────┴──────────────┴───────────────┴──────────────┘
fixed of page bytes PAGE_SIZE (e.g. 8 KiB)
- magic mismatch -> treat slot as empty
- crc32c mismatch -> corruption: drop slot, count as miss, re-fetch from floor
Behaviour & algorithms
Read path (the hot path)
get(page_id, snapshot_lsn):
1. TIER 1: walk version chain for page_id.
resident version v with max(v.lsn) <= snapshot_lsn ?
-> HIT: pin frame, bump CLOCK ref bit, t1_hits++, RETURN (memory latency)
2. TIER 2 (LFC): look up satisfying (page_id, lsn) slot, verify crc32c.
valid ?
-> HIT: read NVMe -> install into tier 1 frame, pin, t2_hits++, RETURN (NVMe latency)
3. MISS: storage.get_page(page_id, snapshot_lsn) // network round-trip to floor
install version into tier 1; async-admit to LFC if admission passes
misses++ ; record object_read_latency ; RETURN (object-store latency)
Three-stop fall-through. Steps 1–2 are local; only step 3 crosses the network and bills egress.
Commit / invalidation path (NOT the hot path)
On a successful commit at LSN L — and only after the WAL is durable (see below) — the writer calls put_committed(page_id, L, page) for each modified page. The cache appends version L to tier 1's chain. Prior versions remain for live snapshots; gc_versions(oldest_live_snapshot) reclaims them later. No NVMe write-back occurs at commit: the LFC learns the new version lazily, on demotion.
Version GC
The engine reports its oldest live snapshot LSN (the same value MVCC vacuum uses). The cache drops any version v of a page where a newer resident version v′ exists with v′.lsn ≤ oldest_live_snapshot — i.e. no live reader can ever select v again. This keeps version chains bounded under a long-running-snapshot workload.
The durability boundary (critical callout)
Cache READ latency, never COMMIT latency
The cache hides read latency completely. It MUST NEVER hide commit latency. A commit MUST NOT be acknowledged to the client until its WAL records are durably stored on the commit-log floor (the S3 conditional-write / CAS append, source §2.C). Acknowledging a commit from an in-memory buffer before the WAL is durable produces acked-write loss: the client is told "committed," the process crashes, and the write is gone. This is disqualifying regardless of how good the latency numbers look (source §8, durability rule; Experiment 4).
This is the single sharpest line in the whole component, and it follows directly from the W1/W2 distinction in the source:
- MUST NOT source a commit acknowledgement from any cache tier. The ack is owned exclusively by the WAL/commit-log durability signal.
- MUST NOT reorder or shortcut the WAL-durable → ack →
put_committedsequence. The page enters the cache after durability, never before. - MUST tolerate a crash between WAL-durable and
put_committed: on restart the page is simply re-fetched/re-materialized from the floor; the committed data is safe because it is in the durable WAL, not because it was in the cache. - MUST keep the cache strictly off the W1 (commit-latency) axis — caching is a W1 non-mitigation. Source §8 is explicit: "cache cannot hide commit latency without breaking durability." Group commit beats W1; the cache only beats reads.
Restate the two levers, so nobody confuses them
The cache addresses read latency. Group commit addresses commit latency (W1). Many-small-DBs / sharding addresses write serialization (W2). They are three different levers for three different problems; the cache is exactly one of them and stays in its lane.
Cold start & warming
Scale-to-zero (source §3) means the engine process is torn down when idle, so caches start cold after every idle period. The first request after idle pays process-start plus cache-warm, and a cold read goes all the way to the floor at object-storage latency (and, off R2, egress cost). The lifecycle controller drives the cold/idle transitions; Deployment Targets sets the substrate-specific cold-start cost (container spin-up vs Worker isolate).
Warming strategies
- SHOULD warm on start (
warm.on_start = true) by prefetching the root/metadata pages and a recorded hot-set manifest before serving the first query, bounding the warm-path p99. - SHOULD persist a small hot-set manifest (the most-referenced
PageKeys) to the floor on graceful shutdown and replay it viawarm()on next start. - MAY survive a warm LFC across restarts when local NVMe is durable across the process lifecycle (e.g. a long-lived VM that restarts the engine without losing the scratch disk); on ephemeral substrates the LFC is cold like everything else.
- MAY support keep-warm: the controller issues a periodic lightweight ping to defer scale-to-zero and keep caches resident for latency-sensitive tools (source §4, "keep-warm ping").
- MUST NOT let warming block correctness or commits — warming is best-effort and asynchronous; a query that races warming simply takes the normal miss path.
Tuning hook
The cold-vs-warm p99 gap measured by Experiment 5 is the input to the idle-timeout / keep-warm decision. A large gap argues for a longer idle timeout or keep-warm on hot tools; a small gap (e.g. on R2 where misses are cheap) argues for aggressive scale-to-zero.
Configuration
| Key | Type / default | Meaning |
|---|---|---|
cache.t1.size | bytes · default 25% RAM | tier-1 page-cache capacity. Larger = higher t1 hit ratio, less executor headroom. |
cache.t1.policy | clock | 2q | lru · default clock | tier-1 replacement policy. |
lfc.enabled | bool · default true | enable the NVMe LFC. Set false on substrates with no local disk (Workers). |
lfc.path | path · e.g. /var/cache/engine/lfc | NVMe directory for LFC slab files. MUST be local, ephemeral storage. |
lfc.size | bytes · default 8 GiB | LFC capacity, bounded by free space at lfc.path. |
lfc.admission | tinylfu | second_touch | always · default tinylfu | LFC admission policy. |
cache.page_size | bytes · default 8192 | page/frame size; MUST match the engine + layer-format page size. |
warm.on_start | bool · default true | prefetch metadata + hot-set manifest before serving first query. |
warm.manifest_path | floor object key | where the hot-set manifest is persisted/loaded. |
cache.target_hit_ratio | ratio · default 0.95 | overall hit-ratio SLO; breaching it raises an alert (capacity signal). |
Failure modes & edge cases
| Failure | Effect | Required handling |
|---|---|---|
| NVMe full (LFC out of space) | cannot admit new pages to tier 2 | evict LFC LRU to make room; if still full, skip admission and serve the page from tier 1 / floor. degraded, never incorrect |
| LFC slab corruption (crc32c mismatch) | a tier-2 entry is unreadable | drop the slot, count as a miss, re-fetch from the floor. Corruption in a disposable cache is always recoverable from the durability floor. |
| NVMe path unwritable / disk failure | LFC cannot operate | disable LFC for the process lifetime, fall through tier 1 → floor, raise an alert. Engine stays correct, just colder. |
| tier-1 OOM pressure | host runtime starved | respect cache.t1.size as a hard cap; shrink on memory pressure signal before the host OOM-kills the process. |
| floor read fails on miss | page cannot be served | surface the error to the query (retry policy lives in the object-storage backend); the cache MUST NOT fabricate or stale-serve a page it does not hold. |
crash between WAL-durable and put_committed | committed page not yet cached | none needed — re-materialize from floor on next read; the commit is safe in the durable WAL. |
| cold start after scale-to-zero | both tiers empty | warm-on-start + manifest replay; first reads take the miss path. expected, not a fault |
| scan / one-shot read flood | risk of evicting hot set | scan-resistant tier-1 policy (CLOCK/2Q) + LFC admission filter reject one-touch pages. |
Invariant across every failure: the cache is disposable. Any corruption, eviction, or loss is recoverable by re-fetching from the durability floor, which is authoritative. No failure of the cache can ever cause data loss — only added latency.
Metric cards
Dependencies / existing pieces to start from
- SHOULD reuse a proven buffer-pool/CLOCK implementation from the engine-core lineage (libSQL / LeanStore-Umbra buffer manager) rather than writing tier 1 from scratch (source §2.A, §6).
- SHOULD adopt the LFC concept directly from the Neon local-file-cache design (NVMe spill below shared buffers) — the closest shipping analog to this two-tier shape.
- MAY lean on RocksDB's local-tier caching underneath the page store if the object-storage backend uses it instead of SlateDB (source §2.C notes RocksDB "if you want local-tier caching underneath").
- SHOULD use a TinyLFU/W-TinyLFU admission sketch (Caffeine lineage) for the LFC.
Acceptance criteria / definition of done
- MUST serve a tier-1 hit with zero syscalls and zero network I/O (verified by syscall trace on a warm-loop read).
- MUST return snapshot-correct versions: a property test with interleaved long readers and a writer shows every reader sees exactly its snapshot's version, never a future one.
- MUST pass Experiment 4 crash injection: no acked write is ever lost when the process is killed between WAL-durable and
put_committed. - MUST recover from LFC corruption: a deliberately flipped crc32c byte yields a miss + floor re-fetch, never a served bad page (fault-injection test).
- MUST export every metric in the metrics list with per-tier granularity.
- SHOULD meet
cache.target_hit_ratio≥ 0.95 on the representative read-heavy/bursty internal-tool workload after warm-up. - SHOULD reduce post-idle warm p99 measurably with
warm.on_start = truevs off, quantified by Experiment 5 (warm vs cold CDFs). - MUST keep working (correct, slower) with
lfc.enabled = falsefor the WASM/Workers target.
Open questions & risks
- MAY need a CLOCK-Pro / W-TinyLFU upgrade for tier 1 if HNSW vector traversal (the "nastiest cache case", source §11) pollutes CLOCK — its random pointer-chasing is the adversarial pattern. Open: should the vector index get a dedicated cache partition?
- MAY want version-chain length limits per page to bound memory under pathologically long-lived snapshots; trade-off between chain GC aggressiveness and forcing live readers to the floor.
- Open: optimal default split of RAM between tier 1 and host-runtime/executor working memory across substrates (Cloud Run container vs Worker isolate) — needs per-target measurement.
- Open: should the LFC persist a warm hot-set across graceful restarts on durable-NVMe VMs, and is the complexity worth it vs always warming from the manifest?
- Risk: on egress-billed stores (S3/GCS), a low hit ratio couples directly to the egress bill — the cache's SLO is partly a cost SLO. R2's zero egress (source §10) structurally de-risks this; non-R2 deployments must watch
cache.miss.object_readsas a spend signal. - Risk: warming aggressiveness vs cold-start time — over-eager warm-on-start can lengthen the very cold start it aims to soften. Needs the Experiment 5 curve to tune.