Benchmark & Validation Plan · Serverless OLTP Engine Spec

Purpose & scope

The architecture has two weak axes and exactly two corresponding levers. This plan exists to measure each axis independently, prove that the intended mitigation actually moves the right number, and produce a per-tool go/no-go verdict that an engineer can defend with data rather than intuition.

The plan is deliberately adversarial. It does not try to show the engine in a good light; it tries to find the workloads that break it, locate the breaking point precisely, and hand those workloads to a coupled Postgres while keeping the disaggregated engine for everything else. The platform is not all-or-nothing: benchmark the boundary once, then route per workload.

What this plan governs

Five experiments (latency floor, group-commit curve, contention wall, crash safety, cold read), a fixed measurement methodology (percentiles via HDR histogram, never means), and a decision rule that converts experiment outputs into a ship / route / block verdict. It applies to both the embedded (FFI) and server (pgwire) paths.

Responsibilities & non-goals

Responsibilities

MUST measure the single-commit latency floor against the real ObjectStorage backend, reported as a distribution (p50/p99/p999), not an average.
MUST quantify what group commit buys (the W1 lever) as a throughput-vs-concurrency curve.
MUST isolate single-writer serialization cost (the W2 axis) and prove that the many-DB sharding lever recovers linear scaling.
MUST prove crash safety of the commit path under adversarial kill -9 as a hard correctness gate.
SHOULD characterize cold-read latency after scale-to-zero to inform idle-timeout and keep-warm tuning.

Non-goals

MUST NOT report mean-only latency for any commit-path experiment; the mean hides the tail that decides this architecture.
MUST NOT conflate W1 (latency) and W2 (lanes); they are different bottlenecks with different levers and are measured by different experiments.
MAY defer OLAP / analytics benchmarking; that is a separate columnar concern (see Capabilities), not part of OLTP boundary validation.
MUST NOT use the benchmark numbers to justify shipping a workload that fails the Experiment 4 durability gate, regardless of how good its latency looks.

The two bottlenecks (do not conflate them)

Before any experiment, internalize the distinction. The engine has two independent weak axes. Mixing them is the single most common analysis error, because both feel like "the writes are slow" from the outside while having completely different causes and cures.

Weakness	Mechanism	Mitigation that works	Mitigation that does NOT
W1 — commit latency	each commit = network / S3 round-trip (~ms) instead of a local fsync (~µs)	group commit / queue + batch; cache serves reads	cache cannot hide commit latency without breaking durability
W2 — write serialization	single writer per database (SQLite-lineage ceiling)	many small DBs = many lanes (sharding)	batching does NOT add lanes; it only makes one lane's commits cheap

Key correction to internalize

The queue defeats W1 (latency); the sharded many-DB design defeats W2 (lanes). They are different levers for different bottlenecks. Batching a single lane never adds lanes; sharding into more lanes never makes one lane's commit cheaper. The only genuinely unsolvable case is many concurrent writers contending on the same row in the same database — that tool belongs on coupled Postgres (see Hot-Row Contention).

The two axes are orthogonal. Experiments 1–2 probe the W1 axis; Experiment 3 probes the W2 axis; the red quadrant (same-row, same-DB, high-rate) is where neither lever applies.

The durability rule

Non-negotiable

Never ack a commit from an in-memory buffer before the WAL is durably stored. Caching hides read latency completely; it MUST never hide commit latency, or you get acked-write loss — exactly the failure Experiment 4 is designed to catch. A fast commit path that loses an acked write is disqualifying regardless of its latency numbers.

Normative requirements

MUST capture all latency as percentiles (p50, p99, p999) via an HDR histogram; mean-only reporting is prohibited.
MUST run Experiment 1 against the real network object store, not a local file backend, since W1 is defined by the round-trip.
MUST run Experiment 1 (latency floor) and Experiment 2 (group-commit curve) before any throughput claim is made about a tool.
MUST run Experiment 3 (contention wall) for any tool whose write profile includes repeated updates to a shared row.
MUST pass Experiment 4 (crash safety) unconditionally before any tool stores real data.
SHOULD run Experiment 5 (cold read) for any tool that relies on scale-to-zero, and extend it to N concurrent cold starts for high-fan-out deployments.
SHOULD pin the object-store region, instance type, and engine commit (git SHA + storage backend SHA) in every results record for reproducibility.
SHOULD repeat each measurement run at least three times and report the median run plus run-to-run spread, to separate engine behavior from cloud noise.
MAY use TPC-C (via go-tpc or BenchBase) in place of synthetic workloads where a realistic OLTP mix is more informative than an isolated micro-benchmark.

Experiment 1 — single-commit latency floor

highest priority This is the foundational measurement. Every other throughput number is read relative to the ceiling this experiment establishes.

What: One transaction, BEGIN; INSERT 1 row; COMMIT, against the ObjectStorage backend. No concurrency, no batching — a single sequential commit measured in isolation.
Report: The latency distribution — p50, p99, p999. Not the average: the tail is what S3 jitter does to you, and the average launders it away.
Variants: Same-region vs cross-region object store, to isolate the network component from engine overhead. Optionally first-write (cold connection) vs steady-state.
Sets: The worst-case ceiling for the sequential-commit quadrant: roughly 1 / latency commits per second per lane. If p99 is 10 ms, the sequential ceiling is ~100 commits/s per lane.

How to run

Server mode (pgwire), built-in percentile reporting:

# -c 1: one client (sequential) · -T 60: 60s · -P 1: per-second progress · -r: per-statement latency
pgbench -c 1 -j 1 -T 60 -P 1 -r -n -f single_commit.sql "postgres://user@host:5432/benchdb"

-- single_commit.sql
BEGIN;
INSERT INTO ledger (k, v) VALUES (:client_id || '-' || :scale, 1);
COMMIT;

Embedded mode (FFI), a direct micro-loop feeding an HDR histogram so the engine overhead and the S3 round-trip are both inside the timed region:

use hdrhistogram::Histogram;
let mut hist = Histogram::<u64>::new_with_bounds(1, 60_000_000, 3).unwrap(); // 1µs..60s, 3 sig digits
let deadline = Instant::now() + Duration::from_secs(60);
let mut i: u64 = 0;
while Instant::now() < deadline {
    let t0 = Instant::now();
    engine.execute("BEGIN")?;
    engine.execute(&format!("INSERT INTO ledger(k,v) VALUES ('{i}', 1)"))?;
    engine.execute("COMMIT")?;            // returns only after WAL durable
    hist.record(t0.elapsed().as_micros() as u64).unwrap();
    i += 1;
}
println!("p50={}µs p99={}µs p999={}µs",
    hist.value_at_quantile(0.50), hist.value_at_quantile(0.99), hist.value_at_quantile(0.999));

Acceptance for this experiment

A clean, monotone distribution with a p999 that is a small multiple of p50 (network jitter, not engine stalls). A p999 that is orders of magnitude above p50 signals a stall in the commit path (e.g. a synchronous compaction or a CAS retry storm) that must be investigated before trusting any downstream curve.

Experiment 2 — group-commit throughput curve

proves W1 is beaten This experiment demonstrates that the queue lever works: that batching independent commits lifts aggregate throughput far above the per-lane Experiment-1 ceiling.

What: Concurrent writers at 1, 2, 4, 8, 16, 32, 64, with group commit enabled. Each writer inserts independent rows (no row contention).
Report: Throughput (commits/sec) vs writer count. Plot the curve and find the plateau — the point where adding writers stops adding throughput.
Read as: The gap between Experiment-1's implied ceiling (1 / latency) and this plateau is what batching buys. If group commit works, the plateau is 10–100× the Exp-1 ceiling.

How to run

# sweep N over the writer set; -c N clients, -j N threads, independent rows
for N in 1 2 4 8 16 32 64; do
  pgbench -c "$N" -j "$N" -T 60 -r -n -f insert_independent.sql \
    "postgres://user@host:5432/benchdb" | tee "exp2_N${N}.log"
done

-- insert_independent.sql — each client writes its own keyspace, zero contention
\set rowid random(1, 1000000000)
BEGIN;
INSERT INTO events (writer, rowid, payload)
  VALUES (:client_id, :rowid, repeat('x', 64));
COMMIT;

For a realistic OLTP mix rather than a synthetic insert, drive TPC-C instead:

go-tpc tpcc --warehouses 16 --threads 32 --time 5m run -H host -P 5432 -D benchdb
# or BenchBase: java -jar benchbase.jar -b tpcc -c config/tpcc_engine.xml --execute=true

Interpreting the plateau

The plateau is the engine's sustained group-commit throughput for one database. If the plateau is comfortably above the tool's aggregate write rate, W1 is not a problem for that tool. If the plateau is barely above the Exp-1 ceiling, group commit is not engaging — check that commits are actually being coalesced and not serialized one round-trip at a time.

Experiment 3 — write contention wall

proves W2 is the other axis Identical setup to Experiment 2 except all writers hammer the same row. This is the red-quadrant detector.

What: Repeat Experiment 2's writer sweep, but every writer issues the same contended update: UPDATE counter SET n = n + 1 WHERE id = 1.
Report: Throughput vs writer count — expect it to flatten or drop far earlier than Experiment 2, because every writer must serialize behind the previous one's committed value.
Read as: The delta between Experiment 2 (independent) and Experiment 3 (contended) isolates the single-writer serialization cost that batching cannot fix. Batching makes one lane's commits cheap; it does not let two writers occupy the same row's lane at once.

How to run

for N in 1 2 4 8 16 32 64; do
  pgbench -c "$N" -j "$N" -T 60 -r -n -f update_same_row.sql \
    "postgres://user@host:5432/benchdb" | tee "exp3_N${N}.log"
done

-- update_same_row.sql — every client contends on id=1
BEGIN;
UPDATE counter SET n = n + 1 WHERE id = 1;
COMMIT;

Then: validate the sharding lever

Repeat the contended workload across N separate databases (one contended counter each) to confirm that N independent lanes scale linearly. This is the experiment that validates the many-small-DBs design from Architecture and Hot-Row Contention: per-DB contention stops meaning "everything waits" and starts meaning "only related writes wait."

# M databases, each with one writer hammering its own counter — should scale ~linearly
for M in 1 2 4 8 16 32; do
  for db in $(seq 1 "$M"); do
    pgbench -c 1 -j 1 -T 60 -r -n -f update_same_row.sql \
      "postgres://user@host:5432/shard_${db}" &
  done
  wait
done   # aggregate commits/sec across the M shards is the metric

Red-quadrant signal

If Experiment 3 (single DB, same row) is flat while the N-database variant scales linearly, the diagnosis is unambiguous: the workload is in the red quadrant only when forced into one DB. Sharding rescues it. If a tool genuinely cannot be sharded — one logical counter that must be globally serialized at high rate — that tool goes to coupled Postgres per the decision rule.

Experiment 4 — crash safety of the commit path

correctness gate · not speed This experiment is a hard gate. It does not produce a number to optimize; it produces a pass/fail that blocks shipping.

What: kill -9 the process at two adversarial points in the commit path: (a) after the S3 CAS-append is issued but before the client ack; (b) after the client ack but before page materialization.
Verify on restart: Every acked commit is present; no torn or half state; no acked-write loss. An ack is a promise — recovery MUST honor every promise made.
Gate: A fast commit path that loses an acked write under crash is disqualifying, regardless of its latency numbers.

The two adversarial windows

  client          engine                 S3-CAS log            page store
    |   COMMIT       |                        |                     |
    |-------------->|                        |                     |
    |               |  CAS-append(WAL) ---->|  (durable)          |
    |               |                  <-- ack                     |
    |               |                        |                     |
    |               |   [crash point a] kill -9 here              |
    |   ack         |                        |                     |
    |<--------------|   [crash point b] kill -9 here              |
    |               |   materialize page ------------------------>|
    |               |                        |                     |
  invariant a: a half-issued CAS that was NOT acked may be absent
  invariant b: any commit the client SAW acked MUST survive restart

Crash point (a) tests that a non-acked, in-flight commit is allowed to disappear cleanly. Crash point (b) tests that an acked commit survives even though page materialization never ran — recovery must replay the WAL.

How to run

MUST use deterministic, seeded crash injection so a failing schedule is reproducible from its seed.
SHOULD implement a fault-injection wrapper around the object-store client that can fail or pause any GET/PUT/CAS at a chosen sequence number.
SHOULD tie into the loom / deterministic-simulation testing referenced by the Engine Core quality plan, exhaustively exploring interleavings around the commit/ack/materialize boundary.

// fault-injection wrapper: deterministically crash after the Nth CAS-append
struct CrashAfterCas { inner: ObjectStore, fire_at: u64, seq: AtomicU64, mode: CrashMode }

impl ObjectStore for CrashAfterCas {
    fn cas_append(&self, key: &str, expect: Etag, body: &[u8]) -> Result<Etag> {
        let n = self.seq.fetch_add(1, Ordering::SeqCst);
        let r = self.inner.cas_append(key, expect, body)?;   // durable in S3
        if n == self.fire_at && self.mode == CrashMode::AfterCasBeforeAck {
            std::process::abort();   // crash point (a): durable but not yet acked
        }
        Ok(r)
    }
}
// Recovery oracle, post-restart:
//   for every commit whose ack the client observed -> row MUST be readable
//   no row may exist in a torn/partial state
//   replaying the WAL from the last durable CAS LSN MUST be idempotent

# harness loop: inject, crash, restart, run oracle, assert
for seed in $(seq 1 500); do
  CRASH_SEED="$seed" CRASH_MODE="after_cas_before_ack" ./engine-server --db crashdb &
  ./driver --seed "$seed" --workload commit_storm   # records every ack it sees
  # process aborts mid-run; restart and verify
  ./engine-server --db crashdb --recover-only
  ./oracle --seed "$seed" --assert no-acked-loss --assert no-torn-state || exit 1
done

Definition of pass

Across every injected crash point and seed: zero acked-write losses and zero torn states. One failure anywhere is a fail. This gate runs before any tool stores real data — durability is non-negotiable and is checked first, before latency is even discussed.

Experiment 5 — cold read after scale-to-zero

informs tuning Scale-to-zero means caches start cold after every idle period. This experiment quantifies that penalty so idle-timeout and keep-warm can be set with data.

What: Warm-cache read vs forced cold read. The cold variant evicts the cache (or starts fresh compute) so the read must hit S3 through the full cache miss path.
Report: Both distributions — p50 and p99 for warm and for cold — as latency CDFs. The cold p99 is the number a user feels on the first request after idle.
Informs: Idle-timeout (how long to keep compute alive) and keep-warm cadence (how often to ping to avoid the cold path). See Lifecycle & Controller.

How to run

# warm: prime cache, then measure
pgbench -c 1 -T 30 -r -n -f point_read.sql "$DSN"            # warm-up + measure

# cold: force eviction / fresh compute before each measured read
for i in $(seq 1 200); do
  ./controller --scale-to-zero --db colddb   # tear down, evict cache
  ./driver --one-read --hdr cold_reads.hdr "$DSN"   # first read after wake = cold
done

-- point_read.sql
\set id random(1, 1000000)
SELECT v FROM ledger WHERE k = :id;

Extension: N concurrent cold starts (thundering herd)

Required for high-fan-out deployments

Extend Experiment 5 to N simultaneous cold starts to find the spin-up saturation point — e.g. 1,000 clients arriving at once after a global idle period. This is the thundering-herd case called out in Deployment Targets and the lifecycle controller in Lifecycle & Controller. Report the knee of the curve: the concurrency at which cold-start latency degrades non-linearly.

Harness & tooling

Concern	Tool	Used for
Server-mode driver	`twill-bench --transport pgwire` · `pgbench`	Exp 1, 2, 3 over the pgwire path with HDR percentiles; the in-crate driver needs no external Postgres tooling (and is CI-gated), `pgbench` for real-host runs
Realistic OLTP mix	TPC-C via `go-tpc` / BenchBase	Exp 2 (and 3) with a representative transaction profile, not just synthetic inserts
Embedded-mode driver	`twill-bench` (direct FFI micro-loop)	Exp 1, 2, 3 on the embedded path, no wire overhead — the same harness, `--transport embedded` (default)
Latency capture	HDR histogram	p50/p99/p999 for every experiment; never mean-only
Crash injection	seeded fault-injection wrapper + `loom`	Exp 4 deterministic crash schedules and interleaving exploration

Plots to produce

MUST produce throughput-vs-concurrency curves for Experiments 2 and 3 (the plateau and the wall, overlaid for the delta).
MUST produce latency CDFs for Experiments 1 and 5 (the commit floor and the cold/warm split).
SHOULD overlay same-region vs cross-region CDFs on the Experiment 1 plot to make the network component visible.

Mean-only is banned

The mean hides the tail that decides this architecture. Every latency figure is a percentile drawn from an HDR histogram. A report that quotes an average commit latency is rejected on sight.

Configuration knobs under test

These are the engine and harness knobs that materially move the experiment outputs. Sweep or pin each explicitly; record the value in every results row.

group_commit_window: Max time the commit queue waits to coalesce independent commits before flushing. Drives the Exp-2 plateau height vs the Exp-1 tail. Larger window = higher throughput, higher per-commit p99.
group_commit_max_batch: Max records per group-commit flush. Caps the batch so a single CAS-append stays bounded.
object_store_region: Same-region vs cross-region; the dominant term in Exp-1 latency. Pin per variant.
cache_size / lfc_size: Shared-buffer and local-file-cache sizing (Local Cache); governs the warm/cold split in Exp-5 and the miss rate.
idle_timeout: How long compute stays alive before scale-to-zero; the variable Exp-5 is meant to inform.
cas_retry_backoff: Backoff policy on CAS contention at the commit log; affects Exp-3 tail behavior under high writer counts.
writers / clients (-c): The swept axis for Exp-2 and Exp-3 (1…64).
shard_count: Number of separate databases in the Exp-3 sharding variant; the lane count for the W2 lever.

Failure modes & what each experiment catches

Symptom in results	Likely cause	Caught by	Action
p999 ≫ p50 in single commits	commit-path stall (sync compaction, CAS retry storm)	Exp 1	investigate before trusting any throughput curve
Exp-2 plateau ≈ Exp-1 ceiling	group commit not coalescing	Exp 2	verify batching is engaged, not 1 round-trip per commit
Throughput flat from N=1	workload hitting the same row (W2)	Exp 3	shard or route to Postgres
N-DB variant does NOT scale linearly	shared resource serializing across DBs (e.g. one log lane)	Exp 3 sharding	architectural bug — lanes are not independent
Acked write missing after crash	ack before WAL durable (durability-rule violation)	Exp 4	block ship — disqualifying
Torn / half row after crash	non-idempotent recovery / partial materialization	Exp 4	block ship — disqualifying
Cold p99 unacceptable	cache cold after idle; spin-up cost	Exp 5	tune idle_timeout / keep-warm
Cold latency knee at low N	thundering-herd spin-up saturation	Exp 5 (N concurrent)	cap concurrency / pre-warm pool

Decision rule (go / no-go, per tool)

The output of this plan is a per-tool verdict, applied in order. The platform is not all-or-nothing; each tool is placed on the correct side of the boundary.

STEP 1 Run Experiments 1 + 2 first. If p99 single-commit latency is acceptable for the tool's write frequency, AND group-commit throughput (the Exp-2 plateau) exceeds the tool's aggregate write rate → fits — ship it.
STEP 2 Run Experiment 3. If the tool has heavy contended writes (same rows, sequential, cannot batch, cannot shard) → it is in the red quadrant → give that one tool coupled Postgres, keep the disaggregated engine for the rest.
STEP 3 Experiment 4 MUST pass unconditionally before any real data → durability is non-negotiable. This gate is checked first in practice even though it is stated last; a tool that fails it never ships, no matter its latency.

The boundary is measured once, then reused

Most small / bursty / read-heavy internal tools land green on Steps 1–2. The occasional write-heavy outlier gets a different backend at Step 2. Step 3 gates everyone. Benchmark the boundary once; then route each new tool by comparing its write profile against the curves this plan produced.

Results-reporting template

Every benchmark run records its findings in this table. Target/Gate is set per tool before the run; Verdict follows mechanically from result vs target.

Experiment	Metric	Target / Gate	Result	Verdict
Exp 1 — latency floor	commit p99 (same-region)	≤ tool write-interval budget	<fill> ms	pending
Exp 1 — latency floor	commit p999 / p50 ratio	small multiple (no stalls)	<fill>×	pending
Exp 2 — group-commit curve	plateau commits/sec	≥ tool aggregate write rate	<fill> /s	pending
Exp 2 — group-commit curve	plateau / Exp-1 ceiling	10–100× (batching works)	<fill>×	pending
Exp 3 — contention wall	contended throughput vs independent	understood & acceptable, else shard/route	<fill>	pending
Exp 3 — sharding	N-DB aggregate scaling	~linear in N	<fill>	pending
Exp 4 — crash safety	acked-write loss	GATE: zero	<fill>	pending
Exp 4 — crash safety	torn / half state	GATE: zero	<fill>	pending
Exp 5 — cold read	cold-read p99	≤ first-request budget	<fill> ms	pending
Exp 5 — herd	cold-start knee (N)	≥ expected peak fan-out	<fill>	pending

Dependencies / existing pieces to start from

MUST have a working ObjectStorage backend (S3-CAS commit log + LSM page store) before Exp 1 — the network round-trip is the thing under test.
MUST have group commit implemented in the Engine Core commit path before Exp 2.
SHOULD have the pgwire server running for pgbench / TPC-C; embedded experiments use the FFI path instead.
SHOULD have the fault-injection wrapper around the object-store client and the loom harness from the Engine Core quality plan available for Exp 4.
SHOULD have the Lifecycle Controller able to scale-to-zero / evict on demand for Exp 5.
MAY reuse off-the-shelf drivers (pgbench, go-tpc, BenchBase, hdrhistogram) rather than building harnesses — only the crash oracle is bespoke.

Acceptance criteria / definition of done

MUST Experiment 4 passes across all adversarial crash points and all seeds with zero acked-write loss and zero torn state. This is the blocking gate.
MUST Experiments 1–3 have published distributions (p50/p99/p999) and curves, with same-region vs cross-region variants for Exp 1.
MUST the Exp-3 N-database variant demonstrate near-linear scaling, validating the sharding lever for W2.
SHOULD Experiment 5 publish warm and cold CDFs plus an N-concurrent-cold-start knee.
SHOULD every run be reproducible: pinned engine SHA, storage backend SHA, region, instance type, and (for Exp 4) crash seed recorded in the results template.
MUST a documented go/no-go verdict exist for each tool slated to use the engine, derived by the decision rule from its own runs.

Open questions & risks

SHOULD resolve: what group-commit window best trades Exp-2 plateau against Exp-1 tail for the platform's typical write frequency? Sweep it; it is the central tuning knob.
SHOULD resolve: does the S3-CAS commit log itself become a cross-DB serialization point in the Exp-3 sharding variant? If lanes do not scale linearly, the bottleneck is architectural, not workload.
MAY investigate: how much does cross-cloud egress / latency (compute and bucket on different providers, per Deployment Targets) shift the Exp-1 and Exp-5 distributions?
MAY investigate: the cold-traversal cost of an HNSW vector index over object storage (Capabilities) — the nastiest cache-miss case — likely warrants a dedicated Exp-5 variant.
SHOULD track maturity risk: object-storage-native OLTP is an active frontier; re-run this plan against each engine/storage version bump, since the numbers move (see Tradeoffs & Risks).

Related specifications

OBJObject-Storage BackendS3-CAS commit log + LSM page store — the system under test for Exp 1–4. COREEngine CoreGroup commit, WAL durability rule, and the loom/simulation harness Exp 4 ties into. HOTHot-Row ContentionThe red-quadrant strategy that Exp 3 detects and routes around. TBENCHTwill Bench CLIThe ergonomic CLI that operationalizes these experiments, with scenarios, correctness profiles, and reporting. CACHELocal CacheWarm/cold split measured by Exp 5; the read path that hides S3 latency. LIFELifecycle & ControllerScale-to-zero, idle-timeout, and the thundering-herd cold start Exp 5 informs. SRVServer ModeThe pgwire listener that pgbench and TPC-C drive against. DEPLOYDeployment TargetsSame-region vs cross-region and egress factors that shift the distributions. RISKTradeoffs & RisksThe honest tradeoffs (write latency, single writer, cold start) this plan quantifies.