15 · TBENCH Draft v0.1 Updated 2026-06-21

Twill Bench CLI

The official benchmarking, correctness, and serverless-efficiency CLI for Twill DB — a single driver that knows the engine's internals, so it measures not just request latency but commit durability, lifecycle behaviour, and data correctness under load. This page captures the product vision and verifies it against what ships today, separating what already exists from what is aligned-but-unbuilt and what would strain the architecture's rules.

Purpose & relationship to the validation plan

Unlike a generic load generator (k6, Locust, Vegeta), Twill Bench understands the engine: it can attribute latency to the commit round-trip, watch a database scale to zero and back, and assert ACID invariants over the result set. It exists to answer three questions on every run — is the database fast, is the data still correct under stress, and how efficiently did the serverless architecture use compute and storage?

15 is the CLI; 09 is the falsifiable plan it operationalizes

The Benchmark & Validation Plan defines the five adversarial experiments (latency floor, group-commit curve, contention wall, crash safety, cold read), the measurement methodology (percentiles via HDR histogram, never means), and the ship/route/block decision rule. This page does not replace those — it wraps them in an ergonomic CLI and adds scenario-oriented workloads, correctness profiles, and reporting on top. Where the two meet, spec 09 is authoritative on methodology; any number Twill Bench reports for a placement decision MUST still satisfy 09 (real object store for W1, the Experiment-4 durability gate before real data).

Goals & design principles

MUST capture all latency as a distribution (p50/p90/p95/p99/p999 via an HDR-style histogram); mean-only reporting is prohibited (inherited from spec 09).
MUST validate data correctness alongside performance — a fast run that loses an acked write or an update is a failure, not a fast success.
MUST emit both human-friendly terminal output and a machine-readable JSON record for archiving, plotting, and CI gating.
SHOULD be zero-config for the common scenarios and reproducible (pinned engine + storage SHA, region, instance type, seed — per spec 09).
SHOULD be safe to run against production for read-only scenarios, and refuse to mutate unless explicitly opted in.
SHOULD be extensible via benchmark profiles without recompiling the driver.
MUST NOT pull heavyweight or non-auditable dependencies into the engine core to satisfy a reporting feature; CLI-only concerns (YAML profiles, Prometheus export) stay in the twill-bench crate, feature-gated, and never cross the storage seam or the thread-free engine boundary.

Command structure

The brief proposes a twill bench <scenario> [options] surface. Today the binary is invoked as twill-bench <expN> --url … [flags]. Both are sub-command driven; the work below is to grow the scenario vocabulary and a profile loader, not to re-architect the entry point.

twill bench read-heavy
twill bench write-heavy
twill bench burst
twill bench scale-to-zero
twill bench bank-transfer        # correctness workload profile
twill bench compare --baseline v0.5.0 --candidate v0.5.1
twill bench custom --profile workload.yaml

Benchmark scenarios

Scenarios are named workload shapes. The three implemented experiments map onto the request-mix scenarios; the lifecycle scenarios (burst, scale-to-zero, long-run) are new and depend on the controller being in the loop.

Scenario	Shape	Verification status
read-heavy	90% SELECT / 10% INSERT — analytical-leaning mix	PLANNED mix engine exists (writer loop); needs read path + ratio control
write-heavy	20% SELECT / 80% INSERT — ingestion	PLANNED close to today's `exp2` independent-row insert loop
mixed-oltp	70% SELECT / 20% INSERT / 8% UPDATE / 2% DELETE — SaaS	PLANNED needs UPDATE/DELETE/SELECT in the driver loop
burst	idle → 500 → 5k → 20k rps → idle, repeat; measures cold/warm starts + scaling latency	NEEDS DESIGN requires the controller's lifecycle signals, not just the engine
scale-to-zero	query → idle 10m → query; measures cold boot, compute reuse, cache restore	NEEDS DESIGN this is spec 09 Exp 5 (cold read) — SHOULD, not yet a subcommand
long-run	hours/days; detects memory/resource/connection leaks, scheduler drift	NEEDS DESIGN soak harness + resource sampling
custom	user-supplied YAML profile (duration, connections, mix)	NEEDS DESIGN profile loader; keep the YAML dep CLI-only + feature-gated

Lifecycle scenarios need the controller, not the embedded engine

Burst, scale-to-zero, and long-run measure cold start, worker reuse, and scheduling — signals the embedded library deliberately does not own (the engine core is thread-free; lifecycle lives in twill-controller). These scenarios are meaningful over the server / pgwire path against a controller-driven deployment, and degrade to a single warm process when run purely embedded.

Metrics

The brief groups metrics into five families. The query family ships today; the compute and scheduler families are the new, lifecycle-dependent surface and must be sourced from the controller/server, never invented inside the engine.

Query metrics IMPLEMENTED

Total / successful / failed requests, requests/sec, and the latency distribution (median, p90, p95, p99, max). Today's driver records per-commit latency into an HDR-style histogram and prints p50/p99/p999; widening to p90/p95 and a success/failure split is additive.

Storage metrics PARTIAL

Commit latency is measured today (it is the experiment). Snapshot/WAL read & write counts, storage-fetch latency, and cache hit/miss ratio require the engine/storage to expose counters through a read-only stats surface — additive, but a new seam-respecting introspection path (no backend internals leak into the Storage trait).

Compute & scheduler metrics NEEDS DESIGN

Cold/warm starts, average start time, worker-reuse rate, compute active/idle duration, scale-to-zero events, peak workers; queue depth, scheduling delay, allocation/placement time. These come from twill-controller and the server, and overlap directly with a future observability/OTLP export. Twill Bench consumes them; it does not generate them.

Network metrics NEEDS DESIGN

Client RTT, TLS handshake, request/response transfer — meaningful only on the pgwire transport; measured client-side in the bench driver.

Data-correctness validation

Performance is only valid if correctness is preserved. Twill Bench asserts, over the workload it just drove, that none of the classic anomalies occurred.

MUST detect: missing rows, duplicate rows, lost updates, dirty reads, non-repeatable reads, phantom reads, serializable violations, and transaction-consistency breaks.
MUST exit non-zero (code 2) when any invariant is violated, regardless of latency.

Today's exp3 already counts first-committer-wins conflicts and retries them, proving no-lost-update on a contended counter (over pgwire the conflict surfaces as SQLSTATE 40001); the conformance and group-commit suites prove durability and isolation at the crate level. The new work is to lift those assertions into named, result-checking workload profiles.

Workload profiles

Profile	Drives	Asserts
bank-transfer	concurrent transfers between two accounts	ACID, balance invariant (sum conserved), transaction correctness
inventory	concurrent stock decrements	no negative inventory, optimistic-lock conflict handling
counter	thousands of concurrent increments	atomicity, zero lost updates (generalizes `exp3`)
document-editing	concurrent updates to one document	conflict rate, merge/retry latency, retry behaviour

Reporting

Latency breakdown NEEDS DESIGN

Rather than a single total, attribute time across the request path — receive, auth, scheduler, compute spin-up, metadata load, storage read, execution, commit, response. The spin-up / metadata / storage segments require the same controller + storage introspection as the compute/storage metrics above; until those land, the breakdown is partial (client-side + commit only).

Serverless-efficiency report NEEDS DESIGN

Unique to Twill DB: compute-active vs idle time, scale-to-zero count, average worker lifetime, compute-seconds/query, storage-reads/query, average compute utilization. This reframes a run around operational cost, not just latency — the most differentiated piece of the brief, and entirely controller-sourced.

Release comparison PLANNED

twill bench compare --baseline vX --candidate vY diffs two runs (read/write latency, cold start, memory, throughput) into a PASS/regression verdict. The JSON record already carries the git SHA and all percentiles; comparison is a post-processing layer over two archived records, naturally CI-friendly.

Output formats

IMPLEMENTED — terminal summary + one-line JSON record (experiment, transport, backend, SHA, throughput, percentiles).
PLANNED — --json as the sole stdout payload for scripting (today the JSON line accompanies the human summary).
NEEDS DESIGN — --prometheus export for Grafana/CI. Keep the exporter dependency CLI-only and feature-gated; it must never reach the engine.

Exit codes

Code	Meaning
0	success
1	benchmark failed
2	correctness validation failed
3	configuration error
4	connection error

Today the binary uses 2 for a usage/parse error and 1 for a run failure. Adopting the brief's table (notably 2 = correctness failure, 3 = config) is a small, behaviour-defining change worth pinning in a test.

Verification summary — brief vs. what ships

Where each brief capability stands against the current crates/bench driver and the project's architecture rules.

Capability	Status	Notes
p50/p99/p999 via HDR histogram, JSON record, git SHA	IMPLEMENTED	`exp1/exp2/exp3`, embedded + pgwire, `file://` + `s3://`
Group-commit + contention experiments	IMPLEMENTED	maps to spec 09 Exp 1–3; conflict retry proven
Crash-safety gate	IMPLEMENTED	spec 09 Exp 4, in `crates/storage` (CI gate), not yet a bench subcommand
Request-mix scenarios (read/write/mixed)	PLANNED	needs SELECT/UPDATE/DELETE in the driver loop + ratio control
Correctness workload profiles	PLANNED	generalize `exp3`; assert invariants, exit code 2 on violation
Release comparison	PLANNED	post-process two archived JSON records
Cold-read / scale-to-zero scenario (Exp 5)	NEEDS DESIGN	spec 09 SHOULD; needs controller in the loop
Compute / scheduler / serverless-efficiency metrics	NEEDS DESIGN	controller-sourced; overlaps a future OTLP export — bench consumes, engine never generates
Storage counters, latency breakdown	NEEDS DESIGN	read-only introspection surface that respects the storage seam
YAML custom profiles, Prometheus export	NEEDS DESIGN	CLI-only, feature-gated deps; keep out of engine core

Constraints & open questions

MUST NOT let any reporting feature (YAML, Prometheus, OTLP) add a dependency to crates/engine or move a backend concept into the Storage trait — these live in the bench/server/controller crates only.
MUST NOT generate lifecycle/compute metrics inside the thread-free engine; consume them from twill-controller and the server.
SHOULD resolve: do the new compute/efficiency metrics share one signal vocabulary with a future observability/OTLP exporter? They should — settle the metric set in the bench first, then emit the same set live.
SHOULD resolve: what is the safe-against-production guard? Read-only scenarios run anywhere; any mutating scenario requires an explicit opt-in flag and a non-production URL by default.
MAY defer (future enhancements per the brief): multi-region runs, fault injection, network-latency simulation, storage-backend comparison, flamegraphs, continuous mode, a historical benchmark DB, HTML reports, and an interactive TUI.

Related specifications

BENCHBenchmark & Validation PlanThe five falsifiable experiments + methodology this CLI operationalizes; authoritative on placement. HOTHot-Row ContentionThe red-quadrant the counter/contention profiles probe and route around. CTLLifecycle & ControllerSource of cold-start, worker-reuse, and scale-to-zero signals the lifecycle scenarios report. SRVServer ModeThe pgwire path the network metrics and deployed-server runs drive against. STORStorage InterfaceThe seam reporting features must not cross; storage counters need a seam-respecting stats surface. RISKTradeoffs & RisksThe honest costs (write latency, single writer, cold start) these runs quantify.