Purpose & relationship to the validation plan

Unlike a generic load generator (k6, Locust, Vegeta), Twill Bench understands the engine: it can attribute latency to the commit round-trip, watch a database scale to zero and back, and assert ACID invariants over the result set. It exists to answer three questions on every run — is the database fast, is the data still correct under stress, and how efficiently did the serverless architecture use compute and storage?

15 is the CLI; 09 is the falsifiable plan it operationalizes

The Benchmark & Validation Plan defines the five adversarial experiments (latency floor, group-commit curve, contention wall, crash safety, cold read), the measurement methodology (percentiles via HDR histogram, never means), and the ship/route/block decision rule. This page does not replace those — it wraps them in an ergonomic CLI and adds scenario-oriented workloads, correctness profiles, and reporting on top. Where the two meet, spec 09 is authoritative on methodology; any number Twill Bench reports for a placement decision MUST still satisfy 09 (real object store for W1, the Experiment-4 durability gate before real data).

Goals & design principles

  • MUST capture all latency as a distribution (p50/p90/p95/p99/p999 via an HDR-style histogram); mean-only reporting is prohibited (inherited from spec 09).
  • MUST validate data correctness alongside performance — a fast run that loses an acked write or an update is a failure, not a fast success.
  • MUST emit both human-friendly terminal output and a machine-readable JSON record for archiving, plotting, and CI gating.
  • SHOULD be zero-config for the common scenarios and reproducible (pinned engine + storage SHA, region, instance type, seed — per spec 09).
  • SHOULD be safe to run against production for read-only scenarios, and refuse to mutate unless explicitly opted in.
  • SHOULD be extensible via benchmark profiles without recompiling the driver.
  • MUST NOT pull heavyweight or non-auditable dependencies into the engine core to satisfy a reporting feature; CLI-only concerns (YAML profiles, Prometheus export) stay in the twill-bench crate, feature-gated, and never cross the storage seam or the thread-free engine boundary.

Command structure

The brief proposes a twill bench <scenario> [options] surface. Today the binary is invoked as twill-bench <expN> --url … [flags]. Both are sub-command driven; the work below is to grow the scenario vocabulary and a profile loader, not to re-architect the entry point.

twill bench read-heavy
twill bench write-heavy
twill bench burst
twill bench scale-to-zero
twill bench bank-transfer        # correctness workload profile
twill bench compare --baseline v0.5.0 --candidate v0.5.1
twill bench custom --profile workload.yaml

Benchmark scenarios

Scenarios are named workload shapes. The three implemented experiments map onto the request-mix scenarios; the lifecycle scenarios (burst, scale-to-zero, long-run) are new and depend on the controller being in the loop.

ScenarioShapeVerification status
read-heavy90% SELECT / 10% INSERT — analytical-leaning mixPLANNED mix engine exists (writer loop); needs read path + ratio control
write-heavy20% SELECT / 80% INSERT — ingestionPLANNED close to today's exp2 independent-row insert loop
mixed-oltp70% SELECT / 20% INSERT / 8% UPDATE / 2% DELETE — SaaSPLANNED needs UPDATE/DELETE/SELECT in the driver loop
burstidle → 500 → 5k → 20k rps → idle, repeat; measures cold/warm starts + scaling latencyNEEDS DESIGN requires the controller's lifecycle signals, not just the engine
scale-to-zeroquery → idle 10m → query; measures cold boot, compute reuse, cache restoreNEEDS DESIGN this is spec 09 Exp 5 (cold read) — SHOULD, not yet a subcommand
long-runhours/days; detects memory/resource/connection leaks, scheduler driftNEEDS DESIGN soak harness + resource sampling
customuser-supplied YAML profile (duration, connections, mix)NEEDS DESIGN profile loader; keep the YAML dep CLI-only + feature-gated

Lifecycle scenarios need the controller, not the embedded engine

Burst, scale-to-zero, and long-run measure cold start, worker reuse, and scheduling — signals the embedded library deliberately does not own (the engine core is thread-free; lifecycle lives in twill-controller). These scenarios are meaningful over the server / pgwire path against a controller-driven deployment, and degrade to a single warm process when run purely embedded.

Metrics

The brief groups metrics into five families. The query family ships today; the compute and scheduler families are the new, lifecycle-dependent surface and must be sourced from the controller/server, never invented inside the engine.

Query metrics IMPLEMENTED

Total / successful / failed requests, requests/sec, and the latency distribution (median, p90, p95, p99, max). Today's driver records per-commit latency into an HDR-style histogram and prints p50/p99/p999; widening to p90/p95 and a success/failure split is additive.

Storage metrics PARTIAL

Commit latency is measured today (it is the experiment). Snapshot/WAL read & write counts, storage-fetch latency, and cache hit/miss ratio require the engine/storage to expose counters through a read-only stats surface — additive, but a new seam-respecting introspection path (no backend internals leak into the Storage trait).

Compute & scheduler metrics NEEDS DESIGN

Cold/warm starts, average start time, worker-reuse rate, compute active/idle duration, scale-to-zero events, peak workers; queue depth, scheduling delay, allocation/placement time. These come from twill-controller and the server, and overlap directly with a future observability/OTLP export. Twill Bench consumes them; it does not generate them.

Network metrics NEEDS DESIGN

Client RTT, TLS handshake, request/response transfer — meaningful only on the pgwire transport; measured client-side in the bench driver.

Data-correctness validation

Performance is only valid if correctness is preserved. Twill Bench asserts, over the workload it just drove, that none of the classic anomalies occurred.

  • MUST detect: missing rows, duplicate rows, lost updates, dirty reads, non-repeatable reads, phantom reads, serializable violations, and transaction-consistency breaks.
  • MUST exit non-zero (code 2) when any invariant is violated, regardless of latency.

Today's exp3 already counts first-committer-wins conflicts and retries them, proving no-lost-update on a contended counter (over pgwire the conflict surfaces as SQLSTATE 40001); the conformance and group-commit suites prove durability and isolation at the crate level. The new work is to lift those assertions into named, result-checking workload profiles.

Workload profiles

ProfileDrivesAsserts
bank-transferconcurrent transfers between two accountsACID, balance invariant (sum conserved), transaction correctness
inventoryconcurrent stock decrementsno negative inventory, optimistic-lock conflict handling
counterthousands of concurrent incrementsatomicity, zero lost updates (generalizes exp3)
document-editingconcurrent updates to one documentconflict rate, merge/retry latency, retry behaviour

Reporting

Latency breakdown NEEDS DESIGN

Rather than a single total, attribute time across the request path — receive, auth, scheduler, compute spin-up, metadata load, storage read, execution, commit, response. The spin-up / metadata / storage segments require the same controller + storage introspection as the compute/storage metrics above; until those land, the breakdown is partial (client-side + commit only).

Serverless-efficiency report NEEDS DESIGN

Unique to Twill DB: compute-active vs idle time, scale-to-zero count, average worker lifetime, compute-seconds/query, storage-reads/query, average compute utilization. This reframes a run around operational cost, not just latency — the most differentiated piece of the brief, and entirely controller-sourced.

Release comparison PLANNED

twill bench compare --baseline vX --candidate vY diffs two runs (read/write latency, cold start, memory, throughput) into a PASS/regression verdict. The JSON record already carries the git SHA and all percentiles; comparison is a post-processing layer over two archived records, naturally CI-friendly.

Output formats

  • IMPLEMENTED — terminal summary + one-line JSON record (experiment, transport, backend, SHA, throughput, percentiles).
  • PLANNED--json as the sole stdout payload for scripting (today the JSON line accompanies the human summary).
  • NEEDS DESIGN--prometheus export for Grafana/CI. Keep the exporter dependency CLI-only and feature-gated; it must never reach the engine.

Exit codes

CodeMeaning
0success
1benchmark failed
2correctness validation failed
3configuration error
4connection error

Today the binary uses 2 for a usage/parse error and 1 for a run failure. Adopting the brief's table (notably 2 = correctness failure, 3 = config) is a small, behaviour-defining change worth pinning in a test.

Verification summary — brief vs. what ships

Where each brief capability stands against the current crates/bench driver and the project's architecture rules.

CapabilityStatusNotes
p50/p99/p999 via HDR histogram, JSON record, git SHAIMPLEMENTEDexp1/exp2/exp3, embedded + pgwire, file:// + s3://
Group-commit + contention experimentsIMPLEMENTEDmaps to spec 09 Exp 1–3; conflict retry proven
Crash-safety gateIMPLEMENTEDspec 09 Exp 4, in crates/storage (CI gate), not yet a bench subcommand
Request-mix scenarios (read/write/mixed)PLANNEDneeds SELECT/UPDATE/DELETE in the driver loop + ratio control
Correctness workload profilesPLANNEDgeneralize exp3; assert invariants, exit code 2 on violation
Release comparisonPLANNEDpost-process two archived JSON records
Cold-read / scale-to-zero scenario (Exp 5)NEEDS DESIGNspec 09 SHOULD; needs controller in the loop
Compute / scheduler / serverless-efficiency metricsNEEDS DESIGNcontroller-sourced; overlaps a future OTLP export — bench consumes, engine never generates
Storage counters, latency breakdownNEEDS DESIGNread-only introspection surface that respects the storage seam
YAML custom profiles, Prometheus exportNEEDS DESIGNCLI-only, feature-gated deps; keep out of engine core

Constraints & open questions

  • MUST NOT let any reporting feature (YAML, Prometheus, OTLP) add a dependency to crates/engine or move a backend concept into the Storage trait — these live in the bench/server/controller crates only.
  • MUST NOT generate lifecycle/compute metrics inside the thread-free engine; consume them from twill-controller and the server.
  • SHOULD resolve: do the new compute/efficiency metrics share one signal vocabulary with a future observability/OTLP exporter? They should — settle the metric set in the bench first, then emit the same set live.
  • SHOULD resolve: what is the safe-against-production guard? Read-only scenarios run anywhere; any mutating scenario requires an explicit opt-in flag and a non-production URL by default.
  • MAY defer (future enhancements per the brief): multi-region runs, fault injection, network-latency simulation, storage-backend comparison, flamegraphs, continuous mode, a historical benchmark DB, HTML reports, and an interactive TUI.

Related specifications

Serverless OLTP Engine — internal development specification. Draft, 2026-06-21. · Author