Twill Bench CLI
The official benchmarking, correctness, and serverless-efficiency CLI for Twill DB — a single driver that knows the engine's internals, so it measures not just request latency but commit durability, lifecycle behaviour, and data correctness under load. This page captures the product vision and verifies it against what ships today, separating what already exists from what is aligned-but-unbuilt and what would strain the architecture's rules.
Purpose & relationship to the validation plan
Unlike a generic load generator (k6, Locust, Vegeta), Twill Bench understands the engine: it can attribute latency to the commit round-trip, watch a database scale to zero and back, and assert ACID invariants over the result set. It exists to answer three questions on every run — is the database fast, is the data still correct under stress, and how efficiently did the serverless architecture use compute and storage?
15 is the CLI; 09 is the falsifiable plan it operationalizes
The Benchmark & Validation Plan defines the five adversarial experiments (latency floor, group-commit curve, contention wall, crash safety, cold read), the measurement methodology (percentiles via HDR histogram, never means), and the ship/route/block decision rule. This page does not replace those — it wraps them in an ergonomic CLI and adds scenario-oriented workloads, correctness profiles, and reporting on top. Where the two meet, spec 09 is authoritative on methodology; any number Twill Bench reports for a placement decision MUST still satisfy 09 (real object store for W1, the Experiment-4 durability gate before real data).
Goals & design principles
- MUST capture all latency as a distribution (p50/p90/p95/p99/p999 via an HDR-style histogram); mean-only reporting is prohibited (inherited from spec 09).
- MUST validate data correctness alongside performance — a fast run that loses an acked write or an update is a failure, not a fast success.
- MUST emit both human-friendly terminal output and a machine-readable JSON record for archiving, plotting, and CI gating.
- SHOULD be zero-config for the common scenarios and reproducible (pinned engine + storage SHA, region, instance type, seed — per spec 09).
- SHOULD be safe to run against production for read-only scenarios, and refuse to mutate unless explicitly opted in.
- SHOULD be extensible via benchmark profiles without recompiling the driver.
- MUST NOT pull heavyweight or non-auditable dependencies into the engine core to satisfy a reporting feature; CLI-only concerns (YAML profiles, Prometheus export) stay in the
twill-benchcrate, feature-gated, and never cross the storage seam or the thread-free engine boundary.
Command structure
The brief proposes a twill bench <scenario> [options] surface. Today the binary is invoked as twill-bench <expN> --url … [flags]. Both are sub-command driven; the work below is to grow the scenario vocabulary and a profile loader, not to re-architect the entry point.
twill bench read-heavy
twill bench write-heavy
twill bench burst
twill bench scale-to-zero
twill bench bank-transfer # correctness workload profile
twill bench compare --baseline v0.5.0 --candidate v0.5.1
twill bench custom --profile workload.yamlBenchmark scenarios
Scenarios are named workload shapes. The three implemented experiments map onto the request-mix scenarios; the lifecycle scenarios (burst, scale-to-zero, long-run) are new and depend on the controller being in the loop.
| Scenario | Shape | Verification status |
|---|---|---|
| read-heavy | 90% SELECT / 10% INSERT — analytical-leaning mix | PLANNED mix engine exists (writer loop); needs read path + ratio control |
| write-heavy | 20% SELECT / 80% INSERT — ingestion | PLANNED close to today's exp2 independent-row insert loop |
| mixed-oltp | 70% SELECT / 20% INSERT / 8% UPDATE / 2% DELETE — SaaS | PLANNED needs UPDATE/DELETE/SELECT in the driver loop |
| burst | idle → 500 → 5k → 20k rps → idle, repeat; measures cold/warm starts + scaling latency | NEEDS DESIGN requires the controller's lifecycle signals, not just the engine |
| scale-to-zero | query → idle 10m → query; measures cold boot, compute reuse, cache restore | NEEDS DESIGN this is spec 09 Exp 5 (cold read) — SHOULD, not yet a subcommand |
| long-run | hours/days; detects memory/resource/connection leaks, scheduler drift | NEEDS DESIGN soak harness + resource sampling |
| custom | user-supplied YAML profile (duration, connections, mix) | NEEDS DESIGN profile loader; keep the YAML dep CLI-only + feature-gated |
Lifecycle scenarios need the controller, not the embedded engine
Burst, scale-to-zero, and long-run measure cold start, worker reuse, and scheduling — signals the embedded library deliberately does not own (the engine core is thread-free; lifecycle lives in twill-controller). These scenarios are meaningful over the server / pgwire path against a controller-driven deployment, and degrade to a single warm process when run purely embedded.
Metrics
The brief groups metrics into five families. The query family ships today; the compute and scheduler families are the new, lifecycle-dependent surface and must be sourced from the controller/server, never invented inside the engine.
Query metrics IMPLEMENTED
Total / successful / failed requests, requests/sec, and the latency distribution (median, p90, p95, p99, max). Today's driver records per-commit latency into an HDR-style histogram and prints p50/p99/p999; widening to p90/p95 and a success/failure split is additive.
Storage metrics PARTIAL
Commit latency is measured today (it is the experiment). Snapshot/WAL read & write counts, storage-fetch latency, and cache hit/miss ratio require the engine/storage to expose counters through a read-only stats surface — additive, but a new seam-respecting introspection path (no backend internals leak into the Storage trait).
Compute & scheduler metrics NEEDS DESIGN
Cold/warm starts, average start time, worker-reuse rate, compute active/idle duration, scale-to-zero events, peak workers; queue depth, scheduling delay, allocation/placement time. These come from twill-controller and the server, and overlap directly with a future observability/OTLP export. Twill Bench consumes them; it does not generate them.
Network metrics NEEDS DESIGN
Client RTT, TLS handshake, request/response transfer — meaningful only on the pgwire transport; measured client-side in the bench driver.
Data-correctness validation
Performance is only valid if correctness is preserved. Twill Bench asserts, over the workload it just drove, that none of the classic anomalies occurred.
- MUST detect: missing rows, duplicate rows, lost updates, dirty reads, non-repeatable reads, phantom reads, serializable violations, and transaction-consistency breaks.
- MUST exit non-zero (code 2) when any invariant is violated, regardless of latency.
Today's exp3 already counts first-committer-wins conflicts and retries them, proving no-lost-update on a contended counter (over pgwire the conflict surfaces as SQLSTATE 40001); the conformance and group-commit suites prove durability and isolation at the crate level. The new work is to lift those assertions into named, result-checking workload profiles.
Workload profiles
| Profile | Drives | Asserts |
|---|---|---|
| bank-transfer | concurrent transfers between two accounts | ACID, balance invariant (sum conserved), transaction correctness |
| inventory | concurrent stock decrements | no negative inventory, optimistic-lock conflict handling |
| counter | thousands of concurrent increments | atomicity, zero lost updates (generalizes exp3) |
| document-editing | concurrent updates to one document | conflict rate, merge/retry latency, retry behaviour |
Reporting
Latency breakdown NEEDS DESIGN
Rather than a single total, attribute time across the request path — receive, auth, scheduler, compute spin-up, metadata load, storage read, execution, commit, response. The spin-up / metadata / storage segments require the same controller + storage introspection as the compute/storage metrics above; until those land, the breakdown is partial (client-side + commit only).
Serverless-efficiency report NEEDS DESIGN
Unique to Twill DB: compute-active vs idle time, scale-to-zero count, average worker lifetime, compute-seconds/query, storage-reads/query, average compute utilization. This reframes a run around operational cost, not just latency — the most differentiated piece of the brief, and entirely controller-sourced.
Release comparison PLANNED
twill bench compare --baseline vX --candidate vY diffs two runs (read/write latency, cold start, memory, throughput) into a PASS/regression verdict. The JSON record already carries the git SHA and all percentiles; comparison is a post-processing layer over two archived records, naturally CI-friendly.
Output formats
- IMPLEMENTED — terminal summary + one-line JSON record (experiment, transport, backend, SHA, throughput, percentiles).
- PLANNED —
--jsonas the sole stdout payload for scripting (today the JSON line accompanies the human summary). - NEEDS DESIGN —
--prometheusexport for Grafana/CI. Keep the exporter dependency CLI-only and feature-gated; it must never reach the engine.
Exit codes
| Code | Meaning |
|---|---|
| 0 | success |
| 1 | benchmark failed |
| 2 | correctness validation failed |
| 3 | configuration error |
| 4 | connection error |
Today the binary uses 2 for a usage/parse error and 1 for a run failure. Adopting the brief's table (notably 2 = correctness failure, 3 = config) is a small, behaviour-defining change worth pinning in a test.
Verification summary — brief vs. what ships
Where each brief capability stands against the current crates/bench driver and the project's architecture rules.
| Capability | Status | Notes |
|---|---|---|
| p50/p99/p999 via HDR histogram, JSON record, git SHA | IMPLEMENTED | exp1/exp2/exp3, embedded + pgwire, file:// + s3:// |
| Group-commit + contention experiments | IMPLEMENTED | maps to spec 09 Exp 1–3; conflict retry proven |
| Crash-safety gate | IMPLEMENTED | spec 09 Exp 4, in crates/storage (CI gate), not yet a bench subcommand |
| Request-mix scenarios (read/write/mixed) | PLANNED | needs SELECT/UPDATE/DELETE in the driver loop + ratio control |
| Correctness workload profiles | PLANNED | generalize exp3; assert invariants, exit code 2 on violation |
| Release comparison | PLANNED | post-process two archived JSON records |
| Cold-read / scale-to-zero scenario (Exp 5) | NEEDS DESIGN | spec 09 SHOULD; needs controller in the loop |
| Compute / scheduler / serverless-efficiency metrics | NEEDS DESIGN | controller-sourced; overlaps a future OTLP export — bench consumes, engine never generates |
| Storage counters, latency breakdown | NEEDS DESIGN | read-only introspection surface that respects the storage seam |
| YAML custom profiles, Prometheus export | NEEDS DESIGN | CLI-only, feature-gated deps; keep out of engine core |
Constraints & open questions
- MUST NOT let any reporting feature (YAML, Prometheus, OTLP) add a dependency to
crates/engineor move a backend concept into theStoragetrait — these live in the bench/server/controller crates only. - MUST NOT generate lifecycle/compute metrics inside the thread-free engine; consume them from
twill-controllerand the server. - SHOULD resolve: do the new compute/efficiency metrics share one signal vocabulary with a future observability/OTLP exporter? They should — settle the metric set in the bench first, then emit the same set live.
- SHOULD resolve: what is the safe-against-production guard? Read-only scenarios run anywhere; any mutating scenario requires an explicit opt-in flag and a non-production URL by default.
- MAY defer (future enhancements per the brief): multi-region runs, fault injection, network-latency simulation, storage-backend comparison, flamegraphs, continuous mode, a historical benchmark DB, HTML reports, and an interactive TUI.