Hot-Row Contention Strategy
Concurrent writes to the same row must be serialized — that is correct database semantics, not a defect — so this spec separates the irreducible cost (serialization) from the manufactured cost (paying a network handoff for conflicts coarse locking invented), and lays out a layered, correct-but-slow-by-default policy that ships safe under any developer SQL and escalates only the rare irreducibly-contended outlier to coupled Postgres.
Purpose & scope
This page is the engineering policy for hot-row write contention: the case where many concurrent transactions update the same logical value stored in one row — a counter, an inventory level, a balance. It expands §9 of the design note into a buildable strategy: what the engine guarantees by default, what guidance and helpers it ships, what gets flagged, and what gets routed off the disaggregated engine entirely.
The scope is deliberately narrow. Contention is a correctness-vs-throughput property of a workload, not a defect of the storage architecture. The disaggregated engine's two weak axes are characterised in Benchmark & Validation Plan as W1 — commit latency (each commit is a network/S3 round-trip, not a local fsync) and W2 — write serialization (single writer per database). Hot-row contention is the place where W1 and W2 compound, so it is the page where the platform commits to a position.
Note
The headline: correctness is handled universally (the engine serializes same-row writers like every serious database does), performance is handled per-tool (shard where we control the SQL; route the irreducible outlier to Postgres). No workload is ever at risk of incorrectness — only the rare contended outlier is at risk of being slow, and that one has a named escape hatch.
Premise — the correct semantics
Concurrent writes to the same row MUST be serialized. This is correct behaviour, not a limitation. Every serious database serializes contended writers — Postgres takes a row lock and resolves visibility through MVCC; none of them parallelizes two writes to the same value, because the problem is inherently sequential: each write must see the prior committed value before it can compute its own result.
correct (what every DB does) impossible (what no DB does)
W1 ─ read n=5 ─ write n=6 ─ commit W1 ─ read n=5 ─ write n=6 ─┐
│ W2 ─ read n=5 ─ write n=6 ─┘ ← both read 5,
W2 ──────────── read n=6 ─ write n=7 lost update: n=6, not 7
│
serialized: "parallel" same-row writes
n ends at 7 silently corrupt the value
Same-row writes have a read-modify-write dependency. Serializing them is the only correct outcome; "parallelizing" them is the lost-update bug.
The only two engineering variables are therefore:
- lock granularity
- How much else gets blocked while one writer holds the row. Per-row blocks only same-row writers; per-database blocks every writer in the DB. The engine's single-writer-per-DB model (see Engine Core) is per-DB granularity.
- handoff cost
- What it costs to pass the right-to-write from one serialized writer to the next. Local Postgres pays ~one fsync (~100µs); the disaggregated engine pays a durable network round-trip + ack (~10ms). This is W1 made concrete.
Everything downstream is a consequence of moving these two dials. Nothing in this spec attempts to make same-row writes parallel — that would be a correctness bug, not an optimization.
Where it happens — the hot-row catalog
A hot row is almost always a shared number that must stay correct under concurrent updates. The recurring patterns:
| Pattern | The shared value | Why every write must see the previous |
|---|---|---|
| View / like / play counter | one count on a popular item | n = n + 1 — increment is read-modify-write |
| Inventory / seat / ticket decrement | units remaining | MUST NOT oversell; the check-and-decrement must be atomic against the prior value |
| Rate-limit / quota counter | requests used per user/window | the limit decision depends on the running total |
| Shared account balance | one money figure | debit/credit must apply to the committed balance, never a stale one |
Hot status / updated_at on a parent | a single field many children touch | last-writer-wins still serializes; many children funnel onto one row |
Signature: one row that is semantically a single shared value, where correctness requires every write to see the previous one.
Contrast — the case that parallelizes freely: many writers each inserting a different row (logs, events, append-only orders) have no read-modify-write dependency between them. They can commit in any order. They are the green case.
Contention, not volume, is the problem
A million inserts a second into a million distinct rows is not a hot-row problem — it is a throughput problem solved by lanes (W2 / sharding) and batching (W1). A thousand updates a second to one row is a hot-row problem no matter how low the absolute volume. Diagnose by dependency between writes, never by request rate.
Per-row vs per-database lock cost over the network
The cost model is the load-bearing argument of this page. It reproduces the §9 comparison: a local per-row lock (Postgres), a hypothetical per-row lock over the network, and this engine's naive per-DB lock over the network.
| Axis | Per-row, local (Postgres) | Per-row, networked | Per-DB, networked (this engine, naive) |
|---|---|---|---|
| What blocks | same-row writers only | same-row writers only | every writer in the DB |
| Handoff cost | ~fsync, ~100µs | round-trip + ack, ~10ms | round-trip + ack, ~10ms |
| Same-row contenders | serialize, drain fast | slow but bounded | same as per-row (contenders overlap anyway) |
| Different-row writers | full parallel | full parallel | all serialize — catastrophic |
Two facts follow directly from the table:
- The network makes every handoff ~100× costlier (10ms vs 100µs). This is the fixed cost of a durable-over-network commit — W1. It applies to every serialized handoff, contended or not.
- Per-DB granularity multiplies how often you pay it. It forces serialization on writes that have no real conflict, turning free parallel writes into expensive queued ones. Coarse locking manufactures contention out of independent writes.
The expensive part is the manufactured contention, not the lock
Look at the columns: same-row contention costs effectively the same under per-row or per-DB granularity — the contenders overlap on one row regardless. The granularity gap only bites on different-row writes (the common case), where per-row stays fully parallel and naive per-DB forces a 10ms serialized handoff between writes that never conflicted. The cost you most want to avoid is paying the expensive network handoff for contention you manufactured by locking too coarsely.
Many-small-DBs recovers per-row-like granularity
The fix for manufactured contention on a per-DB-lock engine is not a finer lock inside one database — it is more databases. If each tool, tenant, or independent write stream is its own database, "per-DB lock" stops meaning "everything waits" and starts meaning "only related writes wait."
ONE big DB, per-DB lock MANY small DBs, per-DB lock each
(manufactured contention) (granularity recovered)
W_a (row A) ─┐ DB1: W_a (row A) ── own lane ──►
W_b (row B) ─┤ all queue on DB2: W_b (row B) ── own lane ──►
W_c (row C) ─┤ ONE per-DB lock DB3: W_c (row C) ── own lane ──►
W_d (row D) ─┘ even though no ↑ no shared lock between
two conflict unrelated streams = parallel
Database boundaries substitute for row-level lock granularity. Splitting unrelated write streams into their own DBs converts a single serialized queue into N independent lanes (W2).
Database boundaries substitute for row-level lock granularity. This is the deep reason the architecture leans on sharding into many small databases — not merely billing or operational isolation, but making the coarse networked lock affordable. The single-writer-per-DB ceiling stops being a throughput cap the moment "per DB" maps to "per independent write stream." This is exactly the W2 lever validated by Experiment 3's "N separate databases scale linearly" check in Benchmark & Validation Plan.
What sharding does and does not solve
Sharding recovers granularity for writes to different rows by giving them different lanes (W2). It does nothing for two writes that genuinely target the same row in the same DB — those still serialize, correctly. Sharding moves the per-DB lock closer to a per-row lock; it cannot abolish the read-modify-write dependency on a truly shared value.
Recommendation — layered policy
The platform cannot control how each tool writes SQL. A developer will write a naive UPDATE counter SET n = n + 1 and the engine must do something correct with it. The policy is therefore correct-but-slow by default, with escalation for flagged outliers, in three layers of decreasing automaticity.
Layer 1 — Option 2 (route-to-Postgres) is the mandatory safety net
This layer is non-optional and requires nothing from the developer. The disaggregated engine already serializes the hot-row case correctly through its single lane — correct, just slow. A naive increment never breaks; it merely runs slowly on the one tool that contends. Experiment 3 (the red-quadrant detector in Benchmark & Validation Plan) flags that tool, and it moves to coupled Postgres. Correctness is never at risk; only performance, only for the outlier.
Why this is the floor
Because correctness is guaranteed by the engine's serialization regardless of SQL quality, the worst outcome for any tool is "slow," never "wrong." That property — safe under arbitrary developer SQL — is what lets the platform ship Option 1 as guidance rather than enforcement.
Layer 2 — Option 1 (avoid the hot row) is offered as guidance, not enforced
Ship the patterns and helpers; use them where the platform controls the tool's schema; never depend on every developer applying them. Both helpers convert "same row" back into "different rows" — moving the workload from the red quadrant into the green one.
Helper A — sharded counter
Split one counter into N sub-counter rows. Each writer increments a random shard (different rows = parallel); reads sum across shards.
-- N sub-counters instead of one hot row
CREATE TABLE counter_shards (
counter_id text NOT NULL,
shard int NOT NULL, -- 0 .. N-1
n bigint NOT NULL DEFAULT 0,
PRIMARY KEY (counter_id, shard)
);
-- WRITE: pick a shard at random -> spreads writers across N different rows
UPDATE counter_shards
SET n = n + 1
WHERE counter_id = 'video:42'
AND shard = floor(random() * 16); -- N = 16
-- READ: aggregate the shards into the logical total
SELECT sum(n) AS total
FROM counter_shards
WHERE counter_id = 'video:42';
Helper B — event log + aggregate on read
Never update the count at all. Append one immutable event per increment (different rows, always parallel) and compute the total on read.
-- Append-only: every write is a NEW row, zero contention
CREATE TABLE counter_events (
id bigserial PRIMARY KEY,
counter_id text NOT NULL,
delta int NOT NULL DEFAULT 1,
ts timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX ON counter_events (counter_id);
-- WRITE: pure INSERT, never contends
INSERT INTO counter_events (counter_id, delta) VALUES ('video:42', 1);
-- READ: fold the events (optionally roll up periodically into a snapshot)
SELECT coalesce(sum(delta), 0) AS total
FROM counter_events
WHERE counter_id = 'video:42';
Layer 3 — Option 3 (Redis / dedicated primitive) is NOT a default
A dedicated atomic-counter system adds a second store, a second consistency model, and a second operational surface — which violates the single-account / no-extra-ops posture of the platform. Reserve it for a tool that is both extremely contended and latency-critical, where neither sharding nor the Postgres route is acceptable. It is an explicit, justified exception, never a reach-for default.
Combined effect — W1 and W2 together
The two levers attack different bottlenecks and compose. Neither substitutes for the other (per the §9 correction: the queue defeats W1, the sharded many-DB design defeats W2).
Group commit (see Engine Core and Experiment 2) amortizes the ~10ms network handoff across the batch of commits waiting in one DB's single lane — it makes the same-row queue cheap per commit. Many-small-DBs gives unrelated write streams their own lanes so they never enter that queue in the first place. Applied together, the only workload that survives both mitigations is the irreducible one:
The one irreducible case
Same-row, same-DB, high-rate, latency-critical. Sharding cannot split a genuinely shared value; group commit batches throughput but cannot lower a single contended write's tail latency; the dependency is sequential by definition. This is exactly the Layer-1 Option-2 Postgres outlier — and it is rare.
The rule
One line
Serialize same-row writes (correct, universal); shard to recover granularity (cheap, where you control it); route the irreducibly-contended outlier to Postgres (rare). Correctness handled universally; performance handled per-tool.
Normative requirements
- MUST serialize concurrent writes to the same row; the engine MUST NOT under any configuration allow two transactions to apply a read-modify-write to the same row without one observing the other's committed value.
- MUST guarantee that a naive contended
UPDATE(e.g.UPDATE counter SET n = n + 1 WHERE id = ?) returns the correct result on the disaggregated engine — slowness under contention is acceptable, incorrectness is not. - MUST provide the route-to-Postgres escape hatch (Option 2) for any tool flagged as irreducibly contended; this is the mandatory safety net, not an optional add-on.
- MUST never acknowledge a commit before its WAL record is durably stored, even under group commit; batching MUST NOT trade durability for latency (see Benchmark & Validation Plan Experiment 4).
- SHOULD ship the sharded-counter and event-log helpers (Option 1) as documented patterns plus copy-paste SQL, applied wherever the platform controls the tool's schema.
- SHOULD map each independent write stream to its own database so per-DB locking approximates per-row granularity (the W2 lever).
- SHOULD surface a contention signal (per-row serialized-handoff rate / wait time) so Experiment 3 outliers are detectable in production, not only in benchmarks.
- MAY introduce a dedicated atomic primitive (Option 3, e.g. Redis) for a tool that is both extremely contended and latency-critical, only after Options 1 and 2 are shown insufficient.
- MUST NOT default to a second datastore for counters; adding a consistency model and ops surface MUST be a justified per-tool exception.
- MUST NOT claim that group commit or sharding solves same-row contention — they solve latency (W1) and lanes (W2) respectively, never the read-modify-write dependency itself.
Behaviour & decision algorithm
Per-tool placement follows the §9 go/no-go rule, specialised for contention:
classify_tool(tool):
# 1. Default everything to the correct-but-slow engine.
backend = DISAGGREGATED_ENGINE
# 2. Recover granularity for free where we own the schema.
if writes_are_to_distinct_rows(tool) or can_shard(tool):
apply(Option1.shard_counter or Option1.event_log)
place tool in its own database # W2 lane
return backend # green quadrant
# 3. Measure the contended case (Benchmark Exp 3).
p99, plateau = run_exp3(tool) # same-row writers, swept
if p99 within tool.write_latency_budget and
plateau >= tool.aggregate_write_rate:
return backend # batching (W1) suffices
# 4. Irreducible: same-row, same-DB, high-rate, latency-critical.
if tool.latency_critical:
return COUPLED_POSTGRES # Option 2 safety net
else:
return backend # slow but acceptable
# Option 3 (dedicated primitive) is a manual override here,
# never returned automatically.
The decision is reversible and per-tool: the platform benchmarks the boundary once, then places each tool on the correct side. There is no global on/off switch — correctness is global, placement is local.
Configuration knobs
- counter_shard_count (N)
- Number of sub-counter rows in the Option-1 sharded-counter helper. Higher N = more parallelism on writes, more rows to sum on read. Default 16; tune to expected concurrent writers.
- group_commit_window
- Max wait before flushing a batch of commits in a single DB's lane (W1 mitigation). Larger window = better amortization of the ~10ms handoff, higher single-commit tail latency. Owned by Engine Core; surfaced here because it sets the same-row queue's cost-per-commit.
- group_commit_max_batch
- Max commits coalesced per durable append, bounding batch latency under bursts.
- contention_flag_threshold
- Per-row serialized-handoff rate or mean wait time above which a tool is flagged as an Exp-3 outlier and surfaced for Option-2 routing.
- db_per_write_stream
- Policy toggle: whether independent write streams are auto-placed in separate databases (the W2 lane lever) or co-located.
Failure modes & edge cases
| Mode | Cause | Outcome / mitigation |
|---|---|---|
| Lost update | two same-row writes both read the stale value (parallelized incorrectly) | disqualifying — engine MUST serialize; this MUST NOT be reachable |
| Manufactured contention | unrelated rows in one big DB serialize on the per-DB lock | shard into many DBs (W2); the granularity gap closes |
| Oversell / negative inventory | decrement not checked against the committed prior value | same-row serialization makes check-and-decrement atomic; do not bypass it with caching |
| Acked-write loss under batching | group commit acks before WAL is durable | disqualifying — see Exp 4; never ack from an in-memory buffer |
| Sharded-counter skew | random shard selection clusters writes onto few shards | raise N; use a well-distributed shard key; reads still correct, only throughput affected |
| Event-log read amplification | SUM over an unbounded event table grows slow | periodic roll-up into a snapshot row + truncate folded events; read = snapshot + recent tail |
| Option-3 split-brain | counter lives in Redis, source of truth diverges from the SQL DB | only adopt with an explicit reconciliation/consistency story; this is why it is not a default |
| Misdiagnosis by volume | high-throughput distinct-row workload mistaken for hot-row | diagnose by write dependency, not request rate; distinct-row volume is a lane/batch problem |
Dependencies & existing pieces to start from
- MUST build on the engine's MVCC + single-writer-per-DB serialization in Engine Core — the correctness guarantee this entire strategy rests on.
- MUST use the group-commit machinery (W1 lever) and the durable-append guarantee from Engine Core and the commit log in Object-Storage Backend (S3 conditional-write / CAS append, single-writer fencing).
- MUST coordinate single-writer fencing with the CAS token / lease in Lifecycle & Controller so the serializing writer is unambiguous.
- SHOULD drive per-tool placement from Benchmark & Validation Plan Experiment 3 (contention wall) and Experiment 2 (group-commit plateau).
- SHOULD reuse the many-small-DBs / sharding posture established in Architecture Overview as the W2 lane mechanism.
Acceptance criteria — definition of done
- MUST demonstrate that a naive contended increment produces the correct total under concurrency (no lost updates) on the disaggregated engine.
- MUST pass Experiment 4 unconditionally: every acked commit survives crash injection, including under group commit, with no torn or half state.
- MUST show Experiment 3 flattening/dropping on same-row writers and the N-separate-databases variant scaling linearly (validates W2).
- SHOULD ship documented Option-1 helpers (sharded counter, event-log) with working SQL and a roll-up recipe for the event-log read path.
- SHOULD expose a contention metric and a documented Option-2 routing runbook so a flagged tool can move to coupled Postgres without code changes to other tools.
- SHOULD demonstrate the sharded-counter helper converting a red-quadrant workload to green (Exp-3 throughput recovering toward Exp-2 levels).
Open questions & risks
- MAY need a built-in serial/sequence-style primitive: should the engine ship a first-class counter type that auto-shards, so developers get the green path without writing the helper SQL?
- MAY want automatic contention detection that suggests sharding in-product, since the platform cannot enforce developer SQL — how aggressive should auto-flagging be before it becomes noise?
- SHOULD define the event-log roll-up cadence and locking: the roll-up itself touches a snapshot row and could reintroduce a (smaller) hot row — needs its own contention budget.
- MAY revisit whether cross-DB transactions are ever needed; the many-small-DBs lever assumes write streams are independent — workloads that span DBs transactionally break the lane model.
- MUST keep the Option-2 boundary measurable in production, not only in benchmarks, so the "rare outlier" assumption stays falsifiable as workloads evolve.