10 Draft v0.1 Updated 2026-06-20

Hot-Row Contention Strategy

Concurrent writes to the same row must be serialized — that is correct database semantics, not a defect — so this spec separates the irreducible cost (serialization) from the manufactured cost (paying a network handoff for conflicts coarse locking invented), and lays out a layered, correct-but-slow-by-default policy that ships safe under any developer SQL and escalates only the rare irreducibly-contended outlier to coupled Postgres.

Purpose & scope

This page is the engineering policy for hot-row write contention: the case where many concurrent transactions update the same logical value stored in one row — a counter, an inventory level, a balance. It expands §9 of the design note into a buildable strategy: what the engine guarantees by default, what guidance and helpers it ships, what gets flagged, and what gets routed off the disaggregated engine entirely.

The scope is deliberately narrow. Contention is a correctness-vs-throughput property of a workload, not a defect of the storage architecture. The disaggregated engine's two weak axes are characterised in Benchmark & Validation Plan as W1 — commit latency (each commit is a network/S3 round-trip, not a local fsync) and W2 — write serialization (single writer per database). Hot-row contention is the place where W1 and W2 compound, so it is the page where the platform commits to a position.

Note

The headline: correctness is handled universally (the engine serializes same-row writers like every serious database does), performance is handled per-tool (shard where we control the SQL; route the irreducible outlier to Postgres). No workload is ever at risk of incorrectness — only the rare contended outlier is at risk of being slow, and that one has a named escape hatch.

Premise — the correct semantics

Concurrent writes to the same row MUST be serialized. This is correct behaviour, not a limitation. Every serious database serializes contended writers — Postgres takes a row lock and resolves visibility through MVCC; none of them parallelizes two writes to the same value, because the problem is inherently sequential: each write must see the prior committed value before it can compute its own result.

   correct (what every DB does)              impossible (what no DB does)

   W1 ─ read n=5 ─ write n=6 ─ commit        W1 ─ read n=5 ─ write n=6 ─┐
                                  │           W2 ─ read n=5 ─ write n=6 ─┘  ← both read 5,
   W2 ──────────── read n=6 ─ write n=7         lost update: n=6, not 7
                                  │
                              serialized:           "parallel" same-row writes
                              n ends at 7           silently corrupt the value

Same-row writes have a read-modify-write dependency. Serializing them is the only correct outcome; "parallelizing" them is the lost-update bug.

The only two engineering variables are therefore:

lock granularity: How much else gets blocked while one writer holds the row. Per-row blocks only same-row writers; per-database blocks every writer in the DB. The engine's single-writer-per-DB model (see Engine Core) is per-DB granularity.
handoff cost: What it costs to pass the right-to-write from one serialized writer to the next. Local Postgres pays ~one fsync (~100µs); the disaggregated engine pays a durable network round-trip + ack (~10ms). This is W1 made concrete.

Everything downstream is a consequence of moving these two dials. Nothing in this spec attempts to make same-row writes parallel — that would be a correctness bug, not an optimization.

Where it happens — the hot-row catalog

A hot row is almost always a shared number that must stay correct under concurrent updates. The recurring patterns:

Pattern	The shared value	Why every write must see the previous
View / like / play counter	one count on a popular item	`n = n + 1` — increment is read-modify-write
Inventory / seat / ticket decrement	units remaining	MUST NOT oversell; the check-and-decrement must be atomic against the prior value
Rate-limit / quota counter	requests used per user/window	the limit decision depends on the running total
Shared account balance	one money figure	debit/credit must apply to the committed balance, never a stale one
Hot status / `updated_at` on a parent	a single field many children touch	last-writer-wins still serializes; many children funnel onto one row

Signature: one row that is semantically a single shared value, where correctness requires every write to see the previous one.

Contrast — the case that parallelizes freely: many writers each inserting a different row (logs, events, append-only orders) have no read-modify-write dependency between them. They can commit in any order. They are the green case.

Contention, not volume, is the problem

A million inserts a second into a million distinct rows is not a hot-row problem — it is a throughput problem solved by lanes (W2 / sharding) and batching (W1). A thousand updates a second to one row is a hot-row problem no matter how low the absolute volume. Diagnose by dependency between writes, never by request rate.

Per-row vs per-database lock cost over the network

The cost model is the load-bearing argument of this page. It reproduces the §9 comparison: a local per-row lock (Postgres), a hypothetical per-row lock over the network, and this engine's naive per-DB lock over the network.

Axis	Per-row, local (Postgres)	Per-row, networked	Per-DB, networked (this engine, naive)
What blocks	same-row writers only	same-row writers only	every writer in the DB
Handoff cost	~fsync, ~100µs	round-trip + ack, ~10ms	round-trip + ack, ~10ms
Same-row contenders	serialize, drain fast	slow but bounded	same as per-row (contenders overlap anyway)
Different-row writers	full parallel	full parallel	all serialize — catastrophic

Two facts follow directly from the table:

The network makes every handoff ~100× costlier (10ms vs 100µs). This is the fixed cost of a durable-over-network commit — W1. It applies to every serialized handoff, contended or not.
Per-DB granularity multiplies how often you pay it. It forces serialization on writes that have no real conflict, turning free parallel writes into expensive queued ones. Coarse locking manufactures contention out of independent writes.

The expensive part is the manufactured contention, not the lock

Look at the columns: same-row contention costs effectively the same under per-row or per-DB granularity — the contenders overlap on one row regardless. The granularity gap only bites on different-row writes (the common case), where per-row stays fully parallel and naive per-DB forces a 10ms serialized handoff between writes that never conflicted. The cost you most want to avoid is paying the expensive network handoff for contention you manufactured by locking too coarsely.

Many-small-DBs recovers per-row-like granularity

The fix for manufactured contention on a per-DB-lock engine is not a finer lock inside one database — it is more databases. If each tool, tenant, or independent write stream is its own database, "per-DB lock" stops meaning "everything waits" and starts meaning "only related writes wait."

  ONE big DB, per-DB lock                MANY small DBs, per-DB lock each
  (manufactured contention)              (granularity recovered)

  W_a (row A) ─┐                         DB1: W_a (row A) ── own lane ──►
  W_b (row B) ─┤ all queue on            DB2: W_b (row B) ── own lane ──►
  W_c (row C) ─┤ ONE per-DB lock         DB3: W_c (row C) ── own lane ──►
  W_d (row D) ─┘ even though no            ↑ no shared lock between
                 two conflict               unrelated streams = parallel

Database boundaries substitute for row-level lock granularity. Splitting unrelated write streams into their own DBs converts a single serialized queue into N independent lanes (W2).

Database boundaries substitute for row-level lock granularity. This is the deep reason the architecture leans on sharding into many small databases — not merely billing or operational isolation, but making the coarse networked lock affordable. The single-writer-per-DB ceiling stops being a throughput cap the moment "per DB" maps to "per independent write stream." This is exactly the W2 lever validated by Experiment 3's "N separate databases scale linearly" check in Benchmark & Validation Plan.

What sharding does and does not solve

Sharding recovers granularity for writes to different rows by giving them different lanes (W2). It does nothing for two writes that genuinely target the same row in the same DB — those still serialize, correctly. Sharding moves the per-DB lock closer to a per-row lock; it cannot abolish the read-modify-write dependency on a truly shared value.

Recommendation — layered policy

The platform cannot control how each tool writes SQL. A developer will write a naive UPDATE counter SET n = n + 1 and the engine must do something correct with it. The policy is therefore correct-but-slow by default, with escalation for flagged outliers, in three layers of decreasing automaticity.

Layer 1 — Option 2 (route-to-Postgres) is the mandatory safety net

This layer is non-optional and requires nothing from the developer. The disaggregated engine already serializes the hot-row case correctly through its single lane — correct, just slow. A naive increment never breaks; it merely runs slowly on the one tool that contends. Experiment 3 (the red-quadrant detector in Benchmark & Validation Plan) flags that tool, and it moves to coupled Postgres. Correctness is never at risk; only performance, only for the outlier.

Why this is the floor

Because correctness is guaranteed by the engine's serialization regardless of SQL quality, the worst outcome for any tool is "slow," never "wrong." That property — safe under arbitrary developer SQL — is what lets the platform ship Option 1 as guidance rather than enforcement.

Layer 2 — Option 1 (avoid the hot row) is offered as guidance, not enforced

Ship the patterns and helpers; use them where the platform controls the tool's schema; never depend on every developer applying them. Both helpers convert "same row" back into "different rows" — moving the workload from the red quadrant into the green one.

Helper A — sharded counter

Split one counter into N sub-counter rows. Each writer increments a random shard (different rows = parallel); reads sum across shards.

-- N sub-counters instead of one hot row
CREATE TABLE counter_shards (
    counter_id  text    NOT NULL,
    shard       int     NOT NULL,            -- 0 .. N-1
    n           bigint  NOT NULL DEFAULT 0,
    PRIMARY KEY (counter_id, shard)
);

-- WRITE: pick a shard at random -> spreads writers across N different rows
UPDATE counter_shards
   SET n = n + 1
 WHERE counter_id = 'video:42'
   AND shard = floor(random() * 16);        -- N = 16

-- READ: aggregate the shards into the logical total
SELECT sum(n) AS total
  FROM counter_shards
 WHERE counter_id = 'video:42';

Helper B — event log + aggregate on read

Never update the count at all. Append one immutable event per increment (different rows, always parallel) and compute the total on read.

-- Append-only: every write is a NEW row, zero contention
CREATE TABLE counter_events (
    id          bigserial PRIMARY KEY,
    counter_id  text      NOT NULL,
    delta       int       NOT NULL DEFAULT 1,
    ts          timestamptz NOT NULL DEFAULT now()
);
CREATE INDEX ON counter_events (counter_id);

-- WRITE: pure INSERT, never contends
INSERT INTO counter_events (counter_id, delta) VALUES ('video:42', 1);

-- READ: fold the events (optionally roll up periodically into a snapshot)
SELECT coalesce(sum(delta), 0) AS total
  FROM counter_events
 WHERE counter_id = 'video:42';

Layer 3 — Option 3 (Redis / dedicated primitive) is NOT a default

A dedicated atomic-counter system adds a second store, a second consistency model, and a second operational surface — which violates the single-account / no-extra-ops posture of the platform. Reserve it for a tool that is both extremely contended and latency-critical, where neither sharding nor the Postgres route is acceptable. It is an explicit, justified exception, never a reach-for default.

Combined effect — W1 and W2 together

The two levers attack different bottlenecks and compose. Neither substitutes for the other (per the §9 correction: the queue defeats W1, the sharded many-DB design defeats W2).

group commit batches the expensive handoffs for the same-row queue

many-small-DBs gives independent lanes for different-row writes

≈ local

together: the coarse networked engine behaves close to a fine-grained local one

Group commit (see Engine Core and Experiment 2) amortizes the ~10ms network handoff across the batch of commits waiting in one DB's single lane — it makes the same-row queue cheap per commit. Many-small-DBs gives unrelated write streams their own lanes so they never enter that queue in the first place. Applied together, the only workload that survives both mitigations is the irreducible one:

The one irreducible case

Same-row, same-DB, high-rate, latency-critical. Sharding cannot split a genuinely shared value; group commit batches throughput but cannot lower a single contended write's tail latency; the dependency is sequential by definition. This is exactly the Layer-1 Option-2 Postgres outlier — and it is rare.

The rule

One line

Serialize same-row writes (correct, universal); shard to recover granularity (cheap, where you control it); route the irreducibly-contended outlier to Postgres (rare). Correctness handled universally; performance handled per-tool.

Normative requirements

MUST serialize concurrent writes to the same row; the engine MUST NOT under any configuration allow two transactions to apply a read-modify-write to the same row without one observing the other's committed value.
MUST guarantee that a naive contended UPDATE (e.g. UPDATE counter SET n = n + 1 WHERE id = ?) returns the correct result on the disaggregated engine — slowness under contention is acceptable, incorrectness is not.
MUST provide the route-to-Postgres escape hatch (Option 2) for any tool flagged as irreducibly contended; this is the mandatory safety net, not an optional add-on.
MUST never acknowledge a commit before its WAL record is durably stored, even under group commit; batching MUST NOT trade durability for latency (see Benchmark & Validation Plan Experiment 4).
SHOULD ship the sharded-counter and event-log helpers (Option 1) as documented patterns plus copy-paste SQL, applied wherever the platform controls the tool's schema.
SHOULD map each independent write stream to its own database so per-DB locking approximates per-row granularity (the W2 lever).
SHOULD surface a contention signal (per-row serialized-handoff rate / wait time) so Experiment 3 outliers are detectable in production, not only in benchmarks.
MAY introduce a dedicated atomic primitive (Option 3, e.g. Redis) for a tool that is both extremely contended and latency-critical, only after Options 1 and 2 are shown insufficient.
MUST NOT default to a second datastore for counters; adding a consistency model and ops surface MUST be a justified per-tool exception.
MUST NOT claim that group commit or sharding solves same-row contention — they solve latency (W1) and lanes (W2) respectively, never the read-modify-write dependency itself.

Behaviour & decision algorithm

Per-tool placement follows the §9 go/no-go rule, specialised for contention:

classify_tool(tool):
    # 1. Default everything to the correct-but-slow engine.
    backend = DISAGGREGATED_ENGINE

    # 2. Recover granularity for free where we own the schema.
    if writes_are_to_distinct_rows(tool) or can_shard(tool):
        apply(Option1.shard_counter or Option1.event_log)
        place tool in its own database          # W2 lane
        return backend                          # green quadrant

    # 3. Measure the contended case (Benchmark Exp 3).
    p99, plateau = run_exp3(tool)               # same-row writers, swept

    if p99 within tool.write_latency_budget and
       plateau >= tool.aggregate_write_rate:
        return backend                          # batching (W1) suffices

    # 4. Irreducible: same-row, same-DB, high-rate, latency-critical.
    if tool.latency_critical:
        return COUPLED_POSTGRES                  # Option 2 safety net
    else:
        return backend                          # slow but acceptable

    # Option 3 (dedicated primitive) is a manual override here,
    # never returned automatically.

The decision is reversible and per-tool: the platform benchmarks the boundary once, then places each tool on the correct side. There is no global on/off switch — correctness is global, placement is local.

Configuration knobs

counter_shard_count (N): Number of sub-counter rows in the Option-1 sharded-counter helper. Higher N = more parallelism on writes, more rows to sum on read. Default 16; tune to expected concurrent writers.
group_commit_window: Max wait before flushing a batch of commits in a single DB's lane (W1 mitigation). Larger window = better amortization of the ~10ms handoff, higher single-commit tail latency. Owned by Engine Core; surfaced here because it sets the same-row queue's cost-per-commit.
group_commit_max_batch: Max commits coalesced per durable append, bounding batch latency under bursts.
contention_flag_threshold: Per-row serialized-handoff rate or mean wait time above which a tool is flagged as an Exp-3 outlier and surfaced for Option-2 routing.
db_per_write_stream: Policy toggle: whether independent write streams are auto-placed in separate databases (the W2 lane lever) or co-located.

Failure modes & edge cases

Mode	Cause	Outcome / mitigation
Lost update	two same-row writes both read the stale value (parallelized incorrectly)	disqualifying — engine MUST serialize; this MUST NOT be reachable
Manufactured contention	unrelated rows in one big DB serialize on the per-DB lock	shard into many DBs (W2); the granularity gap closes
Oversell / negative inventory	decrement not checked against the committed prior value	same-row serialization makes check-and-decrement atomic; do not bypass it with caching
Acked-write loss under batching	group commit acks before WAL is durable	disqualifying — see Exp 4; never ack from an in-memory buffer
Sharded-counter skew	random shard selection clusters writes onto few shards	raise N; use a well-distributed shard key; reads still correct, only throughput affected
Event-log read amplification	`SUM` over an unbounded event table grows slow	periodic roll-up into a snapshot row + truncate folded events; read = snapshot + recent tail
Option-3 split-brain	counter lives in Redis, source of truth diverges from the SQL DB	only adopt with an explicit reconciliation/consistency story; this is why it is not a default
Misdiagnosis by volume	high-throughput distinct-row workload mistaken for hot-row	diagnose by write dependency, not request rate; distinct-row volume is a lane/batch problem

Dependencies & existing pieces to start from

MUST build on the engine's MVCC + single-writer-per-DB serialization in Engine Core — the correctness guarantee this entire strategy rests on.
MUST use the group-commit machinery (W1 lever) and the durable-append guarantee from Engine Core and the commit log in Object-Storage Backend (S3 conditional-write / CAS append, single-writer fencing).
MUST coordinate single-writer fencing with the CAS token / lease in Lifecycle & Controller so the serializing writer is unambiguous.
SHOULD drive per-tool placement from Benchmark & Validation Plan Experiment 3 (contention wall) and Experiment 2 (group-commit plateau).
SHOULD reuse the many-small-DBs / sharding posture established in Architecture Overview as the W2 lane mechanism.

Acceptance criteria — definition of done

MUST demonstrate that a naive contended increment produces the correct total under concurrency (no lost updates) on the disaggregated engine.
MUST pass Experiment 4 unconditionally: every acked commit survives crash injection, including under group commit, with no torn or half state.
MUST show Experiment 3 flattening/dropping on same-row writers and the N-separate-databases variant scaling linearly (validates W2).
SHOULD ship documented Option-1 helpers (sharded counter, event-log) with working SQL and a roll-up recipe for the event-log read path.
SHOULD expose a contention metric and a documented Option-2 routing runbook so a flagged tool can move to coupled Postgres without code changes to other tools.
SHOULD demonstrate the sharded-counter helper converting a red-quadrant workload to green (Exp-3 throughput recovering toward Exp-2 levels).

Open questions & risks

MAY need a built-in serial/sequence-style primitive: should the engine ship a first-class counter type that auto-shards, so developers get the green path without writing the helper SQL?
MAY want automatic contention detection that suggests sharding in-product, since the platform cannot enforce developer SQL — how aggressive should auto-flagging be before it becomes noise?
SHOULD define the event-log roll-up cadence and locking: the roll-up itself touches a snapshot row and could reintroduce a (smaller) hot row — needs its own contention budget.
MAY revisit whether cross-DB transactions are ever needed; the many-small-DBs lever assumes write streams are independent — workloads that span DBs transactionally break the lane model.
MUST keep the Option-2 boundary measurable in production, not only in benchmarks, so the "rare outlier" assumption stays falsifiable as workloads evolve.

Related specifications

Benchmark & Validation PlanExperiment 3 is the red-quadrant detector that flags hot-row outliers; Exp 2 proves the W1 group-commit plateau. Lifecycle & Controllersingle-writer fencing via the commit-log CAS token — the mechanism that makes "the serializing writer" unambiguous. Engine CoreMVCC snapshot isolation, single-writer-per-DB serialization, and group commit — the correctness floor this strategy stands on. Object-Storage Backendthe S3 CAS commit log whose ~10ms durable handoff sets the per-commit cost the policy works around. Architecture Overviewthe many-small-DBs / sharding posture that is the W2 lane lever for recovering lock granularity.