Tradeoffs: Polling vs Event-Driven (Kafka + Outbox)¶
Feature: Event-Driven Job Dispatch via Kafka + Transaction Outbox (Opt-In) Purpose: An honest comparison of the two dispatch modes this feature makes available. Readers should come away understanding when each mode wins and what they are signing up for if they opt into event-driven mode.
Side-by-side comparison¶
| Dimension | Polling (SELECT … FOR UPDATE SKIP LOCKED) |
Event-Driven (Kafka + Transaction Outbox) |
|---|---|---|
| Infrastructure dependencies | PostgreSQL only | PostgreSQL + Kafka cluster |
| Source of truth | One (the job table row) |
Two (outbox row + broker offsets), kept consistent by the outbox pattern |
| Delivery semantics | Strong — the row lock is the single truth | At-least-once; idempotency is the worker's responsibility |
| Dispatch latency (p50) | Bounded below by half the poll interval (≈200–500 ms typical) | Milliseconds — push-based end-to-end |
| Dispatch throughput ceiling | Caps at the point where PG row-lock contention dominates | Caps at broker throughput — orders of magnitude higher in practice |
| Engine replica scaling | Adding replicas increases PG connection pressure without adding dispatch throughput | Adding replicas scales throughput approximately linearly until broker/worker capacity bounds |
| Ordering guarantee | Effectively FIFO per job_type across the deployment |
FIFO per workflow instance (partition key = instance_id); not globally FIFO per job type |
| Backpressure visibility | Hidden — observable only indirectly via lock-wait metrics | Explicit — outbox backlog gauge, consumer lag per group |
| Replay / DLQ / retention | Must be built on top of the DB by hand | First-class broker primitives |
| Multi-region / DR | Hard — replicating a work-queue table reliably is non-trivial | Easier — broker-level replication (MirrorMaker 2, Redpanda replicas) is a well-worn path |
| Debugging "where's my job?" | One SQL query | Four-to-five checkpoints: outbox row → producer ack → broker topic → consumer group → worker log |
| Schema evolution (on-wire events) | N/A — workers pull structured DB rows | Must respect Protobuf field-number rules; a single wrong renumber is a production hazard |
| Baseline infrastructure cost at small scale | Near zero | Non-trivial — even a minimal HA Kafka setup costs money and ops time |
| Team cognitive load | Low (SQL state machine) | Higher (partitions, consumer groups, offsets, rebalances, leader election) |
| Blast radius if infra is down | PG down = whole engine down anyway | Kafka down = dispatch stops, but engine write path still works (outbox keeps accepting rows) |
| Testing surface | PostgreSQL testcontainer | PostgreSQL testcontainer + Redpanda testcontainer + broker-outage / rebalance simulation |
| New failure modes introduced | — | Broker outage, consumer rebalance storm, partition imbalance, relay crash mid-batch, split-brain if advisory lock is misconfigured |
Advantages of event-driven mode that polling genuinely cannot match¶
These are structural wins — you cannot get them by tuning the polling path harder.
- No row-lock contention on dispatch. The core motivation. At a certain volume (tens of thousands of pending jobs with many competing worker fleets), PostgreSQL's row-lock wait dominates and polling cannot scale further on the same hardware. Event-driven dispatch takes the hot path off the database entirely.
- Engine replica scaling becomes approximately linear. In polling mode, every replica competes for the same rows — adding replicas increases contention. In event-driven mode, replicas are independent producers; the broker handles fan-out. Operators who care about horizontal scale for job-creation throughput get real leverage here.
- Dispatch latency drops from poll-interval-floor to milliseconds. For workflows where a job created at
t=0must reach a worker immediately rather than within the next poll cycle, this is a qualitative difference, not a quantitative one. - Backpressure becomes observable. Consumer lag and outbox backlog are directly measurable and directly alertable. Polling mode hides the same information inside lock-wait metrics that are harder to threshold and harder to explain to a non-DBA.
- Worker fleets become isolated from the database. Workers talk to a broker, not to PG. A misbehaving worker fleet (hot-loop connections, leaked prepared statements, whatever) can hurt the broker but cannot take down PG — which is also where the engine's definitions, instances, and audit log live.
- The broker becomes a reusable spine for other events. If the system later needs to emit
instance_started/instance_completedevents to other services, the broker is already there. Polling mode would require a second ad-hoc distribution channel for that use case.
Disadvantages of event-driven mode that cannot be fixed by documentation¶
These are costs you pay forever once you flip the switch. Operators should read them as the price list.
- Dual-source state. The outbox row and the broker's internal offsets are both state. The outbox pattern guarantees durability of the handoff, but the cognitive load of reasoning about "where the job currently is" does not go away. A new contributor has to learn both halves.
- At-least-once delivery becomes a worker contract. Workers will occasionally see the same job twice (producer retries, relay restarts, broker replays). The engine's completion path is idempotent. The handler's side-effects (charging a card, sending an email, mutating an external system) must also be idempotent. The most expensive bugs in event-driven systems are non-idempotent handlers written by developers who forgot to read the contract.
- Failure-mode count multiplies. Broker connectivity, consumer rebalance, partition imbalance, relay lag, split-brain advisory lock, topic deletion, ACL misconfiguration — each is a new class of incident with its own dashboard, alert, and runbook. Polling mode has one failure class (the DB is down).
- Ordering semantics change shape. Polling gives effective global FIFO per
job_type. Event-driven gives per-instance FIFO. If any workflow silently depends on cross-instance order-of-arrival, it will break subtly when the mode is flipped. This is why the spec now explicitly calls it out as an operator review item (seespec.mdAssumptions). - Ops cost is non-zero even at low traffic. A production Kafka cluster is three brokers minimum for HA, plus monitoring, plus schema discipline. Redpanda lowers this floor but does not eliminate it. At low traffic, event-driven mode is strictly more expensive to run than polling mode.
- Testing is more complex. Simulating broker outage, rebalance storms, large-payload rejection, and relay-crash-mid-batch all require real broker test infrastructure. This feature's test plan (see
tasks.md) includes ~10 integration tests just for the new path. - Schema evolution requires discipline. Protobuf field-number rules are a real contract. One PR that renumbers a field, or reuses a removed number, is a production outage waiting for its trigger. This is a new review discipline the team must adopt.
- Minimum dispatch latency at low load may be slightly higher. Polling can be extremely fast for a single job in an idle system (one
UPDATE … RETURNING). Event-driven has to gooutbox-commit → relay batch → broker ack → consumer fetch. The p50 at high load is much lower for event-driven, but the best-case single-job latency on an idle system can be lower for polling.
Decision rubric: when to flip the switch¶
Polling mode is the correct default. Event-driven mode is the correct choice only when both of the following are true:
- The workload has hit, or will measurably hit within 6–12 months, the polling path's row-lock contention ceiling. The concrete signal is sustained
wait_event_type='Lock'on thejobtable during load, or observable engine throughput plateauing regardless of added replicas. - The operating team either has Kafka operational expertise, or has a credible plan to invest in it. Kafka is not a "set it and forget it" dependency; treating it as one is how production incidents are made.
If either condition fails, polling mode remains the better engineering choice — even if event-driven looks more modern. Simpler wins at the scale it fits.
This rubric is also why the feature ships as opt-in rather than as a replacement. There is no "right" dispatch mode in isolation; there is only a right mode for this workload at this team's maturity.