Configuration Reference¶

All engine configuration is read from environment variables. An optional YAML file at /etc/workflow/engine.yaml provides a secondary source (env vars always win).

Variable	Default	Description
`WE_POSTGRES_DSN`	(required)	libpq connection string, e.g. `postgres://user:pass@host:5432/db?sslmode=disable`
`WE_DISPATCH_MODE`	`polling`	`polling` or `kafka_outbox`. Opts into event-driven dispatch.
`WE_KAFKA_TRANSPORT`	`plaintext`	`plaintext` or `sasl_scram_tls`. Required if `WE_DISPATCH_MODE=kafka_outbox`.
`WE_KAFKA_SEED_BROKERS`	`""`	Comma-separated list of brokers, e.g. `localhost:9092`. Required if `kafka_outbox`.
`WE_KAFKA_SASL_MECHANISM`	`SCRAM-SHA-512`	`SCRAM-SHA-256` or `SCRAM-SHA-512`. Required for `sasl_scram_tls`.
`WE_KAFKA_SASL_USERNAME`	`""`	Required for `sasl_scram_tls`.
`WE_KAFKA_SASL_PASSWORD`	`""`	ENV ONLY, NEVER ON DISK. Required for `sasl_scram_tls`.
`WE_KAFKA_TLS_CA_PATH`	`""`	Optional CA path for self-signed brokers in `sasl_scram_tls`.
`WE_KAFKA_TLS_SERVER_NAME`	`""`	Optional TLS ServerName override.
`WE_OUTBOX_BATCH_SIZE`	`200`	Relay drain batch limit.
`WE_REST_PORT`	`8080`	HTTP/REST listener port
`WE_GRPC_PORT`	`9090`	gRPC server port
`WE_METRICS_PORT`	`9091`	Prometheus `/metrics` endpoint port
`WE_LOG_LEVEL`	`info`	Minimum log level: `debug`, `info`, `warn`, `error`
`WE_AUDIT_LOG_ENABLED`	`true`	Record engine actions to `audit_log` table
`DB_MAX_CONNS`	`runtime.NumCPU() * 4` (floor 4)	Maximum pgxpool connections. Tune per deployment to bound Postgres `max_connections` usage across replicas.
`DB_MIN_CONNS`	`0`	Minimum idle pgxpool connections held open. `0` preserves pgxpool's on-demand behaviour. Must be `≤ DB_MAX_CONNS`.

Kafka Partition Assignment Strategy¶

The Workflow Engine and all SDKs (Go, Java, Node.js, Python) utilize the CooperativeStickyAssignor (or cooperative-sticky in librdkafka-based clients) by default.

This strategy enables incremental rebalancing, allowing consumers to keep their assigned partitions during a rebalance if they are not being moved to another member. This avoids "stop-the-world" pauses and is highly recommended for stable operations in Kubernetes environments.

While the default is standardized for stability, users can override this in the SDKs by providing custom Kafka properties during initialization if absolutely necessary.

Engine performance metrics¶

The engine exposes the following Prometheus series (in addition to workflow_*, job_*, http_*, and grpc_*):

Metric	Type	Purpose
`engine_db_transaction_duration_seconds`	Histogram (`tx_type`)	Wall-clock Begin → Commit/Rollback per logical engine transaction
`engine_db_lock_wait_duration_seconds`	Histogram (`operation`)	Pre-acquire wait for `FOR UPDATE` / `FOR UPDATE SKIP LOCKED`
`engine_job_timeout_total`	Counter	Jobs whose lease expired and were recovered by the lease sweeper
`engine_job_pickup_latency_seconds`	Histogram	End-to-end `time.Since(job.created_at)` at successful worker claim. Primary signal for the < 50ms-p95 target.

Multi-replica coordination¶

Sweepers (job lease expiry and boundary_event_schedule timer firing) are gated behind distinct PostgreSQL advisory locks (pg_try_advisory_lock) so that across N replicas only one replica sweeps per interval. No cluster configuration is required — every engine replica tries to acquire the lock each tick; losers skip silently until the current leader disconnects.

Partial JSONB updates¶

Updates to workflow_instance.variables emit chained jsonb_set calls per dirty top-level key instead of rewriting the entire JSONB blob. For ≥ 256 KB payloads this reduces WAL volume by ≥ 40% on single-key updates.