WAL Canary Runbook
This runbook defines a safe rollout process for enabling WAL on production indexes.
Scope:
- WAL is opt-in per index (
withWal(...)). - Default remains disabled (
Wal.EMPTY). - One WAL lives inside each index directory (
<index>/wal).
Goals
- Enable WAL on a small subset of indexes first.
- Detect durability/performance regressions early.
- Roll back quickly to
Wal.EMPTYwhen risk signals appear.
Preconditions
- Current backup/restore flow is verified for the target index set.
wal-toolsdistribution is available for verification:
- Monitoring is collecting
SegmentIndex.metricsSnapshot()WAL fields. - Target indexes for canary are chosen (low business criticality first).
Canary Plan
Phase 0 - Baseline (WAL disabled)
Duration: 24h minimum on target indexes.
Collect baseline:
- write latency
getWalPendingSyncBytes()(should be 0 when disabled)getWalSyncAvgNanos()(should be 0 when disabled)- index throughput
Phase 1 - Enable WAL on canary indexes only
Use explicit WAL config:
Wal wal = Wal.builder()
.withDurabilityMode(WalDurabilityMode.GROUP_SYNC)
.withSegmentSizeBytes(64L * 1024L * 1024L)
.withGroupSyncDelayMillis(5L)
.withGroupSyncMaxBatchBytes(1L * 1024L * 1024L)
.withMaxBytesBeforeForcedCheckpoint(512L * 1024L * 1024L)
.withCorruptionPolicy(WalCorruptionPolicy.TRUNCATE_INVALID_TAIL)
.build();
IndexConfiguration<String, String> conf = IndexConfiguration
.<String, String>builder()
.withKeyClass(String.class)
.withValueClass(String.class)
.withName("orders-canary")
.withWal(wal)
.build();
Open/create the canary index with this config.
Phase 2 - Verify WAL health during rollout
Run WAL verification during rollout windows:
/tmp/wal-tools-<version>/bin/wal_verify /path/to/index/wal
/tmp/wal-tools-<version>/bin/wal_verify /path/to/index/wal --json
If verification fails (exit code 2), stop rollout and execute rollback.
Use dump for diagnostics:
Phase 3 - Expand rollout
Expand only if canary passes acceptance criteria for at least 24h.
Recommended expansion:
- 5% of indexes
- 25% of indexes
- 100% of eligible indexes
Pause one full observation window between stages.
Alert Thresholds
Use these as initial operational thresholds (tune by workload).
| Signal | Warning | Rollback Trigger |
|---|---|---|
getWalSyncFailureCount() |
any increase | immediate rollback |
getWalCorruptionCount() |
any increase | immediate rollback |
getWalTruncationCount() |
any increase outside planned restart | immediate rollback |
getWalRetainedBytes() |
> 80% of maxBytesBeforeForcedCheckpoint for 10m |
> 100% for 10m |
getWalCheckpointLagLsn() |
continuously increasing for 10m | increasing for 30m with no stabilization |
getWalPendingSyncBytes() |
sustained growth for 10m | sustained growth for 30m |
getWalSyncAvgNanos() |
> 2x baseline for 15m | > 4x baseline for 15m |
Critical signals (sync failure, corruption, unexpected truncation) are fail-fast.
Rollback Procedure (to Wal.EMPTY)
- Stop traffic to affected canary indexes (or switch to read-only).
- Take a filesystem backup/snapshot of affected index directories.
- Reopen indexes with WAL disabled override:
IndexConfiguration<String, String> rollbackConf = IndexConfiguration
.<String, String>builder()
.withKeyClass(String.class)
.withValueClass(String.class)
.withName("orders-canary")
.withWal(Wal.EMPTY)
.build();
try (SegmentIndex<String, String> index = SegmentIndex.open(directory, rollbackConf)) {
index.flushAndWait();
}
- Run integrity checks:
index.checkAndRepairConsistency()- point-read spot checks on business keys
- Keep
wal/files for incident forensics until postmortem is complete. - Resume traffic only after checks pass.
Canary Acceptance Criteria
Promote to next stage only when all are true:
- No increase in
getWalSyncFailureCount(). - No increase in
getWalCorruptionCount(). - No unexpected
getWalTruncationCount()increments. getWalRetainedBytes()remains below forced-checkpoint threshold with headroom.- Write latency SLO remains within agreed variance from baseline.
Incident Data to Capture
For any rollback-triggering event, capture:
wal_verify --jsonoutputwal_dump --jsonoutput around the failing segment- index runtime metrics snapshot around the event window
- precise software version and commit hash