WAL Replication and Fencing Design (Phase 6 Prep)
This document defines the design contract for Phase 6:
- WAL log shipping (leader -> replicas)
- epoch/term-based fencing
- replicated durability acknowledgement modes
It is a preparation artifact: implementation can proceed without reopening core design decisions.
Status
- Current WAL v1 scope is local durability only.
- This document defines the extension path to distributed durability.
Goals
- Replicate WAL records by LSN range to follower nodes.
- Prevent stale leaders/writers from appending after leadership changes.
- Provide explicit durability levels: local-only vs replicated acknowledgements.
- Keep backward compatibility for existing local-WAL deployments.
Non-Goals
- Building a new consensus algorithm inside HestiaStore.
- Global multi-index distributed transactions.
- Cross-region conflict resolution in v1 replication.
Leader election is assumed to be provided by external cluster coordination.
Locked Assumptions
- One WAL per index directory (
<index>/wal). - WAL remains opt-in (
withWal(...)), defaultWal.EMPTY. - Local durability behavior must remain unchanged when replication is disabled.
- WAL v1 record framing/checksum rules stay unchanged; replication metadata uses extension fields.
Replication Model
For one index:
- Exactly one leader is allowed to append at a time.
- Followers accept leader WAL records in LSN order.
- Followers verify record checksum/structure before append.
- Followers expose
lastReplicatedLsnandlastDurableLsn.
WAL Header Extension Contract
Use reserved header extension area (already planned via epochSupport):
epoch(orterm) - monotonically increasing leadership epoch.sourceNodeId- stable identifier of leader node.flagsbit for replication metadata presence.
Rules:
epochSupport=falsekeeps extension disabled and local behavior unchanged.- When replication is enabled,
epochSupport=trueis mandatory. - Records with missing required replication metadata are rejected in replicated mode.
Fencing Contract
- Every append request must include current leader epoch.
- Runtime stores
currentEpochdurably. - Append is rejected when request epoch < stored epoch.
- On epoch bump:
- new epoch is persisted before accepting writes
- previous leader/writer sessions are invalidated
Failure behavior:
- If epoch persistence fails, index enters error state and rejects new writes.
- No write acknowledgement is allowed after failed epoch transition.
Log Shipping Protocol (Minimal v1)
Logical RPC-level contract:
Fetch(fromLsn, maxBytes)-> stream/list of WAL records.AppendReplicated(records, epoch, sourceNodeId)-> follower append result.Ack(lastDurableLsn, epoch)-> follower durable progress.
Follower validation on AppendReplicated:
- Epoch must match expected leader epoch.
- First record LSN must be contiguous with follower WAL tail (or recovery point).
- Each record checksum and framing must be valid.
- Invalid tail handling follows configured corruption policy.
Durability Acknowledgement Modes (Replication-Aware)
Keep existing local modes and add replicated interpretation:
LOCAL_ASYNC: local append only.LOCAL_GROUP_SYNC: local group fsync only.LOCAL_SYNC: local sync per write.REPLICATED_QUORUM: acknowledge after local durability + replica quorum durable ack.REPLICATED_ALL: acknowledge after all in-sync replicas durable ack.
Initial rollout recommendation:
- Enable replication with
LOCAL_GROUP_SYNCfirst (observe), - then promote selected indexes to
REPLICATED_QUORUM.
Recovery and Failover Rules
- New leader must not serve writes until:
- epoch bump persisted
- log position selected from acknowledged durable LSN boundary
- Followers that are ahead of chosen leader point must truncate to leader boundary.
- Safe-tail truncation rules remain the same as local WAL recovery.
Compatibility and Migration
- Existing indexes with
wal.enabled=falseremain unaffected. - Existing WAL-enabled local indexes can stay local (
epochSupport=false). - Replication enablement is explicit configuration migration:
- set replication settings
- set
epochSupport=true - Downgrade path:
- disable replication mode
- keep local WAL enabled
Observability Requirements
Add replication metrics:
walReplicationSentRecords,walReplicationSentByteswalReplicationLagLsn(leader durable LSN - follower durable LSN)walReplicationAckLatencyNanos(histogram)walFencingRejectCountwalEpochwalReplicaInSyncCount
Add structured events:
event=wal_epoch_bumpevent=wal_fencing_rejectevent=wal_replication_append_rejectevent=wal_replication_lag_threshold_exceeded
Security Requirements
- Replication transport must support TLS.
- Node identity must be authenticated before accepting replicated append.
- Epoch transition operations must be auditable.
- Do not disable checksum validation for replicated traffic.
Phase 6 Implementation Plan (Execution Order)
P6.1 Metadata and configuration
- Finalize config schema for replication and epoch controls.
- Add validation rules (
epochSupport=truerequired when replication enabled). - Add manifest read/write compatibility tests.
P6.2 Fencing core
- Persist and enforce epoch in WAL runtime.
- Reject stale epoch writes.
- Add unit tests for epoch transitions and stale writer rejection.
P6.3 Log shipping channel
- Implement WAL record fetch by LSN range.
- Implement follower append endpoint with strict validation.
- Add integration tests for contiguous/invalid stream behavior.
P6.4 Ack and durability policy
- Implement replica ack tracking.
- Add
REPLICATED_QUORUMack gating. - Add tests for ack loss and timeout/fallback behavior.
P6.5 Failover correctness
- Leader handoff with epoch bump and boundary selection.
- Deterministic truncation of divergent tails on followers.
- Add failover simulation tests (leader crash, partition, rejoin).
P6.6 Operational hardening
- Metrics, logs, and alerts for lag/fencing.
- Runbook for replication incident handling.
- Canary rollout with guarded enablement.
Acceptance Criteria for Phase 6
- No stale leader writes accepted after epoch change.
- Deterministic failover replay from acknowledged durable boundary.
- Replicated durability modes match documented acknowledgement semantics.
- WAL corruption/tail-repair behavior remains safe under replication.