WAL
HestiaStore supports opt-in write-ahead logging per index. WAL is the main feature to enable when you need local crash recovery stronger than flush/close boundaries alone.
For staged rollout, use WAL Canary Runbook. For longer-term replication design work, see WAL Replication and Fencing Design.
Enable WAL
Wal wal = Wal.builder()
.withDurabilityMode(WalDurabilityMode.GROUP_SYNC)
.build();
IndexConfiguration<String, String> conf = IndexConfiguration
.<String, String>builder()
.withKeyClass(String.class)
.withValueClass(String.class)
.withName("orders")
.withWal(wal)
.build();
WAL is disabled by default through Wal.EMPTY.
Choose a durability mode
ASYNC: lowest write latency, weakest durability guaranteeGROUP_SYNC: batched fsync behavior with a balanced latency/durability trade-offSYNC: fsync on each write, strongest durability and highest write overhead
Choose the mode from your durability target first, then tune performance around that choice.
Corruption and recovery policy
TRUNCATE_INVALID_TAIL: startup truncates a broken tail and continuesFAIL_FAST: startup stops when corruption is detected- Recovery validates global LSN monotonicity across WAL segments
- Invalid
wal/checkpoint.metacontent aborts recovery
WAL directory layout
Inside the index directory:
wal/format.metawal/checkpoint.metawal/*.wal
WAL segment files are named as <20-digit-base-lsn>.wal.
Tooling
WalTool supports:
verifyfor integrity checksdumpfor record-level diagnostics
Run it from compiled classes:
java -cp engine/target/classes org.hestiastore.index.segmentindex.wal.WalTool verify /path/to/index/wal
java -cp engine/target/classes org.hestiastore.index.segmentindex.wal.WalTool dump /path/to/index/wal
Or use the packaged CLI:
mvn -pl wal-tools -am package
unzip wal-tools/target/wal-tools-<version>.zip -d /tmp
/tmp/wal-tools-<version>/bin/wal_verify /path/to/index/wal
/tmp/wal-tools-<version>/bin/wal_dump /path/to/index/wal
JSON output is available through --json for both commands.
Exit codes:
0: success1: usage or runtime failure2:verifyfound WAL issues
Operating signals
Monitor these first:
getWalSyncFailureCount()getWalCorruptionCount()getWalTruncationCount()getWalRetainedBytes()getWalCheckpointLagLsn()getWalPendingSyncBytes()getWalSyncAvgNanos()
When retained WAL exceeds maxBytesBeforeForcedCheckpoint, the write path
applies forced checkpoint behavior and backpressure until retained WAL drops.
Structured log events include:
event=wal_recovery_startevent=wal_recovery_invalid_tailevent=wal_recovery_tail_repairevent=wal_recovery_drop_newer_segmentsevent=wal_recovery_checkpoint_clampevent=wal_recovery_completeevent=wal_checkpoint_cleanupevent=wal_retention_pressure_startevent=wal_retention_pressure_clearedevent=wal_sync_failureevent=wal_sync_failure_transition
Metrics exposed by metricsSnapshot()
- throughput:
getWalAppendCount(),getWalAppendBytes() - durability:
getWalSyncCount(),getWalSyncFailureCount(),getWalDurableLsn() - corruption and recovery:
getWalCorruptionCount(),getWalTruncationCount() - retention and checkpointing:
getWalRetainedBytes(),getWalSegmentCount(),getWalCheckpointLsn(),getWalCheckpointLagLsn() - pending work:
getWalPendingSyncBytes(),getWalAppliedLsn() - sync latency and batch sizing:
getWalSyncTotalNanos(),getWalSyncMaxNanos(),getWalSyncAvgNanos(),getWalSyncBatchBytesTotal(),getWalSyncBatchBytesMax(),getWalSyncAvgBatchBytes()