Monitoring
This document describes runtime monitoring for HestiaStore indexes, with special focus on WAL-enabled deployments.
For rollout and rollback procedures, see WAL Canary Runbook.
Metrics Source
Use SegmentIndex.metricsSnapshot() as the canonical in-process source.
Export these values into your monitoring stack (Micrometer/Prometheus, etc.)
at a fixed scrape interval.
Core Index Signals
- Throughput:
getOperationCount,putOperationCount,deleteOperationCount- Cache behavior:
registryCacheHitCount,registryCacheMissCount,registryCacheEvictionCount- Latency:
readLatencyP50/P95/P99MicroswriteLatencyP50/P95/P99Micros- State:
state(OPENING,READY,CLOSING,ERROR,CLOSED)
Partition Overlay Signals
For the range-partitioned ingest runtime, treat these as the primary backpressure and drain indicators:
- Buffered overlay pressure:
getPartitionBufferedKeyCount()getImmutableRunCount()getDrainingPartitionCount()- Capacity and routing shape:
getPartitionCount()getActivePartitionCount()getMaxNumberOfKeysInActivePartition()getMaxNumberOfKeysInPartitionBuffer()getMaxNumberOfKeysInIndexBuffer()getMaxNumberOfImmutableRunsPerPartition()- Throttling:
getLocalThrottleCount()getGlobalThrottleCount()- Drain activity:
getDrainScheduleCount()getDrainInFlightCount()getDrainLatencyP95Micros()
Compatibility note:
splitScheduleCount,splitInFlightCount,maintenanceQueueSize, and related legacy queue fields are still emitted for older clients.- New dashboards and alerts should prefer the explicit partition fields above.
WAL Signals
Use these fields whenever isWalEnabled() is true:
- Append throughput:
getWalAppendCount()getWalAppendBytes()- Sync health:
getWalSyncCount()getWalSyncFailureCount()getWalSyncAvgNanos()getWalSyncMaxNanos()getWalSyncAvgBatchBytes()getWalSyncBatchBytesMax()- Recovery/corruption:
getWalCorruptionCount()getWalTruncationCount()- Retention/checkpoint:
getWalRetainedBytes()getWalSegmentCount()getWalCheckpointLsn()getWalAppliedLsn()getWalCheckpointLagLsn()- Backlog:
getWalPendingSyncBytes()
Suggested Alerts
Start with these baseline alerts and tune per workload:
wal sync failures:- condition:
getWalSyncFailureCount()increases - severity: critical
wal corruption detected:- condition:
getWalCorruptionCount()increases - severity: critical
unexpected wal truncation:- condition:
getWalTruncationCount()increases outside controlled recovery - severity: high
wal retention pressure:- condition:
getWalRetainedBytes()exceeds 80% of configuredmaxBytesBeforeForcedCheckpointfor 10 minutes - severity: warning
wal checkpoint lag growth:- condition:
getWalCheckpointLagLsn()grows continuously for 10+ minutes - severity: warning
wal pending sync growth:- condition:
getWalPendingSyncBytes()grows without recovery for 10+ minutes - severity: warning
partition overlay backlog growth:- condition:
getPartitionBufferedKeyCount()andgetImmutableRunCount()grow continuously without returning to baseline - severity: warning
partition drain latency spike:- condition:
getDrainLatencyP95Micros()remains elevated above workload baseline for 10+ minutes - severity: warning
partition throttling:- condition:
getLocalThrottleCount()orgetGlobalThrottleCount()increases steadily - severity: warning
index stuck closing:- condition:
state == CLOSINGfor longer than the expected shutdown window - severity: warning
Structured Logs
Parse and index these WAL events:
- Recovery and repair:
event=wal_recovery_startevent=wal_recovery_invalid_tailevent=wal_recovery_tail_repairevent=wal_recovery_drop_newer_segmentsevent=wal_recovery_checkpoint_clampevent=wal_recovery_complete- Checkpoint and retention:
event=wal_checkpoint_cleanupevent=wal_retention_pressure_startevent=wal_retention_pressure_cleared- Sync failures:
event=wal_sync_failureevent=wal_sync_failure_transition
Dashboard Minimum
At minimum, create one dashboard per WAL-enabled index with:
- Write latency (
P50/P95/P99) and throughput. WalSyncAvgNanos,WalSyncMaxNanos,WalSyncCount.WalRetainedBytes,WalSegmentCount,WalCheckpointLagLsn.WalPendingSyncBytes.- Counters for
WalSyncFailureCount,WalCorruptionCount,WalTruncationCount. PartitionBufferedKeyCount,ImmutableRunCount,DrainingPartitionCount,DrainInFlightCount.- Index state timeline (
OPENING/READY/CLOSING/ERROR/CLOSED).