Monitoring

This document describes runtime monitoring for HestiaStore indexes, with special focus on WAL-enabled deployments.

For rollout and rollback procedures, see WAL Canary Runbook.

Metrics Source

Use SegmentIndex.metricsSnapshot() as the canonical in-process source. Export these values into your monitoring stack (Micrometer/Prometheus, etc.) at a fixed scrape interval.

Core Index Signals

Throughput:
getOperationCount, putOperationCount, deleteOperationCount
Cache behavior:
registryCacheHitCount, registryCacheMissCount, registryCacheEvictionCount
Latency:
readLatencyP50/P95/P99Micros
writeLatencyP50/P95/P99Micros
State:
state (OPENING, READY, CLOSING, ERROR, CLOSED)

Partition Overlay Signals

For the range-partitioned ingest runtime, treat these as the primary backpressure and drain indicators:

Buffered overlay pressure:
getPartitionBufferedKeyCount()
getImmutableRunCount()
getDrainingPartitionCount()
Capacity and routing shape:
getPartitionCount()
getActivePartitionCount()
getMaxNumberOfKeysInActivePartition()
getMaxNumberOfKeysInPartitionBuffer()
getMaxNumberOfKeysInIndexBuffer()
getMaxNumberOfImmutableRunsPerPartition()
Throttling:
getLocalThrottleCount()
getGlobalThrottleCount()
Drain activity:
getDrainScheduleCount()
getDrainInFlightCount()
getDrainLatencyP95Micros()

Compatibility note:

splitScheduleCount, splitInFlightCount, maintenanceQueueSize, and related legacy queue fields are still emitted for older clients.
New dashboards and alerts should prefer the explicit partition fields above.

WAL Signals

Use these fields whenever isWalEnabled() is true:

Append throughput:
getWalAppendCount()
getWalAppendBytes()
Sync health:
getWalSyncCount()
getWalSyncFailureCount()
getWalSyncAvgNanos()
getWalSyncMaxNanos()
getWalSyncAvgBatchBytes()
getWalSyncBatchBytesMax()
Recovery/corruption:
getWalCorruptionCount()
getWalTruncationCount()
Retention/checkpoint:
getWalRetainedBytes()
getWalSegmentCount()
getWalCheckpointLsn()
getWalAppliedLsn()
getWalCheckpointLagLsn()
Backlog:
getWalPendingSyncBytes()

Suggested Alerts

Start with these baseline alerts and tune per workload:

wal sync failures:
condition: getWalSyncFailureCount() increases
severity: critical
wal corruption detected:
condition: getWalCorruptionCount() increases
severity: critical
unexpected wal truncation:
condition: getWalTruncationCount() increases outside controlled recovery
severity: high
wal retention pressure:
condition: getWalRetainedBytes() exceeds 80% of configured maxBytesBeforeForcedCheckpoint for 10 minutes
severity: warning
wal checkpoint lag growth:
condition: getWalCheckpointLagLsn() grows continuously for 10+ minutes
severity: warning
wal pending sync growth:
condition: getWalPendingSyncBytes() grows without recovery for 10+ minutes
severity: warning
partition overlay backlog growth:
condition: getPartitionBufferedKeyCount() and getImmutableRunCount() grow continuously without returning to baseline
severity: warning
partition drain latency spike:
condition: getDrainLatencyP95Micros() remains elevated above workload baseline for 10+ minutes
severity: warning
partition throttling:
condition: getLocalThrottleCount() or getGlobalThrottleCount() increases steadily
severity: warning
index stuck closing:
condition: state == CLOSING for longer than the expected shutdown window
severity: warning

Structured Logs

Parse and index these WAL events:

Recovery and repair:
event=wal_recovery_start
event=wal_recovery_invalid_tail
event=wal_recovery_tail_repair
event=wal_recovery_drop_newer_segments
event=wal_recovery_checkpoint_clamp
event=wal_recovery_complete
Checkpoint and retention:
event=wal_checkpoint_cleanup
event=wal_retention_pressure_start
event=wal_retention_pressure_cleared
Sync failures:
event=wal_sync_failure
event=wal_sync_failure_transition

Dashboard Minimum

At minimum, create one dashboard per WAL-enabled index with:

Write latency (P50/P95/P99) and throughput.
WalSyncAvgNanos, WalSyncMaxNanos, WalSyncCount.
WalRetainedBytes, WalSegmentCount, WalCheckpointLagLsn.
WalPendingSyncBytes.
Counters for WalSyncFailureCount, WalCorruptionCount, WalTruncationCount.
PartitionBufferedKeyCount, ImmutableRunCount, DrainingPartitionCount, DrainInFlightCount.
Index state timeline (OPENING / READY / CLOSING / ERROR / CLOSED).