Segment API: Flush, Compact, and Split
Scope and assumptions
This document describes how to use the Segment API for maintenance operations.
The public Segment interface exposes flush() and compact(). Split is
orchestrated by the segment-index layer (SegmentSplitCoordinator +
SegmentSplitter) using Segment.openIterator(SegmentIteratorIsolation.FULL_ISOLATION).
Split behavior is documented here because it affects locking and iterator
semantics.
Segment is thread-safe by contract. The legacy lock-based adapter was removed;
historical notes remain below for reference. For the lock-free SegmentImpl
behavior and state machine rules, rely on
docs/architecture/segment/segment-concurrency.md.
This document keeps legacy notes for context and describes the lock-free model used today.
Legacy (removed lock-based adapter)
Lock model (historical)
This section describes the removed adapter for context only.
Per-segment ReentrantReadWriteLock (via SegmentImplSynchronizationAdapter):
- Read lock:
get(),openIterator(FAIL_FAST)perhasNext()/next()call. - Write lock:
put(),flush(),compact(),invalidateIterators(),close(). openIterator(FULL_ISOLATION)holds the write lock for its entire lifetime.
Optimistic iterator invalidation is driven by VersionController and used by
openIterator(FAIL_FAST).
A) Flush
What it does today (SegmentImpl):
- Snapshot write cache via
SegmentCache.freezeWriteCache(). - Write entries to a delta cache file (
SegmentDeltaCacheWriter). - Merge frozen write cache into delta cache and clear the frozen snapshot.
- Bump version (via
openDeltaCacheWriter()), invalidating fail-fast iterators. - No-op when the write cache is empty.
When a new write arrives while flush() is running:
- The new write blocks on the write lock.
B) Compact
What it does today (SegmentCompacter):
- Resets searchers and bumps the version (invalidates fail-fast iterators).
- Rewrites the full segment using a merged view (index + delta + write cache).
SegmentFullWriterTxclears delta cache and write cache on commit.
When a new write arrives while compact() is running:
- The new write blocks on the write lock.
C) Split (SegmentIndex)
What it does today (SegmentSplitCoordinator + SegmentSplitter):
- Triggered when
getNumberOfKeysInCache()exceedsmaxNumberOfKeysInSegmentCache(or the coordinator threshold). - May run
segment.compact()first when the policy says the segment is small or likely full of tombstones, then re-evaluates the plan. - Acquires the segment write lock for the entire split when the segment is a
SegmentImplSynchronizationAdapter. - Calls
segment.invalidateIterators()to terminate fail-fast iterators. - Opens
openIterator(FULL_ISOLATION)and streams a merged view (index + delta cache + write cache). Tombstones are filtered out by the iterator (MergeDeltaCacheWithIndexIterator). - Writes the first half into a new lower segment. Remaining entries go into a new upper segment. If there are no remaining entries, the result is a compaction: the lower segment replaces the current segment.
- Replaces on-disk files and updates
KeyToSegmentMap, then closes the old segment instance.
When a new write arrives while split is running:
- The new write blocks on the write lock for the full split duration.
When get() or iterators run during split:
get()and fail-fast iterators block on the write lock while split holds it.- Fail-fast iterators are explicitly invalidated at split start.
- FULL_ISOLATION iterators hold the write lock, so split waits until they close.
Backpressure / overload
The segment itself does not enforce a hard size cap. UniqueCache is
unbounded, so if writes outpace maintenance, memory can grow.
Current index-level handling (SegmentIndexImpl):
- When
getNumberOfKeysInWriteCache()reachesmaxNumberOfKeysInSegmentWriteCache, the index triggersflush(). - When
getNumberOfKeysInCache()exceedsmaxNumberOfKeysInSegmentCache, the index triessplit(), and falls back toflush()if splitting is not possible.
If write load still exceeds flush/compact throughput, pick one:
- Apply backpressure at the index level (for example, wrap SegmentIndex with a bounded executor/queue).
- Lower thresholds to flush/split earlier.
- Add explicit write-cache caps or throttling around
Segment.put().
Parallel calls (flush vs compact vs split)
Same segment:
flush(),compact(), and split are exclusive. The write lock serializes them, so parallel calls run one after another.- Split uses
FULL_ISOLATIONiterators and is wrapped in the write lock bySegmentSplitCoordinatorwhen the segment is aSegmentImplSynchronizationAdapter.
Different segments:
- Maintenance can run in parallel (each segment has its own lock).
- Split takes a registry lock only during file rename; it does not block other segments otherwise.
Reads, get(), and iterators
- Multiple
get()calls can run concurrently. openIterator(FAIL_FAST)can run concurrently withget()and other fail-fast iterators, but any mutation (put,flush,compact,split) invalidates it via the version counter.openIterator(FULL_ISOLATION)holds the write lock until closed, blocking writes, flush, compact, and split on that segment.
Corner cases to call out
- Fail-fast iterators must be expected to end early after any mutation.
- FULL_ISOLATION iterators must always be closed; otherwise writers can stall.
flush()is a no-op with an empty write cache.compact()clears both delta cache and write cache; callers should not expect pending writes to remain after compaction.- Split can degrade to compaction when no entries remain for the upper segment.
- Split throws if the plan is not feasible (estimated keys too low).
- Split can be skipped when
hasLiveEntries()finds no live entries (empty or tombstone-only segment). - Split closes and replaces the current segment instance; stale references to the old segment must not be reused.
- Calls on a closed segment return
SegmentResultStatus.CLOSED; callers can checkgetState()or handle the status directly. - Version overflow throws in
VersionController; long-running services should monitor for it.
Legacy lock summary (removed)
- Removed: per-segment read/write lock for API calls.
- One optimistic version counter for fail-fast iterator invalidation
(
VersionController). - Segment registry exposes
executeWithRegistryLock()for split file replacement, but the currentSegmentRegistrydoes not implement a real registry lock.
Lock-free model (current)
Goals
- Keep
FREEZEexclusive phases short and deterministic (swap + version bump). - Run IO-heavy work (sorting, building files) outside
FREEZE. - Allow writes and most reads to continue during
MAINTENANCE_RUNNING. - Bound memory growth under sustained write load.
Coordination model
- Short
FREEZEphase for cache rotation, version capture, and file/reference swap. - Background maintenance uses frozen snapshots; no long-held
FULL_ISOLATIONholds for heavy work. - A per-segment maintenance state serializes flush/compact/split without blocking normal writes.
A) Flush
- Under
FREEZE: swap (freeze) the write cache and capture a version. - Release to
MAINTENANCE_RUNNING: sort and write the frozen snapshot to the delta cache file. - Under
FREEZE: merge metadata and drop the frozen snapshot handle. - Writes continue into the new write cache throughout the flush.
B) Compact
- Under
FREEZE: freeze the merged in-memory view (delta + write cache), capture a version, and redirect writes to a fresh cache. - Release to
MAINTENANCE_RUNNING: rebuild new on-disk files from the snapshot plus current index. - Optionally replay post-freeze writes before swap (or keep them in the active write cache) to avoid data loss.
- Under
FREEZE: validate version and swap files; clear obsolete delta files and reset searchers.
C) Split
- Use the freeze + redirect + replay pattern (see
docs/development/segment-splitting-lock-minimization.md). - Under
FREEZE: freeze the write cache, record a split version, and redirect writes to a fresh cache. - Release to
MAINTENANCE_RUNNING: run the split pipeline against the frozen snapshot. - Replay post-freeze writes into the new lower/upper segments.
- Under
FREEZE: verify version and swap references; on conflict, retry.
Backpressure / overload
- Bound the number/size of frozen snapshots per segment.
- If backlog exceeds limits, apply throttling or spill to disk.
- Run maintenance on a background executor and prioritize older snapshots.
Parallel calls
- One maintenance task per segment at a time; other requests coalesce or wait.
- Flush requests during split should attach to replay or run after split.
- Maintenance can still run in parallel across different segments.
Reads, get(), and iterators
get()should read from current caches and files without blocking on long maintenance tasks.- Fail-fast iterators continue to invalidate on version changes.
- Prefer snapshot-based iterators for long scans instead of holding write
holds for
FULL_ISOLATION.
Corner cases
- Version mismatch at swap time should trigger retry or fallback.
- Large post-freeze write backlog should trigger backpressure or replay caps.
- Segment replacement must be atomic in the registry to avoid stale reads.
- Iterators opened before a swap should either complete on their snapshot or fail fast.
Coordination summary
- Per-segment
FREEZEfor short cache/file swaps. - Per-segment maintenance state to serialize flush/compact/split work.
- Optimistic version counter for iterator invalidation.
- Registry lock for file replacement if added in the future.