Range-Partitioned Ingest Implementation Notes
This page records the implementation contract for the partitioned ingest runtime introduced above stable segment storage.
Current Implementation Scope
- user writes enter an in-memory partition overlay first
- stable segments remain the durable publish target
KeyToSegmentMapremains the persisted routing source of truth- WAL replay restores unpublished writes back into the partition overlay on open
The current implementation changes the user write path first. It still uses
the existing stable-segment split execution primitives, but the runtime no
longer depends on the historical SegmentSplitCoordinator wrapper.
Transitional Split Behavior
- live-segment split is no longer triggered directly from
put() - the current transition slice evaluates partition-aware stable split scheduling from a coalesced background policy scan over current routed partitions
- when the background split policy decides a routed partition should split, child stable segments are materialized from the parent stable snapshot before the final route remap is published
- if buffered overlay data still exists for the parent route at split-apply time, it is reassigned to the produced child routes as part of the same partition-aware split apply step instead of being left on the retired parent segment id
- background overlay drain only requests another policy scan; it does not run split work inline on the hot write path
- point
get()now runs under the same short split-apply read gate asput(), so remap plus overlay reassignment cannot expose a stale point lookup during the split apply window - explicit maintenance split no longer holds that gate for the whole stable child build; the gate only wraps the final split-apply remap window
- while a split is building child stable segments, drain back into the parent route is temporarily suspended so newly buffered data stays in overlay and is reassigned to child routes if the split applies
- there is no longer a runtime-only pending split fallback chain after route apply; by the time the new route becomes visible, the child stable data already exist on disk
Read and Write Semantics
put()anddelete()append to WAL, then update the routed partition overlayget()reads overlay first and stable segment storage second- a successful
put()is therefore visible toget()before any drain completes FULL_ISOLATIONindex streaming now opens against a split-safe route snapshot and retries if the segment map changes underneath the open; it no longer falls back to the historical split-idle barrier on the read path- point operations no longer wait explicitly for background live-segment split
completion before retrying a
BUSYpath flushAndWait()seals active partition data, drains immutable runs into stable segment storage, waits for any partition-aware stable splits already scheduled by background drain, drains again if split apply reassigned overlay data to child routes, waits again for any second-wave split scheduled by that follow-up drain, flushes stable segments, and checkpoints WALcompactAndWait()likewise waits for any split already scheduled by background drain before compacting stable segments, so compaction does not overlap with split materialization of the same routed range; the same second drain plus wait cycle is repeated before compaction begins
Drain Contract
- active mutable partition data rotates into immutable runs
- immutable runs remain readable until they are successfully drained and flushed into stable storage
- drain work is scheduled on index maintenance executors
- partition-aware stable split build work is scheduled on a dedicated split-maintenance executor so the heavy child materialization phase no longer runs on the explicit maintenance caller thread
- a periodic autonomous background split policy loop runs while the index is open; it keeps reevaluating routed stable ranges even when no new writes, reopen, or runtime-threshold patch occurs
- the autonomous loop is driven by a dedicated split-policy scheduler owned by
the index executor registry, so
close()shuts down both split execution and future policy ticks together with the rest of index maintenance infrastructure - that autonomous loop only performs a full routed scan when overlay, drain, and split backlog are idle; hot write periods still rely on targeted post-drain split scheduling for the specific routed partition that just drained
- split scheduling keeps a per-segment cooldown and retry-growth hysteresis window; if a borderline split candidate fails or aborts, the background loop does not immediately thrash on the same stable range again unless either enough time passes or the routed segment grows materially
- that cooldown is adaptive rather than fixed: longer split attempts stretch the next retry window, while short split attempts decay it back toward the baseline
- additional immediate background split policy scans are still triggered on open, after consistency repair, and after runtime split-threshold changes
- explicit
flushAndWait()/compactAndWait()no longer initiate split scanning themselves; they only wait for in-flight split work that was already triggered by the background policy path - while explicit stable flush/compaction is running, new autonomous split candidates are temporarily ignored; a fresh idle scan is requested right after explicit maintenance completes so split materialization does not race against that maintenance on the same routed range
- if the overlay exceeds local or global limits, writes receive bounded backpressure instead of waiting on a live segment split
- during the brief split-apply window itself, write admission is still gated
so route remap and overlay reassignment stay atomic, but explicit
flushAndWait()/compactAndWait()no longer need a whole-operation global write gate
Recovery Contract
- unpublished partition overlay state is transient
- startup reconstructs routing from persisted index metadata
- WAL replay restores acknowledged writes that were not yet published
- startup and explicit consistency checks delete orphaned segment directories that are not referenced from the persisted routing metadata, which covers abandoned split children after interrupted maintenance
- durability after
flushAndWait()is defined by successful stable-segment flush plus WAL checkpoint
Configuration Migration
New partition-oriented settings:
maxNumberOfKeysInActivePartitionmaxNumberOfImmutableRunsPerPartitionmaxNumberOfKeysInPartitionBuffermaxNumberOfKeysInIndexBuffermaxNumberOfKeysInPartitionBeforeSplit
Legacy persisted settings are still accepted during load and are migrated to the new keys:
maxNumberOfKeysInSegmentWriteCache->maxNumberOfKeysInActivePartitionmaxNumberOfKeysInSegmentWriteCacheDuringMaintenance->maxNumberOfKeysInPartitionBuffermaxNumberOfKeysInSegment->maxNumberOfKeysInPartitionBeforeSplit
New manifests are written with the partition-oriented names only.