Health Supervision Architecture
Overview
The health supervision mechanism provides a structured and reliable approach to evaluating and publishing the overall health condition of the system.
It is based on a snapshot-driven and debounced evaluation pipeline, designed to ensure consistency, stability, and traceability of health transitions across CSP.LMC devices.
The architecture separates:
Event collection
Supervision timing
Health classification
Diagnostic aggregation and publication
Architectural Layers
The health supervision system is composed of three conceptual layers:
- Event Collection Layer
Subsystem events such as health changes, operational state updates, and administrative mode transitions are collected asynchronously. These updates are stored in a thread-safe structure that guarantees consistency and prevents race conditions.
- Supervision Layer
A timing mechanism applies debounce and maximum-latency rules. This prevents transient fluctuations from triggering unnecessary evaluations while ensuring that evaluation is never indefinitely delayed.
This layer builds on the generic Generic observation supervisor debounce and max-latency infrastructure originally introduced for observation state management.
Evaluation and Aggregation Layer
A dedicated evaluation stage processes a consistent snapshot of the system and determines the aggregated
HealthState.Diagnostic messages (
HealthInfo) are derived from the evaluation result and forwarded subsystem payloads.The aggregation and publication of
healthInfois delegated to theHealthInfoManager, which:
merges local diagnostics with forwarded subsystem diagnostics;
deduplicates messages per FQDN while preserving order;
removes stale subsystem entries;
performs change detection before publication.
The supervisor coordinates when evaluation occurs, while the
HealthInfoManageris responsible for producing and emitting the final aggregatedhealthInfopayload.
Evaluation Flow
Subsystem events are received and stored (state/health/admin) and subsystem
HealthInfoupdates are buffered (latest per source).The supervision mechanism waits for stability (debounce period) or forces evaluation after a maximum latency threshold.
A consistent snapshot of the current system state is taken when a snapshot-based evaluation is required.
The snapshot is evaluated to determine the aggregated health.
Diagnostic information is derived from the evaluation result.
Forwarded subsystem
HealthInfobuffers are passed to theHealthInfoManagertogether with the locally evaluated diagnostics. The manager recomputes the aggregated structure and, if the outcome differs from the previously published values, emits the updated attribute.
This mechanism ensures that health evaluation remains stable, predictable, and resistant to short-lived oscillations.
Overall Flow
The following diagram illustrates the high-level data flow of the health supervision architecture.
Subsystem events
(state / health / healthInfo)
|
v
+------------------------------+
| CspHealthSupervisor |
|------------------------------|
| - HealthStateStore |
| - LatestBySourceBuffer |
| - debounce / max-latency |
+---------------+--------------+
|
| Debounced evaluation trigger
v
+---------+------------------+
| Snapshot-based evaluation |
|----------------------------|
| 1) Evaluate HealthState |
| 2) Generate diagnostics |
+---------+------------------+
|
| Aggregated data
v
+---------+------------------+
| HealthInfoManager |
|----------------------------|
| - Merge local + forwarded |
| - Deduplicate messages |
| - Change detection |
+---------+------------------+
|
v
+---------+------------------+
| Attribute Publication |
|----------------------------|
| - publish HealthState |
| - publish healthInfo |
+----------------------------+
HealthState and HealthInfo
The HealthState attribute represents the aggregated health
condition of the system (e.g. OK, DEGRADED, FAILED, UNKNOWN).
The HealthInfo attribute complements HealthState by
providing contextual diagnostic information.
While HealthState answers:
“Is the system healthy?”
HealthInfo answers:
“Why is the system in this condition?”
When the system enters a degraded or failed state,
HealthInfo contains human-readable messages describing the
underlying causes. These may include:
Fault conditions in critical components
Communication problems
Disabled subsystems
Operational state inconsistencies
Command execution failures
This allows operators to understand the origin of a problem without manually inspecting individual subsystem states.
When the system returns to a healthy condition,
HealthInfo is cleared.
HealthInfo Propagation
CSP.LMC devices subscribe to the HealthInfo attributes of
their subordinate subsystems.
When a subsystem updates its diagnostic information, the
corresponding parent device incorporates that information into
its own HealthInfo output, according to the aggregation and
diagnostic rules defined by the supervision architecture.
This hierarchical propagation ensures that higher-level devices expose relevant diagnostic context originating from the components they supervise.
Note
The CSP.LMC Controller does not subscribe to the HealthInfo attributes
of CSP.LMC Subarray devices.
This is an intentional design choice. The CSP.LMC Controller derives its diagnostic output directly from the underlying subsystem snapshot rather than aggregating Subarray-level diagnostics.
This prevents duplication of messages across hierarchy levels and
ensures that the CSP.LMC Controller HealthInfo remains concise,
unambiguous, and semantically aligned with its system-wide role.
Forwarded HealthInfo buffering
Forwarded subsystem HealthInfo change events are buffered and
coalesced by the CSP.LMC supervision loop.
For each subscribed subsystem source, only the latest received
HealthInfo payload is retained until the next debounced evaluation
cycle. During that cycle, forwarded updates are merged together with the
locally evaluated diagnostics and a single HealthInfo attribute
publication is performed (subject to change detection).
This prevents bursts of consecutive HealthInfo events on the parent
device when multiple subsystems update their diagnostics in quick
succession.
Health Transition Triggers
The aggregated HealthState (and consequently HealthInfo)
may change through two mechanisms.
Subsystem Event Propagation
Health and operational state changes reported by subordinate subsystems contribute to the aggregated system health. These updates follow the supervised, snapshot-based evaluation flow.
Since both health and operational changes may affect diagnostics, rapid sequences of subsystem updates could otherwise lead to multiple consecutive evaluations and publications.
The supervision layer mitigates this behaviour through debounce and latency control.
Command-Driven Forcing
In addition to event-driven updates, command execution outcomes can directly influence the system health.
When a command executed on a CSP.LMC device fails, the system may
explicitly force the HealthState into a DEGRADED or FAILED
condition, depending on the severity of the failure.
In this case:
The transition does not originate from subsystem updates.
The transition reflects an operational failure.
HealthInfois updated to include the reason.
This dual mechanism ensures that the reported health condition reflects both subsystem status and runtime operational errors.
Design Principles
The health supervision architecture is designed to:
Prevent transient oscillations in reported health.
Guarantee bounded evaluation latency.
Ensure atomic and consistent system snapshots.
Minimize redundant attribute publications.
Provide clear diagnostic information alongside health transitions.
Maintain separation between timing, evaluation, and explanation logic.