Health Supervision Architecture

Overview

The health supervision mechanism provides a structured and reliable approach to evaluating and publishing the overall health condition of the system.

It is based on a snapshot-driven and debounced evaluation pipeline, designed to ensure consistency, stability, and traceability of health transitions across CSP.LMC devices.

The architecture separates:

  • Event collection

  • Supervision timing

  • Health classification

  • Diagnostic aggregation and publication

Architectural Layers

The health supervision system is composed of three conceptual layers:

Event Collection Layer

Subsystem events such as health changes, operational state updates, and administrative mode transitions are collected asynchronously. These updates are stored in a thread-safe structure that guarantees consistency and prevents race conditions.

Supervision Layer

A timing mechanism applies debounce and maximum-latency rules. This prevents transient fluctuations from triggering unnecessary evaluations while ensuring that evaluation is never indefinitely delayed.

This layer builds on the generic Generic observation supervisor debounce and max-latency infrastructure originally introduced for observation state management.

Evaluation and Aggregation Layer

A dedicated evaluation stage processes a consistent snapshot of the system and determines the aggregated HealthState.

Diagnostic messages (HealthInfo) are derived from the evaluation result and forwarded subsystem payloads.

The aggregation and publication of healthInfo is delegated to the HealthInfoManager, which:

  • merges local diagnostics with forwarded subsystem diagnostics;

  • deduplicates messages per FQDN while preserving order;

  • removes stale subsystem entries;

  • performs change detection before publication.

The supervisor coordinates when evaluation occurs, while the HealthInfoManager is responsible for producing and emitting the final aggregated healthInfo payload.

Evaluation Flow

  1. Subsystem events are received and stored (state/health/admin) and subsystem HealthInfo updates are buffered (latest per source).

  2. The supervision mechanism waits for stability (debounce period) or forces evaluation after a maximum latency threshold.

  3. A consistent snapshot of the current system state is taken when a snapshot-based evaluation is required.

  4. The snapshot is evaluated to determine the aggregated health.

  5. Diagnostic information is derived from the evaluation result.

  6. Forwarded subsystem HealthInfo buffers are passed to the HealthInfoManager together with the locally evaluated diagnostics. The manager recomputes the aggregated structure and, if the outcome differs from the previously published values, emits the updated attribute.

This mechanism ensures that health evaluation remains stable, predictable, and resistant to short-lived oscillations.

Overall Flow

The following diagram illustrates the high-level data flow of the health supervision architecture.

Subsystem events
(state / health / healthInfo)
         |
         v
+------------------------------+
| CspHealthSupervisor          |
|------------------------------|
| - HealthStateStore           |
| - LatestBySourceBuffer       |
| - debounce / max-latency     |
+---------------+--------------+
                |
                | Debounced evaluation trigger
                v
      +---------+------------------+
      | Snapshot-based evaluation  |
      |----------------------------|
      | 1) Evaluate HealthState    |
      | 2) Generate diagnostics    |
      +---------+------------------+
                |
                | Aggregated data
                v
      +---------+------------------+
      | HealthInfoManager          |
      |----------------------------|
      | - Merge local + forwarded  |
      | - Deduplicate messages     |
      | - Change detection         |
      +---------+------------------+
                |
                v
      +---------+------------------+
      | Attribute Publication      |
      |----------------------------|
      | - publish HealthState      |
      | - publish healthInfo       |
      +----------------------------+

HealthState and HealthInfo

The HealthState attribute represents the aggregated health condition of the system (e.g. OK, DEGRADED, FAILED, UNKNOWN).

The HealthInfo attribute complements HealthState by providing contextual diagnostic information.

While HealthState answers:

“Is the system healthy?”

HealthInfo answers:

“Why is the system in this condition?”

When the system enters a degraded or failed state, HealthInfo contains human-readable messages describing the underlying causes. These may include:

  • Fault conditions in critical components

  • Communication problems

  • Disabled subsystems

  • Operational state inconsistencies

  • Command execution failures

This allows operators to understand the origin of a problem without manually inspecting individual subsystem states.

When the system returns to a healthy condition, HealthInfo is cleared.

HealthInfo Propagation

CSP.LMC devices subscribe to the HealthInfo attributes of their subordinate subsystems.

When a subsystem updates its diagnostic information, the corresponding parent device incorporates that information into its own HealthInfo output, according to the aggregation and diagnostic rules defined by the supervision architecture.

This hierarchical propagation ensures that higher-level devices expose relevant diagnostic context originating from the components they supervise.

Note

The CSP.LMC Controller does not subscribe to the HealthInfo attributes of CSP.LMC Subarray devices.

This is an intentional design choice. The CSP.LMC Controller derives its diagnostic output directly from the underlying subsystem snapshot rather than aggregating Subarray-level diagnostics.

This prevents duplication of messages across hierarchy levels and ensures that the CSP.LMC Controller HealthInfo remains concise, unambiguous, and semantically aligned with its system-wide role.

Forwarded HealthInfo buffering

Forwarded subsystem HealthInfo change events are buffered and coalesced by the CSP.LMC supervision loop.

For each subscribed subsystem source, only the latest received HealthInfo payload is retained until the next debounced evaluation cycle. During that cycle, forwarded updates are merged together with the locally evaluated diagnostics and a single HealthInfo attribute publication is performed (subject to change detection).

This prevents bursts of consecutive HealthInfo events on the parent device when multiple subsystems update their diagnostics in quick succession.

Health Transition Triggers

The aggregated HealthState (and consequently HealthInfo) may change through two mechanisms.

Subsystem Event Propagation

Health and operational state changes reported by subordinate subsystems contribute to the aggregated system health. These updates follow the supervised, snapshot-based evaluation flow.

Since both health and operational changes may affect diagnostics, rapid sequences of subsystem updates could otherwise lead to multiple consecutive evaluations and publications.

The supervision layer mitigates this behaviour through debounce and latency control.

Command-Driven Forcing

In addition to event-driven updates, command execution outcomes can directly influence the system health.

When a command executed on a CSP.LMC device fails, the system may explicitly force the HealthState into a DEGRADED or FAILED condition, depending on the severity of the failure.

In this case:

  • The transition does not originate from subsystem updates.

  • The transition reflects an operational failure.

  • HealthInfo is updated to include the reason.

This dual mechanism ensures that the reported health condition reflects both subsystem status and runtime operational errors.

Design Principles

The health supervision architecture is designed to:

  • Prevent transient oscillations in reported health.

  • Guarantee bounded evaluation latency.

  • Ensure atomic and consistent system snapshots.

  • Minimize redundant attribute publications.

  • Provide clear diagnostic information alongside health transitions.

  • Maintain separation between timing, evaluation, and explanation logic.