Low CBF Health Monitoring

Philosophy

  • Health means a device’s ability to perform its function.

  • The healthState attribute is used to report the health state of Low CBF Controller, Subarray, Processor and Connector devices.

    • Controller and Subarray aggregate the health of other Tango devices.

    • Processor and Connector report the health of their underlying hardware as well as their firmware/programming and control software.

  • The information used by each device to evaluate its health state will be exposed as Tango attributes for use in GUIs or as alarms.

Purpose

Why do all SKA Tango devices have a healthState attribute?

To pinpoint faulty pieces of the telescope.

Health State Definitions

  • 0 - OK: fully functional

  • 1 - Degraded: partial (not full) function available [optional state]

  • 2 - Failed: unable to perform core function

  • 3 - Unknown: initial state / we can’t determine an answer

See also the documentation for ska_control_model.HealthState.

Evaluation is based on the component itself, not the previous/next piece of signal chain or neighbours. To use an analogy: we evaluate “our car”, not the road, bridge, or traffic lights.

Alarms vs Health

Alarms and Health are different things.

An alarm is defined in IEC 62682 as:

An audible and/or visible means of indicating to the operator of an equipment malfunction, process deviation, or abnormal condition requiring a response.

We define health as:

A device’s ability to perform its function.

_images/alarms_vs_health.svg

One is not a subset of the other, although there is some overlap. Some parameters that trigger alarms may lead to a change in health state, others may not. Some parameters that are used to calculate health state will have alarms associated with them, some may not. A failed health state in some devices may be alarmed, others may not.

Low CBF devices will provide many diagnostic attributes which may or may not be used as alarms in the operating telescope. Alarm annunciation and configuration will be via an instance of the Elettra AlarmHandler Tango device. Its configuration includes a “formula” that is evaluated to trigger an alarm, alarm priority, group, etc.

Note

An attribute that has an AttrQuality status of ALARM does not necessarily result in an alarm being announced to the operator! All operator alarms will be configured and announced via an AlarmHandler Tango device.

Health Aggregation

The Low CBF Controller and Subarray devices will aggregate the health status of their underlying hardware devices.

This means:

  • The health of a Subarray device will be an aggregation of the health of the Processor and Connector devices that are participating in the subarray.

  • The health of the Controller device will be an aggregation of the health of all Low CBF Processors and Connectors.

In principle, these aggregated health states should be:

  • OK if the underlying hardware is sufficiently healthy to perform full functionality (there may be a FAILED state in some redundant piece of hardware)

  • DEGRADED if there are degradations or failures in underlying hardware that reduces overall functionality (perhaps beyond some threshold of acceptable degradation)

  • FAILED if no functionality is possible (either due to a single point of failure or to a combination of multiple failures)

In practise, this is very difficult to codify! There will be pathological combinations of degraded hardware that can result in an overall failure. Rather than aim for a perfection that we fail to achieve, we will implement the following “close enough” algorithm:

for each type of hardware (Connectors, Processors):

    if number of devices OK >= number required for full function:
      device_type_health = OK
    else if all devices FAILED:
      device_type_health = FAILED
    else:
      device_type_health = DEGRADED

aggregated health = worst case of all device_type_health values

If the ENGINEERING_MODE_IGNORE_HEALTH environment variable is defined and set to True (case ignored), the Subarray will ignore health state of external devices (switches, processors) when its adminMode is ENGINEERING. This would prevent propagating the healthState to LowCbfController and possibly confusing the operator.

Controller Health Aggregation

As mentioned above, the healthState attribute of the Low CBF Controller will reflect the aggregated health of all Low CBF hardware devices.

The Controller device searches the Tango database for all Processor and Connector devices when its AdminMode is switched ONLINE. Any devices not defined in the database at this time will not be detected. Beware that the way we deploy Tango devices for development and testing involves dynamically reconfiguring the Tango database, so there is a chance we may not discover devices (i.e. any that are late to inject themselves to the database). We hope that the Tango database used for the real operational telescope will have a fixed definition, thus avoiding this risk.

The health state of all Processor & Connector devices will be available at the Controller via Tango array attributes health_processors & health_connectors.

Health of Low CBF Hardware Devices

As well as the healthState attribute, Low CBF hardware-related devices (Connector & Processor) will expose three health “category” attributes. These categories are intended to help with triaging faults (e.g. a hardware fault likely needs a technician on site, but a processing fault might be remedied by restarting the scan). Like healthState, these health category attributes also use the ska_control_model.HealthState enumeration data type. The three attributes are:

  • health_hardware to summarise the health of the hardware layer. For example: QSFPs, power supplies, temperatures.

  • health_function to summarise the health of the ‘functional’ layer. This includes things like driver interfaces, loading firmware, or routing rules.

  • health_process to summarise the health of the ‘processing’ layer. This means the dynamic conditions including other Tango connections. e.g. FPGA error registers, P4 switch queue overflow.

The healthState attribute will report the worst case of the three categories.

The individual parameters that contribute to these three summary attributes will also be exposed as individual Tango attributes. In other words, all the pieces of information that are used to assess health will be available as separate Tango attributes.

  • Each parameter will be exposed as a single attribute (numeric or boolean). Aggregations (e.g. using JSON structures) or Tango array types will not be used, as these cannot be unpacked by an AlarmHandler.

    • We expose these individual attributes for the purposes of: helping troubleshooters pinpoint an active failure mode, displaying on user interfaces, and to allow for individual alarms to be configured using an AlarmHandler if desired. Using an AlarmHandler, an alarm could be added to a particular parameter for early warning of impending failure, or logic could be used to look at the same parameter across multiple devices - whatever is useful to operations & maintenance.

  • Each attribute will be associated with its category via a naming convention:

    • hardware_<parameter_name> for those contributing to health_hardware

    • function_<parameter_name> for those contributing to health_function

    • process_<parameter_name> for those contributing to health_process

  • When evaluation of an individual parameter is not possible (e.g. FPGA uptime cannot be evaluated when the FPGA is un-programmed), its attribute will report INVALID quality using the standard Tango AttrQuality mechanism.

    • Invalid individual parameters will not contribute to their health category summary.

    • Invalid health category attributes will not contribute to the overall healthState.

Below is an example of the health evaluation scheme in action, represented as a table where each cell aggregates its neighbours on the right. Tango AttrQuality is shown in italics.

Overall Health

Health Category

Individual Parameters

healthState

FAILED VALID

health_hardware

OK VALID

hardware_12v 11.9 VALID

hardware_12v_aux 12.1 VALID

hardware_qsfp_temperature INVALID

health_function

FAILED VALID

function_driver_ok False ALARM

function_firmware_loaded INVALID

function_rules_valid True VALID

health_process

INVALID

process_overflow_error INVALID

process_subscription_ok INVALID

Implementation Details

A YAML configuration file will provide settings that control how each attribute contributes to health state (this is an indicative concept, not prescriptive specification)

# top level keys are names of attributes that contribute to health assessment
hardware_numeric_example:
  # trigger FAILED when outside the interval (-20,100)
  # i.e. "not -20 < value < 100"
  fail_limits: [-20, 100]
  # set DEGRADED when outside this interval (but not the fail interval)
  degrade_limits: [0, 50]
  # if any "limits" value is 'null' (=> None in Python), that limit does not apply
  # if we are not outside any fail/degrade limit then this parameter is OK

hardware_boolean_example:
  # make hardware_health go to FAILED state if attribute value is True
  fail_state: true

hardware_boolean_example_two:
  # DEGRADED hardware_health if our value is False
  degrade_state: false

Using this, we configure the AttributeInfoEx Tango structure for each attribute, so the settings are visible to (and modifiable by) clients. Tango will then automatically drive the WARNING and ALARM status (AttrQuality).

Note

An attribute that has an AttrQuality status of ALARM does not necessarily result in an alarm being announced to the operator! All operator alarms will be configured and announced via an AlarmHandler Tango device.

health_hardware is determined by seeing if any hardware_* attribute is in WARNING/ALARM state (which implies that it’s past its degrade/fail threshold). Likewise for health_function & health_process.

healthState evaluation uses a very simple algorithm:

max(health_hardware, health_function, health_process)

Implemented health attributes

As of Jul-2024 the following health related LowCbfProcessor Tango attributes are implemented (subject to change upon review):

Category

Tango Attributes

Atribute’s AttrQuality value

Description

health_hardware

hardware_fpga_temperature

FPGA core temperature in degrees C

hardware_fpga_power

FPGA power consumption in Watts

hardware_hbm_temperature

FPGA memory temperature in degrees C

hardware_power_supply_12v_voltage

Auxiliary power supply voltage in Volts

hardware_power_supply_12v_current

Auxiliary power supply current in Amperes

hardware_pcie_12v_voltage

PCIe bus power supply voltage in Volts

hardware_pcie_12v_current

PCIe bus power supply current in Amperes

health_function

function_firmware_loaded

FPGA firmware loaded indicator (boolean)

function_driver_ok

ATTR_INVALID when function_firmware_loaded == False

FPGA device driver is operational (boolean)

health_process

process_delay_poly_valid

ATTR_INVALID when subarray is not scanning

Delay polynomials valid indicator (boolean)

process_delay_subscription_ok

ATTR_INVALID when subarray is not scanning

Delay polynomials subscription valid indicator (boolean)

process_spead_packets_ok

ATTR_INVALID when subarray is not scanning

SPS SPEAD packets are arriving at FPGA input (boolean)

Overrides for Testing

We think that overriding of the healthState attribute and the other attributes that contribute to health evaluation will be useful for testing (e.g. to test Low CBF health aggregation logic, or CSP LMC health aggregation logic).

The override will be controlled by the testMode attribute (see ska_control_model.TestMode). Any override configuration will be cleared when testMode is changed to NONE (i.e. test mode off), to minimise the chance of accidentally overriding something.

The override configuration will be set via the test_mode_overrides attribute, using a (JSON encoded) dictionary with attribute names as keys and their desired state as values.

Any attribute not listed in the overrides dictionary will operate as normal.

Examples:

To force the healthState attribute to FAILED

{"healthState": "FAILED"}

To force the hardware_12v value to 13.8, as well as the hardware_health to OK

{"hardware_health": "OK", "hardware_12v": 13.8}

Note

All content below here was written for an older health scheme and needs revision!

reporting hierarchy block diagram

Attribute Subscription

Controller Tango device subscribes to changes in healthState attribute of all constituent Subarrays; it uses Tango database to retrieve the list of Subarrays:

UML sequence diagram

Subarray Tango devices subscribes to changes in healthState attribute of all Connector devices and all Processors allocated to the Subarray. The list of Connector Tango devices is retrieved from Tango database. The list of Processors assigned to Subarray is reported by the Allocator.

UML sequence diagram

Processor health

If the NO_HEALTH_ROLLUP environment variable is defined, the Subarray will not include the health of external devices (switches, processors) in its roll-up. This allows for tests in which switches or procssor devices are not present.