Low CBF Connector Health ======================== Philosophy ********** Refer to the :external+ska-low-cbf:doc:`Low CBF Health documentation ` for details of the overall Low CBF philosophy on health monitoring. The Connector Device adopts this philosophy, summarised as: * Health means the Connector device's ability to perform its function. * The ``healthState`` attribute is used to report the health state of the Connector device. * The Connector reports the health of its underlying hardware as well as its programming and control software. * The information used to evaluate its health state will be exposed as Tango attributes for use in GUIs or as alarms. Health State Definitions ************************ See :external+ska-low-cbf:doc:`Low CBF Health documentation ` and :py:class:`ska_control_model.HealthState`. Health Attributes ***************** In line with :external+ska-low-cbf:doc:`Low CBF Health documentation `, Low CBF Connector devices have three health “category” attributes. The ``healthState`` attribute will report the worst case of the three categories. ``health_hardware`` ------------------- Summarises the health of the hardware by aggregating these attributes: ===================================== ================================================ Attribute Description ===================================== ================================================ ``hardware_motherboard_temperature`` Motherboard average temperature ``hardware_tofino_temperature`` Tofino average temperature ``hardware_port__temperature`` QSFP temperature, one attribute per port ``hardware_port__tx_power`` QSFP transmit power, one attribute per port ``hardware_port__rx_power`` QSFP received power, one attribute per port ===================================== ================================================ ``health_function`` ------------------- Summarises the health of the 'functional' layer by aggregating these attributes: ===================================== ================================================ Attribute Description ===================================== ================================================ ``function_p4_code_version_present`` P4 code version is present ``function_sdp_port_up`` Percentage of SDP ports up ``function_alveo_port_up`` Percentage of Alveo ports up ``function_sps_port_up`` Percentage of SPS ports up ``function_pss_port_up`` Percentage of PSS ports up ``function_pst_port_up`` Percentage of PST ports up ``function_valid_routing_port`` Routes are valid (using active ports) ===================================== ================================================ ``health_process`` ------------------ Summarises the health of the 'processing' layer by aggregating these attributes: ===================================== ================================================ Attribute Description ===================================== ================================================ ``process_port__tx_throughput`` Transmit throughput, one attribute per port ``process_port__rx_throughput`` Receive throughput, one attribute per port ``process_port__queue_occupancy`` Queue occupancy, one attribute per port ===================================== ================================================ .. note:: Queue occupancy is planned but not yet implemented Implementation Details ---------------------- A YAML configuration file will provide settings that control how each attribute contributes to health state: .. code-block:: yaml # motherboard average temperature hardware_motherboard_temperature: # trigger FAILED when outside the interval (-20,70) fail_limits: [-20, 70] # set DEGRADED when outside this interval (but not the fail interval) degrade_limits: [0, 55] # tofino average temperature hardware_tofino_temperature: fail_limits: [-20, 80] degrade_limits: [0, 65] # QSFP temperature hardware_port_number_temperature: fail_limits: [-20, 80] degrade_limits: [0, 65] # QSFP Tx Power hardware_port_number_tx_power: fail_limits: [-40, 10] degrade_limits: [-40, 5] # QSFP Rx Power hardware_port_number_tx_power: fail_limits: [-40, 10] degrade_limits: [-40, 5] # Tofino code installed function_p4_code_version_present: fail_state: false # percentage of SDP port up function_sdp_port_up: fail_limits: [80, 100] degrade_limits: [90, 100] # percentage of Alveo port up function_alveo_port_up: fail_limits: [80, 100] degrade_limits: [90, 100] # percentage of SPS port up function_sps_port_up: fail_limits: [80, 100] degrade_limits: [90, 100] # percentage of PSS port up function_pss_port_up: fail_limits: [80, 100] degrade_limits: [90, 100] # percentage of PST port up function_pst_port_up: fail_limits: [80, 100] degrade_limits: [90, 100] # Routing to non valid port function_valid_routing_port: fail_state: false # Tx throughput per port process_port_number_tx_throughput: fail_limits: [0, 95] degrade_limits: [0,75] # Rx throughput per port process_port_number_rx_throughput: fail_limits: [0, 95] degrade_limits: [0,75] # Queue percentage occupancy per port process_port_queue_occupancy: fail_limits: [0, 95] degrade_limits: [0,75] Alarm Attributes **************** As detailed in the :external+ska-low-cbf:doc:`Low CBF Alarm documentation `, alarms will be generated and configured via the Elettra AlarmHandler Tango device. The connector device features a range of Tango attributes for various purposes, including internal control parameters, health monitoring, telescope operation monitoring, diagnostics, and fault troubleshooting. This document presents a generic configuration of the alarm handler based on the currently available attributes. Specifically, we propose attaching alarms to two main sets of attributes: health attributes and port-related attributes. Alarms for health ----------------- The first type of generic alarm to configure for the connector relates to the various health attributes mentioned above. Specifically, we propose raising an alarm if any health state enters the FAILED state. ===================== =============================== =============== Attribute Description Alarm Trigger ===================== =============================== =============== ``healthState`` General Connector Health State State is FAILED ``health_hardware`` Connector Hardware Health State State is FAILED ``health_function`` Connector Function Health State State is FAILED ``health_process`` Connector Process Health State State is FAILED ===================== =============================== =============== Alarms for port status ---------------------- In addition, we recommend the following alarms to be raised per port for the following attributes ===================================== ================================================ ============================== Attribute Description Alarm Trigger ===================================== ================================================ ============================== ``process_port__tx_throughput`` Transmit throughput, one attribute per port If > 90% of capacity ``process_port__rx_throughput`` Receive throughput, one attribute per port If > 90% of capacity ``diagnostics_port__up`` Status of the port, up or down Port is down ``hardware_port__temperature`` QSFP temperature, one attribute per port If > 60C ===================================== ================================================ ============================== Implementation details ---------------------- Using the Elettra alarm handler we can configure the described alarms as follows: .. code-block:: tag=health_state_connector_0;formula=(low-cbf/connector/0/healthState == 2);priority=log;group=none;message="General Health State of Connector is FAILED" tag=health_hardware_connector_0;formula=(low-cbf/connector/0/health_hardware == 2);priority=log;group=none;message="Hardware Health State of Connector is FAILED" tag=health_function_connector_0;formula=(low-cbf/connector/0/health_function == 2);priority=log;group=none;message="Function Health State of Connector is FAILED" tag=health_process_connector_0;formula=(low-cbf/connector/0/health_process == 2);priority=log;group=none;message="Process Health State of Connector is FAILED" tag=ptp_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_1_up == 0);priority=log;group=none;message="PTP port is down" tag=sps_port_2_monitor;formula=(low-cbf/connector/0/diagnostics_port_2_up == 0);priority=log;group=none;message="SPS port 2 is down" tag=sps_port_5_monitor;formula=(low-cbf/connector/0/diagnostics_port_5_up == 0);priority=log;group=none;message="SPS port 5 is down" tag=alveo_port_7_monitor;formula=(low-cbf/connector/0/diagnostics_port_7_up == 0);priority=log;group=none;message="Alveo port 7 is down" tag=alveo_port_9_monitor;formula=(low-cbf/connector/0/diagnostics_port_9_up == 0);priority=log;group=none;message="Alveo port 9 is down" tag=alveo_port_11_monitor;formula=(low-cbf/connector/0/diagnostics_port_11_up == 0);priority=log;group=none;message="Alveo port 11 is down" tag=alveo_port_13_monitor;formula=(low-cbf/connector/0/diagnostics_port_13_up == 0);priority=log;group=none;message="Alveo port 13 is down" tag=alveo_port_15_monitor;formula=(low-cbf/connector/0/diagnostics_port_15_up == 0);priority=log;group=none;message="Alveo port 15 is down" tag=alveo_port_17_monitor;formula=(low-cbf/connector/0/diagnostics_port_17_up == 0);priority=log;group=none;message="Alveo port 17 is down" tag=pst_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_28_up == 0);priority=log;group=none;message="PST port is down" tag=pss_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_29_up == 0);priority=log;group=none;message="PSS port is down" tag=sdp_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_32_up == 0);priority=log;group=none;message="SDP port is down" tag=ptp_port_temperature;formula=(low-cbf/connector/0/hardware_port_1_temperature > 60);priority=log;group=none;message="PTP port is hot" tag=sps_port_2_temperature;formula=(low-cbf/connector/0/hardware_port_2_temperature > 60);priority=log;group=none;message="SPS port 2 is hot" tag=sps_port_5_temperature;formula=(low-cbf/connector/0/hardware_port_5_temperature > 60);priority=log;group=none;message="SPS port 5 is hot" tag=alveo_port_7_temperature;formula=(low-cbf/connector/0/hardware_port_7_temperature > 60);priority=log;group=none;message="Alveo port 7 is hot" tag=alveo_port_9_temperature;formula=(low-cbf/connector/0/hardware_port_9_temperature > 60);priority=log;group=none;message="Alveo port 9 is hot" tag=alveo_port_11_temperature;formula=(low-cbf/connector/0/hardware_port_11_temperature > 60);priority=log;group=none;message="Alveo port 11 is hot" tag=alveo_port_13_temperature;formula=(low-cbf/connector/0/hardware_port_13_temperature > 60);priority=log;group=none;message="Alveo port 13 is hot" tag=alveo_port_15_temperature;formula=(low-cbf/connector/0/hardware_port_15_temperature > 60);priority=log;group=none;message="Alveo port 15 is hot" tag=alveo_port_17_temperature;formula=(low-cbf/connector/0/hardware_port_17_temperature > 60);priority=log;group=none;message="Alveo port 17 is hot" tag=pst_port_temperature;formula=(low-cbf/connector/0/hardware_port_28_temperature > 60);priority=log;group=none;message="PST port is hot" tag=pss_port_temperature;formula=(low-cbf/connector/0/hardware_port_29_temperature > 60);priority=log;group=none;message="PSS port is hot" tag=sdp_port_temperature;formula=(low-cbf/connector/0/hardware_port_32_temperature > 60);priority=log;group=none;message="SDP port is hot" tag=ptp_port_txthroughput;formula=(low-cbf/connector/0/process_port_1_txthroughput > 9000000000);priority=log;group=none;message="PTP port is transmitting too much" tag=sps_port_2_txthroughput;formula=(low-cbf/connector/0/process_port_2_txthroughput > 35000000000);priority=log;group=none;message="SPS port 2 is transmitting too much" tag=sps_port_5_txthroughput;formula=(low-cbf/connector/0/process_port_5_txthroughput > 35000000000);priority=log;group=none;message="SPS port 5 is transmitting too much" tag=alveo_port_7_txthroughput;formula=(low-cbf/connector/0/process_port_7_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 7 is transmitting too much" tag=alveo_port_9_txthroughput;formula=(low-cbf/connector/0/process_port_9_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 9 is transmitting too much" tag=alveo_port_11_txthroughput;formula=(low-cbf/connector/0/process_port_11_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 11 is transmitting too much" tag=alveo_port_13_txthroughput;formula=(low-cbf/connector/0/process_port_13_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 13 is transmitting too much" tag=alveo_port_15_txthroughput;formula=(low-cbf/connector/0/process_port_15_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 15 is transmitting too much" tag=alveo_port_17_txthroughput;formula=(low-cbf/connector/0/process_port_17_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 17 is transmitting too much" tag=pst_port_txthroughput;formula=(low-cbf/connector/0/process_port_28_txthroughput > 95000000000);priority=log;group=none;message="PST port is transmitting too much" tag=pss_port_txthroughput;formula=(low-cbf/connector/0/process_port_29_txthroughput > 95000000000);priority=log;group=none;message="PSS port is transmitting too much" tag=sdp_port_txthroughput;formula=(low-cbf/connector/0/process_port_32_txthroughput > 95000000000);priority=log;group=none;message="SDP port is transmitting too much" tag=ptp_port_rxthroughput;formula=(low-cbf/connector/0/process_port_1_rxthroughput > 9000000000);priority=log;group=none;message="PTP port is receiving too much" tag=sps_port_2_rxthroughput;formula=(low-cbf/connector/0/process_port_2_rxthroughput > 35000000000);priority=log;group=none;message="SPS port 2 is receiving too much" tag=sps_port_5_rxthroughput;formula=(low-cbf/connector/0/process_port_5_rxthroughput > 35000000000);priority=log;group=none;message="SPS port 5 is receiving too much" tag=alveo_port_7_rxthroughput;formula=(low-cbf/connector/0/process_port_7_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 7 is receiving too much" tag=alveo_port_9_rxthroughput;formula=(low-cbf/connector/0/process_port_9_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 9 is receiving too much" tag=alveo_port_11_rxthroughput;formula=(low-cbf/connector/0/process_port_11_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 11 is receiving too much" tag=alveo_port_13_rxthroughput;formula=(low-cbf/connector/0/process_port_13_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 13 is receiving too much" tag=alveo_port_15_rxthroughput;formula=(low-cbf/connector/0/process_port_15_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 15 is receiving too much" tag=alveo_port_17_rxthroughput;formula=(low-cbf/connector/0/process_port_17_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 17 is receiving too much" tag=pst_port_rxthroughput;formula=(low-cbf/connector/0/process_port_28_rxthroughput > 95000000000);priority=log;group=none;message="PST port is receiving too much" tag=pss_port_rxthroughput;formula=(low-cbf/connector/0/process_port_29_rxthroughput > 95000000000);priority=log;group=none;message="PSS port is receiving too much" tag=sdp_port_rxthroughput;formula=(low-cbf/connector/0/process_port_32_rxthroughput > 95000000000);priority=log;group=none;message="SDP port is receiving too much"