Low CBF Connector Health

Philosophy

Refer to the Low CBF Health documentation for details of the overall Low CBF philosophy on health monitoring.

The Connector Device adopts this philosophy, summarised as:

  • Health means the Connector device’s ability to perform its function.

  • The healthState attribute is used to report the health state of the Connector device.

    • The Connector reports the health of its underlying hardware as well as its programming and control software.

  • The information used to evaluate its health state will be exposed as Tango attributes for use in GUIs or as alarms.

Health State Definitions

See Low CBF Health documentation and ska_control_model.HealthState.

Health Attributes

In line with Low CBF Health documentation, Low CBF Connector devices have three health “category” attributes.

The healthState attribute will report the worst case of the three categories.

health_hardware

Summarises the health of the hardware by aggregating these attributes:

Attribute

Description

hardware_motherboard_temperature

Motherboard average temperature

hardware_tofino_temperature

Tofino average temperature

hardware_port_<n>_temperature

QSFP temperature, one attribute per port

hardware_port_<n>_tx_power

QSFP transmit power, one attribute per port

hardware_port_<n>_rx_power

QSFP received power, one attribute per port

health_function

Summarises the health of the ‘functional’ layer by aggregating these attributes:

Attribute

Description

function_p4_code_version_present

P4 code version is present

function_sdp_port_up

Percentage of SDP ports up

function_alveo_port_up

Percentage of Alveo ports up

function_sps_port_up

Percentage of SPS ports up

function_pss_port_up

Percentage of PSS ports up

function_pst_port_up

Percentage of PST ports up

function_valid_routing_port

Routes are valid (using active ports)

health_process

Summarises the health of the ‘processing’ layer by aggregating these attributes:

Attribute

Description

process_port_<n>_tx_throughput

Transmit throughput, one attribute per port

process_port_<n>_rx_throughput

Receive throughput, one attribute per port

process_port_<n>_queue_occupancy

Queue occupancy, one attribute per port

Note

Queue occupancy is planned but not yet implemented

Implementation Details

A YAML configuration file will provide settings that control how each attribute contributes to health state:

# motherboard average temperature
hardware_motherboard_temperature:
  # trigger FAILED when outside the interval (-20,70)
  fail_limits: [-20, 70]
  # set DEGRADED when outside this interval (but not the fail interval)
  degrade_limits: [0, 55]

# tofino average temperature
hardware_tofino_temperature:
  fail_limits: [-20, 80]
  degrade_limits: [0, 65]

# QSFP temperature
hardware_port_number_temperature:
  fail_limits: [-20, 80]
  degrade_limits: [0, 65]

# QSFP Tx Power
hardware_port_number_tx_power:
  fail_limits: [-40, 10]
  degrade_limits: [-40, 5]

# QSFP Rx Power
hardware_port_number_tx_power:
  fail_limits: [-40, 10]
  degrade_limits: [-40, 5]

# Tofino code installed
function_p4_code_version_present:
  fail_state: false

# percentage of SDP port up
function_sdp_port_up:
  fail_limits:  [80, 100]
  degrade_limits: [90, 100]

# percentage of Alveo port up
function_alveo_port_up:
  fail_limits:  [80, 100]
  degrade_limits: [90, 100]

# percentage of SPS port up
function_sps_port_up:
  fail_limits:  [80, 100]
  degrade_limits: [90, 100]

# percentage of PSS port up
function_pss_port_up:
  fail_limits:  [80, 100]
  degrade_limits: [90, 100]

# percentage of PST port up
function_pst_port_up:
  fail_limits:  [80, 100]
  degrade_limits: [90, 100]

# Routing to non valid port
function_valid_routing_port:
  fail_state: false

# Tx throughput per port
process_port_number_tx_throughput:
  fail_limits: [0, 95]
  degrade_limits: [0,75]

# Rx throughput per port
process_port_number_rx_throughput:
  fail_limits: [0, 95]
  degrade_limits: [0,75]

# Queue percentage occupancy per port
process_port_queue_occupancy:
  fail_limits: [0, 95]
  degrade_limits: [0,75]

Alarm Attributes

As detailed in the Low CBF Alarm documentation, alarms will be generated and configured via the Elettra AlarmHandler Tango device.

The connector device features a range of Tango attributes for various purposes, including internal control parameters, health monitoring, telescope operation monitoring, diagnostics, and fault troubleshooting.

This document presents a generic configuration of the alarm handler based on the currently available attributes. Specifically, we propose attaching alarms to two main sets of attributes: health attributes and port-related attributes.

Alarms for health

The first type of generic alarm to configure for the connector relates to the various health attributes mentioned above. Specifically, we propose raising an alarm if any health state enters the FAILED state.

Attribute

Description

Alarm Trigger

healthState

General Connector Health State

State is FAILED

health_hardware

Connector Hardware Health State

State is FAILED

health_function

Connector Function Health State

State is FAILED

health_process

Connector Process Health State

State is FAILED

Alarms for port status

In addition, we recommend the following alarms to be raised per port for the following attributes

Attribute

Description

Alarm Trigger

process_port_<n>_tx_throughput

Transmit throughput, one attribute per port

If > 90% of capacity

process_port_<n>_rx_throughput

Receive throughput, one attribute per port

If > 90% of capacity

diagnostics_port_<n>_up

Status of the port, up or down

Port is down

hardware_port_<n>_temperature

QSFP temperature, one attribute per port

If > 60C

Implementation details

Using the Elettra alarm handler we can configure the described alarms as follows:

tag=health_state_connector_0;formula=(low-cbf/connector/0/healthState == 2);priority=log;group=none;message="General Health State of Connector is FAILED"
tag=health_hardware_connector_0;formula=(low-cbf/connector/0/health_hardware == 2);priority=log;group=none;message="Hardware Health State of Connector is FAILED"
tag=health_function_connector_0;formula=(low-cbf/connector/0/health_function == 2);priority=log;group=none;message="Function Health State of Connector is FAILED"
tag=health_process_connector_0;formula=(low-cbf/connector/0/health_process == 2);priority=log;group=none;message="Process Health State of Connector is FAILED"
tag=ptp_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_1_up == 0);priority=log;group=none;message="PTP port is down"
tag=sps_port_2_monitor;formula=(low-cbf/connector/0/diagnostics_port_2_up == 0);priority=log;group=none;message="SPS port 2 is down"
tag=sps_port_5_monitor;formula=(low-cbf/connector/0/diagnostics_port_5_up == 0);priority=log;group=none;message="SPS port 5 is down"
tag=alveo_port_7_monitor;formula=(low-cbf/connector/0/diagnostics_port_7_up == 0);priority=log;group=none;message="Alveo port 7 is down"
tag=alveo_port_9_monitor;formula=(low-cbf/connector/0/diagnostics_port_9_up == 0);priority=log;group=none;message="Alveo port 9 is down"
tag=alveo_port_11_monitor;formula=(low-cbf/connector/0/diagnostics_port_11_up == 0);priority=log;group=none;message="Alveo port 11 is down"
tag=alveo_port_13_monitor;formula=(low-cbf/connector/0/diagnostics_port_13_up == 0);priority=log;group=none;message="Alveo port 13 is down"
tag=alveo_port_15_monitor;formula=(low-cbf/connector/0/diagnostics_port_15_up == 0);priority=log;group=none;message="Alveo port 15 is down"
tag=alveo_port_17_monitor;formula=(low-cbf/connector/0/diagnostics_port_17_up == 0);priority=log;group=none;message="Alveo port 17 is down"
tag=pst_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_28_up == 0);priority=log;group=none;message="PST port is down"
tag=pss_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_29_up == 0);priority=log;group=none;message="PSS port is down"
tag=sdp_port_monitor;formula=(low-cbf/connector/0/diagnostics_port_32_up == 0);priority=log;group=none;message="SDP port is down"
tag=ptp_port_temperature;formula=(low-cbf/connector/0/hardware_port_1_temperature > 60);priority=log;group=none;message="PTP port is hot"
tag=sps_port_2_temperature;formula=(low-cbf/connector/0/hardware_port_2_temperature > 60);priority=log;group=none;message="SPS port 2 is hot"
tag=sps_port_5_temperature;formula=(low-cbf/connector/0/hardware_port_5_temperature > 60);priority=log;group=none;message="SPS port 5 is hot"
tag=alveo_port_7_temperature;formula=(low-cbf/connector/0/hardware_port_7_temperature > 60);priority=log;group=none;message="Alveo port 7 is hot"
tag=alveo_port_9_temperature;formula=(low-cbf/connector/0/hardware_port_9_temperature > 60);priority=log;group=none;message="Alveo port 9 is hot"
tag=alveo_port_11_temperature;formula=(low-cbf/connector/0/hardware_port_11_temperature > 60);priority=log;group=none;message="Alveo port 11 is hot"
tag=alveo_port_13_temperature;formula=(low-cbf/connector/0/hardware_port_13_temperature > 60);priority=log;group=none;message="Alveo port 13 is hot"
tag=alveo_port_15_temperature;formula=(low-cbf/connector/0/hardware_port_15_temperature > 60);priority=log;group=none;message="Alveo port 15 is hot"
tag=alveo_port_17_temperature;formula=(low-cbf/connector/0/hardware_port_17_temperature > 60);priority=log;group=none;message="Alveo port 17 is hot"
tag=pst_port_temperature;formula=(low-cbf/connector/0/hardware_port_28_temperature > 60);priority=log;group=none;message="PST port is hot"
tag=pss_port_temperature;formula=(low-cbf/connector/0/hardware_port_29_temperature > 60);priority=log;group=none;message="PSS port is hot"
tag=sdp_port_temperature;formula=(low-cbf/connector/0/hardware_port_32_temperature > 60);priority=log;group=none;message="SDP port is hot"
tag=ptp_port_txthroughput;formula=(low-cbf/connector/0/process_port_1_txthroughput > 9000000000);priority=log;group=none;message="PTP port is transmitting too much"
tag=sps_port_2_txthroughput;formula=(low-cbf/connector/0/process_port_2_txthroughput > 35000000000);priority=log;group=none;message="SPS port 2 is transmitting too much"
tag=sps_port_5_txthroughput;formula=(low-cbf/connector/0/process_port_5_txthroughput > 35000000000);priority=log;group=none;message="SPS port 5 is transmitting too much"
tag=alveo_port_7_txthroughput;formula=(low-cbf/connector/0/process_port_7_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 7 is transmitting too much"
tag=alveo_port_9_txthroughput;formula=(low-cbf/connector/0/process_port_9_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 9 is transmitting too much"
tag=alveo_port_11_txthroughput;formula=(low-cbf/connector/0/process_port_11_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 11 is transmitting too much"
tag=alveo_port_13_txthroughput;formula=(low-cbf/connector/0/process_port_13_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 13 is transmitting too much"
tag=alveo_port_15_txthroughput;formula=(low-cbf/connector/0/process_port_15_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 15 is transmitting too much"
tag=alveo_port_17_txthroughput;formula=(low-cbf/connector/0/process_port_17_txthroughput > 95000000000);priority=log;group=none;message="Alveo port 17 is transmitting too much"
tag=pst_port_txthroughput;formula=(low-cbf/connector/0/process_port_28_txthroughput > 95000000000);priority=log;group=none;message="PST port is transmitting too much"
tag=pss_port_txthroughput;formula=(low-cbf/connector/0/process_port_29_txthroughput > 95000000000);priority=log;group=none;message="PSS port is transmitting too much"
tag=sdp_port_txthroughput;formula=(low-cbf/connector/0/process_port_32_txthroughput > 95000000000);priority=log;group=none;message="SDP port is transmitting too much"
tag=ptp_port_rxthroughput;formula=(low-cbf/connector/0/process_port_1_rxthroughput > 9000000000);priority=log;group=none;message="PTP port is receiving too much"
tag=sps_port_2_rxthroughput;formula=(low-cbf/connector/0/process_port_2_rxthroughput > 35000000000);priority=log;group=none;message="SPS port 2 is receiving too much"
tag=sps_port_5_rxthroughput;formula=(low-cbf/connector/0/process_port_5_rxthroughput > 35000000000);priority=log;group=none;message="SPS port 5 is receiving too much"
tag=alveo_port_7_rxthroughput;formula=(low-cbf/connector/0/process_port_7_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 7 is receiving too much"
tag=alveo_port_9_rxthroughput;formula=(low-cbf/connector/0/process_port_9_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 9 is receiving too much"
tag=alveo_port_11_rxthroughput;formula=(low-cbf/connector/0/process_port_11_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 11 is receiving too much"
tag=alveo_port_13_rxthroughput;formula=(low-cbf/connector/0/process_port_13_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 13 is receiving too much"
tag=alveo_port_15_rxthroughput;formula=(low-cbf/connector/0/process_port_15_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 15 is receiving too much"
tag=alveo_port_17_rxthroughput;formula=(low-cbf/connector/0/process_port_17_rxthroughput > 95000000000);priority=log;group=none;message="Alveo port 17 is receiving too much"
tag=pst_port_rxthroughput;formula=(low-cbf/connector/0/process_port_28_rxthroughput > 95000000000);priority=log;group=none;message="PST port is receiving too much"
tag=pss_port_rxthroughput;formula=(low-cbf/connector/0/process_port_29_rxthroughput > 95000000000);priority=log;group=none;message="PSS port is receiving too much"
tag=sdp_port_rxthroughput;formula=(low-cbf/connector/0/process_port_32_rxthroughput > 95000000000);priority=log;group=none;message="SDP port is receiving too much"