Health Thresholds

This tutorial explains how to configure and modify device health thresholds in the SKA Low MCCS system, including the meaning of different threshold values and their behaviour.

Overview

The system uses a health rollup mechanism that aggregates the health of multiple subdevices (such as antennas, stations, beams) to determine the parent device’s overall health state.

Threshold Structure

Health thresholds are defined as tuples of three integer values:

(failed_threshold, failed_degraded_threshold, degraded_threshold)

Where:

  1. failed_threshold: Number of FAILED (or UNKNOWN) subdevices that cause the overall health to become FAILED

  2. failed_degraded_threshold: Number of FAILED (or UNKNOWN) subdevices that cause the overall health to become DEGRADED

  3. degraded_threshold: Number of DEGRADED subdevices that cause the overall health to become DEGRADED

It should be noted that the thresholds are applied in order, meaning that if the first condition is met, the second and third conditions are not evaluated.

Threshold Value Meanings

Positive Values

Positive integers represent absolute counts of subdevices:

  • (5, 2, 3) means: - 5 or more failed subdevices → overall health becomes FAILED - 2 or more failed subdevices → overall health becomes DEGRADED - 3 or more degraded subdevices → overall health becomes DEGRADED

Zero Values

Zero has special meaning depending on the context, it is interpreted as the maximum number of non-FAILED sources (i.e. OK or DEGRADED) for which health should still roll up to FAILED:

  • (0, 1, 1) means: - 0 failed subdevices cause FAILED state (effectively “never” - all subdevices must fail) - 1 or more failed subdevices → overall health becomes DEGRADED - 1 or more degraded subdevices → overall health becomes DEGRADED

  • (1, 0, 1) means: - 1 or more failed subdevices → overall health becomes FAILED - 0 failed subdevices cause DEGRADED state (effectively “never” from failed devices) - 1 or more degraded subdevices → overall health becomes DEGRADED

  • (1, 1, 0) means: - 1 or more failed subdevices → overall health becomes FAILED - 1 or more failed subdevices → overall health becomes DEGRADED - 0 degraded subdevices cause DEGRADED state (effectively “never” from degraded devices)

Note: the last one will default to FAILED, as explained previously.

Negative Values

Negative values are interpreted as “count from the total number of subdevices”. The system calculates the actual threshold by subtracting the absolute value of the negative number from the total number of subdevices.

Formula: actual_threshold = total_subdevices - abs(negative_value)

Examples with 10 subdevices:

  • (-1, -2, -3) becomes (9, 8, 7): - 9 or more failed subdevices (10 - 1) → overall health becomes FAILED - 8 or more failed subdevices (10 - 2) → overall health becomes DEGRADED - 7 or more degraded subdevices (10 - 3) → overall health becomes DEGRADED

  • (-3, -5, -4) becomes (7, 5, 6): - 7 or more failed subdevices (10 - 3) → overall health becomes FAILED - 5 or more failed subdevices (10 - 5) → overall health becomes DEGRADED - 6 or more degraded subdevices (10 - 4) → overall health becomes DEGRADED

Alternative interpretation: Negative values represent “how many devices must remain healthy”. For example, -2 means “at least 2 devices must remain healthy”, so when 8 out of 10 fail, only 2 remain healthy, triggering the threshold.

This approach is particularly useful for:

  • Percentage-based thresholds: Automatically adapts as the number of subdevices changes

  • Minimum operational requirements: Ensures a minimum number of devices remain functional

  • Scalable deployments: Thresholds scale with system size without manual reconfiguration

Practical example: In a station with 256 antennas, setting (-26, -13, -26) means: - FAILED when ≤25 antennas remain healthy (≥231 failed) - ~90% failure rate - DEGRADED when ≤243 antennas remain healthy (≥13 failed) - ~5% failure rate - DEGRADED when ≤230 antennas remain healthy (≥26 degraded) - ~10% degradation rate

Device-Specific Examples

MccsController

The controller device manages multiple types of subdevices with different threshold strategies:

self._health_thresholds = {
    "subarrays": (0, 1, 1),        # Any failed → degraded, any degraded → degraded
    "stations": (
        min(np.ceil(len(self.MccsStations) / 4), 1),  # ~25% failed → failed
        1,                                             # Any failed → degraded
        min(len(self.MccsStations), 2),               # 2+ degraded → degraded
    ),
    "subarraybeams": (0, 1, 1),    # Any failed → degraded, any degraded → degraded
    "stationbeams": (0, 1, 1),     # Any failed → degraded, any degraded → degraded
}

MccsStation

The station device manages antennas and other station components:

self._health_thresholds = {
    "antennas": (
        min(np.ceil(len(self.AntennaTrls) * 0.1), 25),  # 10% failed (max 25) → failed
        min(np.ceil(len(self.AntennaTrls) * 0.05), 12), # 5% failed (max 12) → degraded
        min(np.ceil(len(self.AntennaTrls) * 0.1), 25),  # 10% degraded (max 25) → degraded
    ),
    "fieldstation": (1, 1, 1),     # Any failed/degraded → corresponding state
    "spsstation": (1, 1, 1),       # Any failed/degraded → corresponding state
}

Setting and Modifying Thresholds

Reading Active Thresholds

To read the active health thresholds from a device:

import json
import tango

# Using Tango client
device = tango.DeviceProxy("low-mccs/station/ci-1")
active_thresholds = device.healthThresholds
print(json.loads(active_thresholds))

Example output:

{
    "antennas": [25, 12, 25],
    "fieldstation": [1, 1, 1],
    "spsstation": [1, 1, 1]
}

Modifying Thresholds

To modify health thresholds, provide a JSON string with the new values:

import json
import tango

# Connect to the device
device = tango.DeviceProxy("low-mccs/station/ci-1")

# Define new thresholds (note: JSON uses lists, but they're converted to tuples internally)
new_thresholds = {
    "antennas": [20, 10, 15],      # 20 failed → failed, 10 failed → degraded, 15 degraded → degraded
    "fieldstation": [1, 1, 1],     # Keep existing values
    "spsstation": [1, 1, 1]        # Keep existing values
}

# Apply the new thresholds
device.healthThresholds = json.dumps(new_thresholds)

# Verify the change
updated_thresholds = device.healthThresholds
print(json.loads(updated_thresholds))

Example output after modification:

{
    "antennas": [20, 10, 15],
    "fieldstation": [1, 1, 1],
    "spsstation": [1, 1, 1]
}

Note: You can modify individual threshold categories without affecting others. Only specify the categories you want to change.

Validation and Error Handling

The system validates threshold keys and will log warnings for invalid keys:

# This will be ignored and logged as invalid
invalid_thresholds = {
    "invalid_key": [1, 1, 1],      # Invalid - not a recognised subdevice category
    "antennas": [5, 3, 4]          # Valid - will be applied
}

device.healthThresholds = json.dumps(invalid_thresholds)

Example log output for invalid keys:

INFO - Invalid Key Supplied: invalid_key. Allowed keys: dict_keys(['antennas', 'fieldstation', 'spsstation'])

The valid threshold will still be applied, while invalid keys are ignored. You can check the device logs to see which keys were rejected.

Common Scenarios

High Availability Setup

For systems requiring high availability, use conservative thresholds:

# Very sensitive to any failures
high_availability_thresholds = {
    "critical_devices": [1, 1, 1],    # Any issue → immediate state change
    "redundant_arrays": [2, 1, 2]     # Minimal tolerance for failures
}

Development/Testing Environment

For development environments, you might want more tolerant thresholds:

# More tolerant for development
development_thresholds = {
    "antennas": [50, 20, 30],         # Allow many failures before state change
    "stations": [5, 2, 3]             # Moderate tolerance
}

Scalable Deployments with Negative Thresholds

Use negative thresholds when you want behaviour that automatically adapts to system size:

# Percentage-based thresholds using negative values
scalable_thresholds = {
    "antennas": [-26, -13, -26],      # ~90% failed → failed, ~5% failed → degraded, ~10% degraded → degraded
    "stations": [-2, -1, -2]          # Keep at least 2 healthy, 1 healthy triggers degraded
}

Example with 100 antennas:

antenna_thresholds = [-10, -5, -15]  # Becomes [90, 95, 85] actual thresholds
# FAILED when ≥90 antennas fail (≤10 remain healthy) - 90% failure rate
# DEGRADED when ≥95 antennas fail (≤5 remain healthy) - 95% failure rate
# DEGRADED when ≥85 antennas degraded (≤15 remain healthy) - 85% degradation rate

Example with 20 antennas (same configuration):

antenna_thresholds = [-10, -5, -15]  # Becomes [10, 15, 5] actual thresholds
# FAILED when ≥10 antennas fail (≤10 remain healthy) - 50% failure rate
# DEGRADED when ≥15 antennas fail (≤5 remain healthy) - 75% failure rate
# DEGRADED when ≥5 antennas degraded (≤15 remain healthy) - 25% degradation rate

Gradual Degradation Detection

To detect gradual system degradation early:

# Sensitive to degraded states
early_warning_thresholds = {
    "devices": [10, 3, 2]             # Very low degraded threshold for early warning
}

Debugging Health States

Use the health report to understand why a device is in a particular state:

device = tango.DeviceProxy("low-mccs/station/ci-1")
health_report = device.healthReport
print(json.loads(health_report))

This will show the health of all subdevices and help identify which ones are causing the overall health state.