================= Health Thresholds ================= This tutorial explains how to configure and modify device health thresholds in the SKA Low MCCS system, including the meaning of different threshold values and their behaviour. Overview ======== The system uses a health rollup mechanism that aggregates the health of multiple subdevices (such as antennas, stations, beams) to determine the parent device's overall health state. Threshold Structure =================== Health thresholds are defined as tuples of three integer values:: (failed_threshold, failed_degraded_threshold, degraded_threshold) Where: 1. **failed_threshold**: Number of FAILED (or UNKNOWN) subdevices that cause the overall health to become FAILED 2. **failed_degraded_threshold**: Number of FAILED (or UNKNOWN) subdevices that cause the overall health to become DEGRADED 3. **degraded_threshold**: Number of DEGRADED subdevices that cause the overall health to become DEGRADED It should be noted that the thresholds are applied in order, meaning that if the first condition is met, the second and third conditions are not evaluated. Threshold Value Meanings ======================== Positive Values --------------- Positive integers represent absolute counts of subdevices: - ``(5, 2, 3)`` means: - 5 or more failed subdevices → overall health becomes FAILED - 2 or more failed subdevices → overall health becomes DEGRADED - 3 or more degraded subdevices → overall health becomes DEGRADED Zero Values ----------- Zero has special meaning depending on the context, it is interpreted as the maximum number of non-FAILED sources (i.e. OK or DEGRADED) for which health should still roll up to FAILED: - ``(0, 1, 1)`` means: - 0 failed subdevices cause FAILED state (effectively "never" - all subdevices must fail) - 1 or more failed subdevices → overall health becomes DEGRADED - 1 or more degraded subdevices → overall health becomes DEGRADED - ``(1, 0, 1)`` means: - 1 or more failed subdevices → overall health becomes FAILED - 0 failed subdevices cause DEGRADED state (effectively "never" from failed devices) - 1 or more degraded subdevices → overall health becomes DEGRADED - ``(1, 1, 0)`` means: - 1 or more failed subdevices → overall health becomes FAILED - 1 or more failed subdevices → overall health becomes DEGRADED - 0 degraded subdevices cause DEGRADED state (effectively "never" from degraded devices) Note: the last one will default to FAILED, as explained previously. Negative Values --------------- Negative values are interpreted as "count from the total number of subdevices". The system calculates the actual threshold by subtracting the absolute value of the negative number from the total number of subdevices. **Formula**: ``actual_threshold = total_subdevices - abs(negative_value)`` Examples with 10 subdevices: - ``(-1, -2, -3)`` becomes ``(9, 8, 7)``: - 9 or more failed subdevices (10 - 1) → overall health becomes FAILED - 8 or more failed subdevices (10 - 2) → overall health becomes DEGRADED - 7 or more degraded subdevices (10 - 3) → overall health becomes DEGRADED - ``(-3, -5, -4)`` becomes ``(7, 5, 6)``: - 7 or more failed subdevices (10 - 3) → overall health becomes FAILED - 5 or more failed subdevices (10 - 5) → overall health becomes DEGRADED - 6 or more degraded subdevices (10 - 4) → overall health becomes DEGRADED **Alternative interpretation**: Negative values represent "how many devices must remain healthy". For example, ``-2`` means "at least 2 devices must remain healthy", so when 8 out of 10 fail, only 2 remain healthy, triggering the threshold. This approach is particularly useful for: - **Percentage-based thresholds**: Automatically adapts as the number of subdevices changes - **Minimum operational requirements**: Ensures a minimum number of devices remain functional - **Scalable deployments**: Thresholds scale with system size without manual reconfiguration **Practical example**: In a station with 256 antennas, setting ``(-26, -13, -26)`` means: - FAILED when ≤25 antennas remain healthy (≥231 failed) - ~90% failure rate - DEGRADED when ≤243 antennas remain healthy (≥13 failed) - ~5% failure rate - DEGRADED when ≤230 antennas remain healthy (≥26 degraded) - ~10% degradation rate Device-Specific Examples ======================== MccsController -------------- The controller device manages multiple types of subdevices with different threshold strategies:: self._health_thresholds = { "subarrays": (0, 1, 1), # Any failed → degraded, any degraded → degraded "stations": ( min(np.ceil(len(self.MccsStations) / 4), 1), # ~25% failed → failed 1, # Any failed → degraded min(len(self.MccsStations), 2), # 2+ degraded → degraded ), "subarraybeams": (0, 1, 1), # Any failed → degraded, any degraded → degraded "stationbeams": (0, 1, 1), # Any failed → degraded, any degraded → degraded } MccsStation ----------- The station device manages antennas and other station components:: self._health_thresholds = { "antennas": ( min(np.ceil(len(self.AntennaTrls) * 0.1), 25), # 10% failed (max 25) → failed min(np.ceil(len(self.AntennaTrls) * 0.05), 12), # 5% failed (max 12) → degraded min(np.ceil(len(self.AntennaTrls) * 0.1), 25), # 10% degraded (max 25) → degraded ), "fieldstation": (1, 1, 1), # Any failed/degraded → corresponding state "spsstation": (1, 1, 1), # Any failed/degraded → corresponding state } Setting and Modifying Thresholds ================================= Reading Active Thresholds ------------------------- To read the active health thresholds from a device:: import json import tango # Using Tango client device = tango.DeviceProxy("low-mccs/station/ci-1") active_thresholds = device.healthThresholds print(json.loads(active_thresholds)) Example output:: { "antennas": [25, 12, 25], "fieldstation": [1, 1, 1], "spsstation": [1, 1, 1] } Modifying Thresholds -------------------- To modify health thresholds, provide a JSON string with the new values:: import json import tango # Connect to the device device = tango.DeviceProxy("low-mccs/station/ci-1") # Define new thresholds (note: JSON uses lists, but they're converted to tuples internally) new_thresholds = { "antennas": [20, 10, 15], # 20 failed → failed, 10 failed → degraded, 15 degraded → degraded "fieldstation": [1, 1, 1], # Keep existing values "spsstation": [1, 1, 1] # Keep existing values } # Apply the new thresholds device.healthThresholds = json.dumps(new_thresholds) # Verify the change updated_thresholds = device.healthThresholds print(json.loads(updated_thresholds)) Example output after modification:: { "antennas": [20, 10, 15], "fieldstation": [1, 1, 1], "spsstation": [1, 1, 1] } **Note**: You can modify individual threshold categories without affecting others. Only specify the categories you want to change. Validation and Error Handling ------------------------------ The system validates threshold keys and will log warnings for invalid keys:: # This will be ignored and logged as invalid invalid_thresholds = { "invalid_key": [1, 1, 1], # Invalid - not a recognised subdevice category "antennas": [5, 3, 4] # Valid - will be applied } device.healthThresholds = json.dumps(invalid_thresholds) Example log output for invalid keys:: INFO - Invalid Key Supplied: invalid_key. Allowed keys: dict_keys(['antennas', 'fieldstation', 'spsstation']) The valid threshold will still be applied, while invalid keys are ignored. You can check the device logs to see which keys were rejected. Common Scenarios ================ High Availability Setup ----------------------- For systems requiring high availability, use conservative thresholds:: # Very sensitive to any failures high_availability_thresholds = { "critical_devices": [1, 1, 1], # Any issue → immediate state change "redundant_arrays": [2, 1, 2] # Minimal tolerance for failures } Development/Testing Environment ------------------------------- For development environments, you might want more tolerant thresholds:: # More tolerant for development development_thresholds = { "antennas": [50, 20, 30], # Allow many failures before state change "stations": [5, 2, 3] # Moderate tolerance } Scalable Deployments with Negative Thresholds ---------------------------------------------- Use negative thresholds when you want behaviour that automatically adapts to system size:: # Percentage-based thresholds using negative values scalable_thresholds = { "antennas": [-26, -13, -26], # ~90% failed → failed, ~5% failed → degraded, ~10% degraded → degraded "stations": [-2, -1, -2] # Keep at least 2 healthy, 1 healthy triggers degraded } **Example with 100 antennas**:: antenna_thresholds = [-10, -5, -15] # Becomes [90, 95, 85] actual thresholds # FAILED when ≥90 antennas fail (≤10 remain healthy) - 90% failure rate # DEGRADED when ≥95 antennas fail (≤5 remain healthy) - 95% failure rate # DEGRADED when ≥85 antennas degraded (≤15 remain healthy) - 85% degradation rate **Example with 20 antennas** (same configuration):: antenna_thresholds = [-10, -5, -15] # Becomes [10, 15, 5] actual thresholds # FAILED when ≥10 antennas fail (≤10 remain healthy) - 50% failure rate # DEGRADED when ≥15 antennas fail (≤5 remain healthy) - 75% failure rate # DEGRADED when ≥5 antennas degraded (≤15 remain healthy) - 25% degradation rate Gradual Degradation Detection ----------------------------- To detect gradual system degradation early:: # Sensitive to degraded states early_warning_thresholds = { "devices": [10, 3, 2] # Very low degraded threshold for early warning } Debugging Health States ----------------------- Use the health report to understand why a device is in a particular state:: device = tango.DeviceProxy("low-mccs/station/ci-1") health_report = device.healthReport print(json.loads(health_report)) This will show the health of all subdevices and help identify which ones are causing the overall health state. Related Attributes ================== - ``healthThresholds``: Read/write active health thresholds - ``healthReport``: Read-only detailed health information - ``healthState``: Read-only overall health state - ``healthModelParams``: Legacy health model parameters (if old health model is active) For more information about the health model architecture, see the API documentation for the specific device classes.