Troubleshooting
This page documents solutions to issues that arose while deploying PST in an integrated environment.
PST LMC and Core pods stuck in “Pending”
During the CSP / CBF / PST integration work in the Low PSI it was found that sometimes PST LMC and Core pods would stay in a “Pending” state as shown in the image below:
Performing
kubectl describe pod low-pst-beam-01-0
Returned
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 18s (x6 over 6m19s) default-scheduler 0/10 nodes are available: 1 node(s) didn't match pod affinity rules, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}, 1 node(s) had untolerated taint {skao.int/dedicated: fpga-dev03}, 1 node(s) had untolerated taint {skao.int/dedicated: low-cbf-p4}, 1 node(s) had untolerated taint {skao.int/dedicated: perentie-old}, 1 node(s) had untolerated taint {skao.int/dedicated: perentie}, 1 node(s) had untolerated taint {skao.int/dedicated: pst}, 3 node(s) had volume node affinity conflict. preemption: 0/10 nodes are available: 10 Preemption is not helpful for scheduling.
When checking the PST Core pod we got the following:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 3m25s (x6 over 9m28s) default-scheduler 0/10 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: true}, 1 node(s) had untolerated taint {skao.int/dedicated: fpga-dev03}, 1 node(s) had untolerated taint {skao.int/dedicated: low-cbf-p4}, 1 node(s) had untolerated taint {skao.int/dedicated: perentie-old}, 1 node(s) had untolerated taint {skao.int/dedicated: perentie}, 1 node(s) had untolerated taint {skao.int/dedicated: pst}, 3 node(s) didn't match Pod's node affinity/selector. preemption: 0/10 nodes are available: 1 No preemption victims found for incoming pod, 9 Preemption is not helpful for scheduling.
The 1 node(s) didn’t have free ports for the requested pod ports part provided a clue for what was happening; this issue
only appears when using hostNetwork: true because the pod is trying to open a port that is already open on the server.
When using hostNetwork: false (the default value) this issue never arises. However, in an integration environment that
needs to accept data from an external Correlator Beam Former (CBF) the PST core pod needs to have hostNetwork: true.
When multiple instances of ska-pst` with hostNetwork: true` are deployed on the same server, in this case the
pst-beam2 in the psi-low environment, they will by default attempt to use the same port values,
causing the didn’t have free ports failure.
To resolve this a unique values file must be used for each deployment, and each file must specify unique host network port overrides. The following code snippet demonstrates how to override the PST Core ports; in this example, each value is offset from the default value by 5.
ska-pst:
core:
applications:
recv:
ports:
mgmt:
port: 18085
targetPort: 18085
datastream:
port: 32085
targetPort: 32085
smrb:
ports:
mgmt:
port: 18086
targetPort: 18086
dsp_disk:
ports:
mgmt:
port: 18087
targetPort: 18087
stat:
ports:
mgmt:
port: 18088
targetPort: 18088
dsp_ft:
ports:
mgmt:
port: 18089
targetPort: 18089
After applying this change PST pods got to a Running state within Kubernetes.
PST LMC Device fails to initialise
During the device init process, the PST LMC device tries to connect to the PST core signal processing applications via gRPC requests. We have seen this fail in the CSP / CBF / PST environment.
The following is a snippet of the log of the PST LMC pod:
Tango NamedDevFailedList exception
Exception for object simulationMode
Index of object in call (starting at 0) = 0
Severity = ERROR
Error reason = PyDs_PythonError
Desc : grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:130.155.198.233:8888: HTTP proxy returned response code 403"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:130.155.198.233:8888: HTTP proxy returned response code 403", grpc_status:14, created_time:"2024-07-03T07:09:42.788604737+00:00"}"
>
Origin : Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/tango/server.py", line 159, in write_attr
return get_worker().execute(write_method, self, value)
File "/usr/local/lib/python3.10/dist-packages/tango/green.py", line 113, in execute
return fn(*args, **kwargs)
File "/app/python/src/ska_pst/lmc/component/pst_device.py", line 377, in simulationMode
self.component_manager.simulation_mode = value
File "/app/python/src/ska_pst/lmc/component/component_manager.py", line 177, in simulation_mode
self._simulation_mode_changed()
File "/app/python/src/ska_pst/lmc/beam/beam_component_manager.py", line 572, in _simulation_mode_changed
self._smrb_subcomponent.simulation_mode = self.simulation_mode
File "/app/python/src/ska_pst/lmc/component/subcomponent_manager.py", line 111, in simulation_mode
self._simulation_mode_changed()
File "/app/python/src/ska_pst/lmc/component/subcomponent_manager.py", line 317, in _simulation_mode_changed
self._update_api()
File "/app/python/src/ska_pst/lmc/component/subcomponent_manager.py", line 536, in _update_api
self._api.connect()
File "/app/python/src/ska_pst/lmc/component/process_api.py", line 491, in connect
_connect()
File "/usr/local/lib/python3.10/dist-packages/backoff/_sync.py", line 105, in retry
ret = target(*args, **kwargs)
File "/app/python/src/ska_pst/lmc/component/process_api.py", line 489, in _connect
self._connected = self._grpc_client.connect()
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 237, in connect
self._service.connect(request)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1176, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1005, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: ipv4:130.155.198.233:8888: HTTP proxy returned response code 403"
debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"failed to connect to all addresses; last error: UNKNOWN: ipv4:130.155.198.233:8888: HTTP proxy returned response code 403", grpc_status:14, created_time:"2024-07-03T07:09:42.788604737+00:00"}"
>
This error suggested that it was going through a HTTP proxy to get to the service
test-ska-pst-core.low-csp.svc.cluster.local:28081. The root cause of this was that the deployment had specified
a HTTP/HTTPS proxy without specifying a no_proxy value, this meant that the gRPC was attempting to go through
the HTTP proxy when the service was within the Kubernetes cluster and namespace. The following is a snippet of
the Helm values file that caused the issue.
global:
environment_variables:
- name: https_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
- name: http_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
# other values
The following code snippet shows how this was resolved:
global:
environment_variables:
- name: https_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
- name: http_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
- name: no_proxy
value: "cluster.local"
The no_proxy value should match the global variable cluster_domain which defaults to cluster.local. The
following shows how to set the proxy with a no_proxy when the cluster_domain is set.
global:
cluster_domain: psi-low.k8s.skao.int
environment_variables:
- name: https_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
- name: http_proxy
value: "http://delphoenix.atnf.csiro.au:8888"
- name: no_proxy
value: "psi-low.k8s.skao.int"
PST errors during ConfigureScan
Generally the error for this should be a validation error. However, it is possible that the required bandwidth of the scan could cause an OutOfMemory limit error in Kubernetes which then results in RECV.CORE application being terminated by the Linux kernel.
When this happens the health check between PST LMC and the RECV.CORE application will fail and an exception will be logged in the LMC logs.
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 211, in _wrapper
return func(client, *args, timeout=timeout, **kwargs)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 361, in configure_beam
self._service.configure_beam(request, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-06-11T07:50:53.031132144+00:00", grpc_status:14, grpc_message:"Socket closed"}"
>
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/python/src/ska_pst/lmc/component/grpc_process_api.py", line 214, in configure_beam
self._grpc_client.configure_beam(request=request, timeout=timeout)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 213, in _wrapper
_handle_grpc_error(e, timeout=timeout)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 266, in _handle_grpc_error
_handle_server_error(error, timeout)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 234, in _handle_server_error
raise ServiceUnavailable(error_code=error_code, message=message) from error
ska_pst.lmc.component.grpc_lmc_client.ServiceUnavailable
1|2025-06-11T07:50:53.040Z|WARNING|ParallelTaskThread_1|go_to_fault|grpc_process_api.py#411|tango-device:low-pst/beam/01|Error in trying to put remote service 'low-pst/beam/01/recv' in FAULT state.
Traceback (most recent call last):
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 211, in _wrapper
return func(client, *args, timeout=timeout, **kwargs)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 361, in configure_beam
self._service.configure_beam(request, timeout=timeout)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1181, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.10/dist-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "Socket closed"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-06-11T07:50:53.031132144+00:00", grpc_status:14, grpc_message:"Socket closed"}"
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/app/python/src/ska_pst/lmc/util/streaming_task.py", line 93, in start
for v in self._item_generator(self._abort_event):
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 715, in perform_health_check
_handle_grpc_error(e, timeout=timeout)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 266, in _handle_grpc_error
_handle_server_error(error, timeout)
File "/app/python/src/ska_pst/lmc/component/grpc_lmc_client.py", line 234, in _handle_server_error
raise ServiceUnavailable(error_code=error_code, message=message) from error
ska_pst.lmc.component.grpc_lmc_client.ServiceUnavailable
1|2025-06-11T07:50:53.845Z|ERROR|Thread-23 (_run)|_handle_health_check_exception|api_subcomponent_manager.py#473|tango-device:low-pst/beam/01|Health check for RECV has raised an exception: . Restarting health check
1|2025-06-11T07:50:53.845Z|WARNING|Thread-30 (_run)|handle_health_state_change|beam_component_manager.py#1493|tango-device:low-pst/beam/01|RECV health state is in FAILED state. Putting 1 into FAILED state
During the development of v1.1.0 of PST the default memory limit was reduced but this affected
the deployments to environments run by TOPIC but this now has been reverted back to the default
value used in previous versions. The issue can be resolved by increasing the memory limits
for the core applications within the Helm values.yaml file when doing a Helm deployment. The following
code snippet shows what value to override. For more information about the Helm parameters check the
SKA PST Helm Chart Parameters page.
ska-pst:
core:
resources:
limits:
cpu: 1600m
memory: 6400Mi # <-- increase this value
requests:
cpu: 1600m
memory: 2000Mi