Low PSI Hardware Audit ====================== The Low PSI network topology — FPGA serials, PCIe BDFs, MAC/SPEAD IDs, P4 switch ports, trunk ports, timing offsets — lives in a single source-of-truth YAML file: ``src/ska_low_cbf_integration/data/psi-net.yaml`` It is consumed at runtime by ``ska_low_cbf_integration.low_psi_net`` (and re-exported by ``ska_low_cbf_integration.low_psi`` for backwards compatibility) and by three stand-alone tooling scripts: * ``scripts/psi_net_check_diagram.py`` — cross-checks the YAML against the ``low-psi-data-links.drawio.xml`` diagram. Reports any device or cable in one that isn't in the other. Runs in CI as part of the lint checks. * ``scripts/psi_net_check_helm.py`` — cross-checks the YAML against the Helm chart at ``charts/psi-low.values.yaml``. Ensures every FPGA's ``serial`` / ``p4_port`` pair in the YAML matches the ``alveo=`` / ``port=`` entries in ``hardware_connections``, and vice versa. Runs in CI as part of the lint checks. * ``scripts/psi_net_check_lsalveo.py`` — cross-checks the YAML against the hand-maintained ``bdf_to_sn_port`` dict in ``scripts/lsalveo``. Parses the dict via ``ast`` (no execution — avoids lsalveo's ``kubernetes`` runtime dependency) and verifies that every ``(host, BDF, serial, port)`` quadruple matches the YAML, and that every YAML FPGA on a host that lsalveo tracks is also present. * ``scripts/psi_fpga_audit.py`` — cross-checks the YAML against live FPGA hardware over SSH. Manual; documented below. ``psi_fpga_audit.py`` --------------------- For a given host, the audit: 1. SSHes into the host. 2. Enumerates the FPGAs: * V80 cards via ``ami_tool overview`` / ``ami_tool mfg_info -d `` * U55C cards via ``xbutil examine`` (for BDF + MAC) plus the xmc sysfs node ``/sys/bus/pci/devices/0000:XX:00.1/xmc*/serial_num`` (for the serial, which xbutil's user-PF does not expose). 3. Cross-references each card against the YAML and reports: * PCIe BDF reported by the tool exists in the YAML for this host (and every YAML entry on this host was reported by the tool). * Serial Number matches the YAML ``serial`` field. * MAC Address 1's lower four bytes match the YAML ``spead_hwid`` field (the SPEAD hardware ID emitted on the wire is the bottom four bytes of the card's MAC). Exits non-zero if any check fails. Running it ~~~~~~~~~~ The script is a self-contained file with a single dependency (``pyyaml``). From the repo root: .. code-block:: bash poetry run python scripts/psi_fpga_audit.py psi-perentie1 # 10x U55C poetry run python scripts/psi_fpga_audit.py psi-perentie2 # 6x V80 poetry run python scripts/psi_fpga_audit.py seren-08 # 2x V80 You need SSH access to the host (key-based auth — the script uses ``BatchMode=yes`` and will not prompt for a password). The remote host needs ``ami_tool`` (V80) or ``xbutil`` (U55C) installed. Example output ~~~~~~~~~~~~~~ .. code-block:: text Host: psi-perentie1 YAML expects 10 FPGA(s): Alveo U55C PASS 0000:4f:00.1 XFL1E35JVJTQ 00:0a:35:0b:1a:08 (psi-perentie1/u55c-10) PASS 0000:52:00.1 XFL1XCRTUC22 00:0a:35:0b:19:10 (psi-perentie1/u55c-9) PASS 0000:53:00.1 XFL1VCYSXCL0 00:0a:35:0b:18:e0 (psi-perentie1/u55c-6) PASS 0000:56:00.1 XFL1ZIN0F4RO 00:0a:35:0b:19:b8 (psi-perentie1/u55c-7) ... ──────────────────────────────────────────────────────────── 0 failure(s) When to run it ~~~~~~~~~~~~~~ This is a manual audit, not part of CI. Run it: * After any physical card swap, to confirm the YAML reflects what is now installed. * When tests fail in ways that suggest the YAML may be stale (wrong serial on a port, unexpected SPEAD hwid in a capture, etc.). Interpreting failures ~~~~~~~~~~~~~~~~~~~~~ * **BDF not in YAML for host X** — a card is physically present that YAML doesn't know about. Add the entry to ``psi-net.yaml``. * **NOT seen by tool** — YAML expects a card at a BDF but the on-host tool didn't report it. Either the card has been removed, has moved to a different BDF, or is in a bad state. * **serial: expected X, got Y** — the wrong card is at this BDF. Either update the YAML (if a swap was intentional and undocumented) or investigate the card identity. * **spead_hwid: expected 0xXXXX, MAC1 (…) lower-4 is 0xYYYY** — the YAML's ``spead_hwid`` does not match the card's actual MAC. This is usually a YAML transcription error; fix the YAML to match the MAC.