SP-1857 Testing the SDP Receive Workflow

YAN-772 Resilience to packet loss and corruption

Introduction

The mechanism chosen to test the resilience of the receiver to packet loss was decided to be the linux “traffic control” or “tc” package available as part of the iproute2 lonix package. As installing all the aspects of the emulator and receive workflow - just to test something that should be testable in isolation we decided to containerise all aspects of the tests. Originally we were going to use a Jupiter notebook to launch all the tests - but it seemed even more simple just to put all the steps into a single bash script. We have broken the steps into 3.

Step 1 Make your images

So the tests require images - the images are built from the repository and we are building an image to “send” to “recv” and an image for testing in case you dont have “TAQL” installed. We’ve automated all of this so:

git clone https://gitlab.com/ska-telescope/ska-sdp-cbf-emulator.git
cd ska-sdp-cbf-emulator
./build-tests.sh local

The above will build all the images required for the tests. As they build in the repo - the context for the build is quite large. The scripts also copy the current directory tree into the image - so they install the version of the emulator/receiver that you have checked out. This is handy for development - but can be a trap if you are editing something and break your installation. After this you will have the following images available:

ska-sdp-cbf-emulator:local -- base image
ska-sdp-cbf-emulator:send -- also the base image - but created incase you need something special in the sender
ska-sdp-cbf-emulator:recv -- base image <but> with ports exposed for the receive - if you want to run this as a separate container
casacore-tools:local -- A containerised version of the casacore tools for measurement set interrogation (specifically TAQL)

Step 2 - Run the tests

So the only extra thing you need now is a test measurement set. There are some in the repo (they maybe in LFS so be careful) but you will have to do something like to get your measurement set. The “sim-vis” is the simplest and all that is required for this first test:

cd data
tar xf sim-vis.ms.tar.gz
cd ..

Actually running the test is now as simple as:

cd tests/scripts
./SP-1857-NETWORK-TEST ../../data/sim-vis.ms

So what is this test doing - this is the contents of that script - it should be pretty self expanatory.

Set up the shared volumes and move the required models and data into them
Create a config file with the parameters for the test and move that into the area to be cross mounted
Start the container with elevated privileges so we can tweak the internal interface
Install iproute2 for traffic shaping
Start the plasma store inside the running container
Start the receiver inside the running container
Start the measurement set writer inside the running container.
Start the sender
Shut down

The parameters for the test by default are pretty harsh. The traffic is being corrupted on the send interface by “docker exec -t emu-test /sbin/tc qdisc add dev lo root netem corrupt 2%” which is corrupting 2% of the sent packets by adding a single bit error to the packet contents. In the results section of this document we will demonstrate what happens when packets are both lost and corrupted and delayed. Also we will demonstrate how the level of packet loss effects the quality of the data actually written to memory / disk.:

#!/bin/bash

# This is a script that will run a test to enable the testing of resilience to packet loss.
# as part of SP-1857

TMP_DIR="/tmp/emu"
CONFIG="/tmp/emu/SP-1857.in"
OUTPUT="/tmp/emu/received.ms"
TOSEND="/tmp/emu/tosend.ms"
MODEL="/tmp/emu/model.ms"

if [ -d $OUTPUT ]
then
rm -r $OUTPUT
fi
if [ -d $TOSEND ]
then
rm -r $TOSEND
fi
if [ -d $MODEL ]
then
rm -r $MODEL
fi
if [ -d $CONFIG ]
then
rm -r $CONFIG
fi



if [ -z "$1" ]
then
        echo "No file supplied please supply a measurement set to send."
        exit 1
else
        SEND_FILE="$1"
fi

mkdir -p /tmp/emu
cp -R $SEND_FILE $TOSEND
cp -R $SEND_FILE $MODEL

cat > $CONFIG << EOF
[reception]
method = spead2_receivers
receiver_port_start = 41001
datamodel = ${MODEL}
ring_heaps = 128
outputfilename = ${OUTPUT}
consumer = plasma_writer
plasma_path = /tmp/plasma_socket

[transmission]
method = spead2_transmitters
target_host = 127.0.0.1
target_port_start = 41001
channels_per_stream = 1
rate = 1000000000
time_interval = 0

[reader]
num_repeats = 1
num_timestamps = 20
num_channels = 0

EOF

echo "Starting the test container"

docker stop emu-test
docker run -td --rm --name emu-test --cap-add=NET_ADMIN -v ${TMP_DIR}:${TMP_DIR} ska-sdp-cbf-emulator:recv

echo "Install iproute2 for traffic shaping"

docker exec -t emu-test apt install -y iproute2

echo "Start the plasma store"

docker exec -td emu-test plasma_store -m 100000000 -s /tmp/plasma_socket

echo "Running Receiver in Background"

docker exec -td emu-test emu-recv -v -c $CONFIG

echo "Start the plasma writer"

docker exec -td emu-test plasma-mswriter -v -s  /tmp/plasma_socket --max_payloads 10 $OUTPUT

echo "Corrupting the UDP packet stream to the level of approximately 2%"

docker exec -t emu-test /sbin/tc qdisc add dev lo root netem corrupt 2%

echo "Running Sender in Foreground"

docker exec -t emu-test emu-send -vv -c $CONFIG /tmp/emu/tosend.ms

echo "Giving the receiver and plasma containers a couple of seconds to wrap up"

sleep 2

echo "Stopping test container"

docker stop emu-test

Summary

One first thing to note is that although we are only affecting 1% of the packets - we are losing ~3% of the HEAPs. I am not completely convinced that the tc tools is accurately reporting its percentage of packets lost so I am not overly concerned about that at this stage.

Test 1 (corruption) and Test 2 (loss) is resulting in comparable loss in the final MS as the interface must be evaluating the checksum and spotting the corrupted packet. So although only a tiny fraction of data is corrupted (1 bit in ~ 1000 in ~1 % of the packets - so roughly 0.001 % data loss - is resulting in >1% data loss on reception. It may be worth investigating whether switching off the CRC checking on the switches / interfaces should be supported. After all we are only recording noise and if there is some flagging in the SDP we could catch the aberrant samples - and leave the rest of the HEAP.

Test 3 (jitter) There is no noticeable affect on the output MS by jittering the input by delays of beyond a sample time.

The current reception of SPEAD2 packets over UDP with checksum based packet rejection enabled by the switches or the interfaces results in a loss of data at reception comparable to the loss of data during transmission (~1% packets are corrupted or lost results in ~1% of lost data in the received measurement set).

TODO: This is because - In this case a whole HEAP can be fit into a single UDP packet. AA0.5 simply does not have a lot of baselines. The effect of lost UDP packets on larger HEAPs will be a bigger issue. If we are losing 1 % of our packets and there are 100 packets per HEAP - then we would lose all the HEAPS … I would suggest we re-visit this for the large array releases.