Example unpacking of a DSPSR scale and offset DADA file

[1]:
%load_ext autoreload
%autoreload 2
[2]:
import pathlib
import tempfile

import numpy as np

from ska_pydada import AsciiHeader, DadaFile, DspsrScalesOffsetsUnpacker

Create an example DADA file for testing

Need to create a file that meets the specification of the DSPSR scale and offset DADA file.

The file needs a header with the following keys set: * NCHAN - the number of channels * NPOL - the number of polarisations * NBIT - set to -32 as the core data is float32 * BLOCK_HEADER_BYTES - set to 8 as the value is a unsigned 64-bit int (i.e. np.uint64) * BLOCK_DATA_BYTES - NCHAN * NPOl * 2 * 4, here 2 is that we have both scale and offset and the 4 is the size, in bytes, of float32

The cells below show how to create a DADA file with some random data. This example will create 2 chunks of data but the rescale operation could have been constant and was calculated with the first sample as being 0

Create ASCII Header

[3]:
header_txt = """HDR_SIZE            4092
HDR_VERSION         1.0
NCHAN               432
NBIT                -32
NDIM                2
NPOL                2
BLOCK_HEADER_BYTES  16
BLOCK_DATA_BYTES    6912
RESOLUTION          6928
"""
[4]:
header = AsciiHeader.from_str(header_txt)
[5]:
assert header.header_size == 4092
assert int(header["NCHAN"]) == 432
assert int(header["NBIT"]) == -32
assert int(header["NDIM"]) == 2
assert int(header["BLOCK_HEADER_BYTES"]) == 16
assert int(header["BLOCK_DATA_BYTES"]) == 6912
assert header.resolution == 6928

Generate some data

For this notebook, the data will be 4 chuncks of data with nchan=432, npol=2 and each chunk was valid for 8196 time samples.

We need to use a Numpy structured array, the following sets up the correct dtype for that array

[6]:
dtype = np.dtype(
    [("sample_offset", np.uint64), ("num_samples", np.uint64), ("data", np.float32, (432, 2, 2))]
)
[7]:
assert dtype.fields is not None

sample_offset_field = dtype.fields["sample_offset"]
assert sample_offset_field is not None

num_samples_field = dtype.fields["num_samples"]
assert num_samples_field is not None

data_field = dtype.fields["data"]
assert data_field is not None

# this is the dtype of the sample_offset field
sample_offset_field[0] == np.uint64
# this is the offset of the sample_offset field
sample_offset_field[1] == 0

# this is the dtype of the num_samples field
num_samples_field[0] == np.uint64
# this is the offset of the num_samples field
num_samples_field[1] == 0

# this is the dtype of the data field
data_field[0] == np.float32
# this is the offset of the data field - this is the BLOCK_HEADER_BYTES
data_field[1] == 8

# this asserts that the dtype overall size is that of the RESOLUTION header
assert dtype.itemsize == header.resolution

Generate the structured data array

[8]:
data = np.array(
    [
        (np.uint64(idx * 8196), np.uint64(8196), np.random.rand(432, 2, 2).astype(np.float32))
        for idx in range(4)
    ],
    dtype=dtype,
)

An instance of a DadaFile can be created using the constructor that takes an optional AsciiHeader and an optional by array of data.

However, the following show how to create a DadaFile and then set the data afterwards.

[9]:
dada_file = DadaFile(header=header)
dada_file.set_data(data)

Once an instance of a DadaFile has been created, it can be saved as a file. This can be then be read back later

[10]:
assert dada_file.data_size == len(data.tobytes())
assert dada_file.data_size % header.resolution == 0
dada_file.data_size
[10]:
27712
[11]:
tmpdir = tempfile.gettempdir()
outfile = pathlib.Path(tmpdir) / "example_dada_file.dada"

dada_file.dump(outfile)
[12]:
%ls -lh $outfile
-rw-rw-r-- 1 wgauvin wgauvin 32K Aug 28 09:39 /tmp/example_dada_file.dada
[13]:
!head -c4096 $outfile
HDR_SIZE            4092
HDR_VERSION         1.0
NCHAN               432
NBIT                -32
NDIM                2
NPOL                2
BLOCK_HEADER_BYTES  16
BLOCK_DATA_BYTES    6912
RESOLUTION          6928

Loading and reading DADA files

While the above shows how to create DADA files that is normally used for testing and is not the main focus of the PyDADA library or DadaFile itself. The power of the DadaFile is that it can read DADA files that conform to the DADA spec of having a header that has HDR_SIZE value of at least 4096 bytes and that the binary data is afterwards. As it is a flexible file format the data may be packed in different ways and this API provides general data access methods to get the data.

The following cell will read the file generated above and print out the header.

[14]:
dada_file2 = DadaFile.load_from_file(outfile)
print(dada_file2.header)
HDR_SIZE            4092
HDR_VERSION         1.0
NCHAN               432
NBIT                -32
NDIM                2
NPOL                2
BLOCK_HEADER_BYTES  16
BLOCK_DATA_BYTES    6912
RESOLUTION          6928

Need to unpack this as a DSPSR scale and offset file

[15]:
unpacked_data = dada_file2.unpack_tfp(unpacker=DspsrScalesOffsetsUnpacker())
unpacked_data
[15]:
array([(    0, 8196, [[[8.98453236e-01, 5.93182266e-01], [4.97357100e-01, 9.95412096e-02]], [[4.65899438e-01, 3.36131990e-01], [8.74583542e-01, 1.59067690e-01]], [[1.79393604e-01, 2.19341904e-01], [4.49888915e-01, 8.68471384e-01]], ..., [[5.11326432e-01, 5.75285494e-01], [3.47158760e-01, 1.65534511e-01]], [[7.52979755e-01, 3.84463370e-01], [8.31367970e-01, 7.64196455e-01]], [[5.64966202e-02, 6.68804944e-01], [6.45671546e-01, 8.93202901e-01]]]),
       ( 8196, 8196, [[[5.40404022e-01, 1.29237622e-01], [8.30671430e-01, 5.93838573e-01]], [[1.86558083e-01, 1.18146122e-01], [4.63513145e-03, 7.77777791e-01]], [[6.26434863e-01, 7.33709037e-01], [6.86623931e-01, 9.26795721e-01]], ..., [[8.40768158e-01, 9.90110099e-01], [8.81555602e-02, 2.59665072e-01]], [[5.48751131e-02, 5.62433243e-01], [3.44117641e-01, 2.03142002e-01]], [[1.46632016e-01, 4.77732420e-01], [2.03769162e-01, 5.59436142e-01]]]),
       (16392, 8196, [[[7.68687427e-01, 4.67736900e-01], [8.10915045e-03, 5.82884192e-01]], [[3.24003607e-01, 7.67787397e-01], [7.37285495e-01, 4.90556180e-01]], [[9.00350332e-01, 2.10212782e-01], [7.38968790e-01, 6.26155496e-01]], ..., [[5.79601228e-01, 6.84320569e-01], [1.47350684e-01, 6.99548841e-01]], [[1.61499120e-02, 6.31651223e-01], [3.87406707e-01, 8.87936801e-02]], [[6.94819614e-02, 7.83718407e-01], [3.67956668e-01, 8.82480562e-01]]]),
       (24588, 8196, [[[8.76429856e-01, 3.59158903e-01], [1.63453102e-01, 4.31162082e-02]], [[5.55841029e-01, 1.15831293e-01], [1.27441481e-01, 2.25621432e-01]], [[3.45352948e-01, 9.67215419e-01], [9.30034518e-01, 9.46359217e-01]], ..., [[5.15811920e-01, 1.81550696e-01], [8.21421325e-01, 8.99467885e-01]], [[3.44732791e-01, 7.45153427e-01], [1.91752493e-01, 7.96722829e-01]], [[4.97192323e-01, 5.16910374e-01], [2.33638480e-01, 2.46460348e-01]]])],
      dtype=[('sample_offset', '<u8'), ('num_samples', '<u8'), ('data', '<f4', (432, 2, 2))])
[16]:
np.testing.assert_allclose(data["sample_offset"], unpacked_data["sample_offset"])
np.testing.assert_allclose(data["num_samples"], unpacked_data["num_samples"])
np.testing.assert_allclose(data["data"], unpacked_data["data"])

The raw data can be retrieved from the file by using:

[17]:
raw_data = dada_file2.raw_data
len(raw_data), raw_data[:20]
[17]:
(27712,
 b'\x00\x00\x00\x00\x00\x00\x00\x00\x04 \x00\x00\x00\x00\x00\x00\x08\x01f?')

The above array is correct but is not in a friendly format. The array can be converted to a Panda’s dataframe doing the following:

[18]:
df = DspsrScalesOffsetsUnpacker.convert_to_dataframe(unpacked_data)
df
[18]:
sample_offset num_samples channel polarisation scale offset
0 0 8196 0 0 0.898453 0.593182
1 0 8196 0 1 0.497357 0.099541
2 0 8196 1 0 0.465899 0.336132
3 0 8196 1 1 0.874584 0.159068
4 0 8196 2 0 0.179394 0.219342
... ... ... ... ... ... ...
859 24588 8196 429 1 0.821421 0.899468
860 24588 8196 430 0 0.344733 0.745153
861 24588 8196 430 1 0.191752 0.796723
862 24588 8196 431 0 0.497192 0.516910
863 24588 8196 431 1 0.233638 0.246460

3456 rows × 6 columns

Show that the scale and offset are the in the Pandas data frame and the original data array

To make is easier to selected a given sample, change the index of the data frame to be a multi-index of sample_offset, channel and polarisation

[19]:
df2 = df.set_index(["sample_offset", "channel", "polarisation"])
[20]:
original_scale, original_offset = data["data"][0, 42, 1]
original_scale, original_offset
[20]:
(0.16168822, 0.81613666)
[21]:
# check values in data frame - note that Pandas using Python typing and float is actually float64
output_scale, output_offset = df2[["scale", "offset"]].loc[0, 42, 1]  # type: ignore
output_scale, output_offset
[21]:
(0.16168822348117828, 0.8161366581916809)
[22]:
np.testing.assert_array_almost_equal(original_scale, np.float32(output_scale))  # type: ignore
[23]:
np.testing.assert_array_almost_equal(original_offset, np.float32(output_offset))  # type: ignore

Clean up

[24]:
if outfile.exists():
    outfile.unlink()