Example unpacking of a DSPSR scale and offset DADA file
[1]:
%load_ext autoreload
%autoreload 2
[2]:
import pathlib
import tempfile
import numpy as np
from ska_pydada import AsciiHeader, DadaFile, DspsrScalesOffsetsUnpacker
Create an example DADA file for testing
Need to create a file that meets the specification of the DSPSR scale and offset DADA file.
The file needs a header with the following keys set: * NCHAN - the number of channels * NPOL - the number of polarisations * NBIT - set to -32 as the core data is float32 * BLOCK_HEADER_BYTES - set to 8 as the value is a unsigned 64-bit int (i.e. np.uint64) * BLOCK_DATA_BYTES - NCHAN * NPOl * 2 * 4, here 2 is that we have both scale and offset and the 4 is the size, in bytes, of float32
The cells below show how to create a DADA file with some random data. This example will create 2 chunks of data but the rescale operation could have been constant and was calculated with the first sample as being 0
Create ASCII Header
[3]:
header_txt = """HDR_SIZE 4092
HDR_VERSION 1.0
NCHAN 432
NBIT -32
NDIM 2
NPOL 2
BLOCK_HEADER_BYTES 16
BLOCK_DATA_BYTES 6912
RESOLUTION 6928
"""
[4]:
header = AsciiHeader.from_str(header_txt)
[5]:
assert header.header_size == 4092
assert int(header["NCHAN"]) == 432
assert int(header["NBIT"]) == -32
assert int(header["NDIM"]) == 2
assert int(header["BLOCK_HEADER_BYTES"]) == 16
assert int(header["BLOCK_DATA_BYTES"]) == 6912
assert header.resolution == 6928
Generate some data
For this notebook, the data will be 4 chuncks of data with nchan=432, npol=2 and each chunk was valid for 8196 time samples.
We need to use a Numpy structured array, the following sets up the correct dtype for that array
[6]:
dtype = np.dtype(
[("sample_offset", np.uint64), ("num_samples", np.uint64), ("data", np.float32, (432, 2, 2))]
)
[7]:
assert dtype.fields is not None
sample_offset_field = dtype.fields["sample_offset"]
assert sample_offset_field is not None
num_samples_field = dtype.fields["num_samples"]
assert num_samples_field is not None
data_field = dtype.fields["data"]
assert data_field is not None
# this is the dtype of the sample_offset field
sample_offset_field[0] == np.uint64
# this is the offset of the sample_offset field
sample_offset_field[1] == 0
# this is the dtype of the num_samples field
num_samples_field[0] == np.uint64
# this is the offset of the num_samples field
num_samples_field[1] == 0
# this is the dtype of the data field
data_field[0] == np.float32
# this is the offset of the data field - this is the BLOCK_HEADER_BYTES
data_field[1] == 8
# this asserts that the dtype overall size is that of the RESOLUTION header
assert dtype.itemsize == header.resolution
Generate the structured data array
[8]:
data = np.array(
[
(np.uint64(idx * 8196), np.uint64(8196), np.random.rand(432, 2, 2).astype(np.float32))
for idx in range(4)
],
dtype=dtype,
)
An instance of a DadaFile can be created using the constructor that takes an optional AsciiHeader and an optional by array of data.
However, the following show how to create a DadaFile and then set the data afterwards.
[9]:
dada_file = DadaFile(header=header)
dada_file.set_data(data)
Once an instance of a DadaFile has been created, it can be saved as a file. This can be then be read back later
[10]:
assert dada_file.data_size == len(data.tobytes())
assert dada_file.data_size % header.resolution == 0
dada_file.data_size
[10]:
27712
[11]:
tmpdir = tempfile.gettempdir()
outfile = pathlib.Path(tmpdir) / "example_dada_file.dada"
dada_file.dump(outfile)
[12]:
%ls -lh $outfile
-rw-rw-r-- 1 wgauvin wgauvin 32K Aug 28 09:39 /tmp/example_dada_file.dada
[13]:
!head -c4096 $outfile
HDR_SIZE 4092
HDR_VERSION 1.0
NCHAN 432
NBIT -32
NDIM 2
NPOL 2
BLOCK_HEADER_BYTES 16
BLOCK_DATA_BYTES 6912
RESOLUTION 6928
Loading and reading DADA files
While the above shows how to create DADA files that is normally used for testing and is not the main focus of the PyDADA library or DadaFile itself. The power of the DadaFile is that it can read DADA files that conform to the DADA spec of having a header that has HDR_SIZE value of at least 4096 bytes and that the binary data is afterwards. As it is a flexible file format the data may be packed in different ways and this API provides general data access methods to get the data.
The following cell will read the file generated above and print out the header.
[14]:
dada_file2 = DadaFile.load_from_file(outfile)
print(dada_file2.header)
HDR_SIZE 4092
HDR_VERSION 1.0
NCHAN 432
NBIT -32
NDIM 2
NPOL 2
BLOCK_HEADER_BYTES 16
BLOCK_DATA_BYTES 6912
RESOLUTION 6928
Need to unpack this as a DSPSR scale and offset file
[15]:
unpacked_data = dada_file2.unpack_tfp(unpacker=DspsrScalesOffsetsUnpacker())
unpacked_data
[15]:
array([( 0, 8196, [[[8.98453236e-01, 5.93182266e-01], [4.97357100e-01, 9.95412096e-02]], [[4.65899438e-01, 3.36131990e-01], [8.74583542e-01, 1.59067690e-01]], [[1.79393604e-01, 2.19341904e-01], [4.49888915e-01, 8.68471384e-01]], ..., [[5.11326432e-01, 5.75285494e-01], [3.47158760e-01, 1.65534511e-01]], [[7.52979755e-01, 3.84463370e-01], [8.31367970e-01, 7.64196455e-01]], [[5.64966202e-02, 6.68804944e-01], [6.45671546e-01, 8.93202901e-01]]]),
( 8196, 8196, [[[5.40404022e-01, 1.29237622e-01], [8.30671430e-01, 5.93838573e-01]], [[1.86558083e-01, 1.18146122e-01], [4.63513145e-03, 7.77777791e-01]], [[6.26434863e-01, 7.33709037e-01], [6.86623931e-01, 9.26795721e-01]], ..., [[8.40768158e-01, 9.90110099e-01], [8.81555602e-02, 2.59665072e-01]], [[5.48751131e-02, 5.62433243e-01], [3.44117641e-01, 2.03142002e-01]], [[1.46632016e-01, 4.77732420e-01], [2.03769162e-01, 5.59436142e-01]]]),
(16392, 8196, [[[7.68687427e-01, 4.67736900e-01], [8.10915045e-03, 5.82884192e-01]], [[3.24003607e-01, 7.67787397e-01], [7.37285495e-01, 4.90556180e-01]], [[9.00350332e-01, 2.10212782e-01], [7.38968790e-01, 6.26155496e-01]], ..., [[5.79601228e-01, 6.84320569e-01], [1.47350684e-01, 6.99548841e-01]], [[1.61499120e-02, 6.31651223e-01], [3.87406707e-01, 8.87936801e-02]], [[6.94819614e-02, 7.83718407e-01], [3.67956668e-01, 8.82480562e-01]]]),
(24588, 8196, [[[8.76429856e-01, 3.59158903e-01], [1.63453102e-01, 4.31162082e-02]], [[5.55841029e-01, 1.15831293e-01], [1.27441481e-01, 2.25621432e-01]], [[3.45352948e-01, 9.67215419e-01], [9.30034518e-01, 9.46359217e-01]], ..., [[5.15811920e-01, 1.81550696e-01], [8.21421325e-01, 8.99467885e-01]], [[3.44732791e-01, 7.45153427e-01], [1.91752493e-01, 7.96722829e-01]], [[4.97192323e-01, 5.16910374e-01], [2.33638480e-01, 2.46460348e-01]]])],
dtype=[('sample_offset', '<u8'), ('num_samples', '<u8'), ('data', '<f4', (432, 2, 2))])
[16]:
np.testing.assert_allclose(data["sample_offset"], unpacked_data["sample_offset"])
np.testing.assert_allclose(data["num_samples"], unpacked_data["num_samples"])
np.testing.assert_allclose(data["data"], unpacked_data["data"])
The raw data can be retrieved from the file by using:
[17]:
raw_data = dada_file2.raw_data
len(raw_data), raw_data[:20]
[17]:
(27712,
b'\x00\x00\x00\x00\x00\x00\x00\x00\x04 \x00\x00\x00\x00\x00\x00\x08\x01f?')
The above array is correct but is not in a friendly format. The array can be converted to a Panda’s dataframe doing the following:
[18]:
df = DspsrScalesOffsetsUnpacker.convert_to_dataframe(unpacked_data)
df
[18]:
| sample_offset | num_samples | channel | polarisation | scale | offset | |
|---|---|---|---|---|---|---|
| 0 | 0 | 8196 | 0 | 0 | 0.898453 | 0.593182 |
| 1 | 0 | 8196 | 0 | 1 | 0.497357 | 0.099541 |
| 2 | 0 | 8196 | 1 | 0 | 0.465899 | 0.336132 |
| 3 | 0 | 8196 | 1 | 1 | 0.874584 | 0.159068 |
| 4 | 0 | 8196 | 2 | 0 | 0.179394 | 0.219342 |
| ... | ... | ... | ... | ... | ... | ... |
| 859 | 24588 | 8196 | 429 | 1 | 0.821421 | 0.899468 |
| 860 | 24588 | 8196 | 430 | 0 | 0.344733 | 0.745153 |
| 861 | 24588 | 8196 | 430 | 1 | 0.191752 | 0.796723 |
| 862 | 24588 | 8196 | 431 | 0 | 0.497192 | 0.516910 |
| 863 | 24588 | 8196 | 431 | 1 | 0.233638 | 0.246460 |
3456 rows × 6 columns
Show that the scale and offset are the in the Pandas data frame and the original data array
To make is easier to selected a given sample, change the index of the data frame to be a multi-index of sample_offset, channel and polarisation
[19]:
df2 = df.set_index(["sample_offset", "channel", "polarisation"])
[20]:
original_scale, original_offset = data["data"][0, 42, 1]
original_scale, original_offset
[20]:
(0.16168822, 0.81613666)
[21]:
# check values in data frame - note that Pandas using Python typing and float is actually float64
output_scale, output_offset = df2[["scale", "offset"]].loc[0, 42, 1] # type: ignore
output_scale, output_offset
[21]:
(0.16168822348117828, 0.8161366581916809)
[22]:
np.testing.assert_array_almost_equal(original_scale, np.float32(output_scale)) # type: ignore
[23]:
np.testing.assert_array_almost_equal(original_offset, np.float32(output_offset)) # type: ignore
Clean up
[24]:
if outfile.exists():
outfile.unlink()