ska-sdp-plasmastman
Installation
These instructions are sufficient to build and install PlasmaStMan
,
making it readily available for usage by third-party applications.
While these instructions are enough to get started, attention needs to be paid when planning to use this storage manager from a python environment. For details see Python quirks.
Dependencies
This project depends on:
Any C++14 compiler
casacore > 3.3.0 (with 64-bit table support)
arrow >= 1.0.0-SNAPSHOT with plasma support
Compiling
This is a cmake
-based project,
so it can be built as any standard cmake
project:
$ git clone https://gitlab.com/ska-telescope/ska-sdp-plasmastman
$ cd plasma-storage-manager
$ cmake . -B build
# cmake --build build
Some of the most relevant cmake
variables
(passed on the first cmake
invocation via -Dvariable=value
)
used for compiling are:
CASACORE_ROOT_DIR
: Root of arbitrary casacore installations in case one is usedArrow_DIR
: directory containing thecmake
configuration exported by Apache Arrow (usually underlib/cmake/arrow
in the arrow installation area).Plasma_DIR
: directory containing thecmake
configuration exported by Apache Plasma (usually underlib/cmake/arrow
in the arrow installation area).CMAKE_CXX_COMPILER
: The C++ compiler to use.CMAKE_CXX_FLAGS
: Extra C++ compilation flags.CMAKE_BUILD_TYPE
: The type of build to produce, one ofDebug
,Release
andRelWithDebInfo
.BUILD_TESTING
: Whether to build unit tests or not, defaults toON
.
Testing
A set of unit tests is included and built by default. To execute them do:
$ cmake --build build --target test
The unit tests require the plasma-store-server
executable
(part of a standard C++ Arrow Plasma installation)
to be visible in the path.
If you want further control on ctest
’s command line flags
you can do:
$ cmake --build build --target test -- ARGS="<ctest command line flags>"
or alternatively:
$ cd build/
$ ctest <ctest command line flags>
Python quirks
When using PlasmaStMan
from python,
special attention needs to be paid to how
the python-casacore
and pyarrow
python packages,
if needed by your python code, are installed
to avoid some otherwise difficult to debug errors.
python-casacore
TL;DR:
Don’t install the pre-built binary wheels from PyPI.
If you can, use the kernsuite repositories to install the
casacore
libraries andpython-casacore
python package from pre-built apt packages.If installing from kernsuite is not an option, then ensure
python-casacore
is built against the samecasacore
installationPlasmaStMan
was built against.
Starting from version 3.4.0, the python-casacore
package
offers pre-built binary wheels
for some major OS and python version combinations.
These binary wheels come bundled
with a copy of the underlying casacore libraries
(libcasa_casa.so
, libcasa_tables.so
, etc)
and their dependencies.
Each of these bundled libraries actually have
a specific SONAME
s and matching filesname
(e.g. libcasa_tables-734048a7.so.6
),
thus avoiding interfering
with any system-wide installation.
On the other hand,
the plug-in mechanism used to register
third-party storage managers with casacore
involves first loading the storage manager shared library into memory,
then invoking a registration function in the library
that registers itself into a static casacore
-owned registration map,
and finally checking that the registration was successful.
This usually looks like this:
+-------------+ 1. dlopen() +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+ +----------------+
^ | ^ | ^ |
| | | | 2. register_plasmastman() | |
| | | \-------------------------------/ |
| | | |
| | | 3. DataMan::registerCtor() |
| | \-------------------------------------/
| |
\--/ 4. check_registration() // all good :)
However when using the binary wheels from PyPI,
and because of the difference in SONAME
between the bundled libraries
and the libraries used to compile the storage manager,
two different copies of casacore.so
are loaded into memory,
and the interaction looks like this:
+----------------------+ 1. dlopen() +----------------+ 1.1 dlopen() +-------------+
| casacore-734048a7.so | ----------------> | plasmastman.so | -------------> | casacore.so |
+----------------------+ +----------------+ +-------------+
^ | | ^ | ^
| | | 2. register_plasmastman() | | |
| | \---------------------------/ | |
| | | |
| | | 3. DataMan::registerCtor() |
| | \---------------------------------/
| |
\--/ 4. check_registration() // fails, registration cannot be found :(
In particular, the error message will look something like:
RuntimeError: Table DataManager error: Data Manager class PlasmaData is not registered
This situation is specific to the binary wheels
distributed via PyPI.
To avoid this issue one must ensure
that the python-casacore
package
uses the same libraries
the storage manager was compiled against.
This could be done either by installing python-casacore
from source and pointing it
to an existing casacore
installation
(which itself might be installed from source or not),
or by using pre-compiled packages
that don’t incur into this duplication of libraries,
like the apt packages provided by the Kernsuite project.
pyarrow
TL;DR:
Pre-built binary wheels from PyPI are incompatible with pre-built Arrow apt packages provided by Apache.
You can install a different version of
pyarrow
alongside the pre-built Arrow apt packages, but this might break in the future.You can install
pyarrow
from sources, building them against the same Arrow/Plasma installationPlasmaStMan
was built against.
Apache Arrow makes available binary wheels in PyPI
for users to install the pyarrow
python package
without needing a compiler or any other external libraries.
Like in the case of python-casacore
,
these binary wheels are bundled
with their own copy of the Arrow shared libraries
(libarrow.so
, libplasma.so
and so on).
For a given version of Arrow,
these libraries share the same SONAME
with those installed via the Arrow apt repositories.
However, the PyPI pyarroww
binary wheels
are compiled using a version of gcc
prior to the introduction
that didn’t offer
a dual ABI mechanism
(read the link for a more detailed explanation).
The effect this has is that
the arrow libraries generated by newer versions of gcc
define differently named symbols
than those generated by older versions of gcc
,
and therefore they cannot be mixed freely
(e.g., linked or dynamically loaded).
This problem has been reported,
but other than acknowledging the issue
and providing some suggestions on how to proceed,
the final response was
that this use case is not officially supported
by the Arrow published artifacts.
Because of this situation,
problems occur if the python
process
loads the storage manager,
which has been compiled against
the apt-installed Arrow libraries,
after importing the PyPI-installed pyarrow
.
In such cases the following situation occurs:
+-------------+
| | 1.1 no dlopen(), library with same SONAME already loaded
| | 1.2 check_required_symbols() // fails, symbol not found
| libarrow.so | <--------------------------\
| | |
+-------------+ 1. dlopen() +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+ +----------------+
In particular, the error message will look something like:
RuntimeError: Shared library plasmastman not found in CASACORE_LDPATH or (DY)LD_LIBRARY_PATH
libcasa_plasmastman.so.4: cannot open shared object file: No such file or directory
libcasa_plasmastman.so: cannot open shared object file: No such file or directory
libplasmastman.so.4: cannot open shared object file: No such file or directory
/usr/local/lib/libplasmastman.so: undefined symbol: _ZN5arrow5fieldENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESt10shared_ptrINS_8DataTypeEEbS6_IKNS_16KeyValueMetadataEE
Note that if pyarrow
has not yet been imported
at the time the storage manager library is loaded
then no error occurs:
+-------------+ 1. dlopen() +----------------+ 1.1 dlopen() +-------------+
| casacore.so | ------------> | plasmastman.so | -------------> | libarrow.so |
+-------------+ +----------------+ +-------------+
| ^
| |
\---------------------------------/
1.2. check_required_symbols() // fine
The situation above is a bit brittle
as it depends on pyarrow
not being loaded at the time.
Moreover, loading it later
might also lead to the same missing symbol error.
A possibility, somewhat fragile,
is to install a version of pyarrow
from PyPI
different to that installed via apt
so the SONAME
of both libraries don’t collide.
That way, plasmastman.so
is forced into loading a different copy of the arrow library
into memory.
This results in the following:
+---------------+
| |
| |
| libarrow.3.so |
| |
+---------------+ 1. dlopen() +----------------+ 1.1 dlopen() +---------------+
| casacore.so | ------------> | plasmastman.so | -------------> | libarrow.4.so |
+---------------+ +----------------+ +---------------+
| ^
| |
\---------------------------------/
2. check_required_symbols() // all good :)
This obviously results in two copies of different versions of the Arrow library loaded into memory. Although we haven’t noticed any side-effects, this might not always be the case.
The ultimate solution is of course
to avoid the problem with bundled libraries altogether
and install pyarrow
from source,
compiling against the same installation of Arrow/Plasma
the PlasmaStMan
was compiled against.
This results on a clean environment,
but has a higher setup cost:
+-------------+
| | 1.1 no dlopen(), library with same SONAME already loaded
| | 1.2 check_required_symbols() // all good :)
| libarrow.so | <--------------------------\
| | |
+-------------+ 1. dlopen() +----------------+
| casacore.so | -------------------> | plasmastman.so |
+-------------+ +----------------+
Usage
PlasmaStMan
maps Apache Arrow Tensors and Tables
(i.e., their Object IDs in the Plasma store)
to individual columns within a casacore Table.
Arrow Tensors map directly to casacore Columns one to one.
The mapping then consists on a pair of strings
indicating the Object ID of the Tensor in the Plasma store
and the name of the casacore Table column it provides data to.
Checks are in place to ensure that a Tensor’s shape and type
match those of the corresponding column of the casacore Table.
All casacore data types are supported by this mapping
with the exception of Strings
.
Arrow Tables on the other hand contain one or more Fields,
which individually map to casacore Columns.
The mapping then consists on a pair of strings
indicating the ObjectID of the Table in the Plasma store
and the name of the Field that should be considered,
which should match the name of the casacore Table column
it provides data to.
Like in the case of Tensors,
a Field’s shape (length) and type are checked
against those of the corresponding column of the casacore Table.
Columns in an Arrow Table have only a single dimension,
so they are currently only supported as scalar columns.
Additionally, Complex values are not supported natively by Arrow Tables,
and therefore Complex
and DComplex
values
are supported as Arrow Struct
objects with r
and i
fields.
Configuration
PlasmaStMan
always needs to connect to a Plasma store.
This happens through a Unix socket in the filesystem.
The location of this socket defaults to /tmp/plasma
,
but its value can be overriden
by setting the PLASMA_SOCKET
environment variable.
Either when reading or writing,
certain aspects of PlasmaStMan
can be configured at runtime via Storage manager properties
(arbitrary key-value pairs).
PlasmaStMan
supports the following properties:
PLASMACONNECTRETRIES
: the number of times the Plasma client should try to connect to the Plasma store before giving up. Defaults to50
.
PLASMAGETTIMEOUT
: the timeout in milliseconds to use when getting an object from the Plasma store that is not immediately available. Defaults to10000
.
Reading
When reading data from a Table
backed by a PlasmaStMan
storage manager
users need to ensured that the libplasmastman
shared library
is visible in the dynamic linker’s path
(e.g., adding the directory containing the library
to the LD_LIBRARY_PATH
environment variable in Linux).
Other than this, existing casacore-based applications do not require any modification or recompilation.
Writing
Note
At the moment PlasmaStMan
does not support writing data to plasma.
Writing is a trickier business.
Even though the data itself cannot be written through PlasmaStMan
,
what can currently be done is creating a casacore table
that points to existing data in Plasma.
To achieve this one must inform the storage manager
about the mapping between Object IDs and columns.
This can be done in two different ways:
If writing a program in C++, one can use the
PlasmaStMan
class to create the storage manager object and bind it to tables. The main constructor of this class accepts twostd::map
objects to provide the mapping from Object ID to column name for Tensors and Tables.Storage managers allow specifications to be given at creation time. This includes the properties specified above, along with the following additional keys:
PLASMASOCKET
: the Unix socket used to connect to Plasma, override thePLASMA_SOCKET
environment variable.
TENSOROBJECTIDS
: a casacoreRecord
object (i.e., a mapping) where keys are Tensor Object IDs and values are column names.
TABLEOBJECTIDS
: a casacoreRecord
object (i.e., a mapping) where keys are Table Object IDs and values are column names.Because this is a generic mechanism, these specifications can be given through different interfaces. For example, the
TaQL
language supports the creation of tables with a given Data Manager specification (see section 8.2, Data manager specification). Thepython-casacore
python bindings also allow the creation of tables with specific Data Manager inforation (seedminfo
argument).
Example
Note
This example needs pyarrow installed.
Included in the ska-sdp-plasmastman
repository
is a python-based script that demonstrates
how to create a casacore Table pointing to Plasma-stored Tensors and Tables.
This can be used to test PlasmaStMan
from external programs:
# Start a plasma store and store tensor and table data with arbitrary values
# and create a table pointing to this new data (using taql).
# Use -h to see a bit more of information on how to use it
$> python scripts/plasma_writer.py -o <table_name> -t <tensor1> -t <tensor2> -T <table1> ... &
# Make the new storage manager visible to third-party apps
$> export LD_LIBRARY_PATH=your-build-directory/src/ska/plasma
# Read the table metadata with casacore's showtableinfo
$> showtableinfo in=<table_name>
# Read the table data back with casacore's taql
$> taql 'select * FROM <table_name>'
Changelog
Migrated to use new cpp-build-base image and new $SKA_CPP_DOCKER_BUILDER_IMAGE variable.
1.3
Added support for runtime properties on the plasma storage manager. Two properties are supported,
PLASMACONNECTRETRIES
andPLASMAGETTIMEOUT
, making it possible to configure plasma-related aspects of the storage manager at runtime.Added validation for user-provided Plasma Object IDs.
1.2
Added support for Arrow Table mapping. Individual Fields/Columns from an Arrow Table can be mapped to the equally named casacore Table. The mapping can be given via the new
TABLEOBJECTIDS
Data Manager specification property.Changed
OBJECTIDS
Data Manager specification property name toTENSOROBJECTIDS
to explicitly state what type of objects do they refer to.Added public C++ API documentation where missing.
1.1
Added support for generic configuration of the plasma storage manager via Data Manager Specification (casacore
Record
) objects. This makes it possible to create casacore Tables with correctly configured plasma storage managers without executables built for that specific purpose. Most unit tests indeed now create Tables usingtaql
, which supports this generic configuration mechanism.
1.0.1
Removed memcheck tests from GitLab CI pipeline.
1.0
First version of the plasma storage manager.
A single column is backed up by a single Tensor stored on a single Plasma store; multiple columns require multiple Tensors stored on a single Plasma store.
Read-only operations are supported for both scalar and array columns.
Shape and type are checked to ensure a Tensor can be used for a given column.
Zero-copy is supported for operations where this can be accomplished, namely: full-column reads (array and scalar columns), single, continuous row range reads (array and scalar columns), and single cell reads (array columns).
Existing programs can use this storage manager without modifications, as demonstrated by tests with
taql
andshowtableinfo
.Table creation is a manual process. A
table_writer
utility is included to help with this.
API
Casacore classes
ska::plasma::PlasmaStMan
and ska::plasma::PlasmaStManColumn
are the two main classes
implementing the Storage Manager API
as mandated by casacore.
-
class PlasmaStMan : public DataManager
The Plasma-based storage manager
This is implemented using a pimpl idiom to hide the particulars of the implementation and hide it from users.
Public Functions
-
PlasmaStMan(std::string plasma_socket = "", const std::map<std::string, ObjectID> &tensor_object_ids = {}, const std::map<std::string, ObjectID> &table_object_ids = {})
Creates a new instance of the Plasma Storage Manager connected to the given socket, and mapping columns to Arrow Tensors and Tables as indicated in the given mappings.
- Parameters
plasma_socket – The UNIX socket where the Plasma store listens for connections. If not given, or empty, it defaults to
/tmp/plasma
, unless thePLASMA_SOCKET
environment variable is set, in which case its value takes precedence.tensor_object_ids – A mapping from column names to Object IDs in the Plasma store where Arrow Tensors with the data for the respective column can be found.
table_object_ids – A mapping from column names to Object IDs in the Plasma store where Arrow Tables with the data for the respective column can be found (the name of the column being mapped must be the same as the column name in the Arrow Table).
-
~PlasmaStMan()
Destructor declaration because of the pimpl idiom, otherwise its implementation is defaulted.
-
void ping_plasma()
See also
-
void set_plasma_get_timeout(std::int64_t timeout)
See also
-
void set_plasma_connect_retries(int connect_retries)
See also
Public Static Functions
-
static casacore::DataManager *makeObject(const casacore::String &aDataManType, const casacore::Record &spec)
Factory function invoked by casacore to create an instance of PlasmaStMan from a given DataManager specification.
See also
- Parameters
aDataManType – The name of the data manager.
spec – The specification of the data manager.
- Returns
A new PlasmaStMan object.
-
class impl
The Plasma-based storage manager implementation
This class fully implements the plasma-based storage manager, while PlasmaStMan only exposes this implementation, while hiding its dependencies.
Public Functions
-
impl(std::string plasma_socket = "", std::map<std::string, ObjectID> tensor_object_ids = {}, std::map<std::string, ObjectID> table_object_ids = {})
See also
-
~impl()
Destructor declaration because of incomplete PlasmaStManColumn type usage in one of our members; otherwise its implementation is defaulted.
-
void ping_plasma()
See also
-
void set_plasma_get_timeout(std::int64_t timeout)
See also
-
void set_plasma_connect_retries(int connect_retries)
-
DataManager *clone() const
See also
PlasmaStMan::clone
-
String dataManagerType() const
See also
PlasmaStMan::dataManagerType
-
String dataManagerName() const
See also
PlasmaStMan::dataManagerName
-
void create64(rownr_t aNrRows)
See also
PlasmaStMan::create64
-
rownr_t open64(rownr_t aRowNr, AipsIO &ios)
See also
PlasmaStMan::open64
-
rownr_t resync64(rownr_t aRowNr)
See also
PlasmaStMan::resync64
-
Bool flush(AipsIO&, Bool doFsync)
See also
PlasmaStMan::flush
-
DataManagerColumn *makeScalarColumn(const String &aName, int aDataType, const String &aDataTypeID)
See also
PlasmaStMan::makeScalarColumn
-
DataManagerColumn *makeDirArrColumn(const String &aName, int aDataType, const String &aDataTypeID)
See also
PlasmaStMan::makeDirArrColumn
-
DataManagerColumn *makeIndArrColumn(const String &aName, int aDataType, const String &aDataTypeID)
See also
PlasmaStMan::makeIndArrColumn
-
void deleteManager()
See also
PlasmaStMan::deleteManager
-
void addRow64(rownr_t aNrRows)
See also
PlasmaStMan::addRow64
-
Record dataManagerSpec() const
See also
PlasmaStMan::dataManagerSpec
-
Record getProperties() const
See also
PlasmaStMan::getProperties
-
void setProperties(const Record &props)
See also
PlasmaStMan::setProperties
-
inline rownr_t nrows() const
Return the number of rows used by all columns managed by this storage manager
- Returns
The number of rows used by all columns managed by this storage manager
Public Static Functions
-
static DataManager *makeObject(const String &aDataManType, const Record &spec)
See also
-
impl(std::string plasma_socket = "", std::map<std::string, ObjectID> tensor_object_ids = {}, std::map<std::string, ObjectID> table_object_ids = {})
-
PlasmaStMan(std::string plasma_socket = "", const std::map<std::string, ObjectID> &tensor_object_ids = {}, const std::map<std::string, ObjectID> &table_object_ids = {})
-
class PlasmaStManColumn : public StManColumnBase
A single column of the Plasma Storage Manager
A PlasmaStManColumn manages a single column on a casacore Table, which will be backed up by an Arrow object stored in Plasma. The actual handling of the underlying Arrow object is done via an ArrowReader instace, which hides the differences between the different types of Arrow objects that can hold data. At the moment the only supported reader is TensorReader (and thus this class still silently assumes that), but more will come. When the Tensor is retrieved from Plasma this class will create the corresponding TensorReader instance, which will ensure the data types are compatible. Also, upon data access (again, through the reader), the tensor’s shape is compared against the column’s cell shape to ensure the tensor and the column define the same dimensionality.
While casacore is column-major, Arrow is by default row-major. On the other hand, the dimensions that this column receives via setShapeColumn are those of individual cells, while Arrow Tensors will contain the full column data. Thus:
The first dimension of the Tensor should always be the number of rows of the column
For the rest of the dimensions, they should match the column cell’s shape in reverse order.
In principle support for non-row-major Tensors should be possible to add, but that is left as a future improvement.
Public Functions
-
PlasmaStManColumn(const std::string &name, PlasmaClient &client, PlasmaStMan::impl &storage_manager, const ArrowObjectInfo &object_info, int dataType)
Create a new PlasmaStManColumn with the given name and data type. Upon construction it connects to Plasma and retrieves the underlying Arrow object, if known at this stage; otherwise a call to initialize_reader needs to be issued later before attempting to read anything.
- Parameters
name – The name of this column.
client – The Plasma client object used to read Arrow objects off Plasma.
storage_manager – A reference to the owning storage manager, used to retrieve the number of rows after table creation.
object_info – Structure containing the Object ID and type of Arrow object to read from Plasma. If the type is ArrowObjectType::UNKNOWN then no reading occurs.
dataType – The data type of this column.
-
void initialize_reader(const ArrowObjectInfo &object_info)
Initializes the underlying reader object with the provided information.
- Parameters
object_info – Structure containing the Object ID and type of Arrow object to read from Plasma. If the type is ArrowObjectType::UNKNOWN then no initialization occurs.
-
bool reader_initialized() const
- Returns
Whether the underlying reader is initialized or not.
Plasma access
-
class PlasmaClient
A class encapsulating access to a Plasma Store.
This class encapsulates access to a Plasma Store. Although it’s a very thin wrapper around ::plasma::PlasmaClient, it adds configuration capabilities around certain aspects, like timeouts, the socket to connect to, retries and others.
Public Functions
-
PlasmaClient(std::string socket)
Create a new PlasmaClient that will connect to the given socket.
- Parameters
socket – The Plasma socket to connect to.
-
void ping()
Ensure communication between the client and the server works.
-
inline void set_get_timeout(std::int64_t timeout)
Set the timeout for the Plasma Get operation, in milliseconds.
- Parameters
timeout – The timeout for the Plasma Get operation, in milliseconds.
-
inline std::int64_t get_timeout() const
- Returns
The timeout for the Plasma Get operation, in milliseconds.
-
inline void set_connect_retries(int connect_retries)
Set the number of attempts to connect to the Plasma socket before failing.
- Parameters
connect_retries – the number of attempts to connect to the Plasma socket before failing.
-
inline int connect_retries() const
- Returns
The number of attempts to connect to the Plasma socket before failing.
-
::plasma::ObjectBuffer get(const ObjectID &object_id)
Read an object from the Plasma store. A plasma_error exception is thrown if no such object is found within the timeout.
- Parameters
object_id – The ID of the object to read.
- Returns
A Plasma Object Buffer pointing to the object in the Plasma Store.
-
inline std::string socket() const
- Returns
The socket where this Plasma client connects to.
-
PlasmaClient(std::string socket)
Data reading
Internally, data reading is organised in a hierarchy of the Reader classes, each taking care of reading different Arrow objects.
-
class ArrowReader
Base class for Arrow data readers used by the PlasmaStManColumn class.
Arrow offers different storage types, like Tensors and Tables. This base class offers a common interface for accessing data from these different storage types.
Subclassed by ska::plasma::TableReader, ska::plasma::TensorReader
Public Functions
-
inline ArrowReader(const std::string &column_name, casacore::DataType data_type)
Constructs a reader for the given data type.
- Parameters
column_name – The casacore column backed by this reader.
data_type – The casacore data type of the column backed by this reader.
-
virtual ~ArrowReader() = default
Virtual destructor required by virtual base class.
-
inline void check_conformance(const Shape &column_shape)
Checks that the data type and the shape of the underlying Arrow object match those of the casacore column this reader backs up. The column data type is known at construction time, and the column shape is given here.
- Parameters
column_shape – The shape of the casacore column this reader backs up.
-
virtual void read_scalar(rownr_t rownr, void *dataPtr) = 0
Read a single scalar value from the underlying Arrow object. The scalar value is that corresponding to the cell in row rownr.
- Parameters
rownr – The (casacore) row number of the cell for which the scalar is being read.
dataPtr – The address where the scalar should be written to.
-
virtual void read_array(ArrayBase &array, std::size_t offset) = 0
Read an array from the underlying Arrow object starting at the given offset. The array’s shape determines how much data is effectively read, and might or might not be able to be created with zero-copy.
- Parameters
array – The array where the data should be read into.
offset – The offset in the underlying Arrow object at which reading will start.
-
inline ArrowReader(const std::string &column_name, casacore::DataType data_type)
-
class TensorReader : public ska::plasma::ArrowReader
An ArrowReader that reads data off an Arrow Tensor.
TODO: The current implementation contains two private templated methods to handle all data types. This means we need to continuously do a runtime check for the casacore data type to choose the correct template instance. This could be avoided by offering a TensorReaderBase class that handles all common aspects, then a TensorReader class templated on the casacore data type, and finally a factory function that is called once from PlasmaStManColumn to create the correct reader for the given casacore data type.
Public Functions
-
TensorReader(const std::string &column_name, casacore::DataType data_type, arrow::io::InputStream *input_stream)
Constructs a TensorReader for the given casacore data type and column from an input stream.
- Parameters
column_name – The casacore column backed by this reader.
data_type – The casacore data type of the column backed by this
input_stream – The input stream from where the Tensor will be read. This is possibly created from an object read from Plasma.
-
virtual void read_scalar(rownr_t rownr, void *dataPtr) override
Read a single scalar value from the underlying Arrow object. The scalar value is that corresponding to the cell in row rownr.
- Parameters
rownr – The (casacore) row number of the cell for which the scalar is being read.
dataPtr – The address where the scalar should be written to.
-
virtual void read_array(ArrayBase &array, std::size_t offset) override
Read an array from the underlying Arrow object starting at the given offset. The array’s shape determines how much data is effectively read, and might or might not be able to be created with zero-copy.
- Parameters
array – The array where the data should be read into.
offset – The offset in the underlying Arrow object at which reading will start.
-
TensorReader(const std::string &column_name, casacore::DataType data_type, arrow::io::InputStream *input_stream)
-
class TableReader : public ska::plasma::ArrowReader
An ArrowReader that reads data off an Arrow Table.
Tables can contain multiple “fields” or “columns”. The column read by this reader is the one with the same name of the casacore Table column backed up by this reader. If no such field/column is found in the Arrow Table then an error is raised. Only Tables written as a single BatchRecord are currently supported.
Public Functions
-
TableReader(const std::string &column_name, casacore::DataType data_type, arrow::io::InputStream *input_stream)
Constructs a TableReader for the given casacore data type and column from an input stream. The column name in casacore must be the same as the column in the Arrow Table that will be read.
- Parameters
column_name – The casacore column backed by this reader. Should be the same as the column in the Arrow Table.
data_type – The casacore data type of the column backed by this
input_stream – The input stream from where the Table will be read. This is possibly created from an object read from Plasma.
-
virtual void read_scalar(rownr_t rownr, void *dataPtr) override
Read a single scalar value from the underlying Arrow object. The scalar value is that corresponding to the cell in row rownr.
- Parameters
rownr – The (casacore) row number of the cell for which the scalar is being read.
dataPtr – The address where the scalar should be written to.
-
virtual void read_array(ArrayBase &array, std::size_t offset) override
Read an array from the underlying Arrow object starting at the given offset. The array’s shape determines how much data is effectively read, and might or might not be able to be created with zero-copy.
- Parameters
array – The array where the data should be read into.
offset – The offset in the underlying Arrow object at which reading will start.
-
TableReader(const std::string &column_name, casacore::DataType data_type, arrow::io::InputStream *input_stream)
Misc
-
class ObjectID
Simple, immutable class containing an Object ID.
This is a simpler version of plasma’s own Object ID class, but without carrying all its dependencies, allowing us to have a specific type to represent Object IDs (other than std::string) without permeating the codebase with plasma dependencies.
Public Functions
-
ObjectID(const std::string &object_id)
Constructs an Object ID for the given string, which must be a valid plasma Object ID.
- Parameters
object_id – The contents of the Object ID
-
ObjectID(const char *object_id)
Constructs an Object ID for the given null-terminated C string, which must be a valid plasma Object ID.
- Parameters
object_id – The contents of the Object ID
-
inline const std::string &string() const
Returns the underlying string.
- Returns
The underlying string
-
inline bool valid() const
Returns whether this is a valid Object ID or not.
- Returns
true if this Object ID is valid
-
ObjectID(const std::string &object_id)