API Usage Overview
Interaction follows standard REST conventions: clients send HTTP requests to the API endpoints and receive JSON responses.
The API can be accessed from a web browser, a command-line tool such as curl or wget, an API platform such as Postman, or from within a Python script.
The Swagger UI documentation provides interactive options to test the API — see Interactive API for more information. For command-line examples using HTTP methods directly, see Access via HTTP Method. For integrating the API in Python, see Python Usage.
Note
This API is typically deployed behind a secure layer that encrypts communication (TLS/SSL) and likely requires user authentication through a separate system. When accessing the API through a browser, both the encryption and the authentication will be handled by the browser, but direct access with scripts or notebooks to the API from outside the cluster is currently not supported. To make use of this API directly, the user need to access it from within the cluster where it is hosted.
Note
If a data product has been assigned a context.access_group, then that data product will not be available/listed when accessing the api directly with scripts or notebooks. This is due to the required access token of an authenticated user that is not available in this mode of operation.
Data Product Indexing
Startup Indexing
When the API starts, it performs a background indexing operation:
The API remains responsive during indexing - you can immediately start using it
Indexing runs in the background and may take several minutes for large volumes
Data products become available progressively as they are discovered
The
/statusendpoint provides indexing progress information
In multi-pod deployments (e.g. horizontally-scaled Kubernetes), only one pod scans
the Persistent Volume at a time. The leader is elected automatically: each pod
writes to a shared dpd_indexing_state coordination table in PostgreSQL on startup.
The first pod to claim a slot for its volume becomes the leader and performs the scan;
all other pods detect the active leader and skip the scan, serving data directly from
the shared table the leader is building. If the leader crashes its slot is
automatically reclaimed once the heartbeat goes stale (default: 300 seconds, configurable
via INDEXING_HEARTBEAT_STALE_SECONDS).
Indexing Timeout Protection
To prevent indefinitely long startup times, the indexing process has a configurable timeout:
Default timeout: 21600 seconds (6 hours)
Configurable via:
INDEXING_TIMEOUT_SECONDSenvironment variableIf timeout is exceeded, indexing stops gracefully with partial results
The API continues operating with the partially indexed data
You can monitor the indexing status through the /status endpoint, which includes:
indexing.in_progress: Whether indexing is currently runningindexing.progress.indexing_step: Current indexing stageindexing.coordination_state: Which pod holds the lease, when it last completed, and how many files it indexed
If indexing times out or you need to refresh the index, trigger a reindex via the
/reindexdataproducts endpoint — see Access via HTTP Method for details.
Connecting to an SDP Configuration database
The API can connect to an SDP Configuration database (etcd) to retrieve flow state information for active processing blocks. This requires one etcd host to be provided:
SDP_ETCD_HOST=<host-name>
This defaults to localhost. If the backend cannot connect to the host, the SDP
flow enrichment is disabled and data products are still served without flow information.
When a connection is established, the API reads flow state for each processing block and stores it in a dedicated flows table (keyed by execution block). Both PV and DLM data products receive flow information via a LATERAL join on their execution block reference — no separate write path is needed per product type.
If the SDP Config DB reports an active processing block for which no data product has been indexed yet, a temporary placeholder row is shown on the dashboard so that in-progress work is visible. The placeholder is automatically replaced by the real data product once it is indexed from the persistent volume.
Note: only one SDP configuration database connection is allowed.
Connecting to the Data Lifecycle Manager
The API can optionally surface data products managed by the SKA Data Lifecycle Manager (DLM) alongside products indexed from the persistent volume. This integration is disabled by default and requires a shared PostgreSQL database that both the API and the DLM can access.
Enable the integration with:
SKA_DATAPRODUCT_DLM_INTERFACE_ENABLED=True
When enabled, the API reads from the DLM data_item table and enriches each
item with storage and location details by joining the storage and
location tables. The following fields are exposed from those tables:
Field |
Source |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Graceful degradation
The API probes table access once at startup using two separate checks:
data_itemaccess — if the database user does not have SELECT permission ondlm.data_item, DLM queries are disabled entirely for the lifetime of the instance.storage/locationaccess — if onlydata_itemis accessible, DLM items are still returned but without storage and location enrichment.
Access check results are reported in the /status endpoint under
metadata_store_status.dlm_interface_status.