API Usage Overview

Interaction follows standard REST conventions: clients send HTTP requests to the API endpoints and receive JSON responses. The API can be accessed from a web browser, a command-line tool such as curl or wget, an API platform such as Postman, or from within a Python script.

The Swagger UI documentation provides interactive options to test the API — see Interactive API for more information. For command-line examples using HTTP methods directly, see Access via HTTP Method. For integrating the API in Python, see Python Usage.

Note

This API is typically deployed behind a secure layer that encrypts communication (TLS/SSL) and likely requires user authentication through a separate system. When accessing the API through a browser, both the encryption and the authentication will be handled by the browser, but direct access with scripts or notebooks to the API from outside the cluster is currently not supported. To make use of this API directly, the user need to access it from within the cluster where it is hosted.

Note

If a data product has been assigned a context.access_group, then that data product will not be available/listed when accessing the api directly with scripts or notebooks. This is due to the required access token of an authenticated user that is not available in this mode of operation.

Data Product Indexing

Startup Indexing

When the API starts, it performs a background indexing operation:

The API remains responsive during indexing - you can immediately start using it
Indexing runs in the background and may take several minutes for large volumes
Data products become available progressively as they are discovered
The /status endpoint provides indexing progress information

In multi-pod deployments (e.g. horizontally-scaled Kubernetes), only one pod scans the Persistent Volume at a time. The leader is elected automatically: each pod writes to a shared dpd_indexing_state coordination table in PostgreSQL on startup. The first pod to claim a slot for its volume becomes the leader and performs the scan; all other pods detect the active leader and skip the scan, serving data directly from the shared table the leader is building. If the leader crashes its slot is automatically reclaimed once the heartbeat goes stale (default: 300 seconds, configurable via INDEXING_HEARTBEAT_STALE_SECONDS).

Indexing Timeout Protection

To prevent indefinitely long startup times, the indexing process has a configurable timeout:

Default timeout: 21600 seconds (6 hours)
Configurable via: INDEXING_TIMEOUT_SECONDS environment variable
If timeout is exceeded, indexing stops gracefully with partial results
The API continues operating with the partially indexed data

You can monitor the indexing status through the /status endpoint, which includes:

indexing.in_progress: Whether indexing is currently running
indexing.progress.indexing_step: Current indexing stage
indexing.coordination_state: Which pod holds the lease, when it last completed, and how many files it indexed

If indexing times out or you need to refresh the index, trigger a reindex via the /reindexdataproducts endpoint — see Access via HTTP Method for details.

Connecting to an SDP Configuration database

The API can connect to an SDP Configuration database (etcd) to retrieve flow state information for active processing blocks. This requires one etcd host to be provided:

SDP_ETCD_HOST=<host-name>

This defaults to localhost. If the backend cannot connect to the host, the SDP flow enrichment is disabled and data products are still served without flow information.

When a connection is established, the API reads flow state for each processing block and stores it in a dedicated flows table (keyed by execution block). Both PV and DLM data products receive flow information via a LATERAL join on their execution block reference — no separate write path is needed per product type.

If the SDP Config DB reports an active processing block for which no data product has been indexed yet, a temporary placeholder row is shown on the dashboard so that in-progress work is visible. The placeholder is automatically replaced by the real data product once it is indexed from the persistent volume.

Note: only one SDP configuration database connection is allowed.

Connecting to the Data Lifecycle Manager

The API can optionally surface data products managed by the SKA Data Lifecycle Manager (DLM) alongside products indexed from the persistent volume. This integration is disabled by default and requires a shared PostgreSQL database that both the API and the DLM can access.

Enable the integration with:

SKA_DATAPRODUCT_DLM_INTERFACE_ENABLED=True

When enabled, the API reads from the DLM data_item table and enriches each item with storage and location details by joining the storage and location tables. The following fields are exposed from those tables:

Field	Source
`storage_name`	`dlm.storage.storage_name`
`root_directory`	`dlm.storage.root_directory`
`location_name`	`dlm.location.location_name`
`location_type`	`dlm.location.location_type`
`location_country`	`dlm.location.location_country`
`location_city`	`dlm.location.location_city`
`location_facility`	`dlm.location.location_facility`

Graceful degradation

The API probes table access once at startup using two separate checks:

data_item access — if the database user does not have SELECT permission on dlm.data_item, DLM queries are disabled entirely for the lifetime of the instance.
storage / location access — if only data_item is accessible, DLM items are still returned but without storage and location enrichment.

Access check results are reported in the /status endpoint under metadata_store_status.dlm_interface_status.