Developer Guide
Package Structure
The SKA Data Product API is organized into a modular package structure:
src/ska_dataproduct_api/
├── api/
│ └── main.py → FastAPI application & endpoints
├── components/
│ ├── annotation.py → DataProductAnnotation model
│ ├── authorisation.py → Access control & user groups
│ ├── metadata.py → DataProductMetadata model
│ ├── mui_datagrid.py → MUI DataGrid configuration
│ ├── pv_interface.py → Persistent Volume interface
│ ├── sdp_config_db.py → SDP Configuration Database interface
│ ├── search/
│ │ └── in_memory_search.py → In-memory search implementation
│ └── store/
│ ├── in_memory.py → In-memory metadata store
│ ├── store_factory.py → Store factory pattern
│ └── postgresql/
│ ├── __init__.py → Exports all PostgreSQL classes
│ ├── _constants.py → SQL column expression constants
│ ├── connector.py → PostgresConnector (connection management)
│ ├── helpers.py → ColumnTracker (dynamic column schema)
│ ├── indexing_state.py → IndexingStateManager (multi-pod coordination)
│ ├── metadata_store.py → PGMetadataStore (CRUD operations)
│ ├── annotations_store.py → PGAnnotationsStore (annotations)
│ ├── search_store.py → PGSearchStore (search orchestration)
│ └── query_builder.py → PostgreSQLQueryBuilder (SQL construction)
├── configuration/
│ └── settings.py → Application settings & environment config
└── utilities/
├── exceptions.py → Custom exception classes
├── helperfunctions.py → Utility functions
├── column_headers.py → Column labels & descriptions (/en/humanreadable)
└── volume_identity.py → Volume identity & table-prefix management
tests/
├── conftest.py → Pytest fixtures
├── mock_postgressql.py → PostgreSQL mocking utilities
Key Modules
- API Layer (
api/main.py) FastAPI application with REST endpoints for metadata retrieval, search, file downloads, and status checks.
- Components
Metadata Models: DataProductMetadata and DataProductAnnotation define the data structures
Authorisation: Access control using user group lists and DLM integration
Storage Backends: In-memory and PostgreSQL implementations with factory pattern
Search: MUI DataGrid filtering and pagination support
- PostgreSQL Backend (
store/postgresql/) Modular PostgreSQL implementation with the following key classes:
PostgresConnector: Database connection management
PGMetadataStore: CRUD operations for metadata
PGAnnotationsStore: Annotation-specific operations
PGSearchStore: High-level search orchestration
PostgreSQLQueryBuilder: SQL query construction
- Configuration (
configuration/settings.py) Environment-based settings using Pydantic for validation and type safety.
- Utilities
Exceptions: Custom exception hierarchy for error handling
Helper Functions: Common utilities for file operations, JSON processing, etc.
Tooling Pre-requisites
Below are some tools that will be required to work with the data product API:
Python 3.10 or later versions: Install page URL: https://www.python.org/downloads/
Poetry 2.0 or later versions: Install page URL: https://python-poetry.org/docs/#installation
GNU make 4.2 or later versions: Install page URL: https://www.gnu.org/software/make/
Development setup
Clone the repository and its submodules:
git clone --recurse-submodules https://gitlab.com/ska-telescope/ska-dataproduct-api.git
A local PostgreSQL database is required for persistent metadata storage, annotation support, and optional DLM integration (see DLM Integration Configuration below). Without it the API runs in in-memory mode.
A development PostgreSQL instance can be started in Docker with:
Note
You will be required to set a developer password for your database instance.
Supply it via SKA_DATAPRODUCT_API_POSTGRESQL_PASSWORD (see Running the application directly).
make create-dev-postgres
Remember to start the Docker daemon before running the command above. You can skip this step if you do not need annotations or DLM integration.
Data Product Indexing Configuration
The API indexes data products at startup to make them searchable. You can configure the indexing behavior:
Environment Variables:
# Timeout for indexing operation (default: 21600 seconds = 6 hours)
INDEXING_TIMEOUT_SECONDS=21600
# Number of data products to process in each indexing batch (default: 100)
INDEXING_BATCH_SIZE=100
Indexing Behavior:
Runs automatically at startup in the background
Times out after
INDEXING_TIMEOUT_SECONDSif not completeOn timeout, returns partial results and continues operation
Does not automatically retry - manual reindex required via
/reindexdataproductsendpoint
For very large persistent volumes, increase INDEXING_TIMEOUT_SECONDS to allow sufficient time.
Per-Volume PostgreSQL Table Names
To support multiple independent Persistent Volumes sharing a single PostgreSQL instance, each PV is assigned a stable identity that drives all of its table names.
Volume identity file
On the first startup against a fresh PV, the API writes a UUID v4 into a small file at the root of the storage path:
<PERSISTENT_STORAGE_PATH>/.dpd-volume-id
The first 12 hex characters of that UUID become the table-name prefix, e.g.
dpd_a3f1c2d4e5f6. All subsequent pod restarts and re-deployments against the same
PV read the same file and therefore use the same table names, so data is never lost on
restart.
The metadata and annotations tables are named after this prefix:
<schema>.dpd_a3f1c2d4e5f6_metadata_table
<schema>.dpd_a3f1c2d4e5f6_annotations_table
The identity filename is configurable:
# Name of the UUID identity file (default: .dpd-volume-id)
SKA_DATAPRODUCT_API_DPD_VOLUME_ID_FILE=.dpd-volume-id
If the storage path is not writable or not accessible at startup (e.g. no PV mounted),
the prefix falls back to None and the API uses fixed legacy table names. A warning
is emitted in the application logs.
Multi-Pod Leader Election
When multiple API pods share the same PV (e.g. in a horizontally-scaled Kubernetes
deployment), only one pod should scan the PV and rebuild the metadata tables.
A shared PostgreSQL table called dpd_indexing_state coordinates this:
Column |
Description |
|---|---|
|
Primary key; the |
|
Hostname (pod name) of the instance that claimed leadership. |
|
|
|
Timestamp when the current leader claimed its slot. |
|
Updated periodically by the leader; used to detect crashed pods. |
|
Timestamp of the last successful completion. |
|
Number of files processed in the last run. |
Claim logic at startup:
Each pod attempts to
INSERT … ON CONFLICT DO UPDATEintodpd_indexing_statefor its volume prefix.The upsert succeeds (pod becomes leader) when any of the following are true:
No row exists yet for this prefix.
The existing row has
indexing_running = FALSE(previous run completed or crashed cleanly).The heartbeat is stale (> 300 seconds old), indicating the previous leader crashed.
The row already belongs to this pod (pod is restarting mid-index).
When another pod holds a fresh claim, the startup pod skips the PV scan entirely and serves data from the shared tables that the leader is building.
The dpd_indexing_state table is not prefixed by volume — it is a single shared
table in the DPD schema that covers all volumes.
The current coordination state is visible in the /status endpoint under
indexing.coordination_state.
Note
dpd_indexing_state is created with CREATE TABLE IF NOT EXISTS, so it is safe
to start multiple pods simultaneously. All other DPD tables are also created with
IF NOT EXISTS — metadata is preserved across pod restarts and updated via upserts
rather than a destructive DROP + CREATE.
DLM Integration Configuration
The API can optionally read data items from a shared SKA Data Lifecycle Manager PostgreSQL database. This requires:
The same PostgreSQL instance to be accessible to both the DPD API and the DLM.
The DPD database user to have
SELECTon at minimumdlm.data_item. GrantingSELECTondlm.storageanddlm.locationas well enables storage/location enrichment.
Environment Variables:
# Enable the DLM integration (default: False)
SKA_DATAPRODUCT_DLM_INTERFACE_ENABLED=True
# PostgreSQL schema that holds the DLM tables (default: dlm)
SKA_DATAPRODUCT_API_POSTGRESQL_DLM_SCHEMA=dlm
# DLM data item table name within the schema above (default: data_item)
SKA_DATAPRODUCT_API_POSTGRESQL_DLM_METADATA_TABLE_NAME=data_item
Access probing at startup:
The API performs two lightweight SELECT 1 probes at startup:
dlm.data_item— if this fails the DLM query path is disabled entirely.dlm.storageanddlm.location— if this fails, DLM items are still returned but without storage/location enrichment.
Probe results are reported in GET /status under
metadata_store_status.dlm_interface_status.
Running the application directly
All settings have sensible defaults, so for a basic local run no .env file is
required. The only value that is never defaulted is the PostgreSQL password — if you
want to use PostgreSQL you must supply it.
What you typically need to set for local development
Create a .env file at the repo root and override only what differs from the
defaults shown below:
# Path scanned for data products.
# Default: ./tests/test_files/product (the built-in test fixtures)
PERSISTENT_STORAGE_PATH=./tests/test_files/product
# PostgreSQL connection — leave HOST empty to run without PostgreSQL
# (the API will use an in-memory store instead).
SKA_DATAPRODUCT_API_POSTGRESQL_HOST=localhost
SKA_DATAPRODUCT_API_POSTGRESQL_USER=postgres
# CORS: the URL/port of the dashboard frontend (used to build the allowed-origins list).
# Only needs changing if your local dashboard runs on a non-standard port.
REACT_APP_SKA_DATAPRODUCT_DASHBOARD_URL=http://localhost
REACT_APP_SKA_DATAPRODUCT_DASHBOARD_PORT=8100
Store the PostgreSQL password in a separate .secrets file (never commit this file):
SKA_DATAPRODUCT_API_POSTGRESQL_PASSWORD=<your postgres password>
Note
Without a PostgreSQL host configured the API starts in in-memory mode: data products are indexed and searchable, but annotations are not persisted across restarts.
To start the application:
make run-dev
This will install the development environment for the project and also start the application. To check the success of the application, open http://localhost:8001/status in your browser, you should see a JSON response with the status of the API.
Running the application via Docker
To run the application inside a docker container on your host machine:
Note
When running the application in a docker container, the <PERSISTENT_STORAGE_PATH> needs to be accessible from within the container. You can mount the test folder into this location as done below:
docker build -t ska-dataproduct-api .
docker run -p 8000:8000 -v <YOUR_PROJECT_DIR>/ska-dataproduct-api/tests:/usr/src/ska_dataproduct_api/tests ska-dataproduct-api
Uvicorn will then be running on http://127.0.0.1:8000.
Column Header Labels and Descriptions
This section documents the pipeline that produces the human-readable column labels and
tooltip descriptions served by GET /en/humanreadable, and explains how to extend it.
How it works
Column header text and tooltip descriptions are assembled from four schema sources at
runtime by functions in utilities/column_headers.py, merged, and returned as a
single JSON response by GET /en/humanreadable.
schema/columns.json ─╮
ska_sdp_dataproduct_metadata ├─► column_headers.py ─► GET /en/humanreadable
ska_sdp_config (Pydantic) ─╯ │
▼
{ "execution_block": "Execution Block",
...
"description": {
"execution_block": "Unique identifier...",
...
}
}
The response has two layers:
Top-level keys — flat mapping of
field_name → display label. Consumed by the frontend viatColumns(item.field)."description"sub-object — nested mapping offield_name → tooltip text. Consumed by the frontend viatColumns(`description.${item.field}`). Fields without a description are absent from this object; the frontend shows no tooltip for them.
Sources of truth
Field group |
Source file |
Extractor function |
|---|---|---|
DPD-specific fields ( |
|
|
DLM-specific fields ( |
|
|
Obscore / metadata fields ( |
|
|
SDP flow fields ( |
|
|
columns.json schema format
Each entry in schema/columns.json is an object with label and description keys:
{
"dpd_columns": {
"execution_block": {
"label": "Execution block",
"description": "Unique identifier for the execution block that produced this data product."
},
"date_created": {
"label": "Date created",
"description": "Date and time at which this data product was created by the pipeline."
}
},
"dlm_columns": {
"item_name": {
"label": "Item Name",
"description": "Human-readable name of the data lifecycle management item."
}
}
}
How to add a new column
Backend — add a 3-tuple
(field, width, type)to_column_specsincomponents/mui_datagrid.py. The column will automatically be:visible if
fieldis listed inDEFAULT_COLUMNS(i.e.hide = False), hidden otherwise.sorted into the correct position by
_sort_columns.
Schema — add the label and description to the appropriate section of
schema/columns.json(for DPD- or DLM-owned fields), or add atitle/descriptionto the upstream schema (metadata.jsonfor obscore/config/context fields, or a PydanticField(description=...)for SDP flow fields).Default visibility — add the field name to
DEFAULT_COLUMNSinconfiguration/settings.pyif it should be visible by default.
How to add a description for an existing column
DPD / DLM fields: edit the
descriptionvalue of the matching entry inschema/columns.json.Obscore / config / context fields: add or update the
descriptionkey in the property definition inside ska_sdp_dataproduct_metadata’s metadata.json.SDP flow fields: add
description="..."to the PydanticField()call in the relevant model in ska_sdp_config’s flow.py.
Default Visible Columns
DEFAULT_COLUMNS in configuration/settings.py is the single source of truth
for which columns are visible by default and what order they appear in the DataGrid.
Pipeline
DEFAULT_COLUMNS in settings.py
│
├─► col.hide = field not in DEFAULT_COLUMNS (MuiDataGridConfig.__init__)
│ computed for every static column in _column_specs
│ also applied to dynamic columns in update_columns()
│
├─► column order = sorted by DEFAULT_COLUMNS index (_sort_columns)
│ DEFAULT_COLUMNS fields appear first, in list order
│ remaining fields follow in stable existing order
│
└─► GET /muidatagridconfig → col.hide sent to frontend
The frontend reads col.hide from /muidatagridconfig and uses it to build the
initial columnVisibilityModel passed to the MUI DataGrid.
Configuring default columns per deployment
Override DEFAULT_COLUMNS without code changes by adding the environment variable to
your .env file (local development) or Helm/Kubernetes values (deployment):
SKA_DATAPRODUCT_API_DEFAULT_COLUMNS='["execution_block","date_created","obscore.s_ra"]'
The value must be a valid JSON array of field-name strings. If the variable is absent
or contains invalid JSON, the built-in default list in settings.py is used.
Adding translations for a new language
The /en/humanreadable endpoint serves English labels and descriptions. To add a
translation for another language without any backend changes:
Create
public/locales/{lng}/humanreadable.jsonin the dashboard repository with entries for whichever fields you want to override:{ "execution_block": "Bloc d'exécution", "description": { "execution_block": "Identifiant unique du bloc d'exécution..." } }
Update
constructLoadPathinsrc/i18n.jsxto fall back to the local file for non-English languages.
The i18next library will merge the local file with the English API response, so only the keys present in the local file are overridden; all others fall back to English.