Developer Guide

Package Structure

The SKA Data Product API is organized into a modular package structure:

src/ska_dataproduct_api/
├── api/
│   └── main.py                    → FastAPI application & endpoints
├── components/
│   ├── annotation.py              → DataProductAnnotation model
│   ├── authorisation.py           → Access control & user groups
│   ├── metadata.py                → DataProductMetadata model
│   ├── mui_datagrid.py            → MUI DataGrid configuration
│   ├── pv_interface.py            → Persistent Volume interface
│   ├── sdp_config_db.py           → SDP Configuration Database interface
│   ├── search/
│   │   └── in_memory_search.py    → In-memory search implementation
│   └── store/
│       ├── in_memory.py           → In-memory metadata store
│       ├── store_factory.py       → Store factory pattern
│       └── postgresql/
│           ├── __init__.py        → Exports all PostgreSQL classes
│           ├── _constants.py      → SQL column expression constants
│           ├── connector.py       → PostgresConnector (connection management)
│           ├── helpers.py         → ColumnTracker (dynamic column schema)
│           ├── indexing_state.py  → IndexingStateManager (multi-pod coordination)
│           ├── metadata_store.py  → PGMetadataStore (CRUD operations)
│           ├── annotations_store.py → PGAnnotationsStore (annotations)
│           ├── search_store.py    → PGSearchStore (search orchestration)
│           └── query_builder.py   → PostgreSQLQueryBuilder (SQL construction)
├── configuration/
│   └── settings.py                → Application settings & environment config
└── utilities/
    ├── exceptions.py              → Custom exception classes
    ├── helperfunctions.py         → Utility functions
    ├── column_headers.py          → Column labels & descriptions (/en/humanreadable)
    └── volume_identity.py         → Volume identity & table-prefix management
tests/
├── conftest.py                    → Pytest fixtures
├── mock_postgressql.py            → PostgreSQL mocking utilities

Key Modules

API Layer (api/main.py)

FastAPI application with REST endpoints for metadata retrieval, search, file downloads, and status checks.

Components
  • Metadata Models: DataProductMetadata and DataProductAnnotation define the data structures

  • Authorisation: Access control using user group lists and DLM integration

  • Storage Backends: In-memory and PostgreSQL implementations with factory pattern

  • Search: MUI DataGrid filtering and pagination support

PostgreSQL Backend (store/postgresql/)

Modular PostgreSQL implementation with the following key classes:

  • PostgresConnector: Database connection management

  • PGMetadataStore: CRUD operations for metadata

  • PGAnnotationsStore: Annotation-specific operations

  • PGSearchStore: High-level search orchestration

  • PostgreSQLQueryBuilder: SQL query construction

Configuration (configuration/settings.py)

Environment-based settings using Pydantic for validation and type safety.

Utilities
  • Exceptions: Custom exception hierarchy for error handling

  • Helper Functions: Common utilities for file operations, JSON processing, etc.

Tooling Pre-requisites

Below are some tools that will be required to work with the data product API:

Development setup

Clone the repository and its submodules:

git clone --recurse-submodules https://gitlab.com/ska-telescope/ska-dataproduct-api.git

A local PostgreSQL database is required for persistent metadata storage, annotation support, and optional DLM integration (see DLM Integration Configuration below). Without it the API runs in in-memory mode.

A development PostgreSQL instance can be started in Docker with:

Note

You will be required to set a developer password for your database instance. Supply it via SKA_DATAPRODUCT_API_POSTGRESQL_PASSWORD (see Running the application directly).

make create-dev-postgres

Remember to start the Docker daemon before running the command above. You can skip this step if you do not need annotations or DLM integration.

Data Product Indexing Configuration

The API indexes data products at startup to make them searchable. You can configure the indexing behavior:

Environment Variables:

# Timeout for indexing operation (default: 21600 seconds = 6 hours)
INDEXING_TIMEOUT_SECONDS=21600

# Number of data products to process in each indexing batch (default: 100)
INDEXING_BATCH_SIZE=100

Indexing Behavior:

  • Runs automatically at startup in the background

  • Times out after INDEXING_TIMEOUT_SECONDS if not complete

  • On timeout, returns partial results and continues operation

  • Does not automatically retry - manual reindex required via /reindexdataproducts endpoint

For very large persistent volumes, increase INDEXING_TIMEOUT_SECONDS to allow sufficient time.

Per-Volume PostgreSQL Table Names

To support multiple independent Persistent Volumes sharing a single PostgreSQL instance, each PV is assigned a stable identity that drives all of its table names.

Volume identity file

On the first startup against a fresh PV, the API writes a UUID v4 into a small file at the root of the storage path:

<PERSISTENT_STORAGE_PATH>/.dpd-volume-id

The first 12 hex characters of that UUID become the table-name prefix, e.g. dpd_a3f1c2d4e5f6. All subsequent pod restarts and re-deployments against the same PV read the same file and therefore use the same table names, so data is never lost on restart.

The metadata and annotations tables are named after this prefix:

<schema>.dpd_a3f1c2d4e5f6_metadata_table
<schema>.dpd_a3f1c2d4e5f6_annotations_table

The identity filename is configurable:

# Name of the UUID identity file (default: .dpd-volume-id)
SKA_DATAPRODUCT_API_DPD_VOLUME_ID_FILE=.dpd-volume-id

If the storage path is not writable or not accessible at startup (e.g. no PV mounted), the prefix falls back to None and the API uses fixed legacy table names. A warning is emitted in the application logs.

Multi-Pod Leader Election

When multiple API pods share the same PV (e.g. in a horizontally-scaled Kubernetes deployment), only one pod should scan the PV and rebuild the metadata tables. A shared PostgreSQL table called dpd_indexing_state coordinates this:

Column

Description

volume_prefix

Primary key; the dpd_<hex> prefix for the volume being indexed.

indexed_by

Hostname (pod name) of the instance that claimed leadership.

indexing_running

TRUE while the leader is actively scanning the PV.

started_at

Timestamp when the current leader claimed its slot.

last_heartbeat

Updated periodically by the leader; used to detect crashed pods.

completed_at

Timestamp of the last successful completion.

files_indexed

Number of files processed in the last run.

Claim logic at startup:

  1. Each pod attempts to INSERT ON CONFLICT DO UPDATE into dpd_indexing_state for its volume prefix.

  2. The upsert succeeds (pod becomes leader) when any of the following are true:

    • No row exists yet for this prefix.

    • The existing row has indexing_running = FALSE (previous run completed or crashed cleanly).

    • The heartbeat is stale (> 300 seconds old), indicating the previous leader crashed.

    • The row already belongs to this pod (pod is restarting mid-index).

  3. When another pod holds a fresh claim, the startup pod skips the PV scan entirely and serves data from the shared tables that the leader is building.

The dpd_indexing_state table is not prefixed by volume — it is a single shared table in the DPD schema that covers all volumes.

The current coordination state is visible in the /status endpoint under indexing.coordination_state.

Note

dpd_indexing_state is created with CREATE TABLE IF NOT EXISTS, so it is safe to start multiple pods simultaneously. All other DPD tables are also created with IF NOT EXISTS — metadata is preserved across pod restarts and updated via upserts rather than a destructive DROP + CREATE.

DLM Integration Configuration

The API can optionally read data items from a shared SKA Data Lifecycle Manager PostgreSQL database. This requires:

  • The same PostgreSQL instance to be accessible to both the DPD API and the DLM.

  • The DPD database user to have SELECT on at minimum dlm.data_item. Granting SELECT on dlm.storage and dlm.location as well enables storage/location enrichment.

Environment Variables:

# Enable the DLM integration (default: False)
SKA_DATAPRODUCT_DLM_INTERFACE_ENABLED=True

# PostgreSQL schema that holds the DLM tables (default: dlm)
SKA_DATAPRODUCT_API_POSTGRESQL_DLM_SCHEMA=dlm

# DLM data item table name within the schema above (default: data_item)
SKA_DATAPRODUCT_API_POSTGRESQL_DLM_METADATA_TABLE_NAME=data_item

Access probing at startup:

The API performs two lightweight SELECT 1 probes at startup:

  1. dlm.data_item — if this fails the DLM query path is disabled entirely.

  2. dlm.storage and dlm.location — if this fails, DLM items are still returned but without storage/location enrichment.

Probe results are reported in GET /status under metadata_store_status.dlm_interface_status.

Running the application directly

All settings have sensible defaults, so for a basic local run no .env file is required. The only value that is never defaulted is the PostgreSQL password — if you want to use PostgreSQL you must supply it.

What you typically need to set for local development

Create a .env file at the repo root and override only what differs from the defaults shown below:

# Path scanned for data products.
# Default: ./tests/test_files/product  (the built-in test fixtures)
PERSISTENT_STORAGE_PATH=./tests/test_files/product

# PostgreSQL connection — leave HOST empty to run without PostgreSQL
# (the API will use an in-memory store instead).
SKA_DATAPRODUCT_API_POSTGRESQL_HOST=localhost
SKA_DATAPRODUCT_API_POSTGRESQL_USER=postgres

# CORS: the URL/port of the dashboard frontend (used to build the allowed-origins list).
# Only needs changing if your local dashboard runs on a non-standard port.
REACT_APP_SKA_DATAPRODUCT_DASHBOARD_URL=http://localhost
REACT_APP_SKA_DATAPRODUCT_DASHBOARD_PORT=8100

Store the PostgreSQL password in a separate .secrets file (never commit this file):

SKA_DATAPRODUCT_API_POSTGRESQL_PASSWORD=<your postgres password>

Note

Without a PostgreSQL host configured the API starts in in-memory mode: data products are indexed and searchable, but annotations are not persisted across restarts.

To start the application:

make run-dev

This will install the development environment for the project and also start the application. To check the success of the application, open http://localhost:8001/status in your browser, you should see a JSON response with the status of the API.

Running the application via Docker

To run the application inside a docker container on your host machine:

Note

When running the application in a docker container, the <PERSISTENT_STORAGE_PATH> needs to be accessible from within the container. You can mount the test folder into this location as done below:

docker build -t ska-dataproduct-api .
docker run -p 8000:8000 -v <YOUR_PROJECT_DIR>/ska-dataproduct-api/tests:/usr/src/ska_dataproduct_api/tests ska-dataproduct-api

Uvicorn will then be running on http://127.0.0.1:8000.

Column Header Labels and Descriptions

This section documents the pipeline that produces the human-readable column labels and tooltip descriptions served by GET /en/humanreadable, and explains how to extend it.

How it works

Column header text and tooltip descriptions are assembled from four schema sources at runtime by functions in utilities/column_headers.py, merged, and returned as a single JSON response by GET /en/humanreadable.

schema/columns.json           ─╮
ska_sdp_dataproduct_metadata   ├─► column_headers.py ─► GET /en/humanreadable
ska_sdp_config (Pydantic)     ─╯         │
                                         ▼
                                { "execution_block": "Execution Block",
                                  ...
                                  "description": {
                                    "execution_block": "Unique identifier...",
                                    ...
                                  }
                                }

The response has two layers:

  • Top-level keys — flat mapping of field_name display label. Consumed by the frontend via tColumns(item.field).

  • "description" sub-object — nested mapping of field_name tooltip text. Consumed by the frontend via tColumns(`description.${item.field}`). Fields without a description are absent from this object; the frontend shows no tooltip for them.

Sources of truth

Field group

Source file

Extractor function

DPD-specific fields (execution_block, date_created, …)

schema/columns.jsondpd_columns section

dpd_descriptions(), dpd_column_descriptions()

DLM-specific fields (item_name, item_state, …)

schema/columns.jsondlm_columns section

dlm_descriptions(), dlm_column_descriptions()

Obscore / metadata fields (obscore.*, config.*, context.*)

ska_sdp_dataproduct_metadata metadata.json

metadata_descriptions(), metadata_column_descriptions()

SDP flow fields (sdp_flows.*)

ska_sdp_config Pydantic models (Flow, FlowSource, Dependency)

flows_descriptions(), flows_column_descriptions()

columns.json schema format

Each entry in schema/columns.json is an object with label and description keys:

{
  "dpd_columns": {
    "execution_block": {
      "label": "Execution block",
      "description": "Unique identifier for the execution block that produced this data product."
    },
    "date_created": {
      "label": "Date created",
      "description": "Date and time at which this data product was created by the pipeline."
    }
  },
  "dlm_columns": {
    "item_name": {
      "label": "Item Name",
      "description": "Human-readable name of the data lifecycle management item."
    }
  }
}

How to add a new column

  1. Backend — add a 3-tuple (field, width, type) to _column_specs in components/mui_datagrid.py. The column will automatically be:

    • visible if field is listed in DEFAULT_COLUMNS (i.e. hide = False), hidden otherwise.

    • sorted into the correct position by _sort_columns.

  2. Schema — add the label and description to the appropriate section of schema/columns.json (for DPD- or DLM-owned fields), or add a title / description to the upstream schema (metadata.json for obscore/config/context fields, or a Pydantic Field(description=...) for SDP flow fields).

  3. Default visibility — add the field name to DEFAULT_COLUMNS in configuration/settings.py if it should be visible by default.

How to add a description for an existing column

  • DPD / DLM fields: edit the description value of the matching entry in schema/columns.json.

  • Obscore / config / context fields: add or update the description key in the property definition inside ska_sdp_dataproduct_metadata’s metadata.json.

  • SDP flow fields: add description="..." to the Pydantic Field() call in the relevant model in ska_sdp_config’s flow.py.

Default Visible Columns

DEFAULT_COLUMNS in configuration/settings.py is the single source of truth for which columns are visible by default and what order they appear in the DataGrid.

Pipeline

DEFAULT_COLUMNS in settings.py
       │
       ├─► col.hide = field not in DEFAULT_COLUMNS   (MuiDataGridConfig.__init__)
       │         computed for every static column in _column_specs
       │         also applied to dynamic columns in update_columns()
       │
       ├─► column order = sorted by DEFAULT_COLUMNS index (_sort_columns)
       │         DEFAULT_COLUMNS fields appear first, in list order
       │         remaining fields follow in stable existing order
       │
       └─► GET /muidatagridconfig  →  col.hide sent to frontend

The frontend reads col.hide from /muidatagridconfig and uses it to build the initial columnVisibilityModel passed to the MUI DataGrid.

Configuring default columns per deployment

Override DEFAULT_COLUMNS without code changes by adding the environment variable to your .env file (local development) or Helm/Kubernetes values (deployment):

SKA_DATAPRODUCT_API_DEFAULT_COLUMNS='["execution_block","date_created","obscore.s_ra"]'

The value must be a valid JSON array of field-name strings. If the variable is absent or contains invalid JSON, the built-in default list in settings.py is used.

Adding translations for a new language

The /en/humanreadable endpoint serves English labels and descriptions. To add a translation for another language without any backend changes:

  1. Create public/locales/{lng}/humanreadable.json in the dashboard repository with entries for whichever fields you want to override:

    {
      "execution_block": "Bloc d'exécution",
      "description": {
        "execution_block": "Identifiant unique du bloc d'exécution..."
      }
    }
    
  2. Update constructLoadPath in src/i18n.jsx to fall back to the local file for non-English languages.

The i18next library will merge the local file with the English API response, so only the keys present in the local file are overridden; all others fall back to English.