Common Utils

Common utilities for SKA-Low science-ops data analysis.

Provides helpers to locate station lists and shared data directories, sort observation metadata tables, discover test execution directories for frequency sweeps and solar drift scans, and load HDF5 file metadata with per-file summaries suitable for reporting.

ska_sci_ops_data_analysis.common_utils.get_test_execution_dirs_in_dates(dates: list[str], freq_sweep: bool = False, solar_drift: bool = False, **args: Any) → dict[str, dict[str, dict[str, object]]]

Discover test-execution run directories within a date window.

Supported data types:

Frequency Sweep (LCO-12), under FrequencySweepMultiple/
Solar Drift Scan (LCO-66), under AcquireBeamformed/

Each run directory is expected to be named multiple_<YYYY-MM-DD>_<HHMMSS> and to contain per-station subdirectories. Station names are loaded to filter subdirectories.

Parameters:

dates – Inclusive date window as [start, end] in YYYY-MM-DD format.
freq_sweep – If True, include frequency sweep runs.
solar_drift – If True, include solar drift scan runs.
args – Reserved for future options; ignored.

Returns:

Mapping with optional keys "freq_sweep_dirs" and/or "solar_drift_dirs". Each maps a run directory name to a record with keys: main_path (absolute path), dir_list (list[Path] of station subdirectories), Date (YYYY-MM-DD), and time_start (HH:MM).

ska_sci_ops_data_analysis.common_utils.is_date_in_range(date: ~.datetime.date, date_range: tuple[str, str]) → bool

Evaluate if given date is within the date_range.

Parameters:

date – Date to evaluate
date_range – 2-tuple of ISO-format date strings

Returns:

True if date is between the two dates in date_range, false otherwise

ska_sci_ops_data_analysis.common_utils.load_stations(allow_comments: bool = True) → list[str]

Load station IDs from <repo_root>/data/station_list.txt.

Reads one station name per line from the project’s data directory. When allow_comments is True, lines beginning with # are ignored.

Expected layout:

<repo_root>/
  ├── src/ska_sci_ops_data_analysis/...
  └── data/station_list.txt

Parameters:: allow_comments – If True, ignore lines beginning with #.
Returns:: Station identifiers in file order.
Raises:: FileNotFoundError – If the file is missing or exists but contains no usable station names.

ska_sci_ops_data_analysis.common_utils.obs_hdf5_info_loader(input_directory: str | Path, compute_missing_channels: bool = False, compute_gaussian: bool = False, channels: int | Sequence[int] | None = None, plot_channels: list[int] | None = None, median_windows: Any | None = None) → DataFrame

Load observation HDF5 files and build a per-file + summary table.

For each .hdf5 file found in input_directory, the function reads start and end timestamps, derives a per-file “useful time” from the sample timestamp array, and can optionally compute:

Missing channel ranges using power arrays (polarization_0**2 + polarization_1**2).
Mean Gaussian fit quality (R²) per file via an external checker (gaussian_check_station), when enabled.

The final row named "TOTAL" aggregates: earliest start, latest end, total useful time, optionally the union of missing channels, the mean Gaussian R², the number of files, and a boolean Data flag.

Parameters:

input_directory – Directory containing HDF5 files to scan.
compute_missing_channels – If True, compute missing channels per file and on the summary row (requires channels).
compute_gaussian – If True, compute the mean Gaussian R² per file using an external function available on the import path.
channels – Number of channels used when computing missing channels (e.g., 384). Required when compute_missing_channels is True.
plot_channels – Optional channel indices forwarded to the Gaussian checker.
median_windows – Optional median filter configuration.

Returns:

One row per file plus a final "TOTAL" row with the aggregate metrics. Columns include: file_name, ts_start_unix, ts_end_unix, ts_start_AWST, ts_end_AWST, ts_array, useful_time_s, useful_time_str, Missing channels, Gaussian R2, Number of files (TOTAL row), and Data (TOTAL row).

Raises:

ValueError – If compute_missing_channels is True but channels is None.

ska_sci_ops_data_analysis.common_utils.sort_df_by_date_time_station(df: DataFrame, date_col: str = 'Date', time_col: str = 'time_start', station_col: str = 'Station') → DataFrame

Sort observations by date, time, and alphanumeric station ID.

Station IDs like s10-3 are split into components (spiral arm letter, cluster integer, and in-cluster integer) to achieve a natural alphanumeric order. Original string formats of date_col and time_col are preserved.

Parameters:

df – Input table containing at least the date, time, and station columns.
date_col – Name of the date column (string-formatted dates).
time_col – Name of the time column (string-formatted time/slot).
station_col – Name of the station identifier column (e.g. "s10-3").

Returns:

A new DataFrame sorted by date, time, and station order. Temporary sort-key columns are removed.