SKA SDP Benchmark Monitor – InfluxDB + Grafana Integration

Overview

Benchmon ships with a lightweight monitoring stack based on InfluxDB 3 and Grafana 12. Helper scripts in exec/ install, configure, and launch both services, deploy dashboards, and keep runtime metadata so that benchmon can stream metrics in real time.

benchmon-run ──▶ InfluxDB 3 ──▶ Grafana ──▶ Dashboards
      │                │              │
      └── CSV traces   └── Time-series └── Visualization

Components

benchmon-install-grafana – Downloads Grafana 12.1.1 and InfluxDB3 3.4.2, prepares data/log directories, enables anonymous Grafana access, and copies packaged dashboards.
benchmon-start-grafana – Launches influxdb3 serve and grafana-server, waits for readiness, configures the datasource, and deploys dashboards.
benchmon-stop-grafana – Shuts down both services using stored PID files.
benchmon-run – Collects metrics, streams them to InfluxDB (optional), and writes CSV artifacts.

Requirements

Python 3.9 or newer.
HTTPS access to dl.grafana.com and dl.influxdata.com during installation.
Disk space for the installation directory and trace outputs.

Install the Monitoring Stack

After cloning the repository, you must run the installer from inside the cloned repository so it can locate the packaged dashboards via a relative path. A safe sequence is:

$ git clone <repo-url>
$ cd ska-sdp-benchmark-monitor
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install .
$ benchmon-install-grafana            # run from repository root

Installation directory selection (what the script actually does):

No argument provided → defaults to ${HOME}/benchmon-stack.
Provide a path argument without a flag → uses that path, e.g. benchmon-install-grafana /opt/benchmon-stack.
Provide --install-dir <path> → uses the supplied path, e.g. benchmon-install-grafana --install-dir ~/bm-stack.
Any other --prefixed option is rejected with an error.

Resulting layout when no --install-dir is given (default ${HOME}/benchmon-stack):

${HOME}/benchmon-stack/grafana – Grafana binaries, dashboards copied from grafana/dashboards, conf/custom.ini, plus created data/ and logs/ folders.
${HOME}/benchmon-stack/influxdb3 – InfluxDB3 binaries (ports configured at runtime by benchmon-start-grafana).

Resulting layout when --install-dir /custom/path (or positional /custom/path) is given:

/custom/path/grafana – Same content as above, but rooted in your chosen directory.
/custom/path/influxdb3 – Same content as above.

Optional environment helpers (set if you pick a custom install dir and want to avoid retyping paths):

export BENCHMON_GRAFANA_PATH=/custom/path/grafana
export BENCHMON_INFLUXDB_PATH=/custom/path/influxdb3

Persist them in your shell profile if desired.

Start Grafana and InfluxDB

Note: If you installed the monitoring stack to a custom directory (using --install-dir), you need to define the BENCHMON_GRAFANA_PATH and BENCHMON_INFLUXDB_PATH environment variables pointing to the respective installation directories. You may also need to explicitly provide the --dashboard-dir argument if the automatic detection fails.

$ benchmon-start-grafana \
  --save-dir /tmp/benchmon-demo \
  --influxdb-port 8181 \
  --influxdb-query-file-limit 2000 \
  --grafana-port 3000

If a requested port is busy, the script automatically increments it until a free port is found and reports the chosen values.

Use --influxdb-query-file-limit when you want to allow broader queries against larger databases. The value is a count of parquet files that a single InfluxDB query may scan. Larger values make it possible to run full-database benchmon-visu --influxdb queries without adding an explicit time window when one database corresponds to one experiment. A value of 1000 can still be too small for large measurements; 2000 is a reasonable next step when the backend reports Query would exceed file limit ... parquet files. This setting is applied only when the stack starts, so changing it requires restarting benchmon-start-grafana. It affects query/read limits only and does not change importer write batching.

The script:

Creates /tmp/benchmon-demo/grafana-data/… (logs, PIDs, connection info).
Starts InfluxDB with influxdb3 serve … and Grafana with grafana-server ….
Waits for Grafana readiness and configures the datasource targeting http://<hostname>:8181.
Deploys packaged dashboards.
Records metadata in pids.json and connection.json.

Access Grafana

Open the printed URL, for example:

http://<hostname>:3000

Anonymous access is enabled. Administrator credentials use admin / admin123.

Run Benchmon with Grafana Integration

# Recommended capture (CSV + Grafana)
benchmon-run --system --csv --grafana \
             --save-dir /tmp/benchmon-demo/run-001

# Grafana-only streaming
benchmon-run --system --grafana --no-csv

# Tuning the streaming pipeline
benchmon-run --system --grafana \
             --grafana-batch-size 10000 \
             --grafana-sample-interval 0.1

Metrics appear in InfluxDB immediately and dashboards refresh in real time.

End-to-End Example: Monitoring a Compute Workload

This walkthrough demonstrates the full lifecycle of monitoring a computationally intensive task.

1. Prepare the Environment

First, ensure the monitoring stack is running.

# Start InfluxDB and Grafana in the background
benchmon-start-grafana --save-dir /tmp/benchmon-demo --influxdb-query-file-limit 2000

# or using default directory
benchmon-start-grafana --influxdb-query-file-limit 2000

Use a higher --influxdb-query-file-limit when you plan to generate figures from the whole database without --start-time and --end-time. If you change the value, restart the stack so InfluxDB is relaunched with the new scan budget.

2. Generate Load & Monitor

We will use benchmon-run to start benchmarking monitoring. In this example, we’ll simulate a CPU-intensive task using stress-ng (or a simple shell loop if stress-ng isn’t available). We enabled both system monitoring (--system) and Grafana streaming (--grafana).

# Example: Monitor a 60-second CPU load
benchmon-run --system --grafana \
             --save-dir /tmp/benchmon-demo/run-stress-test

Open a new terminal, run stree-ng

stress-ng --cpu 4 --timeout 60s
# Note: If you don't have stress-ng, any command works:
# sleep 60

3. Visualise in Real-time

While the command above is running:

The user opens the browser on the node where benchmon-run is running the user is accessing a server via ssh and then opening his browser on his own machine and an additional connection is need via ssh -L

Open your browser to http://<hostname>:3000.
Navigate to Dashboards > System Monitoring.
You will see the CPU Usage graph spike corresponding to the load generated in step 2. High-frequency metrics like CPU frequency will also reflect the processor’s boost behavior.

4. Cleanup

Once the run is complete, stop the background services to free up resources.

benchmon-stop-grafana --save-dir /tmp/benchmon-demo

Stop the Stack

# Graceful stop
$ benchmon-stop-grafana --save-dir /tmp/benchmon-demo

# Manual fallback
$ pkill -F /tmp/benchmon-demo/grafana-data/pids.json

Logs remain under /tmp/benchmon-demo/grafana-data/logs/.

Run Influxdb and Grafana Remotely (AWS)

Login into HEADNODE

srun -N 1 -n 1 -c 16 -p c7i-metal-24xl-noht-ond --pty bash

# Please record the IP address of the node
ifconfig 

# Change to venv
source <path/to/venv/bin/activate>

cd ska-sdp-benchmark-monitor/
pip uninstall ska-sdp-benchmark-monitor
pip install .

# Install Grafana and InfluxBD3 with deault value
benchmon-install-grafana

# Start InfluxDB and Grafana
benchmon-start-grafana --influxdb-query-file-limit 1000

# Run Benchmark Monitoring (C++ version)
rt-monitor --sampling-frequency 5 --batch-size 10000  --cpu --grafana http://localhost:8181?db=metrics --log-level debug


# On the local notepad computer), forward SSH port
# Note: 10.192.34.110 is an example IP addrss retrived by `ifconfig` cmd
ssh -L 3000:10.192.34.110:3000 dp-hpc-headnode -N 

Open browser and visit: http://localhost:3000

Command Line Options

Command	Key flags	Description
`benchmon-install-grafana`	`--install-dir <path>`	Target installation directory (default `~/benchmon-stack`).
`benchmon-start-grafana`	`--save-dir <path>` `--influxdb-port <port>` `--influxdb-query-file-limit <int>` `--grafana-port <port>` `--dashboard-dir <path>`	Runtime directory for logs/PIDs, preferred InfluxDB port (auto-increments if occupied), optional parquet-file scan limit per query applied at stack startup, preferred Grafana port (auto-increments if occupied), dashboard source override.
`benchmon-run`	`--save-dir <path>`	Output directory for run artifacts.
	`--system` / `--csv`	Enable system monitoring and CSV dumps.
	`--grafana`	Stream metrics to the Grafana/InfluxDB stack.
	`--no-csv`	Disable CSV generation.
	`--grafana-url <url>`	Custom Grafana/InfluxDB endpoint (default `http://localhost:3000`).
	`--grafana-token <token>`	Authentication token (blank by default).
	`--grafana-job-name <name>`	Logical job name attached to metrics (default `benchmon`).
	`--grafana-batch-size <int>`	Batch size for uploads (default `50`).
	`--grafana-sample-interval <seconds>`	Sampling interval for uploads (default `1.0`).
`benchmon-stop-grafana`	`--save-dir <path>`	Directory containing `pids.json` (must match start).

Dashboards

Packaged dashboards live in <install-dir>/grafana/dashboards. benchmon-start-grafana deploys every JSON file in that directory automatically, providing CPU, memory, network, disk, and InfiniBand (when available) views with real-time refresh.

Create Standard PNG/SVG Benchmon Plots for Offline Access

Benchmon can run in offline mode (without --grafana) and record metrics to local CSV files. You can later import these CSV traces into InfluxDB and generate the standard benchmon PNG/SVG plots for offline access, archiving, and report sharing.

The importer utility is located at benchmon/run/csv_importer.py.

Import Offline CSV Traces into InfluxDB

Basic Usage

If your trace directory includes grafana-data/connection.json, run:

python3 benchmon/run/csv_importer.py --dir /path/to/trace_folder/benchmon_traces_hostname

Manual Connection Settings

If connection.json is missing, or you want another InfluxDB target:

python3 benchmon/run/csv_importer.py \
  --dir /path/to/traces \
  --grafana-influxdb-url "http://localhost:8181" \
  --workers 8

CSV Importer Options

Argument	Description	Default
`--dir`	(Required) Path to folder containing CSV files such as `cpu_report.csv`.	-
`--grafana-influxdb-url`	InfluxDB URL v3. Overrides `connection.json`.	http://localhost:8181
`--grafana-token`	InfluxDB token. Overrides `connection.json`.	(Empty)
`--database`	Target bucket/database name.	`metrics`
`--org`	InfluxDB organization (optional).	(Empty)
`--batch-size`	Number of points per write request.	5000
`--workers`	Number of concurrent write threads.	4

Generate Benchmon Figures from Imported Data

After CSV import, generate benchmon-style figures directly from InfluxDB:

Compared to the standard CSV-based benchmon-visu usage, where the positional argument is the trace directory, InfluxDB mode needs extra arguments so benchmon knows to query the database instead of reading local files. In practice, the extra required arguments are --influxdb and --influxdb-url. If --start-time and --end-time are omitted, benchmon queries all data in the selected database. This is a good fit when one database represents one experiment and the InfluxDB scan budget is large enough. --influxdb-database is additionally required when your database is not the default metrics bucket.

benchmon-visu ./benchmon_influx_figures \
  --influxdb \
  --influxdb-url http://localhost:8181 \
  --influxdb-database metrics \
  --start-time 2026-02-04T21:48:20 \
  --end-time 2026-02-04T22:03:20 \
  --sys \
  --recursive \
  --fig-fmt png \
  --fig-name benchmon_influx_overview

Notes:

The positional argument is the output directory for generated figures and logs.
If --influxdb-hostname is omitted, benchmon discovers all hostnames in the selected time range and generates one figure set per host.
--recursive also creates multi-node_sync.<fmt> for synchronized multi-node view.
--resolution auto chooses a coarser time bucket for long windows. You can force fixed resolution such as --resolution 1m.
--start-time and --end-time use local wall-clock time in format YYYY-MM-DDTHH:MM:SS.
If some requested measurements are not present in the selected database, benchmon renders the plots backed by the available tables and skips the missing plot types.
If a full-database query hits the InfluxDB backend file-limit, do not assume the measurement is missing. Restart the stack with a larger --influxdb-query-file-limit such as 2000, or rerun with both --start-time and --end-time to narrow the request.
InfluxDB visualization currently supports system plots only: --cpu, --cpu-all, --cpu-freq, --mem, --net, --disk, --ib, and --sys.

Troubleshooting

Symptom	Resolution
Installer cannot download archives	Ensure internet access to Grafana/InfluxData and rerun.
`benchmon-start-grafana` cannot find binaries	Re-run the installer or export `BENCHMON_*` paths correctly.
Grafana dashboards missing	Verify `<install-dir>/grafana/dashboards`; reinstall if needed.
InfluxDB rejects writes	Confirm `admin_token.json` exists and port `8181` is free.
`benchmon-visu --influxdb` says the request is too large or exceeds the file limit	Restart the stack with a larger `--influxdb-query-file-limit` such as `2000`, or rerun with both `--start-time` and `--end-time`. Changing the flag requires a stack restart because it is passed to `influxdb3 serve` at startup.
Grafana unreachable	Check port usage and inspect logs under `<save-dir>/grafana-data/logs/`.

References

Dashboards: <install-dir>/grafana/dashboards
Runtime helpers: exec/benchmon-install-grafana, exec/benchmon-start-grafana, exec/benchmon-stop-grafana
Metric streaming: benchmon/run/influxdb_sender.py
Official docs: InfluxDB 3, Grafana
Metric streaming: benchmon/run/influxdb_sender.py
Official docs: InfluxDB 3, Grafana