SKA SDP Benchmark Monitor – InfluxDB + Grafana Integration
Overview
Benchmon ships with a lightweight monitoring stack based on InfluxDB 3 and Grafana 12. Helper scripts in exec/ install, configure, and launch both services, deploy dashboards, and keep runtime metadata so that benchmon can stream metrics in real time.
benchmon-run ──▶ InfluxDB 3 ──▶ Grafana ──▶ Dashboards
│ │ │
└── CSV traces └── Time-series └── Visualization
Components
benchmon-install-grafana– Downloads Grafana12.1.1and InfluxDB33.4.2, prepares data/log directories, enables anonymous Grafana access, and copies packaged dashboards.benchmon-start-grafana– Launchesinfluxdb3 serveandgrafana-server, waits for readiness, configures the datasource, and deploys dashboards.benchmon-stop-grafana– Shuts down both services using stored PID files.benchmon-run– Collects metrics, streams them to InfluxDB (optional), and writes CSV artifacts.
Requirements
Python 3.9 or newer.
HTTPS access to
dl.grafana.comanddl.influxdata.comduring installation.Disk space for the installation directory and trace outputs.
Install the Monitoring Stack
After cloning the repository, you must run the installer from inside the cloned repository so it can locate the packaged dashboards via a relative path. A safe sequence is:
$ git clone <repo-url>
$ cd ska-sdp-benchmark-monitor
$ python -m venv .venv
$ source .venv/bin/activate
$ pip install .
$ benchmon-install-grafana # run from repository root
Installation directory selection (what the script actually does):
No argument provided → defaults to
${HOME}/benchmon-stack.Provide a path argument without a flag → uses that path, e.g.
benchmon-install-grafana /opt/benchmon-stack.Provide
--install-dir <path>→ uses the supplied path, e.g.benchmon-install-grafana --install-dir ~/bm-stack.Any other
--prefixed option is rejected with an error.
Resulting layout when no --install-dir is given (default ${HOME}/benchmon-stack):
${HOME}/benchmon-stack/grafana– Grafana binaries, dashboards copied fromgrafana/dashboards,conf/custom.ini, plus createddata/andlogs/folders.${HOME}/benchmon-stack/influxdb3– InfluxDB3 binaries (ports configured at runtime bybenchmon-start-grafana).
Resulting layout when --install-dir /custom/path (or positional /custom/path) is given:
/custom/path/grafana– Same content as above, but rooted in your chosen directory./custom/path/influxdb3– Same content as above.
Optional environment helpers (set if you pick a custom install dir and want to avoid retyping paths):
export BENCHMON_GRAFANA_PATH=/custom/path/grafana
export BENCHMON_INFLUXDB_PATH=/custom/path/influxdb3
Persist them in your shell profile if desired.
Start Grafana and InfluxDB
Note: If you installed the monitoring stack to a custom directory (using
--install-dir), you need to define theBENCHMON_GRAFANA_PATHandBENCHMON_INFLUXDB_PATHenvironment variables pointing to the respective installation directories. You may also need to explicitly provide the--dashboard-dirargument if the automatic detection fails.
$ benchmon-start-grafana \
--save-dir /tmp/benchmon-demo \
--influxdb-port 8181 \
--influxdb-query-file-limit 2000 \
--grafana-port 3000
If a requested port is busy, the script automatically increments it until a free port is found and reports the chosen values.
Use --influxdb-query-file-limit when you want to allow broader queries against larger databases. The value is a count of parquet files that a single InfluxDB query may scan. Larger values make it possible to run full-database benchmon-visu --influxdb queries without adding an explicit time window when one database corresponds to one experiment. A value of 1000 can still be too small for large measurements; 2000 is a reasonable next step when the backend reports Query would exceed file limit ... parquet files. This setting is applied only when the stack starts, so changing it requires restarting benchmon-start-grafana. It affects query/read limits only and does not change importer write batching.
The script:
Creates
/tmp/benchmon-demo/grafana-data/…(logs, PIDs, connection info).Starts InfluxDB with
influxdb3 serve …and Grafana withgrafana-server ….Waits for Grafana readiness and configures the datasource targeting
http://<hostname>:8181.Deploys packaged dashboards.
Records metadata in
pids.jsonandconnection.json.
Access Grafana
Open the printed URL, for example:
http://<hostname>:3000
Anonymous access is enabled. Administrator credentials use admin / admin123.
Run Benchmon with Grafana Integration
# Recommended capture (CSV + Grafana)
benchmon-run --system --csv --grafana \
--save-dir /tmp/benchmon-demo/run-001
# Grafana-only streaming
benchmon-run --system --grafana --no-csv
# Tuning the streaming pipeline
benchmon-run --system --grafana \
--grafana-batch-size 10000 \
--grafana-sample-interval 0.1
Metrics appear in InfluxDB immediately and dashboards refresh in real time.
End-to-End Example: Monitoring a Compute Workload
This walkthrough demonstrates the full lifecycle of monitoring a computationally intensive task.
1. Prepare the Environment
First, ensure the monitoring stack is running.
# Start InfluxDB and Grafana in the background
benchmon-start-grafana --save-dir /tmp/benchmon-demo --influxdb-query-file-limit 2000
# or using default directory
benchmon-start-grafana --influxdb-query-file-limit 2000
Use a higher --influxdb-query-file-limit when you plan to generate figures from the whole database without --start-time and --end-time. If you change the value, restart the stack so InfluxDB is relaunched with the new scan budget.
2. Generate Load & Monitor
We will use benchmon-run to start benchmarking monitoring. In this example, we’ll simulate a CPU-intensive task using stress-ng (or a simple shell loop if stress-ng isn’t available). We enabled both system monitoring (--system) and Grafana streaming (--grafana).
# Example: Monitor a 60-second CPU load
benchmon-run --system --grafana \
--save-dir /tmp/benchmon-demo/run-stress-test
Open a new terminal, run stree-ng
stress-ng --cpu 4 --timeout 60s
# Note: If you don't have stress-ng, any command works:
# sleep 60
3. Visualise in Real-time
While the command above is running:
The user opens the browser on the node where benchmon-run is running the user is accessing a server via ssh and then opening his browser on his own machine and an additional connection is need via ssh -L
Open your browser to
http://<hostname>:3000.Navigate to Dashboards > System Monitoring.
You will see the CPU Usage graph spike corresponding to the load generated in step 2. High-frequency metrics like CPU frequency will also reflect the processor’s boost behavior.
4. Cleanup
Once the run is complete, stop the background services to free up resources.
benchmon-stop-grafana --save-dir /tmp/benchmon-demo
Stop the Stack
# Graceful stop
$ benchmon-stop-grafana --save-dir /tmp/benchmon-demo
# Manual fallback
$ pkill -F /tmp/benchmon-demo/grafana-data/pids.json
Logs remain under /tmp/benchmon-demo/grafana-data/logs/.
Run Influxdb and Grafana Remotely (AWS)
Login into HEADNODE
srun -N 1 -n 1 -c 16 -p c7i-metal-24xl-noht-ond --pty bash
# Please record the IP address of the node
ifconfig
# Change to venv
source <path/to/venv/bin/activate>
cd ska-sdp-benchmark-monitor/
pip uninstall ska-sdp-benchmark-monitor
pip install .
# Install Grafana and InfluxBD3 with deault value
benchmon-install-grafana
# Start InfluxDB and Grafana
benchmon-start-grafana --influxdb-query-file-limit 1000
# Run Benchmark Monitoring (C++ version)
rt-monitor --sampling-frequency 5 --batch-size 10000 --cpu --grafana http://localhost:8181?db=metrics --log-level debug
# On the local notepad computer), forward SSH port
# Note: 10.192.34.110 is an example IP addrss retrived by `ifconfig` cmd
ssh -L 3000:10.192.34.110:3000 dp-hpc-headnode -N
Open browser and visit: http://localhost:3000
Command Line Options
| Command | Key flags | Description |
|---|---|---|
benchmon-install-grafana |
--install-dir <path> |
Target installation directory (default ~/benchmon-stack). |
benchmon-start-grafana |
--save-dir <path>--influxdb-port <port>--influxdb-query-file-limit <int>--grafana-port <port>--dashboard-dir <path> |
Runtime directory for logs/PIDs, preferred InfluxDB port (auto-increments if occupied), optional parquet-file scan limit per query applied at stack startup, preferred Grafana port (auto-increments if occupied), dashboard source override. |
benchmon-run |
--save-dir <path> |
Output directory for run artifacts. |
--system / --csv |
Enable system monitoring and CSV dumps. | |
--grafana |
Stream metrics to the Grafana/InfluxDB stack. | |
--no-csv |
Disable CSV generation. | |
--grafana-url <url> |
Custom Grafana/InfluxDB endpoint (default http://localhost:3000). |
|
--grafana-token <token> |
Authentication token (blank by default). | |
--grafana-job-name <name> |
Logical job name attached to metrics (default benchmon). |
|
--grafana-batch-size <int> |
Batch size for uploads (default 50). |
|
--grafana-sample-interval <seconds> |
Sampling interval for uploads (default 1.0). |
|
benchmon-stop-grafana |
--save-dir <path> |
Directory containing pids.json (must match start). |
Dashboards
Packaged dashboards live in <install-dir>/grafana/dashboards. benchmon-start-grafana deploys every JSON file in that directory automatically, providing CPU, memory, network, disk, and InfiniBand (when available) views with real-time refresh.
Create Standard PNG/SVG Benchmon Plots for Offline Access
Benchmon can run in offline mode (without --grafana) and record metrics to local CSV files. You can later import these CSV traces into InfluxDB and generate the standard benchmon PNG/SVG plots for offline access, archiving, and report sharing.
The importer utility is located at benchmon/run/csv_importer.py.
Import Offline CSV Traces into InfluxDB
Basic Usage
If your trace directory includes grafana-data/connection.json, run:
python3 benchmon/run/csv_importer.py --dir /path/to/trace_folder/benchmon_traces_hostname
Manual Connection Settings
If connection.json is missing, or you want another InfluxDB target:
python3 benchmon/run/csv_importer.py \
--dir /path/to/traces \
--grafana-influxdb-url "http://localhost:8181" \
--workers 8
CSV Importer Options
| Argument | Description | Default |
|---|---|---|
--dir |
(Required) Path to folder containing CSV files such as cpu_report.csv. |
- |
--grafana-influxdb-url |
InfluxDB URL v3. Overrides connection.json. |
http://localhost:8181 |
--grafana-token |
InfluxDB token. Overrides connection.json. |
(Empty) |
--database |
Target bucket/database name. | metrics |
--org |
InfluxDB organization (optional). | (Empty) |
--batch-size |
Number of points per write request. | 5000 |
--workers |
Number of concurrent write threads. | 4 |
Generate Benchmon Figures from Imported Data
After CSV import, generate benchmon-style figures directly from InfluxDB:
Compared to the standard CSV-based benchmon-visu usage, where the positional argument is the trace directory, InfluxDB mode needs extra arguments so benchmon knows to query the database instead of reading local files. In practice, the extra required arguments are --influxdb and --influxdb-url. If --start-time and --end-time are omitted, benchmon queries all data in the selected database. This is a good fit when one database represents one experiment and the InfluxDB scan budget is large enough. --influxdb-database is additionally required when your database is not the default metrics bucket.
benchmon-visu ./benchmon_influx_figures \
--influxdb \
--influxdb-url http://localhost:8181 \
--influxdb-database metrics \
--start-time 2026-02-04T21:48:20 \
--end-time 2026-02-04T22:03:20 \
--sys \
--recursive \
--fig-fmt png \
--fig-name benchmon_influx_overview
Notes:
The positional argument is the output directory for generated figures and logs.
If
--influxdb-hostnameis omitted, benchmon discovers all hostnames in the selected time range and generates one figure set per host.--recursivealso createsmulti-node_sync.<fmt>for synchronized multi-node view.--resolution autochooses a coarser time bucket for long windows. You can force fixed resolution such as--resolution 1m.--start-timeand--end-timeuse local wall-clock time in formatYYYY-MM-DDTHH:MM:SS.If some requested measurements are not present in the selected database, benchmon renders the plots backed by the available tables and skips the missing plot types.
If a full-database query hits the InfluxDB backend file-limit, do not assume the measurement is missing. Restart the stack with a larger
--influxdb-query-file-limitsuch as2000, or rerun with both--start-timeand--end-timeto narrow the request.InfluxDB visualization currently supports system plots only:
--cpu,--cpu-all,--cpu-freq,--mem,--net,--disk,--ib, and--sys.
Troubleshooting
| Symptom | Resolution |
|---|---|
| Installer cannot download archives | Ensure internet access to Grafana/InfluxData and rerun. |
benchmon-start-grafana cannot find binaries |
Re-run the installer or export BENCHMON_* paths correctly. |
| Grafana dashboards missing | Verify <install-dir>/grafana/dashboards; reinstall if needed. |
| InfluxDB rejects writes | Confirm admin_token.json exists and port 8181 is free. |
benchmon-visu --influxdb says the request is too large or exceeds the file limit |
Restart the stack with a larger --influxdb-query-file-limit such as 2000, or rerun with both --start-time and --end-time. Changing the flag requires a stack restart because it is passed to influxdb3 serve at startup. |
| Grafana unreachable | Check port usage and inspect logs under <save-dir>/grafana-data/logs/. |
References
Dashboards:
<install-dir>/grafana/dashboardsRuntime helpers:
exec/benchmon-install-grafana,exec/benchmon-start-grafana,exec/benchmon-stop-grafanaMetric streaming:
benchmon/run/influxdb_sender.pyOfficial docs: InfluxDB 3, Grafana
Metric streaming:
benchmon/run/influxdb_sender.pyOfficial docs: InfluxDB 3, Grafana