Perfmon Reference

Configuration

This file contains config related functions and classes

class perfmon.cfg.__init__.GlobalConfiguration(args)[source]

Global configuration with defaults

check_gpus()[source]

This method checks for presence of NVIDIA GPUs

create_config()[source]

Entry point of teh class

make_dirs()[source]

This method creates the directory that put all artefacts of toolkit

populate_config()[source]

This method adds necessary common info to config dict

Common

“This package contains modules related to creating dataframe

class perfmon.common.df.__init__.CreateDataFrame(metric, config)[source]

This class contains all methods to create a dataframe from JSON data

check_non_default_metrics(content)[source]

Check for non default metrics

create_dataframe(content)[source]

This method creates and returns dataframe from the metric data

go()[source]

Entry point to the class

initialise_header_names()[source]

This method initialises the names of headers for each metric

“This package contains classes to export metric data

class perfmon.common.export.__init__.ExportData(config, df_dict)[source]

This class contains all methods to export dataframe into different data store types

get_lock_file()[source]

Get the lock file to update database exports

go()[source]

Entry point to the class

release_lock()[source]

Releases lock file

This file contains initialisation functions for logging

class perfmon.common.logging.__init__.HostnameFilter(name='')[source]
filter(record)[source]

Determine if the specified record is to be logged.

Is the specified record to be logged? Returns 0 for no, nonzero for yes. If deemed appropriate, the record may be modified in-place.

perfmon.common.logging.__init__.logger_config(global_config)[source]

shortcut method for initializing logging

Parameters

global_config (dict) – Dict containing all the configuration info

Returns

Logger initiated based on config passed

Return type

logger object

This module contains functions that are related to perf stat metrics

perfmon.common.perf.__init__.check_perf_events(perf_events)[source]

This function check if all perf groups are actually working. We will only probe the working counters during monitoring

Parameters

perf_events (dict) – A dict of found perf events

Returns

A dict of working perf events

Return type

dict

perfmon.common.perf.__init__.derived_perf_event_list(perf_events)[source]

This function returns list of perf events implemented for a given processor and micro architecture

Parameters

perf_events (dict) – Dictionary of perf events

Returns

A dict with name and event code of perf events dict: A dict with derived perf metrics and its formula

Return type

dict

perfmon.common.perf.__init__.get_mem_bw_event()[source]

This function returns the perf event to get memory bandwidth

Returns

A string to get memory bandwidth for perf stat command

Return type

str

perfmon.common.perf.__init__.get_working_perf_events()[source]

This function checks the micro architecture type and returns available perf events. Raises an exception if micro architecture is not implemented

Returns

Perf events with event name dict: Derived perf metrics from event counters

Return type

dict

Raises

PerfEventsNotFoundError – An error occurred while looking for perf events

perfmon.common.perf.__init__.llc_cache_miss_perf_event(processor_vendor, micro_architecture)[source]

This function gives the event code and umask for LLC cache miss event for different architectures

Parameters
  • processor_vendor (str) – Vendor of the processor

  • micro_architecture (str) – Name of the micro architecture of the processor

Returns

String containing event code and umask

Return type

str

Raises

ProcessorVendorNotFoundError – An error occurred while looking for processor vendor.

perfmon.common.perf.__init__.perf_event_list(micro_architecture)[source]

This function returns list of perf events implemented for a given processor and micro architecture

Parameters

micro_architecture (str) – Name of the micro architecture

Returns

A dict with name and event code of perf events

Return type

dict

Raises

PerfEventListNotFoundError – If perf events yml file is not found

This module contains class for detecting process PIDs for various schedulers

class perfmon.common.pid.__init__.GetJobPid(config)[source]

Class to get the main job PID for different workload managers. Currently SLURM, PBS and OAR schedulers are supported

go()[source]

This is driver method to find job PID

“This package contains functions to plot gathered metrics

class perfmon.common.plots.__init__.GenPlots(config, df_dict)[source]

This class contains all plotting methods (Only for CPU metrics)

apply_plot_settings(plot_type, metric_att, mean_max, ax)[source]

This method applies the common settings to the plots

check_non_default_metrics(df)[source]

Check if IB, mem. bandwidth and RAPL metrics are available in collected metrics

combined_plotting_engine(metric, metric_att, comb_ts_df, comb_metric_df)[source]

Plotting engine for combined metrics

static convert_ts_datetime(df)[source]

Convert timestamps in df to datetime format

static get_global_mean_max(mean_max_all)[source]

Get global mean max of metric from host data

go()[source]

Entry point for plotting

make_plots(df)[source]

This method plots both per host and combined metrics

plot_metric_data(df)[source]

Make plots for the cpu metric data

plotting_engine(host_name, metric, metric_att, ax, data)[source]

Main engine to create plots

static replace_neg_values(df)[source]

Replace negative values in df to preceding positive values

This module contains functions related to processor specific info

perfmon.common.processor.__init__.get_cpu_spec()[source]

This function extracts the vendor and cpu architectures using archspec module

Returns

Name of the vendor str: Micro architecture

Return type

str

This module contains class to generate job report

class perfmon.common.report.__init__.GenReport(config)[source]

This class does all the post monitoring steps like making plots and generating reports

create_job_report(content)[source]

Create a job report using FPDF module

go()[source]

Entry point for creating report

initialise_plot_per_page()[source]

Initialises plot related parameters

Utility functions related to devices on the platform

perfmon.common.utils.devices.get_rapl_devices()[source]

This function gets all the packages, core, uncore and dram device available within RAPL powercap interface

Returns

A dict with package names and paths

Return type

dict

perfmon.common.utils.devices.ibstat_ports()[source]

This function returns Infiniband ports if present

Returns

A dict with IB port names and numbers

Return type

dict

Utility functions for command execution

perfmon.common.utils.execute_cmd.execute_cmd(cmd_str, handle_exception=True)[source]

Accept command string and returns output.

Parameters
  • cmd_str (str) – Command string to be executed

  • handle_exception (bool) – Handle exception manually. If set to false, raises an exception to the caller function

Returns

Output of the command. If command execution fails, returns ‘not_available’

Return type

str

Raises

subprocess.CalledProcessError – An error occurred in execution of command iff handle_exception is set to False

perfmon.common.utils.execute_cmd.execute_cmd_pipe(cmd_str)[source]

Accept command string and execute it using piping and returns process object.

Parameters

cmd_str (str) – Command string to be executed

Returns

Process object

Return type

object

Utility functions for manipulating json files

perfmon.common.utils.json_wrappers.dump_json(content, filename)[source]

This function appends data to an existing json content. It creates a new file if no existing file found.

Parameters
  • content (dict) – Dict to write into JSON format

  • filename (str) – Name of the file to load

perfmon.common.utils.json_wrappers.load_json(filename)[source]

This function loads json file and return dict

Parameters

filename (str) – Name of the file to load

Returns

File contents as dict

Return type

dict

perfmon.common.utils.json_wrappers.write_json(content, filename)[source]

This function writes json content to a file

Parameters
  • content (dict) – Dict to write into JSON format

  • filename (str) – Name of the file to load

Class to lock files

class perfmon.common.utils.locks.FileLock(protected_file_path, timeout=None, delay=1, lock_file_contents=None)[source]

A file locking mechanism that has context-manager support so you can use it in a with statement. This should be relatively cross compatible as it doesn’t rely on msvcrt or fcntl for the locking.

exception FileLockException[source]

Exception to the file lock object

acquire(blocking=True)[source]

Acquire the lock, if possible. If the lock is in use, and blocking is False, return False. Otherwise, check again every self.delay seconds until it either gets the lock or exceeds timeout number of seconds, in which case it raises an exception.

available()[source]

Returns True iff the file is currently available to be locked.

lock_exists()[source]

Returns True iff the external lockfile exists.

locked()[source]

Returns True iff the file is owned by THIS FileLock instance. (Even if this returns false, the file could be owned by another FileLock instance, possibly in a different thread or process).

purge()[source]

For debug purposes only. Removes the lock file from the hard disk.

release()[source]

Get rid of the lock by deleting the lockfile. When working in a with statement, this gets automatically called at the end.

Utility functions for parsing

class perfmon.common.utils.parsing.RawFormatter(prog, indent_increment=2, max_help_position=24, width=None)[source]
Class SmartFormatter prints help messages without any formatting

or unwanted line breaks, acivated when help starts with R|

perfmon.common.utils.parsing.get_parser(cmd_output, reg='lscpu')[source]

Regex parser.

Parameters
  • cmd_output (str) – Output of the executed command

  • reg (str) – Regex pattern to be used

Returns

Function handle to parse the output

Class to create pdf file

class perfmon.common.utils.pdf.PDF(config)[source]

custom PDF class that inherits from the FPDF

footer()[source]

This method defines footer of the pdf

header()[source]

This method defines header of the pdf

page_body(images)[source]

This method defines body of the pdf

print_page(images)[source]

This method add an empty pages and populates with images/text

Utility functions for psutil process finder

perfmon.common.utils.process.find_procs_by_name(name)[source]

Return a list of processes matching ‘name’

Parameters

name (str) – name of the process to find

Returns

List of psutil objects

Return type

list

perfmon.common.utils.process.get_proc_info(pid)[source]

Convenient wrapper around psutil.Process to catch exceptions

perfmon.common.utils.process.proc_if_running(procs)[source]

Check if all processes are running and returns a False if all of them are terminated

Parameters

procs (list) – List of psutil process objects

Returns

Running status of the processes

Return type

bool

Utility functions

perfmon.common.utils.utilities.dump_yaml(config)[source]

Dump config files (for debugging)

perfmon.common.utils.utilities.get_project_root()[source]

Get root directory of the project

Returns

Full path of the root directory

Return type

str

perfmon.common.utils.utilities.get_value(input_dict, target)[source]

Find the value for a given target in dict

Parameters
  • input_dict (dict) – Dict to search for key

  • target (Any) – Key to search

Returns

List of values found in d

Return type

list

perfmon.common.utils.utilities.merge_dicts(exst_dict, new_dict)[source]

Merge two dicts. old_content is updated with data from new_content

Parameters
  • exst_dict (dict) – Existing data in the dict

  • new_dict (dict) – New data to be added to the dict

Returns

updated exst_dict with contents from new_dict

Return type

dict

perfmon.common.utils.utilities.replace_negative(input_list)[source]

This function replaces the negative values in numpy array with mean of neighbours. If the values happen to be at the extremum, it replaces with preceding or succeding elements

Parameters

input_list (list) – A list with positive and/or negative elements

Returns

A list with just positive elements

Return type

list

Core

This file contains class to launch monitoring process

class perfmon.core.metrics.__init__.MonitorPerformanceMetrics(config)[source]

Engine to extract performance metrics

get_job_pid()[source]

This method calls function to get job PID

start_collection()[source]

Start collecting CPU metrics. We use multiprocessing library to spawn different processes to monitor cpu and perf metrics

This file common functions that are needed to monitor metrics

perfmon.core.metrics.common.check_metric_data(data_struct)[source]

This method checks if all the metric data is consistent with number of timestamps

perfmon.core.metrics.common.dump_metrics_async(data, outfile)[source]

Dump metrics asynchronously

Parameters
  • data (dict) – Data to be dumped to disk

  • outfile (str) – Path of the outfile

perfmon.core.metrics.common.get_child_procs(user, procs)[source]

Get list of children processes in user namespace

Parameters
  • user (str) – User name

  • procs (object) – psutil proc iterator

Returns

List of children processes in user space

Return type

list

perfmon.core.metrics.common.get_cumulative_metric_value(metric_type, procs, data)[source]

This method gets cumulative metric account for all childs for a given metric type

This file contains base class to monitor CPU metrics

class perfmon.core.metrics.cpu.MonitorCpuUsage(config)[source]

Engine to monitor cpu related metrics

add_ib_counters_to_dict()[source]

Add IB counters to base dict

add_mem_bw_to_dict()[source]

Add memory bandwidth to base dict

add_metrics_cpu_parameters()[source]

This method adds metrics key/value pair in cpu parameter dict

add_rapl_domains_to_dict()[source]

Add RAPL domain names to base dict

add_timestamp()[source]

This method adds timestamp to the data

check_availability_ib_rapl_membw()[source]

This method checks if infiniband and RAPL metrics are available

dump_metrics()[source]

Dump metrics to JSON file and re-initiate cpu_metrics dict

get_cpu_usage()[source]

This method gets all CPU usage statistics

get_energy_metrics()[source]

This method gets energy metrics from RAPL powercap interface

get_memory_usage()[source]

This method gets memory usage

get_metrics_data()[source]

Extract metrics data

get_misc_metrics()[source]

This method gets IO, file descriptors and thread count

get_network_traffic()[source]

Get network traffic from TCP and Infiniband (if supported)

initialise_cpu_metrics_params()[source]

This method initialises the CPU metric related parameters

run()[source]

This method extracts the cpu related metrics for a given pid

This file contains base class to monitor GPU metrics

class perfmon.core.metrics.gpu.MonitorNvidiaGpuMetrics(config)[source]

Engine to monitor gpu related metrics

add_timestamp()[source]

This method adds timestamp to the data

dump_metrics()[source]

Dump metrics to JSON file and re-initiate gpu_metrics dict

get_clock_info()[source]

This method gets different clock info metrics

get_ecc_metrics()[source]

This method gets ECC error counts

get_memory_usage()[source]

This method gets memory usage

get_metrics_data()[source]

Extract metrics data

get_misc_metrics()[source]

This method gets different misc metrics

get_new_host_name(gpu_dev_num)[source]

Append GPU number to host name

get_power_metrics()[source]

This method gets power metrics

get_utilization_rates()[source]

This method gets all utilization statistics

initialise_gpu_metrics_params()[source]

This method initialises the GPU metric related parameters

run()[source]

This method extracts the gpu related metrics for a given pid

This file contains base class to monitor perf stat metrics

class perfmon.core.metrics.perfcounters.MonitorPerfCounters(config)[source]

Engine to extract performance metrics

add_timestamp()[source]

This method adds timestamp to the data

compute_derived_metrics()[source]

This method computes all the derived metrics from parsed perf counters

dump_avail_perf_events()[source]

Dump the available perf event list for later use

dump_metrics()[source]

Dump metrics to JSON file and re-initiate perf_metrics dict

get_list_of_pids()[source]

This method gets the list of pids to monitor by adding children pids to parents

initialise_perf_metrics_data_dict()[source]

This method initialises the perf metric related parameters

make_perf_command()[source]

This method make the perf command to run

static match_perf_line(pattern, cmd_out)[source]

This method builds perf output pattern and get matching groups

parse_perf_cmd_out(cmd_out)[source]

This method parses perf command output and populate perf data dict with counter values

post_parsing_steps()[source]

Steps to be made after parsing all metrics

run()[source]

This method extracts perf metrics for a given pid

set_up_perf_events()[source]

This method checks for available perf events, tests them and initialise the data dict

setup_perf_monitor()[source]

Setup steps for monitoring perf metrics

Functions to monitor RAPL energy metrics

perfmon.core.metrics.cpumetrics.energy.rapl_energy_readings(rapl_devices, data)[source]

This method gets energy metrics from RAPL powercap interface

Functions to monitor memory related metrics

perfmon.core.metrics.cpumetrics.memory.get_memory_bandwidth(mem_bw_event, procs)[source]

This method returns memory bandwidth based on perf LLC load misses event

perfmon.core.metrics.cpumetrics.memory.memory_usage(mem_bw_event, procs, data)[source]

This method gets memory usage

Functions to monitor other metrics

perfmon.core.metrics.cpumetrics.misc.misc_metrics(procs, data)[source]

This method gets IO, file descriptors and thread count

Functions to monitor network related metrics

perfmon.core.metrics.cpumetrics.network.ib_io_counters(ib_ports, data)[source]

This method gets the IB port counters

perfmon.core.metrics.cpumetrics.network.network_io_counters(data)[source]

This method gets the system wide network IO counters

Functions to monitor CPU usage metrics

perfmon.core.metrics.cpumetrics.usage.get_cpu_percent(cpu_aggregation_interval, procs)[source]

This method gives CPU percent of parent and its childs

perfmon.core.metrics.cpumetrics.usage.get_cpu_time(procs)[source]

This method gets cumulative CPU time from parent and its childs

This module contains all NVIDIA GPU related metrics functions

perfmon.core.metrics.gpumetrics.nvidia.__init__.device_query(func, *args)[source]

Convenience wrapper to query different metrics for NVIDIA GPUs

Parameters

func (str) – Name of the API function

Returns

Metric value

Return type

list

Functions to monitor clock frequency info related metrics for NVIDIA GPUs

perfmon.core.metrics.gpumetrics.nvidia.clock.clock_info(data)[source]

This method gets NVIDIA GPU clock info for memory, graphics and SM

Functions to monitor ECC error counts for NVIDIA GPUs

perfmon.core.metrics.gpumetrics.nvidia.errors.ecc_error_counts(data)[source]

This method gets NVIDIA GPU ECC error counts for SP and DP

Functions to monitor memory related metrics for NVIDIA GPUs

perfmon.core.metrics.gpumetrics.nvidia.memory.memory_usage(data)[source]

This method gets NVIDIA GPU memory and BAR1 memory usage

Functions to monitor misc metrics like temperature, fan speed for NVIDIA GPUs

perfmon.core.metrics.gpumetrics.nvidia.misc.misc_metrics(data)[source]

This method gets misc NVIDIA GPU metrics

Functions to monitor power related metrics for NVIDIA GPUs

perfmon.core.metrics.gpumetrics.nvidia.power.power_usage(data)[source]

This method gets NVIDIA GPUs power usage metrics

perfmon.core.metrics.gpumetrics.nvidia.power.power_violation_report(data)[source]

This method gets NVIDIA GPUs throttling period due to constraints

Functions to get GPU utilization rates

perfmon.core.metrics.gpumetrics.nvidia.utilization.get_encoder_decoder_util_rates(data)[source]

This method gets encoder and decoder utilization rates

perfmon.core.metrics.gpumetrics.nvidia.utilization.get_gpu_mem_util_rates(data)[source]

This method gets GPU and memory utilization rates

Exceptions

This file contains the custom exceptions defined for monitoring tools.

exception perfmon.exceptions.__init__.ArchitectureNotFoundError[source]

Processor architecture not found

exception perfmon.exceptions.__init__.BatchSchedulerNotFound[source]

Batch scheduler not implemented or not recognised

exception perfmon.exceptions.__init__.CommandExecutionFailed[source]

Command execution exception

exception perfmon.exceptions.__init__.JobPIDNotFoundError[source]

Step job PID not found

exception perfmon.exceptions.__init__.KeyNotFoundError[source]

Key not found in the dict

exception perfmon.exceptions.__init__.MetricGroupNotImplementedError[source]

Requested metric group not implemented

exception perfmon.exceptions.__init__.PerfEventListNotFoundError[source]

Perf event list not implemented

exception perfmon.exceptions.__init__.ProcessorVendorNotFoundError[source]

Processor vendor not implemented

Perfevents

“This package contains perf events lists for different architectures

Schemas

“This package contains schemas for perfmon toolkit

This is schema for dataframe

This is schema for metrics data

This is schema for plots