Perfmon Reference¶
Configuration¶
This file contains config related functions and classes
Common¶
“This package contains modules related to creating dataframe
-
class
perfmon.common.df.__init__.
CreateDataFrame
(metric, config)[source]¶ This class contains all methods to create a dataframe from JSON data
“This package contains classes to export metric data
-
class
perfmon.common.export.__init__.
ExportData
(config, df_dict)[source]¶ This class contains all methods to export dataframe into different data store types
This file contains initialisation functions for logging
-
perfmon.common.logging.__init__.
logger_config
(global_config)[source]¶ shortcut method for initializing logging
- Parameters
global_config (dict) – Dict containing all the configuration info
- Returns
Logger initiated based on config passed
- Return type
logger object
This module contains functions that are related to perf stat metrics
-
perfmon.common.perf.__init__.
check_perf_events
(perf_events)[source]¶ This function check if all perf groups are actually working. We will only probe the working counters during monitoring
-
perfmon.common.perf.__init__.
derived_perf_event_list
(perf_events)[source]¶ This function returns list of perf events implemented for a given processor and micro architecture
-
perfmon.common.perf.__init__.
get_mem_bw_event
()[source]¶ This function returns the perf event to get memory bandwidth
- Returns
A string to get memory bandwidth for perf stat command
- Return type
-
perfmon.common.perf.__init__.
get_working_perf_events
()[source]¶ This function checks the micro architecture type and returns available perf events. Raises an exception if micro architecture is not implemented
- Returns
Perf events with event name dict: Derived perf metrics from event counters
- Return type
- Raises
PerfEventsNotFoundError – An error occurred while looking for perf events
-
perfmon.common.perf.__init__.
llc_cache_miss_perf_event
(processor_vendor, micro_architecture)[source]¶ This function gives the event code and umask for LLC cache miss event for different architectures
- Parameters
- Returns
String containing event code and umask
- Return type
- Raises
ProcessorVendorNotFoundError – An error occurred while looking for processor vendor.
-
perfmon.common.perf.__init__.
perf_event_list
(micro_architecture)[source]¶ This function returns list of perf events implemented for a given processor and micro architecture
- Parameters
micro_architecture (str) – Name of the micro architecture
- Returns
A dict with name and event code of perf events
- Return type
- Raises
PerfEventListNotFoundError – If perf events yml file is not found
This module contains class for detecting process PIDs for various schedulers
-
class
perfmon.common.pid.__init__.
GetJobPid
(config)[source]¶ Class to get the main job PID for different workload managers. Currently SLURM, PBS and OAR schedulers are supported
“This package contains functions to plot gathered metrics
-
class
perfmon.common.plots.__init__.
GenPlots
(config, df_dict)[source]¶ This class contains all plotting methods (Only for CPU metrics)
-
apply_plot_settings
(plot_type, metric_att, mean_max, ax)[source]¶ This method applies the common settings to the plots
-
check_non_default_metrics
(df)[source]¶ Check if IB, mem. bandwidth and RAPL metrics are available in collected metrics
-
This module contains functions related to processor specific info
-
perfmon.common.processor.__init__.
get_cpu_spec
()[source]¶ This function extracts the vendor and cpu architectures using
archspec
module- Returns
Name of the vendor str: Micro architecture
- Return type
This module contains class to generate job report
-
class
perfmon.common.report.__init__.
GenReport
(config)[source]¶ This class does all the post monitoring steps like making plots and generating reports
Utility functions related to devices on the platform
-
perfmon.common.utils.devices.
get_rapl_devices
()[source]¶ This function gets all the packages, core, uncore and dram device available within RAPL powercap interface
- Returns
A dict with package names and paths
- Return type
-
perfmon.common.utils.devices.
ibstat_ports
()[source]¶ This function returns Infiniband ports if present
- Returns
A dict with IB port names and numbers
- Return type
Utility functions for command execution
-
perfmon.common.utils.execute_cmd.
execute_cmd
(cmd_str, handle_exception=True)[source]¶ Accept command string and returns output.
- Parameters
- Returns
Output of the command. If command execution fails, returns ‘not_available’
- Return type
- Raises
subprocess.CalledProcessError – An error occurred in execution of command iff handle_exception is set to False
-
perfmon.common.utils.execute_cmd.
execute_cmd_pipe
(cmd_str)[source]¶ Accept command string and execute it using piping and returns process object.
Utility functions for manipulating json files
-
perfmon.common.utils.json_wrappers.
dump_json
(content, filename)[source]¶ This function appends data to an existing json content. It creates a new file if no existing file found.
-
perfmon.common.utils.json_wrappers.
load_json
(filename)[source]¶ This function loads json file and return dict
-
perfmon.common.utils.json_wrappers.
write_json
(content, filename)[source]¶ This function writes json content to a file
Class to lock files
-
class
perfmon.common.utils.locks.
FileLock
(protected_file_path, timeout=None, delay=1, lock_file_contents=None)[source]¶ A file locking mechanism that has context-manager support so you can use it in a
with
statement. This should be relatively cross compatible as it doesn’t rely onmsvcrt
orfcntl
for the locking.-
acquire
(blocking=True)[source]¶ Acquire the lock, if possible. If the lock is in use, and blocking is False, return False. Otherwise, check again every self.delay seconds until it either gets the lock or exceeds timeout number of seconds, in which case it raises an exception.
-
Utility functions for parsing
-
class
perfmon.common.utils.parsing.
RawFormatter
(prog, indent_increment=2, max_help_position=24, width=None)[source]¶ - Class SmartFormatter prints help messages without any formatting
or unwanted line breaks, acivated when help starts with R|
Class to create pdf file
-
class
perfmon.common.utils.pdf.
PDF
(config)[source]¶ custom PDF class that inherits from the FPDF
This method defines footer of the pdf
Utility functions for psutil process finder
-
perfmon.common.utils.process.
find_procs_by_name
(name)[source]¶ Return a list of processes matching ‘name’
-
perfmon.common.utils.process.
get_proc_info
(pid)[source]¶ Convenient wrapper around psutil.Process to catch exceptions
-
perfmon.common.utils.process.
proc_if_running
(procs)[source]¶ Check if all processes are running and returns a False if all of them are terminated
Utility functions
-
perfmon.common.utils.utilities.
get_project_root
()[source]¶ Get root directory of the project
- Returns
Full path of the root directory
- Return type
-
perfmon.common.utils.utilities.
get_value
(input_dict, target)[source]¶ Find the value for a given target in dict
-
perfmon.common.utils.utilities.
merge_dicts
(exst_dict, new_dict)[source]¶ Merge two dicts. old_content is updated with data from new_content
Core¶
This file contains class to launch monitoring process
-
class
perfmon.core.metrics.__init__.
MonitorPerformanceMetrics
(config)[source]¶ Engine to extract performance metrics
This file common functions that are needed to monitor metrics
-
perfmon.core.metrics.common.
check_metric_data
(data_struct)[source]¶ This method checks if all the metric data is consistent with number of timestamps
-
perfmon.core.metrics.common.
get_child_procs
(user, procs)[source]¶ Get list of children processes in user namespace
-
perfmon.core.metrics.common.
get_cumulative_metric_value
(metric_type, procs, data)[source]¶ This method gets cumulative metric account for all childs for a given metric type
This file contains base class to monitor CPU metrics
-
class
perfmon.core.metrics.cpu.
MonitorCpuUsage
(config)[source]¶ Engine to monitor cpu related metrics
This file contains base class to monitor GPU metrics
-
class
perfmon.core.metrics.gpu.
MonitorNvidiaGpuMetrics
(config)[source]¶ Engine to monitor gpu related metrics
This file contains base class to monitor perf stat metrics
-
class
perfmon.core.metrics.perfcounters.
MonitorPerfCounters
(config)[source]¶ Engine to extract performance metrics
-
compute_derived_metrics
()[source]¶ This method computes all the derived metrics from parsed perf counters
-
get_list_of_pids
()[source]¶ This method gets the list of pids to monitor by adding children pids to parents
-
initialise_perf_metrics_data_dict
()[source]¶ This method initialises the perf metric related parameters
-
static
match_perf_line
(pattern, cmd_out)[source]¶ This method builds perf output pattern and get matching groups
-
parse_perf_cmd_out
(cmd_out)[source]¶ This method parses perf command output and populate perf data dict with counter values
-
Functions to monitor RAPL energy metrics
-
perfmon.core.metrics.cpumetrics.energy.
rapl_energy_readings
(rapl_devices, data)[source]¶ This method gets energy metrics from RAPL powercap interface
Functions to monitor memory related metrics
-
perfmon.core.metrics.cpumetrics.memory.
get_memory_bandwidth
(mem_bw_event, procs)[source]¶ This method returns memory bandwidth based on perf LLC load misses event
-
perfmon.core.metrics.cpumetrics.memory.
memory_usage
(mem_bw_event, procs, data)[source]¶ This method gets memory usage
Functions to monitor other metrics
-
perfmon.core.metrics.cpumetrics.misc.
misc_metrics
(procs, data)[source]¶ This method gets IO, file descriptors and thread count
Functions to monitor network related metrics
-
perfmon.core.metrics.cpumetrics.network.
ib_io_counters
(ib_ports, data)[source]¶ This method gets the IB port counters
-
perfmon.core.metrics.cpumetrics.network.
network_io_counters
(data)[source]¶ This method gets the system wide network IO counters
Functions to monitor CPU usage metrics
-
perfmon.core.metrics.cpumetrics.usage.
get_cpu_percent
(cpu_aggregation_interval, procs)[source]¶ This method gives CPU percent of parent and its childs
-
perfmon.core.metrics.cpumetrics.usage.
get_cpu_time
(procs)[source]¶ This method gets cumulative CPU time from parent and its childs
This module contains all NVIDIA GPU related metrics functions
-
perfmon.core.metrics.gpumetrics.nvidia.__init__.
device_query
(func, *args)[source]¶ Convenience wrapper to query different metrics for NVIDIA GPUs
Functions to monitor clock frequency info related metrics for NVIDIA GPUs
-
perfmon.core.metrics.gpumetrics.nvidia.clock.
clock_info
(data)[source]¶ This method gets NVIDIA GPU clock info for memory, graphics and SM
Functions to monitor ECC error counts for NVIDIA GPUs
-
perfmon.core.metrics.gpumetrics.nvidia.errors.
ecc_error_counts
(data)[source]¶ This method gets NVIDIA GPU ECC error counts for SP and DP
Functions to monitor memory related metrics for NVIDIA GPUs
-
perfmon.core.metrics.gpumetrics.nvidia.memory.
memory_usage
(data)[source]¶ This method gets NVIDIA GPU memory and BAR1 memory usage
Functions to monitor misc metrics like temperature, fan speed for NVIDIA GPUs
-
perfmon.core.metrics.gpumetrics.nvidia.misc.
misc_metrics
(data)[source]¶ This method gets misc NVIDIA GPU metrics
Functions to monitor power related metrics for NVIDIA GPUs
-
perfmon.core.metrics.gpumetrics.nvidia.power.
power_usage
(data)[source]¶ This method gets NVIDIA GPUs power usage metrics
-
perfmon.core.metrics.gpumetrics.nvidia.power.
power_violation_report
(data)[source]¶ This method gets NVIDIA GPUs throttling period due to constraints
Functions to get GPU utilization rates
Exceptions¶
This file contains the custom exceptions defined for monitoring tools.
-
exception
perfmon.exceptions.__init__.
ArchitectureNotFoundError
[source]¶ Processor architecture not found
-
exception
perfmon.exceptions.__init__.
BatchSchedulerNotFound
[source]¶ Batch scheduler not implemented or not recognised
-
exception
perfmon.exceptions.__init__.
MetricGroupNotImplementedError
[source]¶ Requested metric group not implemented
Perfevents¶
“This package contains perf events lists for different architectures
Schemas¶
“This package contains schemas for perfmon toolkit
This is schema for dataframe
This is schema for metrics data
This is schema for plots