| .TH "acct_gather.conf" "5" "Slurm Configuration File" "December 2023" "Slurm Configuration File" |
| |
| .SH "NAME" |
| \fBacct_gather.conf\fR \- Slurm configuration file for the acct_gather plugins |
| |
| .SH "DESCRIPTION" |
| |
| \fBacct_gather.conf\fP is a UTF8 formatted file which defines parameters used |
| by Slurm's acct_gather related plugins. |
| The file will always be located in the same directory as the \fBslurm.conf\fR. |
| .LP |
| Parameter names are case insensitive but parameter values are case sensitive. |
| Any text following a "#" in the configuration file is treated |
| as a comment through the end of that line. |
| The size of each line in the file is limited to 1024 characters. |
| .LP |
| Changes to the configuration file take effect upon restart of |
| the Slurm daemons. |
| |
| .LP |
| The following \fBacct_gather.conf\fR parameters are defined to control the |
| general behavior of various plugins in Slurm. |
| |
| .LP |
| The \fBacct_gather.conf\fR file is different than other Slurm .conf files. Each |
| plugin defines which options are available. Each plugin to be loaded must be |
| specified in the \fBslurm.conf\fR under the following configuration entries: |
| .LP |
| \(bu AcctGatherEnergyType (plugin type=\fIacct_gather_energy\fR) |
| .br |
| \(bu AcctGatherInterconnectType (plugin type=\fIacct_gather_interconnect\fR) |
| .br |
| \(bu AcctGatherFilesystemType (plugin type=\fIacct_gather_filesystem\fR) |
| .br |
| \(bu AcctGatherProfileType (plugin type=\fIacct_gather_profile\fR) |
| |
| .LP |
| If the respective plugin for an option is not loaded then that option will |
| be unknown to Slurm, causing the daemon to fatal on initialization. |
| If you decide to change plugin types in \fBslurm.conf\fR, also make sure to |
| change the related options in \fBacct_gather.conf\fR. |
| |
| .SH acct_gather_energy/gpu |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherEnergyType=acct_gather_energy/gpu |
| .fi |
| .RE |
| This plugin doesn't read any options from \fBacct_gather.conf\fR. |
| .br |
| Dataset provided by the plugin is: Energy. |
| .IP |
| |
| .SH acct_gather_energy/IPMI |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherEnergyType=acct_gather_energy/ipmi |
| .fi |
| .RE |
| |
| Options used for acct_gather_energy/ipmi are as follows: |
| |
| .RS |
| .TP 10 |
| \fBEnergyIPMIFrequency\fR=<number> |
| This parameter is the number of seconds between BMC access samples. |
| Ideally it should be higher or equal to \fBJobAcctGatherFrequency\fR, otherwise |
| the JobAcctGather plugin will get repeated values in successive polls. |
| .IP |
| |
| .TP |
| \fBEnergyIPMICalcAdjustment\fR=<yes|no> |
| If set to "yes", the consumption between the last BMC access sample and |
| a step consumption update is approximated to get more accurate task consumption. |
| The adjustment is made at the step start and each time the |
| consumption is updated, including the step end. The approximations are not |
| accumulated, only the first and last adjustments are used to calculated the |
| consumption. The default is "no". |
| .IP |
| |
| .TP |
| \fBEnergyIPMIPowerSensors\fR=<key=values>\fR |
| Optionally specify the ids of the sensors to used. |
| Multiple <key=values> can be set with ";" separators. |
| The key "Node" is mandatory and is used to know the consumed energy for nodes |
| (scontrol show node) and jobs (sacct). |
| Other keys are optional and are named by administrator. |
| These keys are useful only when profile is activated for energy to store power |
| (in watt) of each key. |
| <values> are integers except when using DCMI. Multiple values can be set with |
| "," separators. The sum of the listed sensors is used for each key. |
| EnergyIPMIPowerSensors is optional, default value is "Node=<value>" where |
| "<value>" is the id of the first power sensor returned by ipmi\-sensors. |
| .br |
| i.e. |
| .br |
| .na |
| EnergyIPMIPowerSensors=Node=16,19,23,26;Socket0=16,23;Socket1=19,26;SSUP=23,26;KNC=16,19 |
| .ad |
| .br |
| EnergyIPMIPowerSensors=Node=29,32;SSUP0=29;SSUP1=32 |
| .br |
| EnergyIPMIPowerSensors=Node=1280 |
| |
| Data Center Manageability Interface - acct_gather_energy/ipmi supports gathering |
| power data through DCMI IPMI extension commands. When configured, the ipmi |
| plugin will query the DCMI using the "System Power mode" or the |
| "Enhanced System Power Statistics mode" flags depending on the configuration. |
| |
| To configure one or the other, the special sensor values DCMI or DCMI_ENHANCED |
| can be used, for example: |
| .br |
| .na |
| EnergyIPMIPowerSensors=Node=DCMI |
| .ad |
| .br |
| EnergyIPMIPowerSensors=Node=DCMI_ENHANCED |
| |
| .LP |
| The following \fBacct_gather.conf\fR parameters are defined to control the |
| IPMI config default values for libipmiconsole. |
| |
| .TP 10 |
| \fBEnergyIPMIUsername\fR=\fIUSERNAME\fR |
| Specify BMC Username. |
| .IP |
| |
| .TP |
| \fBEnergyIPMIPassword\fR=\fIPASSWORD\fR |
| Specify BMC Password. |
| .RE |
| Datasets provided by the plugin have name: <IPMI_SENSOR_LABEL>Power. |
| .IP |
| |
| .SS |
| NOTES: |
| .LP |
| This plugin requires the freeipmi development files to be installed and linkable |
| at configure time. The plugin will not build otherwise. When building the RPM, |
| \fIrpmbuild ... --with freeipmi\fR can be specified to explicitly check for |
| these dependencies. |
| |
| .SH acct_gather_energy/rapl |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherEnergyType=acct_gather_energy/rapl |
| .fi |
| .RE |
| This plugin doesn't read any options from \fBacct_gather.conf\fR. |
| .br |
| Dataset provided by the plugin is: Power. |
| .IP |
| |
| .SH acct_gather_energy/XCC |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherEnergyType=acct_gather_energy/xcc |
| .fi |
| .RE |
| |
| Options used for acct_gather_energy/xcc include only in\-band communications |
| with XClarity Controller, thus a reduced set of configurations is supported: |
| |
| .RS |
| .TP 10 |
| \fBEnergyIPMIFrequency\fR=<number> |
| This parameter is the number of seconds between XCC access samples. |
| Default is 30 seconds. |
| .IP |
| |
| .TP |
| \fBEnergyIPMITimeout\fR=<number> |
| Timeout, in seconds, for initializing the IPMI XCC context for a new gathering |
| thread. Default is 10 seconds. |
| .RE |
| Datasets provided by the plugin are: Energy, CurrPower. |
| .IP |
| |
| .SH acct_gather_filesystem/lustre |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherFilesystemType=acct_gather_filesystem/lustre |
| .fi |
| .RE |
| This plugin doesn't read any options from \fBacct_gather.conf\fR. |
| .br |
| Datasets provided by the plugin are: Reads, ReadMB, Writes, WriteMB. |
| .IP |
| |
| .SH acct_gather_profile/HDF5 |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherProfileType=acct_gather_profile/hdf5 |
| .fi |
| .RE |
| |
| Options used for acct_gather_profile/hdf5 are as follows: |
| |
| .RS |
| .TP |
| \fBProfileHDF5Dir\fR=<path> |
| This parameter is the path to the shared folder into which the |
| acct_gather_profile plugin will write detailed data (usually as an HDF5 file). |
| The directory is assumed to be on a file system shared by the controller and |
| all compute nodes. This is a required parameter. |
| .IP |
| |
| .TP |
| \fBProfileHDF5Default\fR |
| A comma\-delimited list of data types to be collected for each job submission. |
| Allowed values are: |
| .RS |
| .TP 8 |
| \fBAll\fR |
| All data types are collected. (Cannot be combined with other values.) |
| .IP |
| |
| .TP |
| \fBNone\fR |
| No data types are collected. This is the default. |
| (Cannot be combined with other values.) |
| .IP |
| |
| .TP |
| \fBEnergy\fR |
| Energy data is collected. |
| .IP |
| |
| .TP |
| \fBFilesystem\fR |
| File system (Lustre) data is collected. |
| .IP |
| |
| .TP |
| \fBNetwork\fR |
| Network (InfiniBand) data is collected. |
| .IP |
| |
| .TP |
| \fBTask\fR |
| Task (I/O, Memory, ...) data is collected. |
| .IP |
| |
| .SH acct_gather_profile/InfluxDB |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherProfileType=acct_gather_profile/influxdb |
| .fi |
| .RE |
| |
| The InfluxDB plugin provides the same information as the HDF5 plugin but will |
| instead send information to the configured InfluxDB server. |
| .P |
| The InfluxDB plugin is designed against 1.x protocol of InfluxDB. Any site |
| running a v2.x InfluxDB server will need to configure a v1.x compatibility |
| endpoint along with the correct user and password authorization. Token |
| authentication is not currently supported. |
| .SS |
| Options: |
| .TP |
| \fBProfileInfluxDBDatabase\fR |
| InfluxDB v1.x database name where profiling information is to be written. |
| InfluxDB v2.x bucket name where profiling information is to be written. |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBDefault\fR |
| A comma\-delimited list of data types to be collected for each job submission. |
| Allowed values are: |
| .IP |
| .RS |
| .TP 10 |
| \fBAll\fR |
| All data types are collected. Cannot be combined with other values. |
| .IP |
| |
| .TP |
| \fBNone\fR |
| No data types are collected. This is the default. |
| Cannot be combined with other values. |
| .IP |
| |
| .TP |
| \fBEnergy\fR |
| Energy data is collected. |
| .IP |
| |
| .TP |
| \fBFilesystem\fR |
| File system (Lustre) data is collected. |
| .IP |
| |
| .TP |
| \fBNetwork\fR |
| Network (InfiniBand) data is collected. |
| .IP |
| |
| .TP |
| \fBTask\fR |
| Task (I/O, Memory, ...) data is collected. |
| .RE |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBHost\fR=<hostname>:<port> |
| The hostname of the machine where the \fIInfluxDB\fR instance is executed and |
| the port used by the HTTP API. The port used by the HTTP API is the one |
| configured through the bind\-address influxdb.conf option in the [http] section. |
| .BR |
| Example: |
| .nf |
| ProfileInfluxDBHost=myinfluxhost:8086 |
| .fi |
| .in -2 |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBPass\fR |
| Password for username configured in ProfileInfluxDBUser. Required in v2.x and |
| optional in v1.x InfluxDB. |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBRTPolicy\fR |
| The InfluxDB v1.x retention policy name for the database configured in |
| ProfileInfluxDBDatabase option. The InfluxDB v2.x retention policy bucket name |
| for the database configured in ProfileInfluxDBDatabase option. |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBUser\fR |
| InfluxDB username that should be used to gain access to the database configured |
| in ProfileInfluxDBDatabase. Required in v2.x and optional in v1.x InfluxDB. |
| This is only needed if InfluxDB v1.x is configured with authentication enabled |
| in the [http] config section and a user has been granted at least WRITE access |
| to the database. See also \fBProfileInfluxDBPass\fR. |
| .IP |
| |
| .TP |
| \fBProfileInfluxDBTimeout\fR=<seconds> |
| The maximum time in seconds that an HTTP query to the InfluxDB server can take. |
| After this timeout the data is discarded. Be aware that a long timeout can drain |
| your nodes if the InfluxDB server is unresponsive and, when terminating the job, |
| the last dataset takes more than UnkillableStepTimeout to be sent. Internally, |
| that option sets CURLOPT_TIMEOUT library option. Default is 10 seconds. |
| .IP |
| |
| .SS |
| NOTES: |
| .LP |
| This plugin requires the libcurl development files to be installed and linkable |
| at configure time. The plugin will not build otherwise. |
| .LP |
| Information on how to install and configure InfluxDB and manage databases, |
| retention policies and such is available on the official webpage. |
| .LP |
| Collected information is written from every compute node where a job runs to |
| the \fIInfluxDB\fR instance listening on the ProfileInfluxDBHost. In order to |
| avoid overloading the \fIInfluxDB\fR instance with incoming connection requests, |
| the plugin uses an internal buffer which is filled with samples. Once the buffer |
| is full, a HTTP API write request is performed and the buffer is emptied to hold |
| subsequent samples. A final request is also performed when a task ends even if |
| the buffer isn't full. |
| .LP |
| Failed HTTP API write requests are silently discarded. This means that collected |
| profile information in the plugin buffer is lost if it can't be written to the |
| \fIInfluxDB\fR database for any reason. |
| .LP |
| Plugin messages are logged along with the slurmstepd logs to SlurmdLogFile. In |
| order to troubleshoot any issues, it is recommended to temporarily increase |
| the slurmd debug level to debug3 and add Profile to the debug flags. This can |
| be accomplished by setting the slurm.conf SlurmdDebug and DebugFlags |
| respectively or dynamically through scontrol setdebug and setdebugflags. |
| .LP |
| Grafana can be used to create charts based on the data held by InfluxDB. |
| This kind of tool permits one to create dashboards, tables and other graphics |
| using the stored time series. |
| |
| .SH acct_gather_interconnect/OFED |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherInterconnectType=acct_gather_interconnect/ofed |
| .fi |
| .RE |
| |
| Options used for acct_gather_interconnect/ofed are as follows: |
| |
| .RS |
| .TP 10 |
| \fBInfinibandOFEDPort\fR=<number> |
| This parameter represents the port number of the local Infiniband card that we are willing to monitor. |
| The default port is 1. |
| .RE |
| Datasets provided by the plugin: PacketsIn, PacketsOut, InMB, OutMB |
| .RE |
| |
| .SH acct_gather_interconnect/sysfs |
| Required entry in slurm.conf: |
| .RS |
| .nf |
| AcctGatherInterconnectType=acct_gather_interconnect/sysfs |
| .fi |
| .RE |
| |
| Options used for acct_gather_interconnect/sysfs are as follows: |
| |
| .RS |
| .TP 10 |
| \fBSysfsInterfaces\fR=<interfaces> |
| Comma\-separated list of interface names to collect statistics from. Usage |
| from all listed interfaces will be summed together, and is not broken down |
| individually. |
| .RE |
| Datasets provided by the plugin: PacketsIn, PacketsOut, InMB, OutMB |
| .RE |
| |
| .SH "EXAMPLE" |
| .nf |
| ### |
| # Slurm acct_gather configuration file |
| ### |
| # Parameters for acct_gather_energy/impi plugin |
| EnergyIPMIFrequency=10 |
| EnergyIPMICalcAdjustment=yes |
| # |
| # Parameters for acct_gather_profile/hdf5 plugin |
| ProfileHDF5Dir=/app/slurm/profile_data |
| # Parameters for acct_gather_interconnect/ofed plugin |
| InfinibandOFEDPort=1 |
| .fi |
| |
| .SH "COPYING" |
| Copyright (C) 2012\-2013 Bull. |
| Copyright (C) 2012\-2022 SchedMD LLC. |
| Produced at Bull (cf, DISCLAIMER). |
| .LP |
| This file is part of Slurm, a resource management program. |
| For details, see <https://slurm.schedmd.com/>. |
| .LP |
| Slurm is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| Slurm is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "SEE ALSO" |
| .LP |
| \fBslurm.conf\fR(5) |