| <!--#include virtual="header.txt"--> |
| |
| <h1>Metrics Guide</h1> |
| |
| <h2 id="contents">Contents |
| <a class="slurm_link" href="#contents"></a> |
| </h2> |
| |
| <ul> |
| <li><a href="#overview">Metrics Overview</a></li> |
| <li><a href="#configuration">Configuration</a></li> |
| <li><a href="#endpoints">HTTP Endpoints</a></li> |
| <li><a href="#openmetrics">OpenMetrics Plugin</a></li> |
| <li><a href="#categories">Metric Categories Provided by Slurm</a></li> |
| <li><a href="#security">Security Considerations</a></li> |
| <li><a href="#performance">Performance Impact</a></li> |
| <li><a href="#examples">Usage Examples</a></li> |
| </ul> |
| |
| <h2 id="overview">Metrics Overview |
| <a class="slurm_link" href="#overview"></a> |
| </h2> |
| |
| <p>Slurm 25.11 introduced a comprehensive system for collecting and exposing |
| metrics related to cluster resources, job states, and scheduler performance. |
| The metrics system exposes real-time data about various Slurm entities through |
| HTTP endpoints provided by the slurmctld daemon.</p> |
| |
| <p>The metrics feature enables integration with popular monitoring systems like |
| Prometheus, Grafana, and other observability tools.</p> |
| |
| <h2 id="configuration">Configuration |
| <a class="slurm_link" href="#configuration"></a> |
| </h2> |
| |
| <h3 id="prerequisites">Prerequisites |
| <a class="slurm_link" href="#prerequisites"></a> |
| </h3> |
| |
| <p>The metrics feature requires specific configuration in <i>slurm.conf</i>:</p> |
| |
| <ul> |
| <li><b>MetricsAuthUsers parameter</b>: Set |
| <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a> to control |
| which users are allowed to query metrics plugin endpoints. <i>SlurmUser</i> and |
| <i>root</i> are always allowed. When set, it will ignore <i>PrivateData</i> |
| and/or <i>MetricsParameters=ignore_private_data</i>, allowing users in the list |
| to query the metrics endpoints. This option enables JWT authentication for |
| querying metrics endpoints. |
| <li><b>MetricsParameters parameter</b>: Set |
| <a href="slurm.conf.html#OPT_MetricsParameters">MetricsParameters</a> to |
| configure the behavior of metrics plugins. Multiple parameters may be comma |
| separated. Currently supported parameters include: |
| <ul> |
| <li><b>ignore_private_data</b>: Set |
| <a href="slurm.conf.html#OPT_ignore_private_data"> |
| MetricsParameters=ignore_private_data</a> to make the metrics plugin ignore |
| <i>PrivateData</i>, and allow all users to query metrics endpoints without |
| authentication. This option will be ignored if <i>MetricsAuthUsers</i> is set. |
| </li> |
| </ul> |
| </li> |
| <li><b>MetricsType parameter</b>: Set the |
| <a href="slurm.conf.html#OPT_MetricsType">MetricsType</a> parameter to specify |
| which metrics plugin to use. Currently, only the OpenMetrics plugin is |
| supported: |
| <pre> |
| MetricsType=metrics/openmetrics |
| </pre> |
| </li> |
| </ul> |
| |
| <h3 id="plugin_loading">Plugin Loading |
| <a class="slurm_link" href="#plugin_loading"></a> |
| </h3> |
| |
| <p>The metrics plugin is automatically loaded by slurmctld when the |
| <i>MetricsType</i> parameter is configured.</p> |
| |
| <h2 id="endpoints">HTTP Endpoints |
| <a class="slurm_link" href="#endpoints"></a> |
| </h2> |
| |
| <p>Slurm exposes metrics through HTTP GET endpoints on the slurmctld daemon's |
| listening port (default 6817). The following endpoints are available:</p> |
| |
| <ul> |
| <li><b>GET /metrics</b> - Print available metric endpoints</li> |
| <li><b>GET /metrics/jobs</b> - Job-related metrics including counts by state, |
| resource allocation, and job statistics (<a href="#job_metrics">examples</a>) |
| </li> |
| <li><b>GET /metrics/jobs-users-accts</b> - User- and account-specific job |
| metrics (<a href="#ua_job_metrics">examples</a>)</li> |
| <li><b>GET /metrics/nodes</b> - Node-related metrics including resource counts, |
| states, and utilization (<a href="#node_metrics">examples</a>)</li> |
| <li><b>GET /metrics/partitions</b> - Partition-related metrics including job |
| counts per partition and resource allocation |
| (<a href="#partition_metrics">examples</a>)</li> |
| <li><b>GET /metrics/scheduler</b> - Scheduler performance metrics including |
| cycle times, backfill statistics, and queue lengths |
| (<a href="#scheduler_metrics">examples</a>)</li> |
| </ul> |
| |
| <p>All endpoints return data in UTF-8 text format making them compatible with |
| Prometheus and other monitoring systems.</p> |
| |
| <h2 id="openmetrics">OpenMetrics Plugin |
| <a class="slurm_link" href="#openmetrics"></a> |
| </h2> |
| |
| <p>The OpenMetrics plugin implements the <a href="https://openmetrics.io/"> |
| OpenMetrics 1.0</a> specification, ensuring compatibility with Prometheus and |
| other monitoring systems that consume metrics in this format.</p> |
| |
| <h3 id="metric_format">Metric Format |
| <a class="slurm_link" href="#metric_format"></a> |
| </h3> |
| |
| <p>Each metric follows the OpenMetrics format with the following components:</p> |
| |
| <ul> |
| <li><b>Metric name</b>: A descriptive name prefixed with "slurm_"</li> |
| <li><b>Metric type</b>: Only "gauge" metrics are exposed</li> |
| <li><b>Metric value</b>: The actual numeric value</li> |
| <li><b>Labels</b>: Optional key-value pairs for additional context</li> |
| <li><b>Help text</b>: Human-readable description of the metric</li> |
| </ul> |
| |
| <h2 id="categories">Metric Categories Provided by Slurm |
| <a class="slurm_link" href="#categories"></a> |
| </h2> |
| |
| <p>Each endpoint provides a set of metrics related to the same general category. |
| Numerous metrics are provided so they are not all documented on this page. |
| A few examples are provided for each category in the following subsections.</p> |
| |
| <h3 id="job_metrics">Job Metrics |
| <a class="slurm_link" href="#job_metrics"></a> |
| </h3> |
| |
| <p>Job metrics provide information about job states, resource allocation, and |
| job counts. Examples include:</p> |
| |
| <ul> |
| <li><code>slurm_jobs</code> - Total number of jobs</li> |
| <li><code>slurm_jobs_running</code> - Number of running jobs</li> |
| <li><code>slurm_jobs_pending</code> - Number of pending jobs (see note)</li> |
| <li><code>slurm_jobs_cpus_alloc</code> - Total CPUs allocated to jobs</li> |
| <li><code>slurm_jobs_memory_alloc</code> - Total memory allocated to jobs</li> |
| </ul> |
| |
| <p><b>NOTE</b>: In Slurm, pending jobs include both jobs waiting for resources |
| and held jobs. Held jobs will not be scheduled until the hold is released.</p> |
| |
| <h3 id="ua_job_metrics">User- and Account-Specific Job Metrics |
| <a class="slurm_link" href="#ua_job_metrics"></a> |
| </h3> |
| |
| <p>Job metrics for user and accounts provide a count of jobs in each state for |
| each active user and account in the system. It stores each entity under a |
| key-value pair. Remember that every unique key-value pair represents a new time |
| series, which can dramatically increase the amount of data stored. |
| Examples include:</p> |
| |
| <ul> |
| <li><code>slurm_user_jobs_pending{username="john"}</code> |
| - Pending jobs for user "john"</li> |
| <li><code>slurm_account_jobs_pending{account="smith"}</code> |
| - Pending jobs for account "smith"</li> |
| </ul> |
| |
| <h3 id="node_metrics">Node Metrics |
| <a class="slurm_link" href="#node_metrics"></a> |
| </h3> |
| |
| <p>Node metrics track resource availability, node states, and utilization. |
| Examples include:</p> |
| |
| <ul> |
| <li><code>slurm_nodes</code> - Total number of nodes</li> |
| <li><code>slurm_nodes_idle</code> - Number of idle nodes</li> |
| <li><code>slurm_nodes_alloc</code> - Number of allocated nodes</li> |
| <li><code>slurm_node_cpus{node="nodename"}</code> - CPUs on the specified node |
| </li> |
| <li><code>slurm_node_memory_bytes{node="nodename"}</code> |
| - Memory on the specified node (bytes)</li> |
| </ul> |
| |
| <h3 id="partition_metrics">Partition Metrics |
| <a class="slurm_link" href="#partition_metrics"></a> |
| </h3> |
| |
| <p>Partition metrics show job distribution and resource allocation across |
| partitions. Examples include:</p> |
| |
| <ul> |
| <li><code>slurm_partitions</code> - Total number of partitions</li> |
| <li><code>slurm_partition_jobs{partition="name"}</code> |
| - Jobs on the specified partition</li> |
| <li><code>slurm_partition_nodes{partition="name"}</code> |
| - Nodes on the specified partition</li> |
| </ul> |
| |
| <p>The following metrics might be useful for |
| <a href="https://slurm.schedmd.com/slinky.html">Slinky</a> or other systems |
| which have an auto-scale feature. By knowing the maximum number of nodes that a |
| job requested in a partition, the decision to extend the nodes of the partition |
| by this number can be considered. Jobs which are held are not included in these |
| metrics.</p> |
| |
| <ul> |
| <li><code>slurm_partition_jobs_max_job_nodes_nohold</code> |
| - Gives the maximum number of nodes requested by any job from all pending jobs |
| that are not held in the partition.</li> |
| <li><code>slurm_partition_jobs_min_job_nodes_nohold</code> |
| - Gives the maximum of the minimum number of nodes requested by any job from all |
| pending jobs that are not held in the partition.</li> |
| </ul> |
| |
| <h3 id="scheduler_metrics">Scheduler Metrics |
| <a class="slurm_link" href="#scheduler_metrics"></a> |
| </h3> |
| |
| <p>Scheduler metrics provide insights into scheduling performance and behavior. |
| Examples include:</p> |
| |
| <ul> |
| <li><code>slurm_sched_cycle_cnt</code> - Scheduling cycle count</li> |
| <li><code>slurm_sched_cycle_last</code> - Last scheduling cycle time</li> |
| <li><code>slurm_bf_cycle_cnt</code> - Backfill cycle count</li> |
| <li><code>slurm_bf_active</code> |
| - Whether the backfill scheduler is currently running</li> |
| </ul> |
| |
| <h2 id="security">Security Considerations |
| <a class="slurm_link" href="#security"></a> |
| </h2> |
| |
| <p>The metrics system has several important security implications:</p> |
| |
| <ul> |
| <li><b>Authentication</b>: By default, metrics endpoints do not require |
| authentication. They are either accessible to everyone or to no one (with the |
| exceptions of <i>SlurmUser</i> and <i>root</i>), depending on whether |
| <i>PrivateData</i> is set. It can be made globally accessible when |
| <i>PrivateData</i> is set by using |
| <a href="slurm.conf.html#OPT_ignore_private_data"> |
| MetricsParameters=ignore_private_data</a>. |
| However, if <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a> |
| is used, the values of both <i>PrivateData</i> and |
| <i>MetricsParameters=ignore_private_data</i> are ignored, and only the users in |
| this list (plus <i>SlurmUser</i> and <i>root</i>) are allowed to query metrics |
| endpoints. When querying without global access enabled, the metrics endpoints |
| require authentication, and the user must provide a JWT token to perform such |
| action.</li> |
| |
| <li><b>Network access</b>: Metrics are exposed through the slurmctld network |
| interface. Consider firewall rules and network segmentation to further control |
| access.</li> |
| |
| <li><b>Information disclosure</b>: Metrics may reveal information about cluster |
| utilization, job patterns, and user activity that could be considered sensitive |
| in some environments.</li> |
| </ul> |
| |
| <h2 id="performance">Performance Impact |
| <a class="slurm_link" href="#performance"></a> |
| </h2> |
| |
| <p>Metrics collection and exposition can impact slurmctld performance:</p> |
| |
| <ul> |
| <li><b>Lock contention</b>: Querying metrics requires acquiring various locks |
| within slurmctld, which can impact scheduler performance during high-frequency |
| queries.</li> |
| |
| <li><b>Data collection overhead</b>: Metrics are collected in real-time from |
| slurmctld's internal data structures, which adds computational overhead.</li> |
| |
| <li><b>Data processing overhead in the external monitoring system</b>: We |
| provide endpoints for unbounded entities like users and accounts metrics. |
| A monitoring system may treat each entity as a new time series, which can |
| dramatically increase the amount of data stored.</li> |
| |
| <li><b>Network I/O</b>: Frequent metric queries generate network traffic and |
| consume slurmctld's network I/O capacity, especially on systems with thousands |
| of jobs, users or accounts.</li> |
| </ul> |
| |
| <p>To minimize performance impact:</p> |
| |
| <ul> |
| <li>Configure appropriate scrape intervals in monitoring systems |
| (e.g., 60-120 seconds)</li> |
| <li>Use caching mechanisms in monitoring systems when possible</li> |
| <li>Monitor slurmctld performance when enabling metrics</li> |
| <li>Do not use unbounded metric endpoints like /metrics/jobs-users-accts to |
| store data in your monitoring system</li> |
| </ul> |
| |
| <h2 id="examples">Usage Examples |
| <a class="slurm_link" href="#examples"></a> |
| </h2> |
| |
| <h3 id="basic_curl">Basic curl Examples |
| <a class="slurm_link" href="#basic_curl"></a> |
| </h3> |
| |
| <p>Query job metrics:</p> |
| |
| <pre> |
| $ curl http://slurmctld.example.com:6817/metrics/jobs |
| # HELP slurm_jobs Total number of jobs |
| # TYPE slurm_jobs gauge |
| slurm_jobs 42 |
| # HELP slurm_jobs_running Number of jobs in Running state |
| # TYPE slurm_jobs_running gauge |
| slurm_jobs_running 15 |
| # HELP slurm_jobs_pending Number of jobs in Pending state |
| # TYPE slurm_jobs_pending gauge |
| slurm_jobs_pending 27 |
| ... |
| </pre> |
| |
| <p>Query node metrics:</p> |
| |
| <pre> |
| $ curl http://slurmctld.example.com:6817/metrics/nodes |
| # HELP slurm_nodes Total number of nodes |
| # TYPE slurm_nodes gauge |
| slurm_nodes 100 |
| # HELP slurm_nodes_idle Number of nodes in Idle state |
| # TYPE slurm_nodes_idle gauge |
| slurm_nodes_idle 85 |
| # HELP slurm_nodes_alloc Number of nodes in Allocated state |
| # TYPE slurm_nodes_alloc gauge |
| slurm_nodes_alloc 15 |
| ... |
| </pre> |
| |
| <p>When <i>MetricsAuthUsers</i>, and/or <i>PrivateData</i> without |
| <i>MetricsParameters=ignore_private_data</i> are configured, global access is |
| restricted. A JWT token must be provided in these cases:</p> |
| |
| <pre> |
| $ curl -H "X-SLURM-USER-TOKEN:$SLURM_JWT" \ |
| http://slurmctld.example.com:6817/metrics/<i>endpoint</i> |
| </pre> |
| |
| <h3 id="prometheus_config">Prometheus Configuration |
| <a class="slurm_link" href="#prometheus_config"></a> |
| </h3> |
| |
| <p>Configure Prometheus to scrape Slurm metrics by adding the following to your |
| <b>prometheus.yml</b>:</p> |
| |
| <pre> |
| scrape_configs: |
| - job_name: 'slurm_jobs' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/jobs' |
| |
| - job_name: 'slurm_nodes' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/nodes' |
| |
| - job_name: 'slurm_partitions' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/partitions' |
| |
| - job_name: 'slurm_scheduler' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/scheduler' |
| |
| - job_name: 'slurm_useracct' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/jobs-users-accts' |
| </pre> |
| |
| <p>If <i>MetricsAuthUsers</i> and/or <i>PrivateData</i> without |
| <i>MetricsParameters=ignore_private_data</i> are configured, global access is |
| restricted. Add a JWT token for a user that is allowed to access to query jobs. |
| For example:</p> |
| |
| <pre> |
| scrape_configs: |
| - job_name: 'slurm_jobs' |
| static_configs: |
| - targets: ['slurm.example.com:6817'] |
| metrics_path: '/metrics/jobs' |
| authorization: |
| type: Bearer |
| credentials: '<JWT_TOKEN>' |
| </pre> |
| |
| <!--#include virtual="footer.txt"--> |