blob: ef517e12bc395857455bb4ff129a647a262405e3 [file]
<!--#include virtual="header.txt"-->
<h1>Metrics Guide</h1>
<h2 id="contents">Contents
<a class="slurm_link" href="#contents"></a>
</h2>
<ul>
<li><a href="#overview">Metrics Overview</a></li>
<li><a href="#configuration">Configuration</a></li>
<li><a href="#endpoints">HTTP Endpoints</a></li>
<li><a href="#openmetrics">OpenMetrics Plugin</a></li>
<li><a href="#categories">Metric Categories Provided by Slurm</a></li>
<li><a href="#security">Security Considerations</a></li>
<li><a href="#performance">Performance Impact</a></li>
<li><a href="#examples">Usage Examples</a></li>
</ul>
<h2 id="overview">Metrics Overview
<a class="slurm_link" href="#overview"></a>
</h2>
<p>Slurm 25.11 introduced a comprehensive system for collecting and exposing
metrics related to cluster resources, job states, and scheduler performance.
The metrics system exposes real-time data about various Slurm entities through
HTTP endpoints provided by the slurmctld daemon.</p>
<p>The metrics feature enables integration with popular monitoring systems like
Prometheus, Grafana, and other observability tools.</p>
<h2 id="configuration">Configuration
<a class="slurm_link" href="#configuration"></a>
</h2>
<h3 id="prerequisites">Prerequisites
<a class="slurm_link" href="#prerequisites"></a>
</h3>
<p>The metrics feature requires specific configuration in <i>slurm.conf</i>:</p>
<ul>
<li><b>MetricsAuthUsers parameter</b>: Set
<a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a> to control
which users are allowed to query metrics plugin endpoints. <i>SlurmUser</i> and
<i>root</i> are always allowed. When set, it will ignore <i>PrivateData</i>
and/or <i>MetricsParameters=ignore_private_data</i>, allowing users in the list
to query the metrics endpoints. This option enables JWT authentication for
querying metrics endpoints.
<li><b>MetricsParameters parameter</b>: Set
<a href="slurm.conf.html#OPT_MetricsParameters">MetricsParameters</a> to
configure the behavior of metrics plugins. Multiple parameters may be comma
separated. Currently supported parameters include:
<ul>
<li><b>ignore_private_data</b>: Set
<a href="slurm.conf.html#OPT_ignore_private_data">
MetricsParameters=ignore_private_data</a> to make the metrics plugin ignore
<i>PrivateData</i>, and allow all users to query metrics endpoints without
authentication. This option will be ignored if <i>MetricsAuthUsers</i> is set.
</li>
</ul>
</li>
<li><b>MetricsType parameter</b>: Set the
<a href="slurm.conf.html#OPT_MetricsType">MetricsType</a> parameter to specify
which metrics plugin to use. Currently, only the OpenMetrics plugin is
supported:
<pre>
MetricsType=metrics/openmetrics
</pre>
</li>
</ul>
<h3 id="plugin_loading">Plugin Loading
<a class="slurm_link" href="#plugin_loading"></a>
</h3>
<p>The metrics plugin is automatically loaded by slurmctld when the
<i>MetricsType</i> parameter is configured.</p>
<h2 id="endpoints">HTTP Endpoints
<a class="slurm_link" href="#endpoints"></a>
</h2>
<p>Slurm exposes metrics through HTTP GET endpoints on the slurmctld daemon's
listening port (default 6817). The following endpoints are available:</p>
<ul>
<li><b>GET /metrics</b> - Print available metric endpoints</li>
<li><b>GET /metrics/jobs</b> - Job-related metrics including counts by state,
resource allocation, and job statistics (<a href="#job_metrics">examples</a>)
</li>
<li><b>GET /metrics/jobs-users-accts</b> - User- and account-specific job
metrics (<a href="#ua_job_metrics">examples</a>)</li>
<li><b>GET /metrics/nodes</b> - Node-related metrics including resource counts,
states, and utilization (<a href="#node_metrics">examples</a>)</li>
<li><b>GET /metrics/partitions</b> - Partition-related metrics including job
counts per partition and resource allocation
(<a href="#partition_metrics">examples</a>)</li>
<li><b>GET /metrics/scheduler</b> - Scheduler performance metrics including
cycle times, backfill statistics, and queue lengths
(<a href="#scheduler_metrics">examples</a>)</li>
</ul>
<p>All endpoints return data in UTF-8 text format making them compatible with
Prometheus and other monitoring systems.</p>
<h2 id="openmetrics">OpenMetrics Plugin
<a class="slurm_link" href="#openmetrics"></a>
</h2>
<p>The OpenMetrics plugin implements the <a href="https://openmetrics.io/">
OpenMetrics 1.0</a> specification, ensuring compatibility with Prometheus and
other monitoring systems that consume metrics in this format.</p>
<h3 id="metric_format">Metric Format
<a class="slurm_link" href="#metric_format"></a>
</h3>
<p>Each metric follows the OpenMetrics format with the following components:</p>
<ul>
<li><b>Metric name</b>: A descriptive name prefixed with "slurm_"</li>
<li><b>Metric type</b>: Only "gauge" metrics are exposed</li>
<li><b>Metric value</b>: The actual numeric value</li>
<li><b>Labels</b>: Optional key-value pairs for additional context</li>
<li><b>Help text</b>: Human-readable description of the metric</li>
</ul>
<h2 id="categories">Metric Categories Provided by Slurm
<a class="slurm_link" href="#categories"></a>
</h2>
<p>Each endpoint provides a set of metrics related to the same general category.
Numerous metrics are provided so they are not all documented on this page.
A few examples are provided for each category in the following subsections.</p>
<h3 id="job_metrics">Job Metrics
<a class="slurm_link" href="#job_metrics"></a>
</h3>
<p>Job metrics provide information about job states, resource allocation, and
job counts. Examples include:</p>
<ul>
<li><code>slurm_jobs</code> - Total number of jobs</li>
<li><code>slurm_jobs_running</code> - Number of running jobs</li>
<li><code>slurm_jobs_pending</code> - Number of pending jobs (see note)</li>
<li><code>slurm_jobs_cpus_alloc</code> - Total CPUs allocated to jobs</li>
<li><code>slurm_jobs_memory_alloc</code> - Total memory allocated to jobs</li>
</ul>
<p><b>NOTE</b>: In Slurm, pending jobs include both jobs waiting for resources
and held jobs. Held jobs will not be scheduled until the hold is released.</p>
<h3 id="ua_job_metrics">User- and Account-Specific Job Metrics
<a class="slurm_link" href="#ua_job_metrics"></a>
</h3>
<p>Job metrics for user and accounts provide a count of jobs in each state for
each active user and account in the system. It stores each entity under a
key-value pair. Remember that every unique key-value pair represents a new time
series, which can dramatically increase the amount of data stored.
Examples include:</p>
<ul>
<li><code>slurm_user_jobs_pending{username="john"}</code>
- Pending jobs for user "john"</li>
<li><code>slurm_account_jobs_pending{account="smith"}</code>
- Pending jobs for account "smith"</li>
</ul>
<h3 id="node_metrics">Node Metrics
<a class="slurm_link" href="#node_metrics"></a>
</h3>
<p>Node metrics track resource availability, node states, and utilization.
Examples include:</p>
<ul>
<li><code>slurm_nodes</code> - Total number of nodes</li>
<li><code>slurm_nodes_idle</code> - Number of idle nodes</li>
<li><code>slurm_nodes_alloc</code> - Number of allocated nodes</li>
<li><code>slurm_node_cpus{node="nodename"}</code> - CPUs on the specified node
</li>
<li><code>slurm_node_memory_bytes{node="nodename"}</code>
- Memory on the specified node (bytes)</li>
</ul>
<h3 id="partition_metrics">Partition Metrics
<a class="slurm_link" href="#partition_metrics"></a>
</h3>
<p>Partition metrics show job distribution and resource allocation across
partitions. Examples include:</p>
<ul>
<li><code>slurm_partitions</code> - Total number of partitions</li>
<li><code>slurm_partition_jobs{partition="name"}</code>
- Jobs on the specified partition</li>
<li><code>slurm_partition_nodes{partition="name"}</code>
- Nodes on the specified partition</li>
</ul>
<p>The following metrics might be useful for
<a href="https://slurm.schedmd.com/slinky.html">Slinky</a> or other systems
which have an auto-scale feature. By knowing the maximum number of nodes that a
job requested in a partition, the decision to extend the nodes of the partition
by this number can be considered. Jobs which are held are not included in these
metrics.</p>
<ul>
<li><code>slurm_partition_jobs_max_job_nodes_nohold</code>
- Gives the maximum number of nodes requested by any job from all pending jobs
that are not held in the partition.</li>
<li><code>slurm_partition_jobs_min_job_nodes_nohold</code>
- Gives the maximum of the minimum number of nodes requested by any job from all
pending jobs that are not held in the partition.</li>
</ul>
<h3 id="scheduler_metrics">Scheduler Metrics
<a class="slurm_link" href="#scheduler_metrics"></a>
</h3>
<p>Scheduler metrics provide insights into scheduling performance and behavior.
Examples include:</p>
<ul>
<li><code>slurm_sched_cycle_cnt</code> - Scheduling cycle count</li>
<li><code>slurm_sched_cycle_last</code> - Last scheduling cycle time</li>
<li><code>slurm_bf_cycle_cnt</code> - Backfill cycle count</li>
<li><code>slurm_bf_active</code>
- Whether the backfill scheduler is currently running</li>
</ul>
<h2 id="security">Security Considerations
<a class="slurm_link" href="#security"></a>
</h2>
<p>The metrics system has several important security implications:</p>
<ul>
<li><b>Authentication</b>: By default, metrics endpoints do not require
authentication. They are either accessible to everyone or to no one (with the
exceptions of <i>SlurmUser</i> and <i>root</i>), depending on whether
<i>PrivateData</i> is set. It can be made globally accessible when
<i>PrivateData</i> is set by using
<a href="slurm.conf.html#OPT_ignore_private_data">
MetricsParameters=ignore_private_data</a>.
However, if <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a>
is used, the values of both <i>PrivateData</i> and
<i>MetricsParameters=ignore_private_data</i> are ignored, and only the users in
this list (plus <i>SlurmUser</i> and <i>root</i>) are allowed to query metrics
endpoints. When querying without global access enabled, the metrics endpoints
require authentication, and the user must provide a JWT token to perform such
action.</li>
<li><b>Network access</b>: Metrics are exposed through the slurmctld network
interface. Consider firewall rules and network segmentation to further control
access.</li>
<li><b>Information disclosure</b>: Metrics may reveal information about cluster
utilization, job patterns, and user activity that could be considered sensitive
in some environments.</li>
</ul>
<h2 id="performance">Performance Impact
<a class="slurm_link" href="#performance"></a>
</h2>
<p>Metrics collection and exposition can impact slurmctld performance:</p>
<ul>
<li><b>Lock contention</b>: Querying metrics requires acquiring various locks
within slurmctld, which can impact scheduler performance during high-frequency
queries.</li>
<li><b>Data collection overhead</b>: Metrics are collected in real-time from
slurmctld's internal data structures, which adds computational overhead.</li>
<li><b>Data processing overhead in the external monitoring system</b>: We
provide endpoints for unbounded entities like users and accounts metrics.
A monitoring system may treat each entity as a new time series, which can
dramatically increase the amount of data stored.</li>
<li><b>Network I/O</b>: Frequent metric queries generate network traffic and
consume slurmctld's network I/O capacity, especially on systems with thousands
of jobs, users or accounts.</li>
</ul>
<p>To minimize performance impact:</p>
<ul>
<li>Configure appropriate scrape intervals in monitoring systems
(e.g., 60-120 seconds)</li>
<li>Use caching mechanisms in monitoring systems when possible</li>
<li>Monitor slurmctld performance when enabling metrics</li>
<li>Do not use unbounded metric endpoints like /metrics/jobs-users-accts to
store data in your monitoring system</li>
</ul>
<h2 id="examples">Usage Examples
<a class="slurm_link" href="#examples"></a>
</h2>
<h3 id="basic_curl">Basic curl Examples
<a class="slurm_link" href="#basic_curl"></a>
</h3>
<p>Query job metrics:</p>
<pre>
$ curl http://slurmctld.example.com:6817/metrics/jobs
# HELP slurm_jobs Total number of jobs
# TYPE slurm_jobs gauge
slurm_jobs 42
# HELP slurm_jobs_running Number of jobs in Running state
# TYPE slurm_jobs_running gauge
slurm_jobs_running 15
# HELP slurm_jobs_pending Number of jobs in Pending state
# TYPE slurm_jobs_pending gauge
slurm_jobs_pending 27
...
</pre>
<p>Query node metrics:</p>
<pre>
$ curl http://slurmctld.example.com:6817/metrics/nodes
# HELP slurm_nodes Total number of nodes
# TYPE slurm_nodes gauge
slurm_nodes 100
# HELP slurm_nodes_idle Number of nodes in Idle state
# TYPE slurm_nodes_idle gauge
slurm_nodes_idle 85
# HELP slurm_nodes_alloc Number of nodes in Allocated state
# TYPE slurm_nodes_alloc gauge
slurm_nodes_alloc 15
...
</pre>
<p>When <i>MetricsAuthUsers</i>, and/or <i>PrivateData</i> without
<i>MetricsParameters=ignore_private_data</i> are configured, global access is
restricted. A JWT token must be provided in these cases:</p>
<pre>
$ curl -H "X-SLURM-USER-TOKEN:$SLURM_JWT" \
http://slurmctld.example.com:6817/metrics/<i>endpoint</i>
</pre>
<h3 id="prometheus_config">Prometheus Configuration
<a class="slurm_link" href="#prometheus_config"></a>
</h3>
<p>Configure Prometheus to scrape Slurm metrics by adding the following to your
<b>prometheus.yml</b>:</p>
<pre>
scrape_configs:
- job_name: 'slurm_jobs'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/jobs'
- job_name: 'slurm_nodes'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/nodes'
- job_name: 'slurm_partitions'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/partitions'
- job_name: 'slurm_scheduler'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/scheduler'
- job_name: 'slurm_useracct'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/jobs-users-accts'
</pre>
<p>If <i>MetricsAuthUsers</i> and/or <i>PrivateData</i> without
<i>MetricsParameters=ignore_private_data</i> are configured, global access is
restricted. Add a JWT token for a user that is allowed to access to query jobs.
For example:</p>
<pre>
scrape_configs:
- job_name: 'slurm_jobs'
static_configs:
- targets: ['slurm.example.com:6817']
metrics_path: '/metrics/jobs'
authorization:
type: Bearer
credentials: '&lt;JWT_TOKEN&gt;'
</pre>
<!--#include virtual="footer.txt"-->