doc/html/metrics.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Metrics Guide</h1>

 <h2 id="contents">Contents
 <a class="slurm_link" href="#contents"></a>
 </h2>

 <ul>
 <li><a href="#overview">Metrics Overview</a></li>
 <li><a href="#configuration">Configuration</a></li>
 <li><a href="#endpoints">HTTP Endpoints</a></li>
 <li><a href="#openmetrics">OpenMetrics Plugin</a></li>
 <li><a href="#categories">Metric Categories Provided by Slurm</a></li>
 <li><a href="#security">Security Considerations</a></li>
 <li><a href="#performance">Performance Impact</a></li>
 <li><a href="#examples">Usage Examples</a></li>
 </ul>

 <h2 id="overview">Metrics Overview
 <a class="slurm_link" href="#overview"></a>
 </h2>

 <p>Slurm 25.11 introduced a comprehensive system for collecting and exposing
 metrics related to cluster resources, job states, and scheduler performance.
 The metrics system exposes real-time data about various Slurm entities through
 HTTP endpoints provided by the slurmctld daemon.</p>

 <p>The metrics feature enables integration with popular monitoring systems like
 Prometheus, Grafana, and other observability tools.</p>

 <h2 id="configuration">Configuration
 <a class="slurm_link" href="#configuration"></a>
 </h2>

 <h3 id="prerequisites">Prerequisites
 <a class="slurm_link" href="#prerequisites"></a>
 </h3>

 <p>The metrics feature requires specific configuration in <i>slurm.conf</i>:</p>

 <ul>
 <li><b>MetricsAuthUsers parameter</b>: Set
 <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a> to control
 which users are allowed to query metrics plugin endpoints. <i>SlurmUser</i> and
 <i>root</i> are always allowed. When set, it will ignore <i>PrivateData</i>
 and/or <i>MetricsParameters=ignore_private_data</i>, allowing users in the list
 to query the metrics endpoints. This option enables JWT authentication for
 querying metrics endpoints.
 <li><b>MetricsParameters parameter</b>: Set
 <a href="slurm.conf.html#OPT_MetricsParameters">MetricsParameters</a> to
 configure the behavior of metrics plugins. Multiple parameters may be comma
 separated. Currently supported parameters include:
 <ul>
 <li><b>ignore_private_data</b>: Set
 <a href="slurm.conf.html#OPT_ignore_private_data">
 MetricsParameters=ignore_private_data</a> to make the metrics plugin ignore
 <i>PrivateData</i>, and allow all users to query metrics endpoints without
 authentication. This option will be ignored if <i>MetricsAuthUsers</i> is set.
 </li>
 </ul>
 </li>
 <li><b>MetricsType parameter</b>: Set the
 <a href="slurm.conf.html#OPT_MetricsType">MetricsType</a> parameter to specify
 which metrics plugin to use. Currently, only the OpenMetrics plugin is
 supported:
 <pre>
 MetricsType=metrics/openmetrics
 </pre>
 </li>
 </ul>

 <h3 id="plugin_loading">Plugin Loading
 <a class="slurm_link" href="#plugin_loading"></a>
 </h3>

 <p>The metrics plugin is automatically loaded by slurmctld when the
 <i>MetricsType</i> parameter is configured.</p>

 <h2 id="endpoints">HTTP Endpoints
 <a class="slurm_link" href="#endpoints"></a>
 </h2>

 <p>Slurm exposes metrics through HTTP GET endpoints on the slurmctld daemon's
 listening port (default 6817). The following endpoints are available:</p>

 <ul>
 <li><b>GET /metrics</b> - Print available metric endpoints</li>
 <li><b>GET /metrics/jobs</b> - Job-related metrics including counts by state,
 resource allocation, and job statistics (<a href="#job_metrics">examples</a>)
 </li>
 <li><b>GET /metrics/jobs-users-accts</b> - User- and account-specific job
 metrics (<a href="#ua_job_metrics">examples</a>)</li>
 <li><b>GET /metrics/nodes</b> - Node-related metrics including resource counts,
 states, and utilization (<a href="#node_metrics">examples</a>)</li>
 <li><b>GET /metrics/partitions</b> - Partition-related metrics including job
 counts per partition and resource allocation
 (<a href="#partition_metrics">examples</a>)</li>
 <li><b>GET /metrics/scheduler</b> - Scheduler performance metrics including
 cycle times, backfill statistics, and queue lengths
 (<a href="#scheduler_metrics">examples</a>)</li>
 </ul>

 <p>All endpoints return data in UTF-8 text format making them compatible with
 Prometheus and other monitoring systems.</p>

 <h2 id="openmetrics">OpenMetrics Plugin
 <a class="slurm_link" href="#openmetrics"></a>
 </h2>

 <p>The OpenMetrics plugin implements the <a href="https://openmetrics.io/">
 OpenMetrics 1.0</a> specification, ensuring compatibility with Prometheus and
 other monitoring systems that consume metrics in this format.</p>

 <h3 id="metric_format">Metric Format
 <a class="slurm_link" href="#metric_format"></a>
 </h3>

 <p>Each metric follows the OpenMetrics format with the following components:</p>

 <ul>
 <li><b>Metric name</b>: A descriptive name prefixed with "slurm_"</li>
 <li><b>Metric type</b>: Only "gauge" metrics are exposed</li>
 <li><b>Metric value</b>: The actual numeric value</li>
 <li><b>Labels</b>: Optional key-value pairs for additional context</li>
 <li><b>Help text</b>: Human-readable description of the metric</li>
 </ul>

 <h2 id="categories">Metric Categories Provided by Slurm
 <a class="slurm_link" href="#categories"></a>
 </h2>

 <p>Each endpoint provides a set of metrics related to the same general category.
 Numerous metrics are provided so they are not all documented on this page.
 A few examples are provided for each category in the following subsections.</p>

 <h3 id="job_metrics">Job Metrics
 <a class="slurm_link" href="#job_metrics"></a>
 </h3>

 <p>Job metrics provide information about job states, resource allocation, and
 job counts. Examples include:</p>

 <ul>
 <li><code>slurm_jobs</code> - Total number of jobs</li>
 <li><code>slurm_jobs_running</code> - Number of running jobs</li>
 <li><code>slurm_jobs_pending</code> - Number of pending jobs (see note)</li>
 <li><code>slurm_jobs_cpus_alloc</code> - Total CPUs allocated to jobs</li>
 <li><code>slurm_jobs_memory_alloc</code> - Total memory allocated to jobs</li>
 </ul>

 <p><b>NOTE</b>: In Slurm, pending jobs include both jobs waiting for resources
 and held jobs. Held jobs will not be scheduled until the hold is released.</p>

 <h3 id="ua_job_metrics">User- and Account-Specific Job Metrics
 <a class="slurm_link" href="#ua_job_metrics"></a>
 </h3>

 <p>Job metrics for user and accounts provide a count of jobs in each state for
 each active user and account in the system. It stores each entity under a
 key-value pair. Remember that every unique key-value pair represents a new time
 series, which can dramatically increase the amount of data stored.
 Examples include:</p>

 <ul>
 <li><code>slurm_user_jobs_pending{username="john"}</code>
  - Pending jobs for user "john"</li>
 <li><code>slurm_account_jobs_pending{account="smith"}</code>
  - Pending jobs for account "smith"</li>
 </ul>

 <h3 id="node_metrics">Node Metrics
 <a class="slurm_link" href="#node_metrics"></a>
 </h3>

 <p>Node metrics track resource availability, node states, and utilization.
 Examples include:</p>

 <ul>
 <li><code>slurm_nodes</code> - Total number of nodes</li>
 <li><code>slurm_nodes_idle</code> - Number of idle nodes</li>
 <li><code>slurm_nodes_alloc</code> - Number of allocated nodes</li>
 <li><code>slurm_node_cpus{node="nodename"}</code> - CPUs on the specified node
 </li>
 <li><code>slurm_node_memory_bytes{node="nodename"}</code>
  - Memory on the specified node (bytes)</li>
 </ul>

 <h3 id="partition_metrics">Partition Metrics
 <a class="slurm_link" href="#partition_metrics"></a>
 </h3>

 <p>Partition metrics show job distribution and resource allocation across
 partitions. Examples include:</p>

 <ul>
 <li><code>slurm_partitions</code> - Total number of partitions</li>
 <li><code>slurm_partition_jobs{partition="name"}</code>
  - Jobs on the specified partition</li>
 <li><code>slurm_partition_nodes{partition="name"}</code>
  - Nodes on the specified partition</li>
 </ul>

 <p>The following metrics might be useful for
 <a href="https://slurm.schedmd.com/slinky.html">Slinky</a> or other systems
 which have an auto-scale feature. By knowing the maximum number of nodes that a
 job requested in a partition, the decision to extend the nodes of the partition
 by this number can be considered. Jobs which are held are not included in these
 metrics.</p>

 <ul>
 <li><code>slurm_partition_jobs_max_job_nodes_nohold</code>
  - Gives the maximum number of nodes requested by any job from all pending jobs
 that are not held in the partition.</li>
 <li><code>slurm_partition_jobs_min_job_nodes_nohold</code>
 - Gives the maximum of the minimum number of nodes requested by any job from all
 pending jobs that are not held in the partition.</li>
 </ul>

 <h3 id="scheduler_metrics">Scheduler Metrics
 <a class="slurm_link" href="#scheduler_metrics"></a>
 </h3>

 <p>Scheduler metrics provide insights into scheduling performance and behavior.
 Examples include:</p>

 <ul>
 <li><code>slurm_sched_cycle_cnt</code> - Scheduling cycle count</li>
 <li><code>slurm_sched_cycle_last</code> - Last scheduling cycle time</li>
 <li><code>slurm_bf_cycle_cnt</code> - Backfill cycle count</li>
 <li><code>slurm_bf_active</code>
  - Whether the backfill scheduler is currently running</li>
 </ul>

 <h2 id="security">Security Considerations
 <a class="slurm_link" href="#security"></a>
 </h2>

 <p>The metrics system has several important security implications:</p>

 <ul>
 <li><b>Authentication</b>: By default, metrics endpoints do not require
 authentication. They are either accessible to everyone or to no one (with the
 exceptions of <i>SlurmUser</i> and <i>root</i>), depending on whether
 <i>PrivateData</i> is set. It can be made globally accessible when
 <i>PrivateData</i> is set by using
 <a href="slurm.conf.html#OPT_ignore_private_data">
 MetricsParameters=ignore_private_data</a>.
 However, if <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a>
 is used, the values of both <i>PrivateData</i> and
 <i>MetricsParameters=ignore_private_data</i> are ignored, and only the users in
 this list (plus <i>SlurmUser</i> and <i>root</i>) are allowed to query metrics
 endpoints. When querying without global access enabled, the metrics endpoints
 require authentication, and the user must provide a JWT token to perform such
 action.</li>

 <li><b>Network access</b>: Metrics are exposed through the slurmctld network
 interface. Consider firewall rules and network segmentation to further control
 access.</li>

 <li><b>Information disclosure</b>: Metrics may reveal information about cluster
 utilization, job patterns, and user activity that could be considered sensitive
 in some environments.</li>
 </ul>

 <h2 id="performance">Performance Impact
 <a class="slurm_link" href="#performance"></a>
 </h2>

 <p>Metrics collection and exposition can impact slurmctld performance:</p>

 <ul>
 <li><b>Lock contention</b>: Querying metrics requires acquiring various locks
 within slurmctld, which can impact scheduler performance during high-frequency
 queries.</li>

 <li><b>Data collection overhead</b>: Metrics are collected in real-time from
 slurmctld's internal data structures, which adds computational overhead.</li>

 <li><b>Data processing overhead in the external monitoring system</b>: We
 provide endpoints for unbounded entities like users and accounts metrics.
 A monitoring system may treat each entity as a new time series, which can
 dramatically increase the amount of data stored.</li>

 <li><b>Network I/O</b>: Frequent metric queries generate network traffic and
 consume slurmctld's network I/O capacity, especially on systems with thousands
 of jobs, users or accounts.</li>
 </ul>

 <p>To minimize performance impact:</p>

 <ul>
 <li>Configure appropriate scrape intervals in monitoring systems
  (e.g., 60-120 seconds)</li>
 <li>Use caching mechanisms in monitoring systems when possible</li>
 <li>Monitor slurmctld performance when enabling metrics</li>
 <li>Do not use unbounded metric endpoints like /metrics/jobs-users-accts to
 store data in your monitoring system</li>
 </ul>

 <h2 id="examples">Usage Examples
 <a class="slurm_link" href="#examples"></a>
 </h2>

 <h3 id="basic_curl">Basic curl Examples
 <a class="slurm_link" href="#basic_curl"></a>
 </h3>

 <p>Query job metrics:</p>

 <pre>
 $ curl http://slurmctld.example.com:6817/metrics/jobs
 # HELP slurm_jobs Total number of jobs
 # TYPE slurm_jobs gauge
 slurm_jobs 42
 # HELP slurm_jobs_running Number of jobs in Running state
 # TYPE slurm_jobs_running gauge
 slurm_jobs_running 15
 # HELP slurm_jobs_pending Number of jobs in Pending state
 # TYPE slurm_jobs_pending gauge
 slurm_jobs_pending 27
 ...
 </pre>

 <p>Query node metrics:</p>

 <pre>
 $ curl http://slurmctld.example.com:6817/metrics/nodes
 # HELP slurm_nodes Total number of nodes
 # TYPE slurm_nodes gauge
 slurm_nodes 100
 # HELP slurm_nodes_idle Number of nodes in Idle state
 # TYPE slurm_nodes_idle gauge
 slurm_nodes_idle 85
 # HELP slurm_nodes_alloc Number of nodes in Allocated state
 # TYPE slurm_nodes_alloc gauge
 slurm_nodes_alloc 15
 ...
 </pre>

 <p>When <i>MetricsAuthUsers</i>, and/or <i>PrivateData</i> without
 <i>MetricsParameters=ignore_private_data</i> are configured, global access is
 restricted. A JWT token must be provided in these cases:</p>

 <pre>
 $ curl -H "X-SLURM-USER-TOKEN:$SLURM_JWT" \
     http://slurmctld.example.com:6817/metrics/<i>endpoint</i>
 </pre>

 <h3 id="prometheus_config">Prometheus Configuration
 <a class="slurm_link" href="#prometheus_config"></a>
 </h3>

 <p>Configure Prometheus to scrape Slurm metrics by adding the following to your
 <b>prometheus.yml</b>:</p>

 <pre>
 scrape_configs:
   - job_name: 'slurm_jobs'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/jobs'

   - job_name: 'slurm_nodes'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/nodes'

   - job_name: 'slurm_partitions'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/partitions'

   - job_name: 'slurm_scheduler'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/scheduler'

   - job_name: 'slurm_useracct'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/jobs-users-accts'
 </pre>

 <p>If <i>MetricsAuthUsers</i> and/or <i>PrivateData</i> without
 <i>MetricsParameters=ignore_private_data</i> are configured, global access is
 restricted. Add a JWT token for a user that is allowed to access to query jobs.
 For example:</p>

 <pre>
 scrape_configs:
   - job_name: 'slurm_jobs'
     static_configs:
       - targets: ['slurm.example.com:6817']
     metrics_path: '/metrics/jobs'
     authorization:
       type: Bearer
       credentials: '&lt;JWT_TOKEN&gt;'
 </pre>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Metrics Guide</h1>

	<h2 id="contents">Contents
	<a class="slurm_link" href="#contents"></a>
	</h2>

	<ul>
	<li><a href="#overview">Metrics Overview</a></li>
	<li><a href="#configuration">Configuration</a></li>
	<li><a href="#endpoints">HTTP Endpoints</a></li>
	<li><a href="#openmetrics">OpenMetrics Plugin</a></li>
	<li><a href="#categories">Metric Categories Provided by Slurm</a></li>
	<li><a href="#security">Security Considerations</a></li>
	<li><a href="#performance">Performance Impact</a></li>
	<li><a href="#examples">Usage Examples</a></li>
	</ul>

	<h2 id="overview">Metrics Overview
	<a class="slurm_link" href="#overview"></a>
	</h2>

	<p>Slurm 25.11 introduced a comprehensive system for collecting and exposing
	metrics related to cluster resources, job states, and scheduler performance.
	The metrics system exposes real-time data about various Slurm entities through
	HTTP endpoints provided by the slurmctld daemon.</p>

	<p>The metrics feature enables integration with popular monitoring systems like
	Prometheus, Grafana, and other observability tools.</p>

	<h2 id="configuration">Configuration
	<a class="slurm_link" href="#configuration"></a>
	</h2>

	<h3 id="prerequisites">Prerequisites
	<a class="slurm_link" href="#prerequisites"></a>
	</h3>

	<p>The metrics feature requires specific configuration in <i>slurm.conf</i>:</p>

	<ul>
	<li><b>MetricsAuthUsers parameter</b>: Set
	<a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a> to control
	which users are allowed to query metrics plugin endpoints. <i>SlurmUser</i> and
	<i>root</i> are always allowed. When set, it will ignore <i>PrivateData</i>
	and/or <i>MetricsParameters=ignore_private_data</i>, allowing users in the list
	to query the metrics endpoints. This option enables JWT authentication for
	querying metrics endpoints.
	<li><b>MetricsParameters parameter</b>: Set
	<a href="slurm.conf.html#OPT_MetricsParameters">MetricsParameters</a> to
	configure the behavior of metrics plugins. Multiple parameters may be comma
	separated. Currently supported parameters include:
	<ul>
	<li><b>ignore_private_data</b>: Set
	<a href="slurm.conf.html#OPT_ignore_private_data">
	MetricsParameters=ignore_private_data</a> to make the metrics plugin ignore
	<i>PrivateData</i>, and allow all users to query metrics endpoints without
	authentication. This option will be ignored if <i>MetricsAuthUsers</i> is set.
	</li>
	</ul>
	</li>
	<li><b>MetricsType parameter</b>: Set the
	<a href="slurm.conf.html#OPT_MetricsType">MetricsType</a> parameter to specify
	which metrics plugin to use. Currently, only the OpenMetrics plugin is
	supported:
	<pre>
	MetricsType=metrics/openmetrics
	</pre>
	</li>
	</ul>

	<h3 id="plugin_loading">Plugin Loading
	<a class="slurm_link" href="#plugin_loading"></a>
	</h3>

	<p>The metrics plugin is automatically loaded by slurmctld when the
	<i>MetricsType</i> parameter is configured.</p>

	<h2 id="endpoints">HTTP Endpoints
	<a class="slurm_link" href="#endpoints"></a>
	</h2>

	<p>Slurm exposes metrics through HTTP GET endpoints on the slurmctld daemon's
	listening port (default 6817). The following endpoints are available:</p>

	<ul>
	<li><b>GET /metrics</b> - Print available metric endpoints</li>
	<li><b>GET /metrics/jobs</b> - Job-related metrics including counts by state,
	resource allocation, and job statistics (<a href="#job_metrics">examples</a>)
	</li>
	<li><b>GET /metrics/jobs-users-accts</b> - User- and account-specific job
	metrics (<a href="#ua_job_metrics">examples</a>)</li>
	<li><b>GET /metrics/nodes</b> - Node-related metrics including resource counts,
	states, and utilization (<a href="#node_metrics">examples</a>)</li>
	<li><b>GET /metrics/partitions</b> - Partition-related metrics including job
	counts per partition and resource allocation
	(<a href="#partition_metrics">examples</a>)</li>
	<li><b>GET /metrics/scheduler</b> - Scheduler performance metrics including
	cycle times, backfill statistics, and queue lengths
	(<a href="#scheduler_metrics">examples</a>)</li>
	</ul>

	<p>All endpoints return data in UTF-8 text format making them compatible with
	Prometheus and other monitoring systems.</p>

	<h2 id="openmetrics">OpenMetrics Plugin
	<a class="slurm_link" href="#openmetrics"></a>
	</h2>

	<p>The OpenMetrics plugin implements the <a href="https://openmetrics.io/">
	OpenMetrics 1.0</a> specification, ensuring compatibility with Prometheus and
	other monitoring systems that consume metrics in this format.</p>

	<h3 id="metric_format">Metric Format
	<a class="slurm_link" href="#metric_format"></a>
	</h3>

	<p>Each metric follows the OpenMetrics format with the following components:</p>

	<ul>
	<li><b>Metric name</b>: A descriptive name prefixed with "slurm_"</li>
	<li><b>Metric type</b>: Only "gauge" metrics are exposed</li>
	<li><b>Metric value</b>: The actual numeric value</li>
	<li><b>Labels</b>: Optional key-value pairs for additional context</li>
	<li><b>Help text</b>: Human-readable description of the metric</li>
	</ul>

	<h2 id="categories">Metric Categories Provided by Slurm
	<a class="slurm_link" href="#categories"></a>
	</h2>

	<p>Each endpoint provides a set of metrics related to the same general category.
	Numerous metrics are provided so they are not all documented on this page.
	A few examples are provided for each category in the following subsections.</p>

	<h3 id="job_metrics">Job Metrics
	<a class="slurm_link" href="#job_metrics"></a>
	</h3>

	<p>Job metrics provide information about job states, resource allocation, and
	job counts. Examples include:</p>

	<ul>
	<li><code>slurm_jobs</code> - Total number of jobs</li>
	<li><code>slurm_jobs_running</code> - Number of running jobs</li>
	<li><code>slurm_jobs_pending</code> - Number of pending jobs (see note)</li>
	<li><code>slurm_jobs_cpus_alloc</code> - Total CPUs allocated to jobs</li>
	<li><code>slurm_jobs_memory_alloc</code> - Total memory allocated to jobs</li>
	</ul>

	<p><b>NOTE</b>: In Slurm, pending jobs include both jobs waiting for resources
	and held jobs. Held jobs will not be scheduled until the hold is released.</p>

	<h3 id="ua_job_metrics">User- and Account-Specific Job Metrics
	<a class="slurm_link" href="#ua_job_metrics"></a>
	</h3>

	<p>Job metrics for user and accounts provide a count of jobs in each state for
	each active user and account in the system. It stores each entity under a
	key-value pair. Remember that every unique key-value pair represents a new time
	series, which can dramatically increase the amount of data stored.
	Examples include:</p>

	<ul>
	<li><code>slurm_user_jobs_pending{username="john"}</code>
	- Pending jobs for user "john"</li>
	<li><code>slurm_account_jobs_pending{account="smith"}</code>
	- Pending jobs for account "smith"</li>
	</ul>

	<h3 id="node_metrics">Node Metrics
	<a class="slurm_link" href="#node_metrics"></a>
	</h3>

	<p>Node metrics track resource availability, node states, and utilization.
	Examples include:</p>

	<ul>
	<li><code>slurm_nodes</code> - Total number of nodes</li>
	<li><code>slurm_nodes_idle</code> - Number of idle nodes</li>
	<li><code>slurm_nodes_alloc</code> - Number of allocated nodes</li>
	<li><code>slurm_node_cpus{node="nodename"}</code> - CPUs on the specified node
	</li>
	<li><code>slurm_node_memory_bytes{node="nodename"}</code>
	- Memory on the specified node (bytes)</li>
	</ul>

	<h3 id="partition_metrics">Partition Metrics
	<a class="slurm_link" href="#partition_metrics"></a>
	</h3>

	<p>Partition metrics show job distribution and resource allocation across
	partitions. Examples include:</p>

	<ul>
	<li><code>slurm_partitions</code> - Total number of partitions</li>
	<li><code>slurm_partition_jobs{partition="name"}</code>
	- Jobs on the specified partition</li>
	<li><code>slurm_partition_nodes{partition="name"}</code>
	- Nodes on the specified partition</li>
	</ul>

	<p>The following metrics might be useful for
	<a href="https://slurm.schedmd.com/slinky.html">Slinky</a> or other systems
	which have an auto-scale feature. By knowing the maximum number of nodes that a
	job requested in a partition, the decision to extend the nodes of the partition
	by this number can be considered. Jobs which are held are not included in these
	metrics.</p>

	<ul>
	<li><code>slurm_partition_jobs_max_job_nodes_nohold</code>
	- Gives the maximum number of nodes requested by any job from all pending jobs
	that are not held in the partition.</li>
	<li><code>slurm_partition_jobs_min_job_nodes_nohold</code>
	- Gives the maximum of the minimum number of nodes requested by any job from all
	pending jobs that are not held in the partition.</li>
	</ul>

	<h3 id="scheduler_metrics">Scheduler Metrics
	<a class="slurm_link" href="#scheduler_metrics"></a>
	</h3>

	<p>Scheduler metrics provide insights into scheduling performance and behavior.
	Examples include:</p>

	<ul>
	<li><code>slurm_sched_cycle_cnt</code> - Scheduling cycle count</li>
	<li><code>slurm_sched_cycle_last</code> - Last scheduling cycle time</li>
	<li><code>slurm_bf_cycle_cnt</code> - Backfill cycle count</li>
	<li><code>slurm_bf_active</code>
	- Whether the backfill scheduler is currently running</li>
	</ul>

	<h2 id="security">Security Considerations
	<a class="slurm_link" href="#security"></a>
	</h2>

	<p>The metrics system has several important security implications:</p>

	<ul>
	<li><b>Authentication</b>: By default, metrics endpoints do not require
	authentication. They are either accessible to everyone or to no one (with the
	exceptions of <i>SlurmUser</i> and <i>root</i>), depending on whether
	<i>PrivateData</i> is set. It can be made globally accessible when
	<i>PrivateData</i> is set by using
	<a href="slurm.conf.html#OPT_ignore_private_data">
	MetricsParameters=ignore_private_data</a>.
	However, if <a href="slurm.conf.html#OPT_MetricsAuthUsers">MetricsAuthUsers</a>
	is used, the values of both <i>PrivateData</i> and
	<i>MetricsParameters=ignore_private_data</i> are ignored, and only the users in
	this list (plus <i>SlurmUser</i> and <i>root</i>) are allowed to query metrics
	endpoints. When querying without global access enabled, the metrics endpoints
	require authentication, and the user must provide a JWT token to perform such
	action.</li>

	<li><b>Network access</b>: Metrics are exposed through the slurmctld network
	interface. Consider firewall rules and network segmentation to further control
	access.</li>

	<li><b>Information disclosure</b>: Metrics may reveal information about cluster
	utilization, job patterns, and user activity that could be considered sensitive
	in some environments.</li>
	</ul>

	<h2 id="performance">Performance Impact
	<a class="slurm_link" href="#performance"></a>
	</h2>

	<p>Metrics collection and exposition can impact slurmctld performance:</p>

	<ul>
	<li><b>Lock contention</b>: Querying metrics requires acquiring various locks
	within slurmctld, which can impact scheduler performance during high-frequency
	queries.</li>

	<li><b>Data collection overhead</b>: Metrics are collected in real-time from
	slurmctld's internal data structures, which adds computational overhead.</li>

	<li><b>Data processing overhead in the external monitoring system</b>: We
	provide endpoints for unbounded entities like users and accounts metrics.
	A monitoring system may treat each entity as a new time series, which can
	dramatically increase the amount of data stored.</li>

	<li><b>Network I/O</b>: Frequent metric queries generate network traffic and
	consume slurmctld's network I/O capacity, especially on systems with thousands
	of jobs, users or accounts.</li>
	</ul>

	<p>To minimize performance impact:</p>

	<ul>
	<li>Configure appropriate scrape intervals in monitoring systems
	(e.g., 60-120 seconds)</li>
	<li>Use caching mechanisms in monitoring systems when possible</li>
	<li>Monitor slurmctld performance when enabling metrics</li>
	<li>Do not use unbounded metric endpoints like /metrics/jobs-users-accts to
	store data in your monitoring system</li>
	</ul>

	<h2 id="examples">Usage Examples
	<a class="slurm_link" href="#examples"></a>
	</h2>

	<h3 id="basic_curl">Basic curl Examples
	<a class="slurm_link" href="#basic_curl"></a>
	</h3>

	<p>Query job metrics:</p>

	<pre>
	$ curl http://slurmctld.example.com:6817/metrics/jobs
	# HELP slurm_jobs Total number of jobs
	# TYPE slurm_jobs gauge
	slurm_jobs 42
	# HELP slurm_jobs_running Number of jobs in Running state
	# TYPE slurm_jobs_running gauge
	slurm_jobs_running 15
	# HELP slurm_jobs_pending Number of jobs in Pending state
	# TYPE slurm_jobs_pending gauge
	slurm_jobs_pending 27
	...
	</pre>

	<p>Query node metrics:</p>

	<pre>
	$ curl http://slurmctld.example.com:6817/metrics/nodes
	# HELP slurm_nodes Total number of nodes
	# TYPE slurm_nodes gauge
	slurm_nodes 100
	# HELP slurm_nodes_idle Number of nodes in Idle state
	# TYPE slurm_nodes_idle gauge
	slurm_nodes_idle 85
	# HELP slurm_nodes_alloc Number of nodes in Allocated state
	# TYPE slurm_nodes_alloc gauge
	slurm_nodes_alloc 15
	...
	</pre>

	<p>When <i>MetricsAuthUsers</i>, and/or <i>PrivateData</i> without
	<i>MetricsParameters=ignore_private_data</i> are configured, global access is
	restricted. A JWT token must be provided in these cases:</p>

	<pre>
	$ curl -H "X-SLURM-USER-TOKEN:$SLURM_JWT" \
	http://slurmctld.example.com:6817/metrics/<i>endpoint</i>
	</pre>

	<h3 id="prometheus_config">Prometheus Configuration
	<a class="slurm_link" href="#prometheus_config"></a>
	</h3>

	<p>Configure Prometheus to scrape Slurm metrics by adding the following to your
	<b>prometheus.yml</b>:</p>

	<pre>
	scrape_configs:
	- job_name: 'slurm_jobs'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/jobs'

	- job_name: 'slurm_nodes'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/nodes'

	- job_name: 'slurm_partitions'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/partitions'

	- job_name: 'slurm_scheduler'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/scheduler'

	- job_name: 'slurm_useracct'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/jobs-users-accts'
	</pre>

	<p>If <i>MetricsAuthUsers</i> and/or <i>PrivateData</i> without
	<i>MetricsParameters=ignore_private_data</i> are configured, global access is
	restricted. Add a JWT token for a user that is allowed to access to query jobs.
	For example:</p>

	<pre>
	scrape_configs:
	- job_name: 'slurm_jobs'
	static_configs:
	- targets: ['slurm.example.com:6817']
	metrics_path: '/metrics/jobs'
	authorization:
	type: Bearer
	credentials: '<JWT_TOKEN>'
	</pre>

	<!--#include virtual="footer.txt"-->