| .TH sdiag "1" "Slurm Commands" "May 2023" "Slurm Commands" |
| |
| .SH "NAME" |
| sdiag \- Scheduling diagnostic tool for Slurm |
| |
| .SH "SYNOPSIS" |
| sdiag |
| |
| .SH "DESCRIPTION" |
| sdiag shows information related to slurmctld execution about: threads, agents, |
| jobs, and scheduling algorithms. The goal is to obtain data from slurmctld |
| behavior helping to adjust configuration parameters or queues policies. The |
| main reason behind is to know Slurm behavior under systems with a high throughput. |
| .LP |
| It has two execution modes. The default mode \fB\-\-all\fR shows several counters |
| and statistics explained later, and there is another execution option |
| \fB\-\-reset\fR for resetting those values. |
| .LP |
| Values are reset at midnight UTC time by default. |
| .LP |
| The first block of information is related to global slurmctld execution: |
| |
| .TP |
| \fBServer thread count\fR |
| The number of current active slurmctld threads. A high number would mean a high |
| load processing events like job submissions, jobs dispatching, jobs completing, |
| etc. If this is often close to MAX_SERVER_THREADS it could point to a potential |
| bottleneck. |
| .IP |
| |
| .TP |
| \fBAgent queue size\fR |
| Slurm design has scalability in mind and sending messages to thousands of nodes |
| is not a trivial task. The agent mechanism helps to control communication |
| between slurmctld and the slurmd daemons for a best effort. This value denotes |
| the count of enqueued outgoing RPC requests in an internal retry list. |
| .IP |
| |
| .TP |
| \fBAgent count\fR |
| Number of agent threads. Each of these agent threads can create in turn a group |
| of up to 2 + AGENT_THREAD_COUNT active threads at a time. |
| .IP |
| |
| .TP |
| \fBAgent thread count\fR |
| Total count of active threads created by all the agent threads. |
| .IP |
| |
| .TP |
| \fBDBD Agent queue size\fR |
| Slurm queues up the messages intended for the SlurmDBD and processes them in a |
| separate thread. If the SlurmDBD, or database, is down then this number will |
| increase. |
| |
| The max queue size is configured in the slurm.conf with MaxDBDMsgs. If this number begins to grow more than half of the max queue size, the slurmdbd |
| and the database should be investigated immediately. |
| .IP |
| |
| .TP |
| \fBJobs submitted\fR |
| Number of jobs submitted since last reset |
| .IP |
| |
| .TP |
| \fBJobs started\fR |
| Number of jobs started since last reset. This includes backfilled jobs. |
| .IP |
| |
| .TP |
| \fBJobs completed\fR |
| Number of jobs completed since last reset. |
| .IP |
| |
| .TP |
| \fBJobs canceled\fR |
| Number of jobs canceled since last reset. |
| .IP |
| |
| .TP |
| \fBJobs failed\fR |
| Number of jobs failed due to slurmd or other internal issues since last reset. |
| .IP |
| |
| .TP |
| \fBJob states ts:\fR |
| Lists the timestamp of when the following job state counts were gathered. |
| .IP |
| |
| .TP |
| \fBJobs pending:\fR |
| Number of jobs pending at the given time of the time stamp above. |
| .IP |
| |
| .TP |
| \fBJobs running:\fR |
| Number of jobs running at the given time of the time stamp above. |
| .IP |
| |
| .LP |
| The next block of information is related to main scheduling algorithm based |
| on jobs priorities. A scheduling cycle implies to get the job_write_lock lock, |
| then trying to get resources for jobs pending, starting from the most priority |
| one and going in descending order. Once a job can not get the resources the |
| loop keeps going but just for jobs requesting other partitions. Jobs with |
| dependencies or affected by accounts limits are not processed. |
| |
| .TP |
| \fBLast cycle\fR |
| Time in microseconds for last scheduling cycle. |
| .IP |
| |
| .TP |
| \fBMax cycle\fR |
| Maximum time in microseconds for any scheduling cycle since last reset. |
| .IP |
| |
| .TP |
| \fBTotal cycles\fR |
| Total run time in microseconds for all scheduling cycles since last reset. |
| Scheduling is performed periodically and (depending upon configuration) |
| when a job is submitted or a job is completed. |
| .IP |
| |
| .TP |
| \fBMean cycle\fR |
| Mean time in microseconds for all scheduling cycles since last reset. |
| .IP |
| |
| .TP |
| \fBMean depth cycle\fR |
| Mean of cycle depth. Depth means number of jobs processed in a scheduling cycle. |
| .IP |
| |
| .TP |
| \fBCycles per minute\fR |
| Counter of scheduling executions per minute. |
| .IP |
| |
| .TP |
| \fBLast queue length\fR |
| Length of jobs pending queue. |
| .IP |
| |
| .LP |
| The next block of information is related to backfilling scheduling algorithm. |
| A backfilling scheduling cycle implies to get locks for jobs, nodes and |
| partitions objects then trying to get resources for jobs pending. Jobs are |
| processed based on priorities. If a job can not get resources the algorithm |
| calculates when it could get them obtaining a future start time for the job. |
| Then next job is processed and the algorithm tries to get resources for that |
| job but avoiding to affect the \fIprevious ones\fR, and again it calculates |
| the future start time if not current resources available. The backfilling |
| algorithm takes more time for each new job to process since more priority jobs |
| can not be affected. The algorithm itself takes measures for avoiding a long |
| execution cycle and for taking all the locks for too long. |
| |
| .TP |
| \fBTotal backfilled jobs (since last slurm start)\fR |
| Number of jobs started thanks to backfilling since last slurm start. |
| .IP |
| |
| .TP |
| \fBTotal backfilled jobs (since last stats cycle start)\fR |
| Number of jobs started thanks to backfilling since last time stats where reset. |
| By default these values are reset at midnight UTC time. |
| .IP |
| |
| .TP |
| \fBTotal backfilled heterogeneous job components\fR |
| Number of heterogeneous job components started thanks to backfilling since |
| last Slurm start. |
| .IP |
| |
| .TP |
| \fBTotal cycles\fR |
| Number of backfill scheduling cycles since last reset |
| .IP |
| |
| .TP |
| \fBLast cycle when\fR |
| Time when last backfill scheduling cycle happened in the format |
| "weekday Month MonthDay hour:minute.seconds year" |
| .IP |
| |
| .TP |
| \fBLast cycle\fR |
| Time in microseconds of last backfill scheduling cycle. |
| It counts only execution time, removing sleep time inside a scheduling cycle |
| when it executes for an extended period time. |
| Note that locks are released during the sleep time so that other work can |
| proceed. |
| .IP |
| |
| .TP |
| \fBMax cycle\fR |
| Time in microseconds of maximum backfill scheduling cycle execution since last reset. |
| It counts only execution time, removing sleep time inside a scheduling cycle |
| when it executes for an extended period time. |
| Note that locks are released during the sleep time so that other work can |
| proceed. |
| .IP |
| |
| .TP |
| \fBMean cycle\fR |
| Mean time in microseconds of backfilling scheduling cycles since last reset. |
| .IP |
| |
| .TP |
| \fBLast depth cycle\fR |
| Number of processed jobs during last backfilling scheduling cycle. It counts |
| every job even if that job can not be started due to dependencies or limits. |
| .IP |
| |
| .TP |
| \fBLast depth cycle (try sched)\fR |
| Number of processed jobs during last backfilling scheduling cycle. It counts |
| only jobs with a chance to start using available resources. These |
| jobs consume more scheduling time than jobs which are found can not be started |
| due to dependencies or limits. |
| .IP |
| |
| .TP |
| \fBDepth Mean\fR |
| Mean count of jobs processed during all backfilling scheduling cycles since last |
| reset. |
| Jobs which are found to be ineligible to run when examined by the backfill |
| scheduler are not counted (e.g. jobs submitted to multiple partitions and |
| already started, jobs which have reached a QOS or account limit such as |
| maximum running jobs for an account, etc). |
| .IP |
| |
| .TP |
| \fBDepth Mean (try sched)\fR |
| The subset of Depth Mean that the backfill scheduler attempted to schedule. |
| .IP |
| |
| .TP |
| \fBLast queue length\fR |
| Number of jobs pending to be processed by backfilling algorithm. |
| A job is counted once for each partition it is queued to use. |
| A pending job array will normally be counted as one job (tasks of a job array |
| which have already been started/requeued or individually modified will already |
| have individual job records and are each counted as a separate job). |
| .IP |
| |
| .TP |
| \fBQueue length Mean\fR |
| Mean count of jobs pending to be processed by backfilling algorithm. |
| A job is counted once for each partition it requested. |
| A pending job array will normally be counted as one job (tasks of a job array |
| which have already been started/requeued or individually modified will already |
| have individual job records and are each counted as a separate job). |
| .IP |
| |
| .TP |
| \fBLast table size\fR |
| Count of different time slots tested by the backfill scheduler in its last |
| iteration. |
| .IP |
| |
| .TP |
| \fBMean table size\fR |
| Mean count of different time slots tested by the backfill scheduler. |
| Larger counts increase the time required for the backfill operation. |
| The table size is influenced by many scheduling parameters, including: |
| bf_min_age_reserve, bf_min_prio_reserve, bf_resolution, and bf_window. |
| .IP |
| |
| .TP |
| \fBLatency for 1000 calls to gettimeofday()\fR |
| Latency of 1000 calls to the gettimeofday() syscall in microseconds, |
| as measured at controller startup. |
| .IP |
| |
| .LP |
| The next blocks of information report the most frequently issued |
| remote procedure calls (RPCs), calls made for the Slurmctld daemon to perform |
| some action. |
| The fourth block reports the RPCs issued by message type. |
| You will need to look up those RPC codes in the Slurm source code by looking |
| them up in the file src/common/slurm_protocol_defs.h. |
| The report includes the number of times each RPC is invoked, the total time |
| consumed by all of those RPCs plus the average time consumed by each RPC in |
| microseconds. |
| The fifth block reports the RPCs issued by user ID, the total number of RPCs |
| they have issued, the total time consumed by all of those RPCs plus the average |
| time consumed by each RPC in microseconds. |
| RPCs statistics are collected for the life of the slurmctld process unless |
| explicitly \fB\-\-reset\fR. |
| |
| .LP |
| The sixth block of information, labeled Pending RPC Statistics, shows |
| information about pending outgoing RPCs on the slurmctld agent queue. |
| The first section of this block shows types of RPCs on the queue and the |
| count of each. The second section shows up to the first 25 individual RPCs |
| pending on the agent queue, including the type and the destination host list. |
| This information is cached and only refreshed on 30 second intervals. |
| |
| .SH "OPTIONS" |
| |
| .TP |
| \fB\-a\fR, \fB\-\-all\fR |
| Get and report information. This is the default mode of operation. |
| .IP |
| |
| .TP |
| \fB\-M\fR, \fB\-\-cluster\fR=<\fIstring\fR> |
| The cluster to issue commands to. Only one cluster name may be specified. |
| Note that the \fBslurmdbd\fR must be up for this option to work properly, unless |
| running in a federation with \fBFederationParameters=fed_display\fR configured. |
| .IP |
| |
| .TP |
| \fB\-h\fR, \fB\-\-help\fR |
| Print description of options and exit. |
| .IP |
| |
| .TP |
| \f3\-\-json\fP, \f3\-\-json\fP=\fIlist\fR, \f3\-\-json\fP=<\fIdata_parser\fR> |
| Dump information as JSON using the default data_parser plugin or explicit |
| data_parser with parameters. Sorting and formatting arguments will be ignored. |
| .IP |
| |
| .TP |
| \fB\-r\fR, \fB\-\-reset\fR |
| Reset scheduler and RPC counters to 0. Only supported for Slurm operators and |
| administrators. |
| .IP |
| |
| .TP |
| \fB\-i\fR, \fB\-\-sort\-by\-id\fR |
| Sort Remote Procedure Call (RPC) data by message type ID and user ID. |
| .IP |
| |
| .TP |
| \fB\-t\fR, \fB\-\-sort\-by\-time\fR |
| Sort Remote Procedure Call (RPC) data by total run time. |
| .IP |
| |
| .TP |
| \fB\-T\fR, \fB\-\-sort\-by\-time2\fR |
| Sort Remote Procedure Call (RPC) data by average run time. |
| .IP |
| |
| .TP |
| \fB\-\-usage\fR |
| Print list of options and exit. |
| .IP |
| |
| .TP |
| \fB\-V\fR, \fB\-\-version\fR |
| Print current version number and exit. |
| .IP |
| |
| .TP |
| \f3\-\-yaml\fP, \f3\-\-yaml\fP=\fIlist\fR, \f3\-\-yaml\fP=<\fIdata_parser\fR> |
| Dump information as YAML using the default data_parser plugin or explicit |
| data_parser with parameters. Sorting and formatting arguments will be ignored. |
| .IP |
| |
| .SH "PERFORMANCE" |
| .PP |
| Executing \fBsdiag\fR sends a remote procedure call to \fBslurmctld\fR. If |
| enough calls from \fBsdiag\fR or other Slurm client commands that send remote |
| procedure calls to the \fBslurmctld\fR daemon come in at once, it can result in |
| a degradation of performance of the \fBslurmctld\fR daemon, possibly resulting |
| in a denial of service. |
| .PP |
| Do not run \fBsdiag\fR or other Slurm client commands that send remote procedure |
| calls to \fBslurmctld\fR from loops in shell scripts or other programs. Ensure |
| that programs limit calls to \fBsdiag\fR to the minimum necessary for the |
| information you are trying to gather. |
| |
| .SH "ENVIRONMENT VARIABLES" |
| .PP |
| Some \fBsdiag\fR options may be set via environment variables. These |
| environment variables, along with their corresponding options, are listed below. |
| (Note: Command line options will always override these settings.) |
| |
| .TP 20 |
| \fBSLURM_CLUSTERS\fR |
| Same as \fB\-\-cluster\fR |
| .IP |
| |
| .TP 20 |
| \fBSLURM_CONF\fR |
| The location of the Slurm configuration file. |
| |
| .TP |
| \fBSLURM_JSON\fR |
| Control JSON serialization: |
| .IP |
| .RS |
| .TP |
| \fBcompact\fR |
| Output JSON as compact as possible. |
| .IP |
| |
| .TP |
| \fBpretty\fR |
| Output JSON in pretty format to make it more readable. |
| .IP |
| .RE |
| |
| .TP |
| \fBSLURM_YAML\fR |
| Control YAML serialization: |
| .IP |
| .RS |
| .TP |
| \fBcompact\fR Output YAML as compact as possible. |
| .IP |
| |
| .TP |
| \fBpretty\fR Output YAML in pretty format to make it more readable. |
| .RE |
| .IP |
| |
| .SH "COPYING" |
| Copyright (C) 2010\-2011 Barcelona Supercomputing Center. |
| .br |
| Copyright (C) 2010\-2022 SchedMD LLC. |
| .LP |
| Slurm is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| Slurm is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "SEE ALSO" |
| .LP |
| sinfo(1), squeue(1), scontrol(1), slurm.conf(5), |