| <!--#include virtual="header.txt"--> |
| |
| <h1>Upgrade Guide</h1> |
| |
| <p>Slurm supports in-place upgrades between certain versions. This page provides |
| important details about the steps necessary to perform an upgrade and the |
| potential complications to prepare for.</p> |
| |
| <p>See also <a href="quickstart_admin.html">Quick Start Administrator Guide</a></p> |
| |
| <h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2> |
| <ul> |
| <li><a href="#release_cycle">Release Cycle</a> |
| <ul> |
| <li><a href="#compatibility_window">Compatibility Window</a></li> |
| <li><a href="#epel_repository">EPEL Repository</a></li> |
| <li><a href="#prerelease">Pre-Release Versions</a></li> |
| </ul></li> |
| <li><a href="#revert">Reverting an Upgrade</a></li> |
| <li><a href="#minor_upgrades">Minor Upgrades</a></li> |
| <li><a href="#procedure">Upgrade Procedure</a> |
| <ul> |
| <li><a href="#preparation">Preparation</a></li> |
| <li><a href="#backups">Create Backups</a></li> |
| <li><a href="#slurmdbd">slurmdbd (Accounting)</a> |
| <ul> |
| <li><a href="#db_server">Database Server</a></li> |
| </ul></li> |
| <li><a href="#slurmctld">slurmctld (Controller)</a></li> |
| <li><a href="#slurmd">slurmd (Compute Nodes)</a></li> |
| <li><a href="#other_commands">Other Slurm Commands</a></li> |
| <li><a href="#custom_plugins">Customized Slurm Plugins</a></li> |
| </ul></li> |
| <li><a href="#seamless_upgrades">Seamless Upgrades</a></li> |
| </ul> |
| |
| <h2 id="release_cycle">Release Cycle |
| <a class="slurm_link" href="#release_cycle"></a></h2> |
| |
| <p>The Slurm version number contains three period-separated numbers that |
| represent both the major Slurm release and maintenance release level. |
| For example, Slurm 23.11.4:</p> |
| |
| <ul> |
| <li><b>23.11</b> = major release |
| <ul> |
| <li>This matches the year and month of initial release (November 2023)</li> |
| <li>Major releases may contain changes to RPCs (remote procedure calls), |
| state files, configuration options, and core functionality</li> |
| </ul></li> |
| <li><b>.4</b> = maintenance version |
| <ul> |
| <li>Maintenance releases may contain bug fixes and performance improvements</li> |
| </ul></li> |
| </ul> |
| |
| <p>Prior to the 24.05 release, Slurm operated on a 9-month release cycle for |
| major versions. Slurm 24.05 represents the first release on the |
| <a href="https://www.schedmd.com/slurm-releases-move-to-a-six-month-cycle/"> |
| new 6-month cycle</a>.</p> |
| |
| <h3 id="compatibility_window">Compatibility Window |
| <a class="slurm_link" href="#compatibility_window"></a></h3> |
| |
| <p>Upgrades from the <b>previous two major releases</b> are compatible. For |
| example, slurmdbd 23.11.x is capable of accepting messages from slurmctld |
| daemons and commands with a version of 23.11.x, 23.02.x or 22.05.x. It is also |
| capable of updating the records in the database that were recorded by an |
| instance of slurmdbd running these versions.</p> |
| |
| <p>The Slurm 24.11 release introduces compatibility with three previous |
| major releases to provide a similar support duration with the more frequent |
| 6-month release cycle:</p> |
| |
| <table class="tlist centered"> |
| <tbody> |
| <tr> |
| <td><strong>Slurm Release</strong></td> |
| <td><strong>Revised End of Support</strong><br>(total length)</td> |
| <td><strong>Compatible Prior Version</strong></td> |
| </tr> |
| <tr> |
| <td>23.11</td> |
| <td>May 2025 (18 months)</td> |
| <td>23.02, 22.05</td> |
| </tr> |
| <tr> |
| <td>24.05</td> |
| <td>November 2025 (18 months)</td> |
| <td>23.11, 23.02</td> |
| </tr> |
| <tr> |
| <td>24.11</td> |
| <td>May 2026 (18 months)</td> |
| <td>24.05, 23.11, 23.02</td> |
| </tr> |
| <tr> |
| <td>25.05</td> |
| <td>November 2026 (18 months)</td> |
| <td>24.11, 24.05, 23.11</td> |
| </tr> |
| <tr> |
| <td>25.11</td> |
| <td>May 2027 (18 months)</td> |
| <td>25.05, 24.11, 24.05</td> |
| </tr> |
| <tr> |
| <td>26.05</td> |
| <td>November 2028 (18 months)</td> |
| <td>25.11, 25.05, 24.11</td> |
| </tr> |
| </tbody> |
| </table> |
| <br> |
| |
| <p>Upgrades from incompatible versions will fail immediately upon startup. |
| It is required to perform upgrades from incompatible prior versions in steps, |
| going to newer versions compatible with the current running version. It may |
| take several steps to upgrade to a current release of Slurm. For example, |
| instead of upgrading directly from Slurm 20.11 to 23.11, first upgrade all |
| systems to Slurm 22.05 and verify functionality, then proceed to upgrade to |
| 23.11. This ensures that each upgrade performed is tested and can be supported |
| by SchedMD. Compatibility requirements apply to running jobs and upgrading |
| outside of their compatibility window will result in the jobs being killed and |
| job accounting being lost.</p> |
| |
| <h3 id="epel_repository">EPEL Repository |
| <a class="slurm_link" href="#epel_repository"></a></h3> |
| |
| <p>In the beginning of 2021, a version of Slurm was added to the |
| EPEL repository. This version is not provided by or supported by SchedMD, and is |
| not currently supported for customer use. Unfortunately, this inclusion could |
| cause Slurm to be updated to a newer version outside of a planned maintenance |
| period or result in conflicting packages. In order to prevent Slurm from being |
| changed and broken unintentionally, we recommend you modify the EPEL Repository |
| configuration to exclude all Slurm packages from automatic updates.</p> |
| |
| <p>Add the following under the <code>[epel]</code> |
| section of /etc/yum.repos.d/epel.repo: |
| <pre>exclude=slurm*</pre></p> |
| |
| <h3 id="prerelease">Pre-Release Versions |
| <a class="slurm_link" href="#prerelease"></a></h3> |
| |
| <p>When installing pre-release versions (e.g., 24.05.0rc1 or |
| <a href="https://github.com/SchedMD/slurm">master branch</a>), you should prepare |
| for unexpected crashes, bugs, and loss of state information. SchedMD aims to |
| use the NEWS file to indicate cases in which state information will be lost with |
| pre-release versions. However, these pre-release versions receive <b>limited |
| testing</b> and are not intended for production clusters. Sites are encouraged |
| to actively run pre-release versions on test machines before each major release. |
| </p> |
| |
| <h2 id="revert">Reverting an Upgrade |
| <a class="slurm_link" href="#revert"></a></h2> |
| |
| <p>Reverting an upgrade (or downgrading) is <b>not supported</b> once any of the |
| Slurm daemons have been started. When starting up after an upgrade, the Slurm |
| daemons (slurmctld, slurmdbd, and slurmd) will update their relevant state |
| files and databases to the structure used in the new version. If you revert to |
| an older version, the relevant Slurm daemon will not recognize the new state |
| file or database, resulting in loss or corruption of state information or job |
| accounting. The Slurm daemons will likely refuse to start unless configured to |
| start with the risk of possible data loss.</p> |
| |
| <p>By using recovery tools, like comprehensive file backups, disk images, and |
| snapshots, it may be possible to revert components to the pre-upgrade state. |
| In particular, restoring the contents of <i>StateSaveLocation</i> (as defined in |
| <i>slurm.conf</i>) and (if configured) the accounting database will be required |
| if you wish to revert an upgrade. Reverting an upgrade will wipe out anything |
| that happened after the backups were created.</p> |
| |
| <h2 id="minor_upgrades">Minor Upgrades |
| <a class="slurm_link" href="#minor_upgrades"></a></h2> |
| |
| <p>When upgrading to a newer minor maintenance release (as |
| <a href="#release_cycle">defined above</a>), we recommend following the same |
| upgrade procedure as with major releases. You will find that the process takes |
| less time, and is more accommodating of mixed versions and in-place |
| downgrades. However, you should always have current backups to solidify your |
| recovery options.</p> |
| |
| <h2 id="procedure">Upgrade Procedure |
| <a class="slurm_link" href="#procedure"></a></h2> |
| |
| <p>The upgrades procedure can be summarized as follows. Note the specific order |
| in which the daemons should be upgraded:</p> |
| |
| <ol> |
| <li><a href="#preparation">Prepare cluster for the upgrade</a></li> |
| <li><a href="#backups">Create backups</a></li> |
| <li>Upgrade <a href="#slurmdbd">slurmdbd</a></li> |
| <li>Upgrade <a href="#slurmctld">slurmctld</a></li> |
| <li>Upgrade <a href="#slurmd">slurmd</a> (preferably with slurmctld)</li> |
| <li>Upgrade <a href="#other_commands">login nodes and client commands</a></li> |
| <li>Recompile/upgrade <a href="#custom_plugins">customized Slurm plugins</a></li> |
| <li>Test key functionality</li> |
| <li>Archive backup data</li> |
| </ol> |
| |
| <p>Before considering the upgrade complete, wait for all jobs that were already |
| running to finish. Any jobs started before the <b>slurmd</b> system was upgraded |
| will be running with the old version of <b>slurmstepd</b>, so starting another |
| upgrade or trying to use new features in the new version may cause problems.</p> |
| |
| <p><b>NOTE</b>: If RPM/DEB packages are used, all packages present on each |
| system must be upgraded together instead of piecewise. This is because the |
| packages that contain Slurm daemons and client commands depend on the general |
| <b>slurm</b> package. Avoid using low-level package managers like <code>rpm</code> |
| or <code>dpkg</code> as they may not properly enforce these dependencies. |
| After upgrading, daemons should be started in the order listed above.</p> |
| |
| <h3 id="preparation">Preparation |
| <a class="slurm_link" href="#preparation"></a></h3> |
| |
| <h4 id="release_notes">RELEASE_NOTES and CHANGELOG |
| <a class="slurm_link" href="#release_notes"></a></h4> |
| |
| <p>Review relevant release notes in the <b>RELEASE_NOTES.md</b> file in root of |
| Slurm source directory for the target release and any major versions between |
| what you're currently running and the target you are upgrading to. Pay |
| particular attention to any entries in which items are <b>removed</b> or |
| <b>changed</b>. These are particularly likely to require specific attention or |
| changes during the upgrade. Also look for changes in optional slurm components |
| that you are using. You may also notice new items added to Slurm that you wish |
| to start using after the upgrade.</p> |
| |
| <p>Release notes for the latest major version are |
| available <a href="release_notes.html">here</a> or on GitHub |
| (<a href="https://github.com/SchedMD/slurm/blob/master/RELEASE_NOTES.md">RELEASE_NOTES.md</a>). |
| Release notes for other versions can be found in the source, which can be viewed |
| on GitHub by selecting the branch or tag corresponding to the desired version. |
| (Refer to |
| <a href="https://github.com/SchedMD/slurm/blob/slurm-24.11/RELEASE_NOTES"> |
| RELEASE_NOTES</a> for Slurm 24.11 and older.) More detailed changes, including |
| minor release changes, can be found in the <b>CHANGELOG</b> directory (formerly |
| the NEWS file), but are usually not needed to prepare for upgrades.</p> |
| |
| <h4 id="config_changes">Configuration Changes |
| <a class="slurm_link" href="#config_changes"></a></h4> |
| |
| <p>Always prepare and test configuration changes in a test environment |
| before upgrading in production. Changes outlined in the release notes will need |
| to be looked up in the man pages (such as <a href="slurm.conf.html">slurm.conf |
| </a>) for details and new syntax. Certain options in your configuration files |
| may need to be changed as features and functionality are improved in every major |
| Slurm release. Typically, new naming and syntax conventions are introduced |
| several versions before the old ones are removed, so you may be able to make the |
| necessary changes before starting the upgrade process.</p> |
| |
| <h4 id="downtime">Plan for Downtime |
| <a class="slurm_link" href="#downtime"></a></h4> |
| |
| <p>Refer to the expected downtime guidance in the |
| following sections for each relevant Slurm daemon, particularly the |
| <a href="#slurmdbd">slurmdbd</a>. Notify affected users of the estimated |
| downtime for the relevant services and the potential impact on their jobs. |
| Whenever possible, try to plan upgrades during SchedMD's support hours. |
| If you encounter an issue outside of these hours there will be a delay before |
| assistance can be provided.</p> |
| |
| <h4 id="openapi_changes">OpenAPI Changes |
| <a class="slurm_link" href="#openapi_changes"></a></h4> |
| |
| <p>Sites using <code>--json</code> or <code>--yaml</code> arguments with any CLI |
| commands or running <code>slurmrestd</code> need to check for format |
| compatibility and data_parser plugin removals before upgrading. The formats for |
| the values parsed and dumped as JSON and YAML are handled by the data_parser |
| and openapi plugins. Changes to the formats are tracked in the |
| <a href="openapi_release_notes.html">OpenAPI release notes</a>.</p> |
| |
| <table class="tlist centered"> |
| <tbody> |
| <tr> |
| <td><strong>Release Notes</strong></td> |
| <td><strong>Added OpenAPI plugins</strong></td> |
| <td><strong>Added Data_Parser plugin</strong></td> |
| <td><strong>Removed in Release</strong></td> |
| </tr> |
| <tr> |
| <td><a href="openapi_release_notes.html#24050">24.05</a></td> |
| <td></td> |
| <td>v0.0.41</td> |
| <td>26.05</td> |
| </tr> |
| <tr> |
| <td><a href="openapi_release_notes.html#24110">24.11</a></td> |
| <td></td> |
| <td>v0.0.42</td> |
| <td>26.11</td> |
| </tr> |
| <tr> |
| <td><a href="openapi_release_notes.html#25050">25.05</a></td> |
| <td></td> |
| <td>v0.0.43</td> |
| <td>27.05</td> |
| </tr> |
| <tr> |
| <td><a href="openapi_release_notes.html#25110">25.05</a></td> |
| <td></td> |
| <td>v0.0.44</td> |
| <td>27.11</td> |
| </tr> |
| </tbody> |
| </table> |
| |
| <p><b>NOTE</b>: The unversioned openapi/slurmctld and openapi/slurmdbd plugins |
| have no planned removal release.</p> |
| |
| <p>Any scripts or clients making use of <code>--json</code> or |
| <code>--yaml</code> arguments with any CLI commands may need to pass the |
| data_parser version explicitly to avoid issues after an upgrade. The default |
| data_parser used is the latest version which may not have a compatible format |
| with the prior versions. Sites can use the specification generation mode to |
| compare formatting differences. |
| <pre> |
| $CLI_COMMAND --json=v0.0.41+spec_only > /tmp/v41.json; |
| $CLI_COMMAND --json=v0.0.40+spec_only > /tmp/v40.json; |
| json_diff /tmp/v40.json /tmp/v41.json; |
| </pre></p> |
| |
| <p>In the event of a format incompatibility, the preferred data_parser can be |
| requested explicitly starting with the v0.0.40 plugins in any release before |
| the plugin's removal. |
| <pre> |
| $CLI_COMMAND --json=v0.0.41 $OTHER_ARGS | $SITE_SCRIPT; |
| $CLI_COMMAND --json=v0.0.40 $OTHER_ARGS | $SITE_SCRIPT; |
| $CLI_COMMAND --yaml=v0.0.41 $OTHER_ARGS | $SITE_SCRIPT; |
| $CLI_COMMAND --yaml=v0.0.40 $OTHER_ARGS | $SITE_SCRIPT; |
| </pre></p> |
| |
| <p>Any <code>slurmrestd</code> web clients can determine the relevant plugin |
| being used by looking at the URL being queried. Example URLs: |
| <pre> |
| http://$HOST/slurmdb/v0.0.40/jobs |
| http://$HOST/slurm/v0.0.40/jobs |
| </pre></p> |
| |
| <p>The relevant data_parser plugin in the example URLs is "v0.0.40" which |
| matches the <code>data_parser/v0.0.40</code> plugin. Plugin naming follows the |
| naming schema of <code>vXX.XX.XX</code> where the XX are numbers. The naming |
| schema matches the internal naming schema for Slurm's packed binary RPC layer |
| but is not directly related. The URLs for each given data_parser plugins will |
| remain a valid query target until the plugin is removed as part of SchedMD's |
| commitment to ensure release limited backwards compatibility. While it should |
| be possible to continue using any client from a prior release while the plugins |
| are still supported, <b>sites should always recompile any generated OpenAPI |
| clients and test thoroughly before upgrading.</b></p> |
| |
| <h3 id="backups">Create Backups |
| <a class="slurm_link" href="#backups"></a></h3> |
| |
| <p><b>Always</b> create full backups to restore all parts of Slurm, including |
| the Mysql database, before upgrading in the event the upgrade must be reverted. |
| SchedMD aims to make supported upgrades a seamless process but it is possible |
| for unexpected issues to arise and <b>irreversibly corrupt</b> all of the data |
| kept by Slurm. If something like this happens, it will not be possible to |
| recover any corrupted data and you will be reliant on backed up data.</p> |
| |
| <p>It is recommended to prepare recovery options (file backups, disk images, |
| snapshots, database dumps) that will take you back to a known working cluster |
| state. How backups are taken is specific to how the systems integrator |
| designed and setup the cluster and procedures are not provided here.</p> |
| |
| <p>At a minimum, back up the following: |
| <ul> |
| <li><b>StateSaveLocation</b> as defined in |
| <a href="slurm.conf.html#OPT_StateSaveLocation">slurm.conf</a>, or it can be |
| queried by calling <pre>scontrol show config | grep StateSaveLocation</pre></li> |
| <li><b>Entire slurm configuration directory</b>, as defined by |
| <code>configure --sysconfdir=DIR</code> during compilation. |
| This is usually located in <code>/etc/slurm/</code></li> |
| <li><b>MySQL database</b> (if slurmdbd is configured). Usually done by calling |
| <pre> |
| mysqldump --databases slurm_acct_db > /path/to/offline/storage/backup.sql |
| </pre> |
| This assumes that <b>slurmdbd</b> is not running while the dump is running. |
| <br>If you wish to back it up while <b>slurmdbd</b> is running, you may use the |
| <code>--single-transaction</code> flag with the <b>following limitations</b>: |
| <ol> |
| <li>Database operations may be slower while the dump is running</li> |
| <li>Restoring this dump will restore the database at the time the dump was |
| <b>started</b>, losing any changes made during or after the dump</li> |
| <li>Certain cluster operations may lead to an incorrect or failed dump: |
| <ul> |
| <li>Creating a new database</li> |
| <li>Upgrading an existing database</li> |
| <li>Adding or Removing a cluster in the slurmdbd</li> |
| <li><a href="https://slurm.schedmd.com/accounting.html#slurmdbd-archive-purge"> |
| Archiving or Purging</a> accounting data</li> |
| </ul> |
| </li> |
| </ol> |
| </li> |
| </ul></p> |
| |
| <h3 id="slurmdbd">slurmdbd (Accounting) |
| <a class="slurm_link" href="#slurmdbd"></a></h3> |
| |
| <p>If <b>slurmdbd</b> is used in your environment, it must be at the same or |
| higher major release number as the slurmctld daemon(s), and at a close enough |
| version for <a href="#compatibility_window">compatibility</a>. Thus, when |
| performing upgrades, it should be upgraded first. When a backup slurmdbd host |
| is in use, it should be upgraded at the same time as the primary.</p> |
| |
| <p>Upgrades to the slurmdbd may require significant <b>downtime</b>. |
| With large accounting databases, the precautionary database dump will take some |
| time, and the upgraded daemon may be unresponsive for tens of minutes while it |
| updates the database to the new schema. Sites are encouraged to use the |
| <a href="slurmdbd.conf.html#OPT_PurgeJobAfter">purge functionality</a> if older |
| accounting data is not required for normal operations. Purging old records |
| before attempting to upgrade can significantly decrease outage time.</p> |
| |
| <p>The non-slurmdbd functionality of the cluster will continue to operate while |
| the upgrade is in process, provided the activity does not fill up the slurmdbd |
| Agent queue on the slurmctld node. While slurmdbd is offline, you should |
| monitor the memory usage of slurmctld, and the <b>DBD Agent queue size</b>, as |
| reported by <b>sdiag</b>, to ensure it does not exceed the configured |
| <b>MaxDBDMsgs</b> in <a href="slurm.conf.html#OPT_MaxDBDMsgs">slurm.conf</a>. |
| Cli commands <a href="sacct.html">sacct</a> and <a href="sacctmgr.html"> |
| sacctmgr</a> will not work while slurmdbd is down. |
| <code>slurmrestd</code> queries that include slurmdb in |
| the URL path will fail while slurmdbd is down.</p> |
| |
| <p>It is preferred to create a backup of the database after shutting down the |
| <b>slurmdbd</b> daemon, when the MySQL database is no longer changing. If you |
| wish to take a backup with <b>mysqldump</b> while the slurmdbd is still |
| running, you can add <code>--single-transaction</code> to the mysqldump command. |
| Note that the slurmdbd will continue to execute operations that will not be |
| contained in the dump, which may cause complications if you need to restore |
| the database to this state.</p> |
| |
| <p>The suggested upgrade procedure is as follows:</p> |
| |
| <ol> |
| <li>Shutdown the slurmdbd daemon(s) gracefully: |
| <pre>sacctmgr shutdown</pre>or via systemd: |
| <pre>systemctl stop slurmdbd</pre> Wait until slurmdbd is fully down before |
| proceeding or there may be data loss from data that was not fully saved. |
| <pre>systemctl status slurmdbd</pre> |
| </li> |
| <li><a href="#backups">Backup the Slurm database</a></li> |
| <li>Verify that the innodb_buffer_pool_size in my.cnf is greater than the |
| default. See the recommendation in the |
| <a href="accounting.html#slurm-accounting-configuration-before-build"> |
| accounting page</a>.</li> |
| <li>Upgrade the slurmdbd daemon binaries, libraries, and its systemd unit file |
| (if used). If using <a href="quickstart_admin.html#build_install"> |
| RPM/DEB packages</a>, the package manager will take care of these, |
| although systemd overrides may prevent the new unit from taking effect. |
| <br>Only upgrade the slurmdbd system(s) at this time; other Slurm |
| systems should remain on the old version.</li> |
| <li>Start the primary slurmdbd daemon. |
| <br><b>NOTE</b>: If you typically use systemd, it is recommended to |
| initially start the daemon directly as the configured SlurmUser: |
| <br><code>sudo -u slurm slurmdbd -D</code> |
| <br>When the daemon starts up for the first time after upgrading, it |
| will take some extra time to update existing records in the database. If |
| it is started with systemd and reaches the configured timeout value, it |
| may be killed prematurely potentially causing data loss. After it |
| finishes starting up, you can use <code>Ctrl+C</code> to exit, then |
| start it normally with systemd.</li> |
| <li>Start the backup slurmdbd daemon (if applicable).</li> |
| <li>Validate accounting operation, such as retrieving data through |
| <code>sacct</code> or <code>sacctmgr</code>.</li> |
| </ol> |
| |
| <h4 id="db_server"><b>Database Server</b> |
| <a class="slurm_link" href="#db_server"></a></h4> |
| |
| <p>When upgrading the database server that is used by slurmdbd (e.g., MySQL or |
| MariaDB), usually no special procedures are required. It is recommended to use a |
| database server that is supported by the publisher (or that was at the time when |
| the chosen Slurm version was initially released). Database upgrades should be |
| performed while the slurmdbd is stopped and according to the recommended |
| procedure for the database used.</p> |
| |
| <p>When upgrading an existing accounting database to <b>MariaDB 10.2.1</b> or |
| later from an older version of MariaDB or any version of MySQL, ensure you are |
| running <b>slurmdbd 22.05.7</b> or later. These versions will gracefully handle |
| changes to MariaDB default values that can cause problems for slurmdbd.</p> |
| |
| <h3 id="slurmctld">slurmctld (Controller) |
| <a class="slurm_link" href="#slurmctld"></a></h3> |
| |
| <p>It is preferred to upgrade the slurmctld system(s) at the same time as slurmd |
| on the compute nodes and other Slurm commands on client machines and login nodes. |
| The effects of downtime on slurmctld and slurmd daemons are largely the same, |
| so upgrading them all together minimizes the total duration of these effects. |
| Rolling upgrades are also possible if the slurmctld is upgraded first. When |
| multiple slurmctld hosts are used, all should be upgraded simultaneously.</p> |
| |
| <p>Upgrading the slurmctld involves a brief period of <b>downtime</b> during |
| which job submissions are not accepted, queued jobs are not scheduled, and |
| information about completing jobs is held. These functions will resume once |
| the upgraded controller is started.</p> |
| |
| <p>The recommended upgrade procedure is below, including optional steps for a |
| simultaneous upgrade of slurmd systems:</p> |
| |
| <ol> |
| <li>Increase configured SlurmdTimeout and SlurmctldTimeout values and |
| execute <code>scontrol reconfig</code> for them to take effect. |
| <br>The new timeout should be long enough to perform the upgrade using |
| your preferred method. If the timeout is reached, nodes may be marked |
| DOWN and their jobs killed.</li> |
| <li>Shutdown the slurmctld daemon(s).</li> |
| <li>(opt.) Shutdown the slurmd daemons on the compute nodes.</li> |
| <li>Back up the contents of the configured StateSaveLocation.</li> |
| <li>Upgrade the slurmctld (and optionally slurmd) daemons and their systemd |
| service files (if used).</li> |
| <li>(opt.) Restart the slurmd daemons on the compute nodes.</li> |
| <li>Restart the slurmctld daemon(s).</li> |
| <li>Validate proper operation, such as communication with nodes and a job's |
| ability to successfully start and finish.</li> |
| <li>Restore the preferred SlurmdTimeout and SlurmctldTimeout values and |
| execute <code>scontrol reconfig</code> for them to take effect.</li> |
| </ol> |
| |
| <h3 id="slurmd">slurmd (Compute Nodes) |
| <a class="slurm_link" href="#slurmd"></a></h3> |
| |
| <p>It is preferred to upgrade all slurmd nodes at the same time as the slurmctld. |
| It is also possible to perform a rolling upgrade by upgrading the slurmd nodes |
| later in any number of groups. Sites are encouraged to minimize the amount of |
| time during which mixed versions are used in a cluster.</p> |
| |
| <p>Upgrades will not interrupt running jobs as long as <b>SlurmdTimeout</b> |
| is not reached during the process. However, while the slurmd is down for |
| upgrades, new jobs will not be started and finishing jobs will wait to |
| report back to the controller until it comes back online.</p> |
| |
| <p>If you are upgrading the slurmd nodes separately from the controller, the |
| following procedure can be followed:</p> |
| |
| <ol> |
| <li>Increase the configured SlurmdTimeout value and execute |
| <code>scontrol reconfig</code> for it to take effect. |
| <br>The new timeout should be long enough to perform the upgrade using |
| your preferred method. If the timeout is reached, nodes may be marked |
| DOWN and their jobs killed.</li> |
| <li>Shutdown the slurmd daemons on the compute nodes.</li> |
| <li>Back up the contents of the configured StateSaveLocation.</li> |
| <li>Upgrade the slurmd daemons and their systemd unit files (if used).</li> |
| <li>Restart the slurmd daemons.</li> |
| <li>Validate proper operation, such as communication with the controller and a |
| job's ability to successfully start and finish.</li> |
| <li>Repeat for any other groups of nodes that need to be upgraded.</li> |
| <li>Restore the preferred SlurmdTimeout value and |
| execute <code>scontrol reconfig</code> for it to take effect.</li> |
| </ol> |
| |
| <h3 id="other_commands">Other Slurm Commands |
| <a class="slurm_link" href="#other_commands"></a></h3> |
| |
| <p>Other Slurm commands (including client commands) do not require special |
| attention when upgrading, except where specifically noted in the release notes. |
| You should also pay attention to any changes introduced in these additional |
| components. After core Slurm components have been upgraded, upgrade additional |
| components along with their systemd unit files (if used) and client commands |
| using the normal method for your system, then restart any affected daemons.</p> |
| |
| <h3 id="custom_plugins">Customized Slurm Plugins |
| <a class="slurm_link" href="#custom_plugins"></a></h3> |
| |
| <p>Slurm's main public API library (libslurm.so.X.0.0) increases its version |
| number with every major release, so any application linked against it should be |
| recompiled after an upgrade. This includes locally developed Slurm plugins.</p> |
| |
| <p>If you have built your own version of Slurm plugins, besides having to |
| recompile them, they will likely need modification to support the new version |
| of Slurm. It is common for plugins to add new functions and function arguments |
| during major updates. See the RELEASE_NOTES file for details about these |
| changes.</p> |
| |
| <p>Slurm's PMI-1 (libpmi.so.0.0.0) and PMI-2 (libpmi2.so.0.0.0) public API |
| libraries do not change between releases and are meant to be permanently |
| fixed. This means that linking against either of them will not require you |
| to recompile the application after a Slurm upgrade, except in the unlikely |
| event that one of them changes. It is unlikely because these libraries must |
| be compatible with any other PMI-1 and PMI-2 implementations. If there was a |
| change, it would be announced in the RELEASE_NOTES and would only happen on |
| a major release.</p> |
| |
| <p>As an example, MPI stacks like OpenMPI and MVAPICH2 link against Slurm's |
| PMI-1 and/or PMI-2 API, but not against our main public API. This means that at |
| the time of writing this documentation, you don't need to recompile these |
| stacks after a Slurm upgrade. One known exception is MPICH. When MPICH is |
| compiled with Slurm support and with the Hydra Process Manager, it will use |
| the Slurm API to obtain job information. This link means you will need to |
| recompile the MPICH stack after an upgrade.</p> |
| |
| <p>One easy way to know if an application requires a recompile is to inspect all |
| of its ELF files with 'ldd' and grep for 'slurm'. If you see a versioned |
| 'libslurm.so.x.y.z' reference, then the application will likely need to be |
| recompiled.</p> |
| |
| <h2 id="seamless_upgrades">Seamless Upgrades |
| <a class="slurm_link" href="#seamless_upgrades"></a></h2> |
| |
| <p>In environments where the Slurm build process is customized, it is possible |
| to install a new version of Slurm to a unique directory and use a symbolic link |
| to point the directory in your PATH to the version of Slurm you would like to |
| use. This allows you to install the new version before you are in a maintenance |
| period as well as easily switch between versions should you need to roll |
| back for any reason. It also avoids potential problems with library conflicts |
| that might arise from installing different versions to the same directory.</p> |
| |
| <p style="text-align:center;">Last modified 27 August 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |