| <!--#include virtual="header.txt"--> |
| |
| <h1>Network Configuration Guide</h1> |
| |
| <h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2> |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#slurmctld">Communication for slurmctld</a></li> |
| <li><a href="#slurmdbd">Communication for slurmdbd</a></li> |
| <li><a href="#slurmd">Communication for slurmd</a></li> |
| <li><a href="#client">Communication for client commands</a></li> |
| <li><a href="#failover">Communication for multiple controllers</a></li> |
| <li><a href="#multi">Communication with multiple clusters</a></li> |
| <li><a href="#federation">Communication in a federation</a></li> |
| <li><a href="#ipv6">Communication with IPv6</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <p>There are a lot of components in a Slurm cluster that need to be able |
| to communicate with each other. Some sites have security requirements that |
| prevent them from opening all communications between the machines and will |
| need to be able to selectively open just the ports that are necessary. |
| This document will go over what is needed for different components to be |
| able to talk to each other.</p> |
| |
| <p>Below is a diagram of a fairly typical cluster, with <b>slurmctld</b> |
| and <b>slurmdbd</b> on separate machines. In smaller clusters, MySQL can run |
| on the same machine as the <b>slurmdbd</b>, but in most cases it is preferable |
| to have it run on a dedicated machine. <b>slurmd</b> runs on the |
| compute nodes and the client commands can be installed and run from machines |
| of your choosing.</p> |
| |
| <div class="figure"> |
| <img src="network_standard.gif" width="550"><br> |
| Typical configuration |
| </div> |
| |
| <h2 id="slurmctld">Communication for slurmctld |
| <a class="slurm_link" href="#slurmctld"></a> |
| </h2> |
| |
| <p>The default port used by <b>slurmctld</b> to listen for incoming requests |
| is <u>6817</u>. This port can be changed with the |
| <a href="slurm.conf.html#OPT_SlurmctldPort">SlurmctldPort</a> slurm.conf |
| parameter. Slurmctld listens for incoming requests on that port and responds |
| back on the same connection opened by the requester.</p> |
| |
| <p>The machine running <b>slurmctld</b> needs to be able to establish |
| outbound connections as well. It needs to communicate with <b>slurmdbd</b> |
| on port <u>6819</u> by default (see the <a href="#slurmdbd">slurmdbd</a> |
| section for information on how to change this). It also needs to communicate |
| with <b>slurmd</b> on the compute nodes on port <u>6818</u> by default (see the |
| <a href="#slurmd">slurmd</a> section for information on how to change |
| this).</p> |
| |
| <p>By default, the <b>slurmctld</b> will listen for IPv4 traffic. IPv6 |
| communication can be enabled by adding <u>EnableIPv6</u> to the |
| <a href="slurm.conf.html#OPT_CommunicationParameters"> |
| CommunicationParameters</a> in your slurm.conf. With IPv6 enabled, you can |
| disable IPv4 by adding <u>DisableIPv4</u> to the |
| <a href="slurm.conf.html#OPT_CommunicationParameters"> |
| CommunicationParameters</a>. These settings must match in both slurmdbd.conf |
| and slurm.conf (see the <a href="#slurmdbd">slurmdbd</a> section).</p> |
| |
| <h2 id="slurmdbd">Communication for slurmdbd |
| <a class="slurm_link" href="#slurmdbd"></a> |
| </h2> |
| |
| <p>The default port used by <b>slurmdbd</b> to listen for incoming requests |
| is <u>6819</u>. This port can be changed with the |
| <a href="slurmdbd.conf.html#OPT_DbdPort">DbdPort</a> slurmdbd.conf parameter. |
| Slurmdbd listens for incoming requests on that port and responds back |
| on the same connection opened by the requester.</p> |
| |
| <p>The machine running <b>slurmdbd</b> needs to be able to reach the |
| MySQL or MariaDB server on port <u>3306</u> by default (the port is |
| configurable on the database side). |
| This port can be changed with the |
| <a href="slurmdbd.conf.html#OPT_StoragePort">StoragePort</a> slurmdbd.conf |
| parameter. It also needs to be able to initiate |
| a connection to <b>slurmctld</b> on port 6819 by default (see the |
| <a href="#slurmctld">slurmctld</a> section for information on how to |
| change this).</p> |
| |
| <p>By default, the <b>slurmdbd</b> will listen for IPv4 traffic. IPv6 |
| communication can be enabled by adding <u>EnableIPv6</u> to the |
| <a href="slurmdbd.conf.html#OPT_CommunicationParameters"> |
| CommunicationParameters</a> in your slurmdbd.conf. With IPv6 enabled, you can |
| disable IPv4 by adding <u>DisableIPv4</u> to the |
| <a href="slurmdbd.conf.html#OPT_CommunicationParameters"> |
| CommunicationParameters</a>. These settings must match in both slurmdbd.conf |
| and slurm.conf (see the <a href="#slurmctld">slurmctld</a> section).</p> |
| |
| <h2 id="slurmd">Communication for slurmd |
| <a class="slurm_link" href="#slurmd"></a> |
| </h2> |
| |
| <p>The default port used by <b>slurmd</b> to listen for incoming requests |
| from <b>slurmctld</b> is <u>6818</u>. This port can be changed with the |
| <a href="slurm.conf.html#OPT_SlurmdPort">SlurmdPort</a> slurm.conf |
| parameter.</p> |
| |
| <p>The machines running <b>srun</b> also use a range of ports to be able |
| to communicate with <b>slurmstepd</b>. By default these ports are chosen |
| at random from the ephemeral port range, but you can use the |
| <a href="slurm.conf.html#OPT_SrunPortRange">SrunPortRange</a> to specify |
| a range of ports from which they can be chosen. This is necessary |
| for login nodes that are behind a firewall.</p> |
| |
| <p>The machines running <b>slurmd</b> need to be able to establish |
| connections with <b>slurmctld</b> on port <u>6817</u> by default (see |
| the <a href="#slurmctld">slurmctld</a> section for information on how to |
| change this).</p> |
| |
| <p>By default, the <b>slurmd</b> communicates over IPv4. Please see the |
| <a href="#slurmctld">slurmctld</a> section for details on how to change this |
| as the slurm.conf parameter affects <b>slurmd</b> daemons as well.</p> |
| |
| <h2 id="client">Communication for client commands |
| <a class="slurm_link" href="#client"></a> |
| </h2> |
| |
| <p>The majority of the client commands will communicate with <b>slurmctld</b> |
| on port <u>6817</u> by default (see the <a href="#slurmctld">slurmctld</a> |
| section for information on how to change this) to get the information they |
| need. This includes the following commands:</p> |
| <dl> |
| <dd>salloc |
| <dd>sacctmgr |
| <dd>sbatch |
| <dd>sbcast |
| <dd>scancel |
| <dd>scontrol |
| <dd>sdiag |
| <dd>sinfo |
| <dd>sprio |
| <dd>squeue |
| <dd>sshare |
| <dd>sstat |
| <dd>strigger |
| <dd>sview |
| </dl> |
| |
| <p>There are also commands that communicate directly with <b>slurmdbd</b> on |
| port <u>6819</u> by default (see the <a href="#slurmdbd">slurmdbd</a> section |
| for information on how to change this). The following commands get information |
| from <b>slurmdbd</b>:</p> |
| <dl> |
| <dd>sacct |
| <dd>sacctmgr |
| <dd>sreport |
| </dl> |
| |
| <p>When a user starts a job using <b>srun</b> there has to be a communication |
| path from the machine where <b>srun</b> is called to the node(s) the job is |
| allocated. Communication follows the sequence outlined below:</p> |
| |
| <dl> |
| <dd>1a. srun sends job allocation request to slurmctld |
| <dd>1b. slurmctld grants allocation and returns details |
| <dd>2a. srun sends step create request to slurmctld |
| <dd>2b. slurmctld responds with step credential |
| <dd>3. srun opens sockets for I/O |
| <dd>4. srun forwards credential with task info to slurmd |
| <dd>5. slurmd forwards request as needed (per fanout) |
| <dd>6. slurmd forks/execs slurmstepd |
| <dd>7. slurmstepd connects I/O and launches tasks |
| <dd>8. On task termination, slurmstepd notifies srun |
| <dd>9. srun notifies slurmctld of job termination |
| <dd>10. slurmctld verifies termination of all processes via slurmd and |
| releases resources for next job |
| </dl> |
| |
| <div class="figure"> |
| <img src="network_srun.gif" width="550"><br> |
| srun communication |
| </div> |
| |
| <h2 id="failover">Communication with multiple controllers |
| <a class="slurm_link" href="#failover"></a> |
| </h2> |
| |
| <p>You can configure a secondary <b>slurmctld</b> and/or <b>slurmdbd</b> to |
| serve as a fallback if the primary should go down. The ports involved don't |
| change, but there are additional communication paths that need to be taken |
| into consideration. The client commands need to be able to reach both |
| machines running <b>slurmctld</b> as well as both machines running |
| <b>slurmdbd</b>. Both instances of <b>slurmctld</b> need to be able to |
| reach both instances of <b>slurmdbd</b> and each <b>slurmdbd</b> needs |
| to be able to reach the MySQL server.</p> |
| |
| <div class="figure"> |
| <img src="network_failover.gif" width="550"><br> |
| Fallback slurmctld and slurmdbd |
| </div> |
| |
| <h2 id="multi">Communication with multiple clusters |
| <a class="slurm_link" href="#multi"></a> |
| </h2> |
| |
| <p>In environments where multiple <b>slurmctld</b> instances share the same |
| <b>slurmdbd</b> you can configure each cluster to stand on their own and allow |
| users to specify a cluster to submit their jobs to. Ports |
| used by the different daemons don't change, but all instances of |
| <b>slurmctld</b> need to be able to communicate with the same instance of |
| <b>slurmdbd</b>. You can read more about multi cluster configurations in the |
| <a href="multi_cluster.html#OPT_SlurmdPort">Multi-Cluster Operation</a> |
| documentation.</p> |
| |
| <div class="figure"> |
| <img src="network_multi_cluster.gif" width="550"><br> |
| Multi-Cluster configuration |
| </div> |
| |
| <h2 id="federation">Communication in a federation |
| <a class="slurm_link" href="#federation"></a> |
| </h2> |
| |
| <p>Slurm also provides the ability to schedule jobs in a peer-to-peer fashion |
| between multiple clusters, allowing jobs to run on the cluster that has |
| available resources first. The difference in communication needs between this |
| and a multi-cluster configuration is that the two instances of <b>slurmctld</b> |
| need to be able to communicate with each other. There are more details about |
| using a |
| <a href="federation.html#OPT_SlurmdPort">Federation</a> in the |
| documentation.</p> |
| |
| <div class="figure"> |
| <img src="network_federation.gif" width="550"><br> |
| Federation configuration |
| </div> |
| |
| <h2 id="ipv6">Communication with IPv6 |
| <a class="slurm_link" href="#ipv6"></a> |
| </h2> |
| |
| <p>The <b>slurmctld</b>, <b>slurmdbd</b>, and <b>slurmd</b> daemons will, |
| by default, communicate using IPv4, but they can be configured to use IPv6. |
| This is handled by setting <b>CommunicationParameters=EnableIPv6</b> |
| in your slurm.conf and slurmdbd.conf, then restarting all of the daemons. |
| The <b>slurmd</b> may operate over IPv4 OR IPv6 in this mode. IPv4 can be |
| disabled by setting <b>CommunicationParameters=EnableIPv6,DisableIPv4</b>. |
| In is mode, everything must have a valid IPv6 address or the connection will |
| fail.</p> |
| |
| <p>The <b>slurmctld</b> expects a node to map to a single IP address (which |
| will be the first address returned when looking up the IP of the node with |
| <b>getaddrinfo()</b>). If you enable IPv6 on an existing cluster and the |
| nodes have IPv6 addresses, you must restart the <b>slurmd</b> daemons for |
| communication over IPv6 to be established.</p> |
| |
| <p>The presence of <span>precedence ::ffff:0:0/96 100</span> in /etc/gai.conf |
| will cause IPv4 addresses to be returned BEFORE an IPv6 address. This might |
| cause a situation where you have enabled IPv6 for Slurm, but are still seeing nodes |
| communicate with IPv4. If there is confusion as to which address is being used |
| you can call <span>scontrol setdebugflags +NET</span> to enable network related |
| debug logging in your slurmctld.log.</p> |
| |
| <p>If IPv4 and IPv6 are enabled, the loopback interface may still resolve to |
| 127.0.0.1. This is not necessarily an indication of a problem.</p> |
| |
| <p style="text-align:center;">Last modified 25 November 2020</p> |
| |
| <!--#include virtual="footer.txt"--> |