| <!--#include virtual="header.txt"--> |
| |
| <h1>SLURM User and Administrator Guide for Cray Systems</h1> |
| |
| <ul> |
| <li><a href="#user_guide">User Guide</a></li> |
| <li><a href="#admin_guide">Administrator Guide</a></li> |
| <li><a href="http://www.cray.com">Cray</a></li> |
| </ul> |
| |
| <HR SIZE=4> |
| |
| <h2><a name="user_guide">User Guide</a></h2> |
| |
| <p>This document describes the unique features of SLURM on Cray computers. |
| You should be familiar with the SLURM's mode of operation on Linux clusters |
| before studying the differences in Cray system operation described in this |
| document.</p> |
| |
| <p>Since version 2.3, SLURM is designed to operate as a job scheduler over |
| Cray's Application Level Placement Scheduler (ALPS). |
| Use SLURM's <i>sbatch</i> or <i>salloc</i> commands to create a resource |
| allocation in ALPS. |
| Then use ALPS' <i>aprun</i> command to launch parallel jobs within the resource |
| allocation. |
| The resource allocation is terminated once the the batch script or the |
| <i>salloc</i> command terminates. |
| Alternately there is an <i>aprun</i> wrapper distributed with SLURM in |
| <i>contribs/cray/srun</i> which will translate <i>srun</i> options |
| into the equivalent <i>aprun</i> options. This wrapper will also execute |
| <i>salloc</i> as needed to create a job allocation in which to run the |
| <i>aprun</i> command. The <i>srun</i> script contains two new options: |
| <i>--man</i> will print a summary of the options including notes about which |
| <i>srun</i> options are not supported and <i>--alps="</i> which can be used |
| to specify <i>aprun</i> options which lack an equivalent within <i>srun</i>. |
| For example, <i>srun --alps="-a xt" -n 4 a.out</i>. |
| Since <i>aprun</i> is used to launch tasks (the equivalent of a SLURM |
| job step), the job steps will not be visible using SLURM commands. |
| Other than SLURM's <i>srun</i> command being replaced by <i>aprun</i> |
| and the job steps not being visible, all other SLURM commands will operate |
| as expected. Note that in order to build and install the aprun wrapper |
| described above, execute "configure" with the <i>--with-srun2aprun</i> |
| option or add <i>%_with_srun2aprun 1</i> to your <i>~/.rpmmacros</i> |
| file. This option is set with RPMs from Cray.</p> |
| |
| <h3>Node naming and node geometry on Cray XT/XE systems</h3> |
| <p>SLURM node names will be of the form "nid#####" where "#####" is a five-digit sequence number. |
| Other information available about the node are it's XYZ coordinate in the node's <i>NodeAddr</i> |
| field and it's component label in the <i>HostNodeName</i> field. |
| The format of the component label is "c#-#c#s#n#" where the "#" fields represent in order: |
| cabinet, row, cage, blade or slot, and node. |
| For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p> |
| |
| <p>Cray XT/XE systems come with a 3D torus by default. On smaller systems the cabling in X dimension is |
| omitted, resulting in a two-dimensional torus (1 x Y x Z). On Gemini/XE systems, pairs of adjacent nodes |
| (nodes 0/1 and 2/3 on each blade) share one network interface each. This causes the same Y coordinate to |
| be assigned to those nodes, so that the number of distinct torus coordinates is half the number of total |
| nodes.</p> |
| <p>The SLURM <i>smap</i> and <i>sview</i> tools can visualize node torus positions. Clicking on a particular |
| node shows its <i>NodeAddr</i> field, which is its (X,Y,Z) torus coordinate base-36 encoded as a 3-character |
| string. For example, a NodeAddr of '07A' corresponds to the coordinates X = 0, Y = 7, Z = 10. |
| The <i>NodeAddr</i> of a node can also be shown using 'scontrol show node nid#####'.</p> |
| |
| <p>Please note that the sbatch/salloc options "<i>--geometry</i>" and "<i>--no-rotate</i>" are BlueGene-specific |
| and have no impact on Cray systems. Topological node placement depends on what Cray makes available via the |
| ALPS_NIDORDER configuration option (see below).</p> |
| |
| <h3>Specifying thread depth</h3> |
| <p>For threaded applications, use the <i>--cpus-per-task</i>/<i>-c</i> parameter of sbatch/salloc to set |
| the thread depth per node. This corresponds to mppdepth in PBS and to the aprun -d parameter. Please |
| note that SLURM does not set the OMP_NUM_THREADS environment variable. Hence, if an application spawns |
| 4 threads, an example script would look like</p> |
| <pre> |
| #SBATCH --comment="illustrate the use of thread depth and OMP_NUM_THREADS" |
| #SBATCH --ntasks=3 |
| #SBATCH -c 4 |
| export OMP_NUM_THREADS=4 |
| aprun -n 3 -d $OMP_NUM_THREADS ./my_exe |
| </pre> |
| |
| <h3>Specifying number of tasks per node</h3> |
| <p>SLURM uses the same default as ALPS, assigning each task to a single core/CPU. In order to |
| make more resources available per task, you can reduce the number of processing elements |
| per node (<i>aprun -N</i> parameter, <i>mppnppn</i> in PBS) with the |
| <i>--ntasks-per-node</i> option of <i>sbatch/salloc</i>. |
| This is in particular necessary when tasks require more memory than the per-CPU default.</p> |
| |
| <h3>Specifying per-task memory</h3> |
| <p>In Cray terminology, a task is also called a "processing element" (PE), hence below we |
| refer to the per-task memory and "per-PE" memory interchangeably. The per-PE memory |
| requested through the batch system corresponds to the <i>aprun -m</i> parameter.</p> |
| |
| <p>Due to the implicit default assumption that 1 task runs per core/CPU, the default memory |
| available per task is the <i>per-CPU share</i> of node_memory / number_of_cores. For |
| example, on a XT5 system with 16000MB per 12-core node, the per-CPU share is 1333MB.</p> |
| |
| <p>If nothing else is specified, the <i>--mem</i> option to sbatch/salloc can only be used to |
| <i>reduce</i> the per-PE memory below the per-CPU share. This is also the only way that |
| the <i>--mem-per-cpu</i> option can be applied (besides, the <i>--mem-per-cpu</i> option |
| is ignored if the user forgets to set --ntasks/-n). |
| Thus, the preferred way of specifying memory is the more general <i>--mem</i> option.</p> |
| |
| <p>To <i>increase</i> the per-PE memory settable via the <i>--mem</i> option requires making |
| more per-task resources available using the <i>--ntasks-per-node</i> option to sbatch/salloc. |
| This allows <i>--mem</i> to request up to node_memory / ntasks_per_node MegaBytes.</p> |
| |
| <p>When <i>--ntasks-per-node</i> is 1, the entire node memory may be requested by the application. |
| Setting <i>--ntasks-per-node</i> to the number of cores per node yields the default per-CPU share |
| minimum value.</p> |
| |
| <p>For all cases in between these extremes, set --mem=per_task_node or --mem-per-cpu=memory_per_cpu (node CPU count and task count may differ) and</p> |
| <pre> |
| --ntasks-per-node=floor(node_memory / per_task_memory) |
| </pre> |
| <p>whenever per_task_memory needs to be larger than the per-CPU share.</p> |
| |
| <p><b>Example:</b> An application with 64 tasks needs 7500MB per task on a cluster with 32000MB and 24 cores |
| per node. Hence ntasks_per_node = floor(32000/7500) = 4.</p> |
| <pre> |
| #SBATCH --comment="requesting 7500MB per task on 32000MB/24-core nodes" |
| #SBATCH --ntasks=64 |
| #SBATCH --ntasks-per-node=4 |
| #SBATCH --mem=30000 |
| </pre> |
| <p>If you would like to fine-tune the memory limit of your application, you can set the same parameters in |
| a salloc session and then check directly, using</p> |
| <pre> |
| apstat -rvv -R $BASIL_RESERVATION_ID |
| </pre> |
| <p>to see how much memory has been requested.</p> |
| |
| <h3>Using aprun -B</h3> |
| <p>CLE 3.x allows a nice <i>aprun</i> shortcut via the <i>-B</i> option, which |
| reuses all the batch system parameters (<i>--ntasks, --ntasks-per-node, |
| --cpus-per-task, --mem</i>) at application launch, as if the corresponding |
| (<i>-n, -N, -d, -m</i>) parameters had been set; see the aprun(1) manpage |
| on CLE 3.x systems for details.</p> |
| |
| <h3>Node ordering options</h3> |
| <p>SLURM honors the node ordering policy set for Cray's Application Level Placement Scheduler (ALPS). Node |
| ordering is a configurable system option (ALPS_NIDORDER in /etc/sysconfig/alps). The current |
| setting is reported by '<i>apstat -svv</i>' (look for the line starting with "nid ordering option") and |
| can not be changed at runtime. The resulting, effective node ordering is revealed by '<i>apstat -no</i>' |
| (if no special node ordering has been configured, 'apstat -no' shows the |
| same order as '<i>apstat -n</i>').</p> |
| |
| <p>SLURM uses exactly the same order as '<i>apstat -no</i>' when selecting |
| nodes for a job. With the <i>--contiguous</i> option to <i>sbatch/salloc</i> |
| you can request a contiguous (relative to the current ALPS nid ordering) set |
| of nodes. Note that on a busy system there is typically more fragmentation, |
| hence it may take longer (or even prove impossible) to allocate contiguous |
| sets of a larger size.</p> |
| |
| <p>Cray/ALPS node ordering is a topic of ongoing work, some information can be found in the CUG-2010 paper |
| "<i>ALPS, Topology, and Performance</i>" by Carl Albing and Mark Baker.</p> |
| |
| <h3>GPU Use</h3> |
| |
| <p>Users may specify GPU memory required per node using the <i>--gres=gpu_mem:#</i> |
| option to any of the commands used to create a job allocation/reservation.</p> |
| |
| <h3>Front-End Node Use</h3> |
| |
| <p>If you want to be allocated resources on a front-end node and no compute |
| nodes (typically used for pre- or post-processing functionality) then submit a |
| batch job with a node count specification of zero.</p> |
| <pre> |
| sbatch -N0 pre_process.bash |
| </pre> |
| <p><b>Note</b>: Support for Cray job allocations with zero compute nodes was |
| added to SLURM version 2.4. Earlier versions of SLURM will return an error for |
| zero compute node job requests.</p> |
| <p><b>Note</b>: Job allocations with zero compute nodes can only be made in |
| SLURM partitions explicitly configured with <b>MinNodes=0</b> (the default |
| minimum node count for a partition is one compute node).</p> |
| |
| <HR SIZE=4> |
| |
| <h2><a name="admin_guide">Administrator Guide</a></h2> |
| |
| <h3>Install supporting RPMs</h3> |
| |
| <p>The build requires a few -devel RPMs listed below. You can obtain these from |
| SuSe/Novell. |
| <ul> |
| <li>CLE 2.x uses SuSe SLES 10 packages (RPMs may be on the normal ISOs)</li> |
| <li>CLE 3.x uses Suse SLES 11 packages (RPMs are on the SDK ISOs, there |
| are two SDK ISO files for SDK)</li> |
| </ul></p> |
| |
| <p>You can check by logging onto the boot node and running</p> |
| <pre> |
| boot: # xtopview |
| default: # rpm -qa |
| </pre> |
| |
| <p>The list of packages that should be installed is:</p> |
| <ul> |
| <li>expat-2.0.xxx</li> |
| <li>libexpat-devel-2.0.xxx</li> |
| <li>cray-MySQL-devel-enterprise-5.0.64 (this should be on the Cray ISO)</li> |
| </ul> |
| |
| <p>For example, loading MySQL can be done like this:</p> |
| <pre> |
| smw: # mkdir mnt |
| smw: # mount -o loop, ro xe-sles11sp1-trunk.201107070231a03.iso mnt |
| smw: # find mnt -name cray-MySQL-devel-enterprise\* |
| mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm |
| smw: # scp mnt/craydist/xt-packages/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64 |
| </pre> |
| |
| <p>Then switch to boot node and run:</p> |
| <pre> |
| boot: # xtopview |
| default: # rpm -ivh /software/cray-MySQL-devel-enterprise-5.0.64.1.0000.2899.19.2.x86_64.rpm |
| default: # exit |
| </pre> |
| |
| <p>All Cray-specific PrgEnv and compiler modules should be removed and root |
| privileges will be required to install these files.</p> |
| |
| <h3>Install Munge</h3> |
| |
| <p>Note the Munge installation process on Cray systems differs |
| somewhat from that described in the |
| <a href="http://code.google.com/p/munge/wiki/InstallationGuide"> |
| MUNGE Installation Guide</a>.</p> |
| |
| <p>Munge is the authentication daemon and needed by SLURM. You can get |
| Munge RPMs from Cray. Use the below method to install and test it. The |
| Cray Munge RPM installs Munge in /opt/munge.</p> |
| |
| <p>If needed copy the RPMs over to the boot node</p> |
| <pre> |
| login: # scp munge-*.rpm root@boot:/rr/current/software |
| </pre> |
| |
| <p>Install the RPMs on the boot node. While this process creates a |
| Munge key, it can't use the /etc/munge directory. So we make a |
| /opt/munge/key directory instead and create a key there.</p> |
| <pre> |
| boot: # xtopview |
| default: # rpm -ivh /software/munge-*.x86_64.rpm |
| default: # mkdir /opt/munge/key |
| default: # dd if=/dev/urandom bs=1 count=1024 >/opt/munge/key/munge.key |
| default: # chmod go-rxw /opt/munge/key/munge.key |
| default: # chown daemon /opt/munge/key/munge.key |
| default: # perl -pi -e 's/#DAEMON_ARGS=/DAEMON_ARGS="--key-file \/opt\/munge\/key\/munge.key"/g' /etc/init.d/munge |
| default: # exit |
| </pre> |
| |
| <h3>Configure Munge</h3> |
| |
| <p>The following steps apply to each login node and the sdb, where |
| <ul> |
| <li>The <i>slurmd</i> or <i>slurmctld</i> daemon will run and/or</li> |
| <li>Users will be submitting jobs</li> |
| </ul></p> |
| |
| <pre> |
| sdb: # mkdir --mode=0711 -vp /var/lib/munge |
| sdb: # mkdir --mode=0700 -vp /var/log/munge |
| sdb: # mkdir --mode=0755 -vp /var/run/munge |
| sdb: # chown daemon /var/lib/munge |
| sdb: # chown daemon /var/log/munge |
| sdb: # chown daemon /var/run/munge |
| sdb: # /etc/init.d/munge start |
| |
| </pre> |
| |
| <p>Start the Munge daemon and test it.</p> |
| <pre> |
| login: # export PATH=/opt/munge/bin:$PATH |
| login: # munge -n |
| MUNGE:AwQDAAAEy341MRViY+LacxYlz+mchKk5NUAGrYLqKRUvYkrR+MJzHTgzSm1JALqJcunWGDU6k3vpveoDFLD7fLctee5+OoQ4dCeqyK8slfAFvF9DT5pccPg=: |
| login: # munge -n | unmunge |
| </pre> |
| |
| <p>When done, verify network connectivity by executing the following (the |
| Munged daemon must be started on the other-login-host as well): |
| <ul> |
| <li><i>munge -n | ssh other-login-host /opt/munge/bin/unmunge</i></li> |
| </ul> |
| |
| <h3>Enable the Cray job service</h3> |
| |
| <p>This is a common dependency on Cray systems. ALPS relies on the Cray job service to |
| generate cluster-unique job container IDs (PAGG IDs). These identifiers are used by |
| ALPS to track running (aprun) job steps. The default (session IDs) is not unique |
| across multiple login nodes. This standard procedure is described in chapter 9 of |
| <a href="http://docs.cray.com/books/S-2393-4003/">S-2393</a> and takes only two |
| steps, both to be done on all 'login' class nodes (xtopview -c login):</p> |
| <ul> |
| <li>make sure that the /etc/init.d/job service is enabled (chkconfig) and started</li> |
| <li>enable the pam_job.so module from /opt/cray/job/default in /etc/pam.d/common-session<br/> |
| (NB: the default pam_job.so is very verbose, a simpler and quieter variant is provided |
| in contribs/cray.)</li> |
| </ul> |
| <p>The latter step is required only if you would like to run interactive |
| <i>salloc</i> sessions.</p> |
| <pre> |
| boot: # xtopview -c login |
| login: # chkconfig job on |
| login: # emacs -nw /etc/pam.d/common-session |
| (uncomment the pam_job.so line) |
| session optional /opt/cray/job/default/lib64/security/pam_job.so |
| login: # exit |
| boot: # xtopview -n 31 |
| node/31:# chkconfig job on |
| node/31:# emacs -nw /etc/pam.d/common-session |
| (uncomment the pam_job.so line as shown above) |
| </pre> |
| |
| <h3>Install and Configure SLURM</h3> |
| |
| <p>SLURM can be built and installed as on any other computer as described |
| <a href="quickstart_admin.html">Quick Start Administrator Guide</a>. |
| You can also get current SLURM RPMs from Cray. An installation |
| process for the RPMs is described below. The |
| Cray SLURM RPMs install in /opt/slurm.</p> |
| |
| <p><b>NOTE:</b> By default neither the <i>salloc</i> command or <i>srun</i> |
| command wrapper can be executed as a background process. This is done for two |
| reasons:</p> |
| <ol> |
| <li>Only one ALPS reservation can be created from each session ID. The |
| <i>salloc</i> command can not change it's session ID without disconnecting |
| itself from the terminal and its parent process, meaning the process could not |
| be later put into the foreground or easily identified</li> |
| <li>To better identify every process spawned under the <i>salloc</i> process |
| using terminal foreground process group IDs</li> |
| </ol> |
| <p>You can optionally enable <i>salloc</i> and <i>srun</i> to execute as |
| background processes by using the configure option |
| <i>"--enable-salloc-background"</i> (or the .rpmmacros option |
| <i>"%_with_salloc_background 1"</i>), however doing will result in failed |
| resource allocations |
| (<i>error: Failed to allocate resources: Requested reservation is in use</i>) |
| if not executed sequentially and |
| increase the likelihood of orphaned processes. Specifically request |
| this version when requesting RPMs from Cray as this is not on by default.</p> |
| <!-- Example: |
| Modify srun script or ask user to execute "/usr/bin/setsid" |
| before salloc or srun command --> |
| <!-- Example: |
| salloc spawns zsh, zsh spawns bash, etc. |
| when salloc terminates, bash becomes a child of init --> |
| |
| <p>If needed copy the RPMs over to the boot node.</p> |
| <pre> |
| login: # scp slurm-*.rpm root@boot:/rr/current/software |
| </pre> |
| |
| <p>Install the RPMs on the boot node.</p> |
| <pre> |
| boot: # xtopview |
| default: # rpm -ivh /software/slurm-*.x86_64.rpm |
| <i>edit /etc/slurm/slurm.conf and /etc/slurm/cray.conf</i> |
| default: # exit |
| </pre> |
| |
| <p>When building SLURM's <i>slurm.conf</i> configuration file, use the |
| <i>NodeName</i> parameter to specify all batch nodes to be scheduled. |
| If nodes are defined in ALPS, but not defined in the <i>slurm.conf</i> file, a |
| complete list of all batch nodes configured in ALPS will be logged by |
| the <i>slurmctld</i> daemon when it starts. |
| One would typically use this information to modify the <i>slurm.conf</i> file |
| and restart the <i>slurmctld</i> daemon. |
| Note that the <i>NodeAddr</i> and <i>NodeHostName</i> fields should not be |
| configured, but will be set by SLURM using data from ALPS. |
| <i>NodeAddr</i> be set to the node's XYZ coordinate and be used by SLURM's |
| <i>smap</i> and <i>sview</i> commands. |
| <i>NodeHostName</i> will be set to the node's component label. |
| The format of the component label is "c#-#c#s#n#" where the "#" fields |
| represent in order: cabinet, row, cage, blade or slot, and node. |
| For example "c0-1c2s5n3" is cabinet 0, row 1, cage 3, slot 5 and node 3.</p> |
| |
| <p>The <i>slurmd</i> daemons will not execute on the compute nodes, but will |
| execute on one or more front end nodes. |
| It is from here that batch scripts will execute <i>aprun</i> commands to |
| launch tasks. |
| This is specified in the <i>slurm.conf</i> file by using the |
| <i>FrontendName</i> and optionally the <i>FrontEndAddr</i> fields |
| as seen in the examples below.</p> |
| |
| <p>Note that SLURM will by default kill running jobs when a node goes DOWN, |
| while a DOWN node in ALPS only prevents new jobs from being scheduled on the |
| node. To help avoid confusion, we recommend that <i>SlurmdTimeout</i> in the |
| <i>slurm.conf</i> file be set to the same value as the <i>suspectend</i> |
| parameter in ALPS' <i>nodehealth.conf</i> file.</p> |
| |
| <p>You need to specify the appropriate resource selection plugin (the |
| <i>SelectType</i> option in SLURM's <i>slurm.conf</i> configuration file). |
| Configure <i>SelectType</i> to <i>select/cray</i> The <i>select/cray</i> |
| plugin provides an interface to ALPS plus issues calls to the |
| <i>select/linear</i>, which selects resources for jobs using a best-fit |
| algorithm to allocate whole nodes to jobs (rather than individual sockets, |
| cores or threads).</p> |
| |
| <p>Note that the system topology is based upon information gathered from |
| the ALPS database and is based upon the ALPS_NIDORDER configuration in |
| <i>/etc/sysconfig/alps</i>. Excerpts of a <i>slurm.conf</i> file for |
| use on a Cray systems follow:</p> |
| |
| <pre> |
| #--------------------------------------------------------------------- |
| # SLURM USER |
| #--------------------------------------------------------------------- |
| # SLURM user on cray systems must be root |
| # This requirement derives from Cray ALPS: |
| # - ALPS reservations can only be created by the job owner or root |
| # (confirmation may be done by other non-privileged users) |
| # - Freeing a reservation always requires root privileges |
| SlurmUser=root |
| |
| #--------------------------------------------------------------------- |
| # PLUGINS |
| #--------------------------------------------------------------------- |
| # Network topology (handled internally by ALPS) |
| TopologyPlugin=topology/none |
| |
| # Scheduling |
| SchedulerType=sched/backfill |
| |
| # Node selection: use the special-purpose "select/cray" plugin. |
| # Internally this uses select/linear, i.e. nodes are always allocated |
| # in units of nodes (other allocation is currently not possible, since |
| # ALPS does not yet allow to run more than 1 executable on the same |
| # node, see aprun(1), section LIMITATIONS). |
| # |
| # Add CR_memory as parameter to support --mem/--mem-per-cpu. |
| # GPU memory allocation supported as generic resource. |
| # NOTE: No gres/gpu_mem plugin is required, only generic SLURM GRES logic. |
| SelectType=select/cray |
| SelectTypeParameters=CR_Memory |
| GresTypes=gpu_mem |
| |
| # Proctrack plugin: only/default option is proctrack/sgi_job |
| # ALPS requires cluster-unique job container IDs and thus the /etc/init.d/job |
| # service needs to be started on all slurmd and login nodes, as described in |
| # S-2393, chapter 9. Due to this requirement, ProctrackType=proctrack/sgi_job |
| # is the default on Cray and need not be specified explicitly. |
| |
| #--------------------------------------------------------------------- |
| # PATHS |
| #--------------------------------------------------------------------- |
| SlurmdSpoolDir=/ufs/slurm/spool |
| StateSaveLocation=/ufs/slurm/spool/state |
| |
| # main logfile |
| SlurmctldLogFile=/ufs/slurm/log/slurmctld.log |
| # slurmd logfiles (using %h for hostname) |
| SlurmdLogFile=/ufs/slurm/log/%h.log |
| |
| # PIDs |
| SlurmctldPidFile=/var/run/slurmctld.pid |
| SlurmdPidFile=/var/run/slurmd.pid |
| |
| #--------------------------------------------------------------------- |
| # COMPUTE NODES |
| #--------------------------------------------------------------------- |
| # Return DOWN nodes to service when e.g. slurmd has been unresponsive |
| ReturnToService=1 |
| |
| # Configure the suspectend parameter in ALPS' nodehealth.conf file to the same |
| # value as SlurmdTimeout for consistent behavior (e.g. "suspectend: 600") |
| SlurmdTimeout=600 |
| |
| # Controls how a node's configuration specifications in slurm.conf are |
| # used. |
| # 0 - use hardware configuration (must agree with slurm.conf) |
| # 1 - use slurm.conf, nodes with fewer resources are marked DOWN |
| # 2 - use slurm.conf, but do not mark nodes down as in (1) |
| FastSchedule=2 |
| |
| # Per-node configuration for PALU AMD G34 dual-socket "Magny Cours" |
| # Compute Nodes. We deviate from slurm's idea of a physical socket |
| # here, since the Magny Cours hosts two NUMA nodes each, which is |
| # also visible in the ALPS inventory (4 Segments per node, each |
| # containing 6 'Processors'/Cores). |
| # Also specify that 2 GB of GPU memory is available on every node |
| NodeName=DEFAULT Sockets=4 CoresPerSocket=6 ThreadsPerCore=1 |
| NodeName=DEFAULT RealMemory=32000 State=UNKNOWN |
| NodeName=DEFAULT Gres=gpu_mem:2g |
| |
| # List the nodes of the compute partition below (service nodes are not |
| # allowed to appear) |
| NodeName=nid00[002-013,018-159,162-173,178-189] |
| |
| # Frontend nodes: these should not be available to user logins, but |
| # have all filesystems mounted that are also |
| # available on a login node (/scratch, /home, ...). |
| FrontendName=palu[7-9] |
| |
| #--------------------------------------------------------------------- |
| # ENFORCING LIMITS |
| #--------------------------------------------------------------------- |
| # Enforce the use of associations: {associations, limits, wckeys} |
| AccountingStorageEnforce=limits |
| |
| # Do not propagate any resource limits from the user's environment to |
| # the slurmd |
| PropagateResourceLimits=NONE |
| |
| #--------------------------------------------------------------------- |
| # Resource limits for memory allocation: |
| # * the Def/Max 'PerCPU' and 'PerNode' variants are mutually exclusive; |
| # * use the 'PerNode' variant for both default and maximum value, since |
| # - slurm will automatically adjust this value depending on |
| # --ntasks-per-node |
| # - if using a higher per-cpu value than possible, salloc will just |
| # block. |
| #-------------------------------------------------------------------- |
| # XXX replace both values below with your values from 'xtprocadmin -A' |
| DefMemPerNode=32000 |
| MaxMemPerNode=32000 |
| |
| #--------------------------------------------------------------------- |
| # PARTITIONS |
| #--------------------------------------------------------------------- |
| # defaults common to all partitions |
| PartitionName=DEFAULT Nodes=nid00[002-013,018-159,162-173,178-189] |
| PartitionName=DEFAULT MaxNodes=178 |
| PartitionName=DEFAULT Shared=EXCLUSIVE State=UP DefaultTime=60 |
| |
| # "User Support" partition with a higher priority |
| PartitionName=usup Hidden=YES Priority=10 MaxTime=720 AllowGroups=staff |
| |
| # normal partition available to all users |
| PartitionName=day Default=YES Priority=1 MaxTime=01:00:00 |
| </pre> |
| |
| <p>SLURM supports an optional <i>cray.conf</i> file containing Cray-specific |
| configuration parameters. <b>This file is NOT needed for production systems</b>, |
| but is provided for advanced configurations. If used, <i>cray.conf</i> must be |
| located in the same directory as the <i>slurm.conf</i> file. Configuration |
| parameters supported by <i>cray.conf</i> are listed below.</p> |
| |
| <p><dl> |
| <dt><b>apbasil</b></dt> |
| <dd>Fully qualified pathname to the apbasil command. |
| The default value is <i>/usr/bin/apbasil</i>.</dd> |
| <dt><b>apkill</b></dt> |
| <dd>Fully qualified pathname to the apkill command. |
| The default value is <i>/usr/bin/apkill</i>.</dd> |
| <dt><b>SDBdb</b></dt> |
| <dd>Name of the ALPS database. |
| The default value is <i>XTAdmin</i>.</dd> |
| <dt><b>SDBhost</b></dt> |
| <dd>Hostname of the database server. |
| The default value is based upon the contents of the 'my.cnf' file used to |
| store default database access information and that defaults to user 'sdb'.</dd> |
| <dt><b>SDBpass</b></dt> |
| <dd>Password used to access the ALPS database. |
| The default value is based upon the contents of the 'my.cnf' file used to |
| store default database access information and that defaults to user 'basic'.</dd> |
| <dt><b>SDBport</b></dt> |
| <dd>Port used to access the ALPS database. |
| The default value is 0.</dd> |
| <dt><b>SDBuser</b></dt> |
| <dd>Name of user used to access the ALPS database. |
| The default value is based upon the contents of the 'my.cnf' file used to |
| store default database access information and that defaults to user 'basic'.</dd> |
| </dl></p> |
| |
| <pre> |
| # Example cray.conf file |
| apbasil=/opt/alps_simulator_40_r6768/apbasil.sh |
| SDBhost=localhost |
| SDBuser=alps_user |
| SDBdb=XT5istanbul |
| </pre> |
| |
| <p>One additional configuration script can be used to insure that the slurmd |
| daemons execute with the highest resource limits possible, overriding default |
| limits on Suse systems. Depending upon what resource limits are propagated |
| from the user's environment, lower limits may apply to user jobs, but this |
| script will insure that higher limits are possible. Copy the file |
| <i>contribs/cray/etc_sysconfig_slurm</i> into <i>/etc/sysconfig/slurm</i> |
| for these limits to take effect. This script is executed from |
| <i>/etc/init.d/slurm</i>, which is typically executed to start the SLURM |
| daemons. An excerpt of <i>contribs/cray/etc_sysconfig_slurm</i> is shown |
| below.</p> |
| |
| <pre> |
| # |
| # /etc/sysconfig/slurm for Cray XT/XE systems |
| # |
| # Cray is SuSe-based, which means that ulimits from |
| # /etc/security/limits.conf will get picked up any time SLURM is |
| # restarted e.g. via pdsh/ssh. Since SLURM respects configured limits, |
| # this can mean that for instance batch jobs get killed as a result |
| # of configuring CPU time limits. Set sane start limits here. |
| # |
| # Values were taken from pam-1.1.2 Debian package |
| ulimit -t unlimited # max amount of CPU time in seconds |
| ulimit -d unlimited # max size of a process's data segment in KB |
| </pre> |
| |
| <p>SLURM will ignore any interactive jobs or nodes in interactive mode |
| so set all your nodes to batch from any service node. Dropping the |
| -n option will make all nodes batch.</p> |
| |
| <pre> |
| # xtprocadmin -k m batch -n NODEIDS |
| </pre> |
| |
| <p>Now create the needed directories for logs and state files then start the |
| daemons on the sdb and login nodes as shown below.</p> |
| |
| <pre> |
| sdb: # mkdir -p /ufs/slurm/log |
| sdb: # mkdir -p /ufs/slurm/spool |
| sdb: # module load slurm |
| sdb: # /etc/init.d/slurm start |
| </pre> |
| |
| <pre> |
| login: # /etc/init.d/slurm start |
| </pre> |
| |
| <h3>Srun wrapper configuration</h3> |
| |
| <p>The <i>srun</i> wrapper to <i>aprun</i> might require modification to run |
| as desired. Specifically the <i>$aprun</i> variable could be set to the |
| absolute pathname of that executable file. Without that modification, the |
| <i>aprun</i> command executed will depend upon the user's search path.</p> |
| |
| <p>In order to debug the <i>srun</i> wrapper, uncomment the line</p> |
| <pre> |
| print "comment=$command\n" |
| </pre> |
| <p>If the <i>srun</i> wrapper is executed from |
| within an existing SLURM job allocation (i.e. within <i>salloc</i> or an |
| <i>sbatch</i> script), then it just executes the <i>aprun</i> command with |
| appropriate options. If executed without an allocation, the wrapper executes |
| <i>salloc</i>, which then executes the <i>srun</i> wrapper again. This second |
| execution of the <i>srun</i> wrapper is required in order to process environment |
| variables that are set by the <i>salloc</i> command based upon the resource |
| allocation.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 18 September 2012</p></td> |
| |
| <!--#include virtual="footer.txt"--> |