doc/html/faq.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1><a name="top">Frequently Asked Questions</a></h1>
 <h2>For Users</h2>
 <ol>
 <li><a href="#comp">Why is my job/node in COMPLETING state?</a></li>
 <li><a href="#rlimit">Why are my resource limits not propagated?</a></li>
 <li><a href="#pending">Why is my job not running?</a></li>
 <li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs
 to run on nodes?</a></li>
 <li><a href="#purge">Why is my job killed prematurely?</a></li>
 <li><a href="#opts">Why are my srun options ignored?</a></li>
 <li><a href="#cred">Why are &quot;Invalid job credential&quot; errors generated?</a></li>
 <li><a href="#backfill">Why is the SLURM backfill scheduler not starting my
 job?</a></li>
 <li><a href="#steps">How can I run multiple jobs from within a single script?</a></li>
 <li><a href="#orphan">Why do I have job steps when my job has already COMPLETED?</a></li>
 <li><a href="#multi_batch">How can I run a job within an existing job allocation?</a></li>
 <li><a href="#user_env">How does SLURM establish the environment for my job?</a></li>
 <li><a href="#prompt">How can I get shell prompts in interactive mode?</a></li>
 <li><a href="#batch_out">How can I get the task ID in the output or error file
 name for a batch job?</a></li>
 <li><a href="#parallel_make">Can the <i>make</i> command utilize the resources
 allocated to a SLURM job?</a></li>
 <li><a href="#terminal">Can tasks be launched with a remote terminal?</a></li>
 </ol>
 <h2>For Administrators</h2>
 <ol>
 <li><a href="#suspend">How is job suspend/resume useful?</a></li>
 <li><a href="#fast_schedule">How can I configure SLURM to use the resources actually
 found on a node rather than what is defined in <i>slurm.conf</i>?</a></li>
 <li><a href="#return_to_service">Why is a node shown in state DOWN when the node
 has registered for service?</a></li>
 <li><a href="#down_node">What happens when a node crashes?</a></li>
 <li><a href="#multi_job">How can I control the execution of multiple
 jobs per node?</a></li>
 <li><a href="#inc_plugin">When the SLURM daemon starts, it prints
 &quot;cannot resolve X plugin operations&quot; and exits. What does this mean?</a></li>
 <li><a href="#sigpipe">Why are user tasks intermittently dying at launch with SIGPIPE
 error messages?</a></li>
 <li><a href="#maint_time">How can I dry up the workload for a maintenance
 period?</a></li>
 <li><a href="#pam">How can PAM be used to control a user's limits on or
 access to compute nodes?</a></li>
 <li><a href="#time">Why are jobs allocated nodes and then unable to initiate
 programs on some nodes?</a></li>
 <li><a href="#ping"> Why does <i>slurmctld</i> log that some nodes
 are not responding even if they are not in any partition?</a></li>
 <li><a href="#controller"> How should I relocated the primary or backup
 controller?</a></li>
 <li><a href="#multi_slurm">Can multiple SLURM systems be run in
 parallel for testing purposes?</a></li>
 <li><a href="#multi_slurmd">Can slurm emulate a larger cluster?</a></li>
 <li><a href="#extra_procs">Can SLURM emulate nodes with more
 resources than physically exist on the node?</a></li>
 <li><a href="#credential_replayed">What does a "credential
 replayed" error in the <i>SlurmdLogFile</i> indicate?</a></li>
 <li><a href="#large_time">What does a "Warning: Note very large
 processing time" in the <i>SlurmctldLogFile</i> indicate?</a></li>
 <li><a href="#lightweight_core">How can I add support for lightweight
 core files?</a></li>
 <li><a href="#limit_propagation">Is resource limit propagation
 useful on a homogeneous cluster?</a></li>
 </ol>

 <h2>For Users</h2>
 <p><a name="comp"><b>1. Why is my job/node in COMPLETING state?</b></a><br>
 When a job is terminating, both the job and its nodes enter the COMPLETING state.
 As the SLURM daemon on each node determines that all processes associated with
 the job have terminated, that node changes state to IDLE or some other appropriate
 state.
 When every node allocated to a job has determined that all processes associated
 with it have terminated, the job changes state to COMPLETED or some other
 appropriate state (e.g. FAILED).
 Normally, this happens within a second.
 However, if the job has processes that cannot be terminated with a SIGKILL
 signal, the job and one or more nodes can remain in the COMPLETING state
 for an extended period of time.
 This may be indicative of processes hung waiting for a core file
 to complete I/O or operating system failure.
 If this state persists, the system administrator should check for processes
 associated with the job that can not be terminated then use the
 <span class="commandline">scontrol</span> command to change the node's
 state to DOWN (e.g. &quot;scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing&quot;),
 reboot the node, then reset the node's state to IDLE
 (e.g. &quot;scontrol update NodeName=<i>name</i> State=RESUME&quot;).
 Note that setting the node DOWN will terminate all running or suspended
 jobs associated with that node.
 An alternative is to set the node's state to DRAIN until all jobs
 associated with it terminate before setting it DOWN and re-booting.</p>

 <p><a name="rlimit"><b>2. Why are my resource limits not propagated?</b></a><br>
 When the <span class="commandline">srun</span> command executes, it captures the
 resource limits in effect at that time. These limits are propagated to the allocated
 nodes before initiating the user's job. The SLURM daemon running on that node then
 tries to establish identical resource limits for the job being initiated.
 There are several possible reasons for not being able to establish those
 resource limits.
 <ul>
 <li>The hard resource limits applied to SLURM's slurmd daemon are lower
 than the user's soft resources limits on the submit host. Typically
 the slurmd daemon is initiated by the init daemon with the operating
 system default limits. This may be address either through use of the
 ulimit command in the /etc/sysconfig/slurm file or enabling
 <a href="#pam">PAM in SLURM</a>.</li>
 <li>The user's hard resource limits on the allocated node sre lower than
 the same user's soft  hard resource limits on the node from which the
 job was submitted. It is recommended that the system administrator
 establish uniform hard resource limits for users on all nodes
 within a cluster to prevent this from occurring.</li>
 </ul></p>
 <p>NOTE: This may produce the error message &quot;Can't propagate RLIMIT_...&quot;.
 The error message is printed only if the user explicity specifies that
 the resource limit should be propagated or the srun command is running
 with verbose logging of actions from the slurmd daemon (e.g. "srun -d6 ...").</p>

 <p><a name="pending"><b>3. Why is my job not running?</b></a><br>
 The answer to this question depends upon the scheduler used by SLURM. Executing
 the command</p>
 <blockquote>
 <p> <span class="commandline">scontrol show config | grep SchedulerType</span></p>
 </blockquote>
 <p> will supply this information. If the scheduler type is <b>builtin</b>, then
 jobs will be executed in the order of submission for a given partition. Even if
 resources are available to initiate your job immediately, it will be deferred
 until no previously submitted job is pending. If the scheduler type is <b>backfill</b>,
 then jobs will generally be executed in the order of submission for a given partition
 with one exception: later submitted jobs will be initiated early if doing so does
 not delay the expected execution time of an earlier submitted job. In order for
 backfill scheduling to be effective, users jobs should specify reasonable time
 limits. If jobs do not specify time limits, then all jobs will receive the same
 time limit (that associated with the partition), and the ability to backfill schedule
 jobs will be limited. The backfill scheduler does not alter job specifications
 of required or excluded nodes, so jobs which specify nodes will substantially
 reduce the effectiveness of backfill scheduling. See the <a href="#backfill">
 backfill</a> section for more details. If the scheduler type is <b>wiki</b>,
 this represents
 <a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php">
 The Maui Scheduler</a> or
 <a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php">
 Moab Cluster Suite</a>.
 Please refer to its documentation for help. For any scheduler, you can check priorities
 of jobs using the command <span class="commandline">scontrol show job</span>.</p>

 <p><a name="sharing"><b>4. Why does the srun --overcommit option not permit multiple jobs
 to run on nodes?</b></a><br>
 The <b>--overcommit</b> option is a means of indicating that a job or job step is willing
 to execute more than one task per processor in the job's allocation. For example,
 consider a cluster of two processor nodes. The srun execute line may be something
 of this sort</p>
 <blockquote>
 <p><span class="commandline">srun --ntasks=4 --nodes=1 a.out</span></p>
 </blockquote>
 <p>This will result in not one, but two nodes being allocated so that each of the four
 tasks is given its own processor. Note that the srun <b>--nodes</b> option specifies
 a minimum node count and optionally a maximum node count. A command line of</p>
 <blockquote>
 <p><span class="commandline">srun --ntasks=4 --nodes=1-1 a.out</span></p>
 </blockquote>
 <p>would result in the request being rejected. If the <b>--overcommit</b> option
 is added to either command line, then only one node will be allocated for all
 four tasks to use.</p>
 <p>More than one job can execute simultaneously on the same nodes through the use
 of srun's <b>--shared</b> option in conjunction with the <b>Shared</b> parameter
 in SLURM's partition configuration. See the man pages for srun and slurm.conf for
 more information.</p>

 <p><a name="purge"><b>5. Why is my job killed prematurely?</b></a><br>
 SLURM has a job purging mechanism to remove inactive jobs (resource allocations)
 before reaching its time limit, which could be infinite.
 This inactivity time limit is configurable by the system administrator.
 You can check it's value with the command</p>
 <blockquote>
 <p><span class="commandline">scontrol show config | grep InactiveLimit</span></p>
 </blockquote>
 <p>The value of InactiveLimit is in seconds.
 A zero value indicates that job purging is disabled.
 A job is considered inactive if it has no active job steps or if the srun
 command creating the job is not responding.
 In the case of a batch job, the srun command terminates after the job script
 is submitted.
 Therefore batch job pre- and post-processing is limited to the InactiveLimit.
 Contact your system administrator if you believe the InactiveLimit value
 should be changed.

 <p><a name="opts"><b>6. Why are my srun options ignored?</b></a><br>
 Everything after the command <span class="commandline">srun</span> is
 examined to determine if it is a valid option for srun. The first
 token that is not a valid option for srun is considered the command
 to execute and everything after that is treated as an option to
 the command. For example:</p>
 <blockquote>
 <p><span class="commandline">srun -N2 hostname -pdebug</span></p>
 </blockquote>
 <p>srun processes "-N2" as an option to itself. "hostname" is the
 command to execute and "-pdebug" is treated as an option to the
 hostname command. Which will change the name of the computer
 on which SLURM executes the command - Very bad, <b>Don't run
 this command as user root!</b></p>

 <p><a name="cred"><b>7. Why are &quot;Invalid job credential&quot; errors generated?
 </b></a><br>
 This error is indicative of SLURM's job credential files being inconsistent across
 the cluster. All nodes in the cluster must have the matching public and private
 keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the
 slurm configuration file <b>slurm.conf</b>.

 <p><a name="backfill"><b>8. Why is the SLURM backfill scheduler not starting my job?
 </b></a><br>
 There are significant limitations in the current backfill scheduler plugin.
 It was designed to perform backfill node scheduling for a homogeneous cluster.
 It does not manage scheduling on individual processors (or other consumable
 resources). It also does not update the required or excluded node list of
 individual jobs. These are the current limiations. You can use the
 scontrol show command to check if these conditions apply.</p>
 <ul>
 <li>partition: State=UP</li>
 <li>partition: RootOnly=NO</li>
 <li>partition: Shared=NO</li>
 <li>job: ReqNodeList=NULL</li>
 <li>job: ExcNodeList=NULL</li>
 <li>job: Contiguous=0</li>
 <li>job: Features=NULL</li>
 <li>job: MinProcs, MinMemory, and MinTmpDisk satisfied by all nodes in
 the partition</li>
 <li>job: MinProcs or MinNodes not to exceed partition's MaxNodes</li>
 </ul>
 <p>As soon as any priority-ordered job in the partition's queue fail to
 satisfy the request, no lower priority job in that partition's queue
 will be considered as a backfill candidate. Any programmer wishing
 to augment the existing code is welcome to do so.

 <p><a name="steps"><b>9. How can I run multiple jobs from within a
 single script?</b></a><br>
 A SLURM job is just a resource allocation. You can execute many
 job steps within that allocation, either in parallel or sequentially.
 Some jobs actually launch thousands of job steps this way. The job
 steps will be allocated nodes that are not already allocated to
 other job steps. This essential provides a second level of resource
 management within the job for the job steps.</p>

 <p><a name="orphan"><b>10. Why do I have job steps when my job has
 already COMPLETED?</b></a><br>
 NOTE: This only applies to systems configured with
 <i>SwitchType=switch/elan</i> or <i>SwitchType=switch/federation</i>.
 All other systems will purge all job steps on job completion.</p>
 <p>SLURM maintains switch (network interconnect) information within
 the job step for Quadrics Elan and IBM Federation switches.
 This information must be maintained until we are absolutely certain
 that the processes associated with the switch have been terminated
 to avoid the possibility of re-using switch resources for other
 jobs (even on different nodes).
 SLURM considers jobs COMPLETED when all nodes allocated to the
 job are either DOWN or confirm termination of all it's processes.
 This enables SLURM to purge job information in a timely fashion
 even when there are many failing nodes.
 Unfortunately the job step information may persist longer.</p>

 <p><a name="multi_batch"><b>11. How can I run a job within an existing
 job allocation?</b></a><br>
 There is a srun option <i>--jobid</i> that can be used to specify
 a job's ID.
 For a batch job or within an existing resource allocation, the
 environment variable <i>SLURM_JOBID</i> has already been defined,
 so all job steps will run within that job allocation unless
 otherwise specified.
 The one exception to this is when submitting batch jobs.
 When a batch job is submitted from within an existing batch job,
 it is treated as a new job allocation request and will get a
 new job ID unless explicitly set with the <i>--jobid</i> option.
 If you specify that a batch job should use an existing allocation,
 that job allocation will be released upon the termination of
 that batch job.</p>

 <p><a name="user_env"><b>12. How does SLURM establish the environment
 for my job?</b></a><br>
 SLURM processes are not run under a shell, but directly exec'ed
 by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch
 the processes).
 The environment variables in effect at the time the <i>srun</i> command
 is executed are propagated to the spawned processes.
 The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed
 as part of the process launch.</p>

 <p><a name="prompt"><b>13. How can I get shell prompts in interactive
 mode?</b></a><br>
 <i>srun -u bash -i</i><br>
 Srun's <i>-u</i> option turns off buffering of stdout.
 Bash's <i>-i</i> option tells it to run in interactive mode (with prompts).

 <p><a name="batch_out"><b>14. How can I get the task ID in the output
 or error file name for a batch job?</b></a><br>
 The <i>srun -b</i> or <i>sbatch</i> commands are meant to accept a
 script rather than a command line. If you specify a command line
 rather than a script, it gets translated to a simple script of this
 sort:</p>
 <pre>
 #!/bin/sh
 srun hostname
 </pre>
 <p>You will note that the srun command lacks the output file specification.
 It's output (for all tasks) becomes the output of the job. If you
 want separate output by task, you will need to build a script containing
 this specification. For example:</p>
 <pre>
 $ cat test
 #!/bin/sh
 echo begin_test
 srun -o out_%j_%t hostname

 $ sbatch -n7 -o out_%j test
 sbatch: Submitted batch job 65541

 $ ls -l out*
 -rw-rw-r--  1 jette jette 11 Jun 15 09:15 out_65541
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_0
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_1
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_2
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_3
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_4
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_5
 -rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_6

 $ cat out_65541
 begin_test

 $ cat out_65541_2
 tdev2
 </pre>

 <p><a name="parallel_make"><b>15. Can the <i>make</i> command
 utilize the resources allocated to a SLURM job?</b></a><br>
 Yes. There is a patch available for GNU make version 3.81
 available as part of the SLURM distribution in the file
 <i>contribs/make.slurm.patch</i>.
 This patch will use SLURM to launch tasks across a job's current resource
 allocation. Depending upon the size of modules to be compiled, this may
 or may not improve performance. If most modules are thousands of lines
 long, the use of additional resources should more than compensate for the
 overhead of SLURM's task launch. Use with make's <i>-j</i> option within an
 existing SLURM allocation. Outside of a SLURM allocation, make's behavior
 will be unchanged.</p>

 <p><a name="terminal"><b>16. Can tasks be launched with a remote
 terminal?</b></a><br>
 SLURM does not directly support a remote pseudo terminal for spawned
 tasks.
 We intend to remedy this in Slurm version 1.3.
 Until then, you can accomplish this by starting an appropriate program
 or script. In the simplest case (X11 over TCP with the DISPLAY
 environment already set), <i>srun xterm</i> may suffice. In the more
 general case, the following scripts should work.
 <b>NOTE: The pathname to the additional scripts are included in the
 variables BS and IS of the first script. You must change this in the
 first script.</b>
 Execute the script with the sbatch options desired.
 For example, <i>interactive -N2 -pdebug</i>.

 <pre>
 #!/bin/bash
 # -*- coding: utf-8 -*-
 # Author: P&auml;r Andersson (National Supercomputer Centre, Sweden)
 # Version: 0.3 2007-07-30
 #
 # This will submit a batch script that starts screen on a node.
 # Then ssh is used to connect to the node and attach the screen.
 # The result is very similar to an interactive shell in PBS
 # (qsub -I)

 # Batch Script that starts SCREEN
 BS=/INSTALL_DIRECTORY/_interactive
 # Interactive screen script
 IS=/INSTALL_DIRECTORY/_interactive_screen

 # Submit the job and get the job id
 JOB=`sbatch --output=/dev/null --error=/dev/null $@ $BS 2>&1 \
     | egrep -o -e "\b[0-9]+$"`

 # Make sure the job is always canceled
 trap "{ /usr/bin/scancel -q $JOB; exit; }" SIGINT SIGTERM EXIT

 echo "Waiting for JOBID $JOB to start"
 while true;do
     sleep 5s

     # Check job status
     STATUS=`squeue -j $JOB -t PD,R -h -o %t`

     if [ "$STATUS" = "R" ];then
 	# Job is running, break the while loop
 	break
     elif [ "$STATUS" != "PD" ];then
 	echo "Job is not Running or Pending. Aborting"
 	scancel $JOB
 	exit 1
     fi

     echo -n "."

 done

 # Determine the first node in the job:
 NODE=`srun --jobid=$JOB -N1 hostname`

 # SSH to the node and attach the screen
 sleep 1s
 ssh -X -t $NODE $IS slurm$JOB
 # The trap will now cancel the job before exiting.
 </pre>

 <p>NOTE: The above script executes the script below,
 named <i>_interactive<i>.</p>
 <pre>
 #!/bin/sh
 # -*- coding: utf-8 -*-
 # Author: P&auml;r Andersson  (National Supercomputer Centre, Sweden)
 # Version: 0.2 2007-07-30
 #
 # Simple batch script that starts SCREEN.

 exec screen -Dm -S slurm$SLURM_JOBID
 </pre>

 <p>The following script named <i>_interactive_screen</i> is also used.</p>
 <pre>
 #!/bin/sh
 # -*- coding: utf-8 -*-
 # Author: P&auml;r Andersson  (National Supercomputer Centre, Sweden)
 # Version: 0.3 2007-07-30
 #

 SCREENSESSION=$1

 # If DISPLAY is set then set that in the screen, then create a new
 # window with that environment and kill the old one.
 if [ "$DISPLAY" != "" ];then
     screen -S $SCREENSESSION -X unsetenv DISPLAY
     screen -p0 -S $SCREENSESSION -X setenv DISPLAY $DISPLAY
     screen -p0 -S $SCREENSESSION -X screen
     screen -p0 -S $SCREENSESSION -X kill
 fi

 exec screen -S $SCREENSESSION -rd
 </pre>


 <p class="footer"><a href="#top">top</a></p>

 <h2>For Administrators</h2>
 <p><a name="suspend"><b>1. How is job suspend/resume useful?</b></a><br>
 Job suspend/resume is most useful to get particularly large jobs initiated
 in a timely fashion with minimal overhead. Say you want to get a full-system
 job initiated. Normally you would need to either cancel all running jobs
 or wait for them to terminate. Canceling jobs results in the loss of
 their work to that point from either their beginning or last checkpoint.
 Waiting for the jobs to terminate can take hours, depending upon your
 system configuration. A more attractive alternative is to suspend the
 running jobs, run the full-system job, then resume the suspended jobs.
 This can easily be accomplished by configuring a special queue for
 full-system jobs and using a script to control the process.
 The script would stop the other partitions, suspend running jobs in those
 partitions, and start the full-system partition.
 The process can be reversed when desired.
 One can effectively gang schedule (time-slice) multiple jobs
 using this mechanism, although the algorithms to do so can get quite
 complex.
 Suspending and resuming a job makes use of the SIGSTOP and SIGCONT
 signals respectively, so swap and disk space should be sufficient to
 accommodate all jobs allocated to a node, either running or suspended.

 <p><a name="fast_schedule"><b>2. How can I configure SLURM to use
 the resources actually found on a node rather than what is defined
 in <i>slurm.conf</i>?</b></a><br>
 SLURM can either base it's scheduling decisions upon the node
 configuration defined in <i>slurm.conf</i> or what each node
 actually returns as available resources.
 This is controlled using the configuration parameter <i>FastSchedule</i>.
 Set it's value to zero in order to use the resources actually
 found on each node, but with a higher overhead for scheduling.
 A value of one is the default and results in the node configuration
 defined in <i>slurm.conf</i> being used. See &quot;man slurm.conf&quot;
 for more details.</p>

 <p><a name="return_to_service"><b>3. Why is a node shown in state
 DOWN when the node has registered for service?</b></a><br>
 The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i>
 controls how DOWN nodes are handled.
 Set its value to one in order for DOWN nodes to automatically be
 returned to service once the <i>slurmd</i> daemon registers
 with a valid node configuration.
 A value of zero is the default and results in a node staying DOWN
 until an administrator explicity returns it to service using
 the command &quot;scontrol update NodeName=whatever State=RESUME&quot;.
 See &quot;man slurm.conf&quot; and &quot;man scontrol&quot; for more
 details.</p>

 <p><a name="down_node"><b>4. What happens when a node crashes?</b></a><br>
 A node is set DOWN when the slurmd daemon on it stops responding
 for <i>SlurmdTimeout</i> as defined in <i>slurm.conf</i>.
 The node can also be set DOWN when certain errors occur or the
 node's configuration is inconsistent with that defined in <i>slurm.conf</i>.
 Any active job on that node will be killed unless it was submitted
 with the srun option <i>--no-kill</i>.
 Any active job step on that node will be killed.
 See the slurm.conf and srun man pages for more information.</p>

 <p><a name="multi_job"><b>5. How can I control the execution of multiple
 jobs per node?</b></a><br>
 There are two mechanism to control this.
 If you want to allocate individual processors on a node to jobs,
 configure <i>SelectType=select/cons_res</i>.
 See <a href="cons_res.html">Consumable Resources in SLURM</a>
 for details about this configuration.
 If you want to allocate whole nodes to jobs, configure
 configure <i>SelectType=select/linear</i>.
 Each partition also has a configuration parameter <i>Shared</i>
 that enables more than one job to execute on each node.
 See <i>man slurm.conf</i> for more information about these
 configuration paramters.</p>

 <p><a name="inc_plugin"><b>6. When the SLURM daemon starts, it
 prints &quot;cannot resolve X plugin operations&quot; and exits.
 What does this mean?</b></a><br>
 This means that symbols expected in the plugin were
 not found by the daemon. This typically happens when the
 plugin was built or installed improperly or the configuration
 file is telling the plugin to use an old plugin (say from the
 previous version of SLURM). Restart the daemon in verbose mode
 for more information (e.g. &quot;slurmctld -Dvvvvv&quot;).

 <p><a name="sigpipe"><b>7. Why are user tasks intermittently dying
 at launch with SIGPIPE error messages?</b></a><br>
 If you are using ldap or some other remote name service for
 username and groups lookup, chances are that the underlying
 libc library functions are triggering the SIGPIPE.  You can likely
 work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf
 file.  However, be aware that you will need to run &quot;scontrol
 reconfigure &quot; any time your groups database is updated.

 <p><a name="maint_time"><b>8. How can I dry up the workload for a
 maintenance period?</b></a><br>
 There isn't a mechanism to tell SLURM that all jobs should be
 completed by a specific time. The best way to address this is
 to shorten the <i>MaxTime</i> associated with the partitions so
 as to avoid initiating jobs that will not have completed by
 the maintenance period.

 <p><a name="pam"><b>9. How can PAM be used to control a user's limits on
 or access to compute nodes?</b></a><br>
 First, enable SLURM's use of PAM by setting <i>UsePAM=1</i> in
 <i>slurm.conf</i>.<br>
 Second, establish a PAM configuration file for slurm in <i>/etc/pam.d/slurm</i>.
 A basic configuration you might use is:</p>
 <pre>
 auth     required  pam_localuser.so
 account  required  pam_unix.so
 session  required  pam_limits.so
 </pre>
 <p>Third, set the desired limits in <i>/etc/security/limits.conf</i>.
 For example, to set the locked memory limit to unlimited for all users:</p>
 <pre>
 *   hard   memlock   unlimited
 *   soft   memlock   unlimited
 </pre>
 <p>Finally, you need to disable SLURM's forwarding of the limits from the
 session from which the <i>srun</i> initiating the job ran. By default
 all resource limits are propogated from that session. For example, adding
 the following line to <i>slurm.conf</i> will prevent the locked memory
 limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.</p>

 <p>We also have a PAM module for SLURM that prevents users from
 logging into nodes that they have not been allocated (except for user
 root, which can always login. pam_slurm is available for download from
 <a href="ftp://ftp.llnl.gov/pub/linux/pam_slurm/">ftp://ftp.llnl.gov/pub/linux/pam_slurm</a>
 The use of pam_slurm does not require <i>UsePAM</i> being set. The
 two uses of PAM are independent.

 <p><a name="time"><b>10. Why are jobs allocated nodes and then unable
 to initiate programs on some nodes?</b></a><br>
 This typically indicates that the time on some nodes is not consistent
 with the node on which the <i>slurmctld</i> daemon executes. In order to
 initiate a job step (or batch job), the <i>slurmctld</i> daemon generates
 a credential containing a time stamp. If the <i>slurmd</i> daemon
 receives a credential containing a time stamp later than the current
 time or more than a few minutes in the past, it will be rejected.
 If you check in the <i>SlurmdLog</i> on the nodes of interest, you
 will likely see messages of this sort: "<i>Invalid job credential from
 &lt;some IP address&gt;: Job credential expired</i>." Make the times
 consistent across all of the nodes and all should be well.

 <p><a name="ping"><b>11. Why does <i>slurmctld</i> log that some nodes
 are not responding even if they are not in any partition?</b></a><br>
 The <i>slurmctld</i> daemon periodically pings the <i>slurmd</i>
 daemon on every configured node, even if not associated with any
 partition. You can control the frequency of this ping with the
 <i>SlurmdTimeout</i> configuration parameter in <i>slurm.conf</i>.

 <p><a name="controller"><b>12. How should I relocated the primary or
 backup controller?</b></a><br>
 If the cluster's computers used for the primary or backup controller
 will be out of service for an extended period of time, it may be desirable
 to relocate them. In order to do so, follow this procedure:</p>
 <ol>
 <li>Stop all SLURM daemons</li>
 <li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>,
 <i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li>
 <li>Distribute the updated <i>slurm.conf</i> file file to all nodes</li>
 <li>Restart all SLURM daemons</li>
 </ol>
 <p>There should be no loss of any running or pending jobs. Insure that
 any nodes added to the cluster have a current <i>slurm.conf</i> file
 installed.
 <b>CAUTION:</b> If two nodes are simultaneously configured as the primary
 controller (two nodes on which <i>ControlMachine</i> specify the local host
 and the <i>slurmctld</i> daemon is executing on each), system behavior will be
 destructive. If a compute node has an incorrect <i>ControlMachine</i> or
 <i>BackupController</i> parameter, that node may be rendered unusable, but no
 other harm will result.

 <p><a name="multi_slurm"><b>13. Can multiple SLURM systems be run in
 parallel for testing purposes?</b></a><br>
 Yes, this is a great way to test new versions of SLURM.
 Just install the test version in a different location with a different
 <i>slurm.conf</i>.
 The test system's <i>slurm.conf</i> should specify different
 pathnames and port numbers to avoid conflicts.
 The only problem is if more than one version of SLURM is configured
 with <i>switch/elan</i> or <i>switch/federation</i>.
 In that case, there can be conflicting switch window requests from
 the different SLURM systems.
 This can be avoided by configuring the test system with <i>switch/none</i>.
 MPI jobs started on an Elan or Federation switch system without the
 switch windows configured will not execute properly, but other jobs
 will run fine.
 Another option for testing on Elan or Federation systems is to use
 a different set of nodes for the different SLURM systems.
 That will permit both systems to allocate switch windows without
 conflicts.

 <p><a name="multi_slurmd"><b>14. Can slurm emulate a larger
 cluster?</b></a><br>
 Yes, this can be useful for testing purposes.
 It has also been used to partition "fat" nodes into multiple SLURM nodes.
 There are two ways to do this.
 The best method for most conditins is to run one <i>slurmd</i>
 daemon per emulated node in the cluster as follows.
 <ol>
 <li>When executing the <i>configure</i> program, use the option
 <i>--multiple-slurmd</i> (or add that option to your <i>~/.rpmmacros</i>
 file).</li>
 <li>Build and install SLURM in the usual manner.</li>
 <li>In <i>slurm.conf</i> define the desired node names (arbitrary
 names used only by SLURM) as <i>NodeName</i> along with the actual
 address of the physical node in <i>NodeHostname</i>. Multiple
 <i>NodeName</i> values can be mapped to a single
 <i>NodeHostname</i>.  Note that each <i>NodeName</i> on a single
 physical node needs to be configured to use a different port number.  You
 will also want to use the "%n" symbol in slurmd related path options in
 slurm.conf. </li>
 <li>When starting the <i>slurmd</i> daemon, include the <i>NodeName</i>
 of the node that it is supposed to serve on the execute line.</li>
 </ol>
 <p>It is strongly recommended that SLURM version 1.2 or higher be used
 for this due to it's improved support for multiple slurmd daemons.
 See the
 <a href="programmer_guide.shtml#multiple_slurmd_support">Programmers Guide</a>
 for more details about configuring multiple slurmd support.

 <p>In order to emulate a really large cluster, it can be more
 convenient to use a single <i>slurmd</i> daemon.
 That daemon will not be able to launch many tasks, but can
 suffice for developing or testing scheduling software.
 Do not run job steps with more than a couple of tasks each
 or execute more than a few jobs at any given time.
 Doing so may result in the <i>slurmd</i> daemon exhausting its
 memory and failing.
 <b>Use this method with caution.</b>
 <ol>
 <li>Execute the <i>configure</i> program with your normal options.</li>
 <li>Append the line "<i>#define HAVE_FRONT_END 1</i>" to the resulting
 <i>config.h</i> file.</li>
 <li>Build and install SLURM in the usual manner.</li>
 <li>In <i>slurm.conf</i> define the desired node names (arbitrary
 names used only by SLURM) as <i>NodeName</i> along with the actual
 name and address of the <b>one</b> physical node in <i>NodeHostName</i>
 and <i>NodeAddr</i>.
 Up to 64k nodes can be configured in this virtual cluster.</li>
 <li>Start your <i>slurmctld</i> and one <i>slurmd</i> daemon.
 It is advisable to use the "-c" option to start the daemons without
 trying to preserve any state files from previous executions.
 Be sure to use the "-c" option when switch from this mode too.</li>
 <li>Create job allocations as desired, but do not run job steps
 with more than a couple of tasks.</li>
 </ol>
 <pre>
 $ ./configure --enable-debug --prefix=... --sysconfdir=...
 $ echo "#define HAVE_FRONT_END 1" >>config.h
 $ make install
 $ grep NodeHostName slurm.conf
 <i>NodeName=dummy[1-1200] NodeHostName=localhost NodeAddr=127.0.0.1</i>
 $ slurmctld -c
 $ slurmd -c
 $ sinfo
 <i>PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST</i>
 <i>pdebug*      up      30:00  1200   idle dummy[1-1200]</i>
 $ cat tmp
 <i>#!/bin/bash</i>
 <i>sleep 30</i>
 $ srun -N200 -b tmp
 <i>srun: jobid 65537 submitted</i>
 $ srun -N200 -b tmp
 <i>srun: jobid 65538 submitted</i>
 $ srun -N800 -b tmp
 <i>srun: jobid 65539 submitted</i>
 $ squeue
 <i>JOBID PARTITION  NAME   USER  ST  TIME  NODES NODELIST(REASON)</i>
 <i>65537    pdebug   tmp  jette   R  0:03    200 dummy[1-200]</i>
 <i>65538    pdebug   tmp  jette   R  0:03    200 dummy[201-400]</i>
 <i>65539    pdebug   tmp  jette   R  0:02    800 dummy[401-1200]</i>
 </pre>

 <p><a name="extra_procs"><b>15. Can SLURM emulate nodes with more
 resources than physically exist on the node?</b></a><br>
 Yes in SLURM version 1.2 or higher.
 In the <i>slurm.conf</i> file, set <i>FastSchedule=2</i> and specify
 any desired node resource specifications (<i>Procs</i>, <i>Sockets</i>,
 <i>CoresPerSocket</i>, <i>ThreadsPerCore</i>, and/or <i>TmpDisk</i>).
 SLURM will use the resource specification for each node that is
 given in <i>slurm.conf</i> and will not check these specifications
 against those actaully found on the node.

 <p><a name="credential_replayed"><b>16. What does a "credential
 replayed" error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
 This error is indicative of the <i>slurmd</i> daemon not being able
 to respond to job initiation requests from the <i>srun</i> command
 in a timely fashion (a few seconds).
 <i>Srun</i> responds by resending the job initiation request.
 When the <i>slurmd</i> daemon finally starts to respond, it
 processes both requests.
 The second request is rejected and the event is logged with
 the "credential replayed" error.
 If you check the <i>SlurmdLogFile</i> and <i>SlurmctldLogFile</i>,
 you should see signs of the <i>slurmd</i> daemon's non-responsiveness.
 A variety of factors can be responsible for this problem
 including
 <ul>
 <li>Diskless nodes encountering network problems</li>
 <li>Very slow Network Information Service (NIS)</li>
 <li>The <i>Prolog</i> script taking a long time to complete</li>
 </ul>
 <p>In Slurm version 1.2, this can be addressed with the
 <i>MessageTimeout</i> configuration parameter by setting a
 value higher than the default 5 seconds.
 In earlier versions of Slurm, the <i>--msg-timeout</i> option
 of <i>srun</i> serves a similar purpose.

 <p><a name="large_time"><b>17. What does a "Warning: Note very large
 processing time" in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
 This error is indicative of some operation taking an unexpectedly
 long time to complete, over one second to be specific.
 Setting the value of <i>SlurmctldDebug</i> configuration parameter
 a value of six or higher should identify which operation(s) are
 experiencing long delays.
 This message typically indicates long delays in file system access
 (writing state information or getting user information).
 Another possibility is that the node on which the slurmctld
 daemon executes has exhausted memory and is paging.
 Try running the program <i>top</i> to check for this possibility.

 <p><a name="lightweight_core"><b>18. How can I add support for
 lightweight core files?</b></a><br>
 SLURM supports lightweight core files by setting environment variables
 based upon the <i>srun --core</i> option. Of particular note, it
 sets the <i>LD_PRELOAD</i> environment variable to load new functions
 used to process a core dump.
 >First you will need to acquire and install a shared object
 library with the appropriate functions.
 Then edit the SLURM code in <i>src/srun/core-format.c</i> to
 specify a name for the core file type,
 add a test for the existence of the library,
 and set environment variables appropriately when it is used.

 <p><a name="limit_propagation"><b>19. Is resource limit propagation
 useful on a homogeneous cluster?</b></a><br>
 Resource limit propagation permits a user to modify resource limits
 and submit a job with those limits.
 By default, SLURM automatically propagates all resource limits in
 effect at the time of job submission to the tasks spawned as part
 of that job.
 System administrators can utilize the <i>PropagateResourceLimits</i>
 and <i>PropagateResourceLimitsExcept</i> configuration parameters to
 change this behavior.
 Users can override defaults using the <i>srun --propagate</i>
 option.
 See <i>"man slurm.conf"</i> and <i>"man srun"</i> for more information
 about these options.

 <p class="footer"><a href="#top">top</a></p>

 <p style="text-align:center;">Last modified 30 July 2007</p>

 <!--#include virtual="footer.txt"-->