| <!--#include virtual="header.txt"--> |
| |
| <h1><a name="top">Frequently Asked Questions</a></h1> |
| <h2>For Users</h2> |
| <ol> |
| <li><a href="#comp">Why is my job/node in COMPLETING state?</a></li> |
| <li><a href="#rlimit">Why are my resource limits not propagated?</a></li> |
| <li><a href="#pending">Why is my job not running?</a></li> |
| <li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs |
| to run on nodes?</a></li> |
| <li><a href="#purge">Why is my job killed prematurely?</a></li> |
| <li><a href="#opts">Why are my srun options ignored?</a></li> |
| <li><a href="#cred">Why are "Invalid job credential" errors generated?</a></li> |
| <li><a href="#backfill">Why is the SLURM backfill scheduler not starting my |
| job?</a></li> |
| <li><a href="#steps">How can I run multiple jobs from within a single script?</a></li> |
| <li><a href="#orphan">Why do I have job steps when my job has already COMPLETED?</a></li> |
| <li><a href="#multi_batch">How can I run a job within an existing job allocation?</a></li> |
| <li><a href="#user_env">How does SLURM establish the environment for my job?</a></li> |
| <li><a href="#prompt">How can I get shell prompts in interactive mode?</a></li> |
| <li><a href="#batch_out">How can I get the task ID in the output or error file |
| name for a batch job?</a></li> |
| <li><a href="#parallel_make">Can the <i>make</i> command utilize the resources |
| allocated to a SLURM job?</a></li> |
| <li><a href="#terminal">Can tasks be launched with a remote terminal?</a></li> |
| </ol> |
| <h2>For Administrators</h2> |
| <ol> |
| <li><a href="#suspend">How is job suspend/resume useful?</a></li> |
| <li><a href="#fast_schedule">How can I configure SLURM to use the resources actually |
| found on a node rather than what is defined in <i>slurm.conf</i>?</a></li> |
| <li><a href="#return_to_service">Why is a node shown in state DOWN when the node |
| has registered for service?</a></li> |
| <li><a href="#down_node">What happens when a node crashes?</a></li> |
| <li><a href="#multi_job">How can I control the execution of multiple |
| jobs per node?</a></li> |
| <li><a href="#inc_plugin">When the SLURM daemon starts, it prints |
| "cannot resolve X plugin operations" and exits. What does this mean?</a></li> |
| <li><a href="#sigpipe">Why are user tasks intermittently dying at launch with SIGPIPE |
| error messages?</a></li> |
| <li><a href="#maint_time">How can I dry up the workload for a maintenance |
| period?</a></li> |
| <li><a href="#pam">How can PAM be used to control a user's limits on or |
| access to compute nodes?</a></li> |
| <li><a href="#time">Why are jobs allocated nodes and then unable to initiate |
| programs on some nodes?</a></li> |
| <li><a href="#ping"> Why does <i>slurmctld</i> log that some nodes |
| are not responding even if they are not in any partition?</a></li> |
| <li><a href="#controller"> How should I relocated the primary or backup |
| controller?</a></li> |
| <li><a href="#multi_slurm">Can multiple SLURM systems be run in |
| parallel for testing purposes?</a></li> |
| <li><a href="#multi_slurmd">Can slurm emulate a larger cluster?</a></li> |
| <li><a href="#extra_procs">Can SLURM emulate nodes with more |
| resources than physically exist on the node?</a></li> |
| <li><a href="#credential_replayed">What does a "credential |
| replayed" error in the <i>SlurmdLogFile</i> indicate?</a></li> |
| <li><a href="#large_time">What does a "Warning: Note very large |
| processing time" in the <i>SlurmctldLogFile</i> indicate?</a></li> |
| <li><a href="#lightweight_core">How can I add support for lightweight |
| core files?</a></li> |
| <li><a href="#limit_propagation">Is resource limit propagation |
| useful on a homogeneous cluster?</a></li> |
| </ol> |
| |
| <h2>For Users</h2> |
| <p><a name="comp"><b>1. Why is my job/node in COMPLETING state?</b></a><br> |
| When a job is terminating, both the job and its nodes enter the COMPLETING state. |
| As the SLURM daemon on each node determines that all processes associated with |
| the job have terminated, that node changes state to IDLE or some other appropriate |
| state. |
| When every node allocated to a job has determined that all processes associated |
| with it have terminated, the job changes state to COMPLETED or some other |
| appropriate state (e.g. FAILED). |
| Normally, this happens within a second. |
| However, if the job has processes that cannot be terminated with a SIGKILL |
| signal, the job and one or more nodes can remain in the COMPLETING state |
| for an extended period of time. |
| This may be indicative of processes hung waiting for a core file |
| to complete I/O or operating system failure. |
| If this state persists, the system administrator should check for processes |
| associated with the job that can not be terminated then use the |
| <span class="commandline">scontrol</span> command to change the node's |
| state to DOWN (e.g. "scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing"), |
| reboot the node, then reset the node's state to IDLE |
| (e.g. "scontrol update NodeName=<i>name</i> State=RESUME"). |
| Note that setting the node DOWN will terminate all running or suspended |
| jobs associated with that node. |
| An alternative is to set the node's state to DRAIN until all jobs |
| associated with it terminate before setting it DOWN and re-booting.</p> |
| |
| <p><a name="rlimit"><b>2. Why are my resource limits not propagated?</b></a><br> |
| When the <span class="commandline">srun</span> command executes, it captures the |
| resource limits in effect at that time. These limits are propagated to the allocated |
| nodes before initiating the user's job. The SLURM daemon running on that node then |
| tries to establish identical resource limits for the job being initiated. |
| There are several possible reasons for not being able to establish those |
| resource limits. |
| <ul> |
| <li>The hard resource limits applied to SLURM's slurmd daemon are lower |
| than the user's soft resources limits on the submit host. Typically |
| the slurmd daemon is initiated by the init daemon with the operating |
| system default limits. This may be address either through use of the |
| ulimit command in the /etc/sysconfig/slurm file or enabling |
| <a href="#pam">PAM in SLURM</a>.</li> |
| <li>The user's hard resource limits on the allocated node sre lower than |
| the same user's soft hard resource limits on the node from which the |
| job was submitted. It is recommended that the system administrator |
| establish uniform hard resource limits for users on all nodes |
| within a cluster to prevent this from occurring.</li> |
| </ul></p> |
| <p>NOTE: This may produce the error message "Can't propagate RLIMIT_...". |
| The error message is printed only if the user explicity specifies that |
| the resource limit should be propagated or the srun command is running |
| with verbose logging of actions from the slurmd daemon (e.g. "srun -d6 ...").</p> |
| |
| <p><a name="pending"><b>3. Why is my job not running?</b></a><br> |
| The answer to this question depends upon the scheduler used by SLURM. Executing |
| the command</p> |
| <blockquote> |
| <p> <span class="commandline">scontrol show config | grep SchedulerType</span></p> |
| </blockquote> |
| <p> will supply this information. If the scheduler type is <b>builtin</b>, then |
| jobs will be executed in the order of submission for a given partition. Even if |
| resources are available to initiate your job immediately, it will be deferred |
| until no previously submitted job is pending. If the scheduler type is <b>backfill</b>, |
| then jobs will generally be executed in the order of submission for a given partition |
| with one exception: later submitted jobs will be initiated early if doing so does |
| not delay the expected execution time of an earlier submitted job. In order for |
| backfill scheduling to be effective, users jobs should specify reasonable time |
| limits. If jobs do not specify time limits, then all jobs will receive the same |
| time limit (that associated with the partition), and the ability to backfill schedule |
| jobs will be limited. The backfill scheduler does not alter job specifications |
| of required or excluded nodes, so jobs which specify nodes will substantially |
| reduce the effectiveness of backfill scheduling. See the <a href="#backfill"> |
| backfill</a> section for more details. If the scheduler type is <b>wiki</b>, |
| this represents |
| <a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php"> |
| The Maui Scheduler</a> or |
| <a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php"> |
| Moab Cluster Suite</a>. |
| Please refer to its documentation for help. For any scheduler, you can check priorities |
| of jobs using the command <span class="commandline">scontrol show job</span>.</p> |
| |
| <p><a name="sharing"><b>4. Why does the srun --overcommit option not permit multiple jobs |
| to run on nodes?</b></a><br> |
| The <b>--overcommit</b> option is a means of indicating that a job or job step is willing |
| to execute more than one task per processor in the job's allocation. For example, |
| consider a cluster of two processor nodes. The srun execute line may be something |
| of this sort</p> |
| <blockquote> |
| <p><span class="commandline">srun --ntasks=4 --nodes=1 a.out</span></p> |
| </blockquote> |
| <p>This will result in not one, but two nodes being allocated so that each of the four |
| tasks is given its own processor. Note that the srun <b>--nodes</b> option specifies |
| a minimum node count and optionally a maximum node count. A command line of</p> |
| <blockquote> |
| <p><span class="commandline">srun --ntasks=4 --nodes=1-1 a.out</span></p> |
| </blockquote> |
| <p>would result in the request being rejected. If the <b>--overcommit</b> option |
| is added to either command line, then only one node will be allocated for all |
| four tasks to use.</p> |
| <p>More than one job can execute simultaneously on the same nodes through the use |
| of srun's <b>--shared</b> option in conjunction with the <b>Shared</b> parameter |
| in SLURM's partition configuration. See the man pages for srun and slurm.conf for |
| more information.</p> |
| |
| <p><a name="purge"><b>5. Why is my job killed prematurely?</b></a><br> |
| SLURM has a job purging mechanism to remove inactive jobs (resource allocations) |
| before reaching its time limit, which could be infinite. |
| This inactivity time limit is configurable by the system administrator. |
| You can check it's value with the command</p> |
| <blockquote> |
| <p><span class="commandline">scontrol show config | grep InactiveLimit</span></p> |
| </blockquote> |
| <p>The value of InactiveLimit is in seconds. |
| A zero value indicates that job purging is disabled. |
| A job is considered inactive if it has no active job steps or if the srun |
| command creating the job is not responding. |
| In the case of a batch job, the srun command terminates after the job script |
| is submitted. |
| Therefore batch job pre- and post-processing is limited to the InactiveLimit. |
| Contact your system administrator if you believe the InactiveLimit value |
| should be changed. |
| |
| <p><a name="opts"><b>6. Why are my srun options ignored?</b></a><br> |
| Everything after the command <span class="commandline">srun</span> is |
| examined to determine if it is a valid option for srun. The first |
| token that is not a valid option for srun is considered the command |
| to execute and everything after that is treated as an option to |
| the command. For example:</p> |
| <blockquote> |
| <p><span class="commandline">srun -N2 hostname -pdebug</span></p> |
| </blockquote> |
| <p>srun processes "-N2" as an option to itself. "hostname" is the |
| command to execute and "-pdebug" is treated as an option to the |
| hostname command. Which will change the name of the computer |
| on which SLURM executes the command - Very bad, <b>Don't run |
| this command as user root!</b></p> |
| |
| <p><a name="cred"><b>7. Why are "Invalid job credential" errors generated? |
| </b></a><br> |
| This error is indicative of SLURM's job credential files being inconsistent across |
| the cluster. All nodes in the cluster must have the matching public and private |
| keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the |
| slurm configuration file <b>slurm.conf</b>. |
| |
| <p><a name="backfill"><b>8. Why is the SLURM backfill scheduler not starting my job? |
| </b></a><br> |
| There are significant limitations in the current backfill scheduler plugin. |
| It was designed to perform backfill node scheduling for a homogeneous cluster. |
| It does not manage scheduling on individual processors (or other consumable |
| resources). It also does not update the required or excluded node list of |
| individual jobs. These are the current limiations. You can use the |
| scontrol show command to check if these conditions apply.</p> |
| <ul> |
| <li>partition: State=UP</li> |
| <li>partition: RootOnly=NO</li> |
| <li>partition: Shared=NO</li> |
| <li>job: ReqNodeList=NULL</li> |
| <li>job: ExcNodeList=NULL</li> |
| <li>job: Contiguous=0</li> |
| <li>job: Features=NULL</li> |
| <li>job: MinProcs, MinMemory, and MinTmpDisk satisfied by all nodes in |
| the partition</li> |
| <li>job: MinProcs or MinNodes not to exceed partition's MaxNodes</li> |
| </ul> |
| <p>As soon as any priority-ordered job in the partition's queue fail to |
| satisfy the request, no lower priority job in that partition's queue |
| will be considered as a backfill candidate. Any programmer wishing |
| to augment the existing code is welcome to do so. |
| |
| <p><a name="steps"><b>9. How can I run multiple jobs from within a |
| single script?</b></a><br> |
| A SLURM job is just a resource allocation. You can execute many |
| job steps within that allocation, either in parallel or sequentially. |
| Some jobs actually launch thousands of job steps this way. The job |
| steps will be allocated nodes that are not already allocated to |
| other job steps. This essential provides a second level of resource |
| management within the job for the job steps.</p> |
| |
| <p><a name="orphan"><b>10. Why do I have job steps when my job has |
| already COMPLETED?</b></a><br> |
| NOTE: This only applies to systems configured with |
| <i>SwitchType=switch/elan</i> or <i>SwitchType=switch/federation</i>. |
| All other systems will purge all job steps on job completion.</p> |
| <p>SLURM maintains switch (network interconnect) information within |
| the job step for Quadrics Elan and IBM Federation switches. |
| This information must be maintained until we are absolutely certain |
| that the processes associated with the switch have been terminated |
| to avoid the possibility of re-using switch resources for other |
| jobs (even on different nodes). |
| SLURM considers jobs COMPLETED when all nodes allocated to the |
| job are either DOWN or confirm termination of all it's processes. |
| This enables SLURM to purge job information in a timely fashion |
| even when there are many failing nodes. |
| Unfortunately the job step information may persist longer.</p> |
| |
| <p><a name="multi_batch"><b>11. How can I run a job within an existing |
| job allocation?</b></a><br> |
| There is a srun option <i>--jobid</i> that can be used to specify |
| a job's ID. |
| For a batch job or within an existing resource allocation, the |
| environment variable <i>SLURM_JOBID</i> has already been defined, |
| so all job steps will run within that job allocation unless |
| otherwise specified. |
| The one exception to this is when submitting batch jobs. |
| When a batch job is submitted from within an existing batch job, |
| it is treated as a new job allocation request and will get a |
| new job ID unless explicitly set with the <i>--jobid</i> option. |
| If you specify that a batch job should use an existing allocation, |
| that job allocation will be released upon the termination of |
| that batch job.</p> |
| |
| <p><a name="user_env"><b>12. How does SLURM establish the environment |
| for my job?</b></a><br> |
| SLURM processes are not run under a shell, but directly exec'ed |
| by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch |
| the processes). |
| The environment variables in effect at the time the <i>srun</i> command |
| is executed are propagated to the spawned processes. |
| The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed |
| as part of the process launch.</p> |
| |
| <p><a name="prompt"><b>13. How can I get shell prompts in interactive |
| mode?</b></a><br> |
| <i>srun -u bash -i</i><br> |
| Srun's <i>-u</i> option turns off buffering of stdout. |
| Bash's <i>-i</i> option tells it to run in interactive mode (with prompts). |
| |
| <p><a name="batch_out"><b>14. How can I get the task ID in the output |
| or error file name for a batch job?</b></a><br> |
| The <i>srun -b</i> or <i>sbatch</i> commands are meant to accept a |
| script rather than a command line. If you specify a command line |
| rather than a script, it gets translated to a simple script of this |
| sort:</p> |
| <pre> |
| #!/bin/sh |
| srun hostname |
| </pre> |
| <p>You will note that the srun command lacks the output file specification. |
| It's output (for all tasks) becomes the output of the job. If you |
| want separate output by task, you will need to build a script containing |
| this specification. For example:</p> |
| <pre> |
| $ cat test |
| #!/bin/sh |
| echo begin_test |
| srun -o out_%j_%t hostname |
| |
| $ sbatch -n7 -o out_%j test |
| sbatch: Submitted batch job 65541 |
| |
| $ ls -l out* |
| -rw-rw-r-- 1 jette jette 11 Jun 15 09:15 out_65541 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_0 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_1 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_2 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_3 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_4 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_5 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_6 |
| |
| $ cat out_65541 |
| begin_test |
| |
| $ cat out_65541_2 |
| tdev2 |
| </pre> |
| |
| <p><a name="parallel_make"><b>15. Can the <i>make</i> command |
| utilize the resources allocated to a SLURM job?</b></a><br> |
| Yes. There is a patch available for GNU make version 3.81 |
| available as part of the SLURM distribution in the file |
| <i>contribs/make.slurm.patch</i>. |
| This patch will use SLURM to launch tasks across a job's current resource |
| allocation. Depending upon the size of modules to be compiled, this may |
| or may not improve performance. If most modules are thousands of lines |
| long, the use of additional resources should more than compensate for the |
| overhead of SLURM's task launch. Use with make's <i>-j</i> option within an |
| existing SLURM allocation. Outside of a SLURM allocation, make's behavior |
| will be unchanged.</p> |
| |
| <p><a name="terminal"><b>16. Can tasks be launched with a remote |
| terminal?</b></a><br> |
| SLURM does not directly support a remote pseudo terminal for spawned |
| tasks. |
| We intend to remedy this in Slurm version 1.3. |
| Until then, you can accomplish this by starting an appropriate program |
| or script. In the simplest case (X11 over TCP with the DISPLAY |
| environment already set), <i>srun xterm</i> may suffice. In the more |
| general case, the following scripts should work. |
| <b>NOTE: The pathname to the additional scripts are included in the |
| variables BS and IS of the first script. You must change this in the |
| first script.</b> |
| Execute the script with the sbatch options desired. |
| For example, <i>interactive -N2 -pdebug</i>. |
| |
| <pre> |
| #!/bin/bash |
| # -*- coding: utf-8 -*- |
| # Author: Pär Andersson (National Supercomputer Centre, Sweden) |
| # Version: 0.3 2007-07-30 |
| # |
| # This will submit a batch script that starts screen on a node. |
| # Then ssh is used to connect to the node and attach the screen. |
| # The result is very similar to an interactive shell in PBS |
| # (qsub -I) |
| |
| # Batch Script that starts SCREEN |
| BS=/INSTALL_DIRECTORY/_interactive |
| # Interactive screen script |
| IS=/INSTALL_DIRECTORY/_interactive_screen |
| |
| # Submit the job and get the job id |
| JOB=`sbatch --output=/dev/null --error=/dev/null $@ $BS 2>&1 \ |
| | egrep -o -e "\b[0-9]+$"` |
| |
| # Make sure the job is always canceled |
| trap "{ /usr/bin/scancel -q $JOB; exit; }" SIGINT SIGTERM EXIT |
| |
| echo "Waiting for JOBID $JOB to start" |
| while true;do |
| sleep 5s |
| |
| # Check job status |
| STATUS=`squeue -j $JOB -t PD,R -h -o %t` |
| |
| if [ "$STATUS" = "R" ];then |
| # Job is running, break the while loop |
| break |
| elif [ "$STATUS" != "PD" ];then |
| echo "Job is not Running or Pending. Aborting" |
| scancel $JOB |
| exit 1 |
| fi |
| |
| echo -n "." |
| |
| done |
| |
| # Determine the first node in the job: |
| NODE=`srun --jobid=$JOB -N1 hostname` |
| |
| # SSH to the node and attach the screen |
| sleep 1s |
| ssh -X -t $NODE $IS slurm$JOB |
| # The trap will now cancel the job before exiting. |
| </pre> |
| |
| <p>NOTE: The above script executes the script below, |
| named <i>_interactive<i>.</p> |
| <pre> |
| #!/bin/sh |
| # -*- coding: utf-8 -*- |
| # Author: Pär Andersson (National Supercomputer Centre, Sweden) |
| # Version: 0.2 2007-07-30 |
| # |
| # Simple batch script that starts SCREEN. |
| |
| exec screen -Dm -S slurm$SLURM_JOBID |
| </pre> |
| |
| <p>The following script named <i>_interactive_screen</i> is also used.</p> |
| <pre> |
| #!/bin/sh |
| # -*- coding: utf-8 -*- |
| # Author: Pär Andersson (National Supercomputer Centre, Sweden) |
| # Version: 0.3 2007-07-30 |
| # |
| |
| SCREENSESSION=$1 |
| |
| # If DISPLAY is set then set that in the screen, then create a new |
| # window with that environment and kill the old one. |
| if [ "$DISPLAY" != "" ];then |
| screen -S $SCREENSESSION -X unsetenv DISPLAY |
| screen -p0 -S $SCREENSESSION -X setenv DISPLAY $DISPLAY |
| screen -p0 -S $SCREENSESSION -X screen |
| screen -p0 -S $SCREENSESSION -X kill |
| fi |
| |
| exec screen -S $SCREENSESSION -rd |
| </pre> |
| |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>For Administrators</h2> |
| <p><a name="suspend"><b>1. How is job suspend/resume useful?</b></a><br> |
| Job suspend/resume is most useful to get particularly large jobs initiated |
| in a timely fashion with minimal overhead. Say you want to get a full-system |
| job initiated. Normally you would need to either cancel all running jobs |
| or wait for them to terminate. Canceling jobs results in the loss of |
| their work to that point from either their beginning or last checkpoint. |
| Waiting for the jobs to terminate can take hours, depending upon your |
| system configuration. A more attractive alternative is to suspend the |
| running jobs, run the full-system job, then resume the suspended jobs. |
| This can easily be accomplished by configuring a special queue for |
| full-system jobs and using a script to control the process. |
| The script would stop the other partitions, suspend running jobs in those |
| partitions, and start the full-system partition. |
| The process can be reversed when desired. |
| One can effectively gang schedule (time-slice) multiple jobs |
| using this mechanism, although the algorithms to do so can get quite |
| complex. |
| Suspending and resuming a job makes use of the SIGSTOP and SIGCONT |
| signals respectively, so swap and disk space should be sufficient to |
| accommodate all jobs allocated to a node, either running or suspended. |
| |
| <p><a name="fast_schedule"><b>2. How can I configure SLURM to use |
| the resources actually found on a node rather than what is defined |
| in <i>slurm.conf</i>?</b></a><br> |
| SLURM can either base it's scheduling decisions upon the node |
| configuration defined in <i>slurm.conf</i> or what each node |
| actually returns as available resources. |
| This is controlled using the configuration parameter <i>FastSchedule</i>. |
| Set it's value to zero in order to use the resources actually |
| found on each node, but with a higher overhead for scheduling. |
| A value of one is the default and results in the node configuration |
| defined in <i>slurm.conf</i> being used. See "man slurm.conf" |
| for more details.</p> |
| |
| <p><a name="return_to_service"><b>3. Why is a node shown in state |
| DOWN when the node has registered for service?</b></a><br> |
| The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i> |
| controls how DOWN nodes are handled. |
| Set its value to one in order for DOWN nodes to automatically be |
| returned to service once the <i>slurmd</i> daemon registers |
| with a valid node configuration. |
| A value of zero is the default and results in a node staying DOWN |
| until an administrator explicity returns it to service using |
| the command "scontrol update NodeName=whatever State=RESUME". |
| See "man slurm.conf" and "man scontrol" for more |
| details.</p> |
| |
| <p><a name="down_node"><b>4. What happens when a node crashes?</b></a><br> |
| A node is set DOWN when the slurmd daemon on it stops responding |
| for <i>SlurmdTimeout</i> as defined in <i>slurm.conf</i>. |
| The node can also be set DOWN when certain errors occur or the |
| node's configuration is inconsistent with that defined in <i>slurm.conf</i>. |
| Any active job on that node will be killed unless it was submitted |
| with the srun option <i>--no-kill</i>. |
| Any active job step on that node will be killed. |
| See the slurm.conf and srun man pages for more information.</p> |
| |
| <p><a name="multi_job"><b>5. How can I control the execution of multiple |
| jobs per node?</b></a><br> |
| There are two mechanism to control this. |
| If you want to allocate individual processors on a node to jobs, |
| configure <i>SelectType=select/cons_res</i>. |
| See <a href="cons_res.html">Consumable Resources in SLURM</a> |
| for details about this configuration. |
| If you want to allocate whole nodes to jobs, configure |
| configure <i>SelectType=select/linear</i>. |
| Each partition also has a configuration parameter <i>Shared</i> |
| that enables more than one job to execute on each node. |
| See <i>man slurm.conf</i> for more information about these |
| configuration paramters.</p> |
| |
| <p><a name="inc_plugin"><b>6. When the SLURM daemon starts, it |
| prints "cannot resolve X plugin operations" and exits. |
| What does this mean?</b></a><br> |
| This means that symbols expected in the plugin were |
| not found by the daemon. This typically happens when the |
| plugin was built or installed improperly or the configuration |
| file is telling the plugin to use an old plugin (say from the |
| previous version of SLURM). Restart the daemon in verbose mode |
| for more information (e.g. "slurmctld -Dvvvvv"). |
| |
| <p><a name="sigpipe"><b>7. Why are user tasks intermittently dying |
| at launch with SIGPIPE error messages?</b></a><br> |
| If you are using ldap or some other remote name service for |
| username and groups lookup, chances are that the underlying |
| libc library functions are triggering the SIGPIPE. You can likely |
| work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf |
| file. However, be aware that you will need to run "scontrol |
| reconfigure " any time your groups database is updated. |
| |
| <p><a name="maint_time"><b>8. How can I dry up the workload for a |
| maintenance period?</b></a><br> |
| There isn't a mechanism to tell SLURM that all jobs should be |
| completed by a specific time. The best way to address this is |
| to shorten the <i>MaxTime</i> associated with the partitions so |
| as to avoid initiating jobs that will not have completed by |
| the maintenance period. |
| |
| <p><a name="pam"><b>9. How can PAM be used to control a user's limits on |
| or access to compute nodes?</b></a><br> |
| First, enable SLURM's use of PAM by setting <i>UsePAM=1</i> in |
| <i>slurm.conf</i>.<br> |
| Second, establish a PAM configuration file for slurm in <i>/etc/pam.d/slurm</i>. |
| A basic configuration you might use is:</p> |
| <pre> |
| auth required pam_localuser.so |
| account required pam_unix.so |
| session required pam_limits.so |
| </pre> |
| <p>Third, set the desired limits in <i>/etc/security/limits.conf</i>. |
| For example, to set the locked memory limit to unlimited for all users:</p> |
| <pre> |
| * hard memlock unlimited |
| * soft memlock unlimited |
| </pre> |
| <p>Finally, you need to disable SLURM's forwarding of the limits from the |
| session from which the <i>srun</i> initiating the job ran. By default |
| all resource limits are propogated from that session. For example, adding |
| the following line to <i>slurm.conf</i> will prevent the locked memory |
| limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.</p> |
| |
| <p>We also have a PAM module for SLURM that prevents users from |
| logging into nodes that they have not been allocated (except for user |
| root, which can always login. pam_slurm is available for download from |
| <a href="ftp://ftp.llnl.gov/pub/linux/pam_slurm/">ftp://ftp.llnl.gov/pub/linux/pam_slurm</a> |
| The use of pam_slurm does not require <i>UsePAM</i> being set. The |
| two uses of PAM are independent. |
| |
| <p><a name="time"><b>10. Why are jobs allocated nodes and then unable |
| to initiate programs on some nodes?</b></a><br> |
| This typically indicates that the time on some nodes is not consistent |
| with the node on which the <i>slurmctld</i> daemon executes. In order to |
| initiate a job step (or batch job), the <i>slurmctld</i> daemon generates |
| a credential containing a time stamp. If the <i>slurmd</i> daemon |
| receives a credential containing a time stamp later than the current |
| time or more than a few minutes in the past, it will be rejected. |
| If you check in the <i>SlurmdLog</i> on the nodes of interest, you |
| will likely see messages of this sort: "<i>Invalid job credential from |
| <some IP address>: Job credential expired</i>." Make the times |
| consistent across all of the nodes and all should be well. |
| |
| <p><a name="ping"><b>11. Why does <i>slurmctld</i> log that some nodes |
| are not responding even if they are not in any partition?</b></a><br> |
| The <i>slurmctld</i> daemon periodically pings the <i>slurmd</i> |
| daemon on every configured node, even if not associated with any |
| partition. You can control the frequency of this ping with the |
| <i>SlurmdTimeout</i> configuration parameter in <i>slurm.conf</i>. |
| |
| <p><a name="controller"><b>12. How should I relocated the primary or |
| backup controller?</b></a><br> |
| If the cluster's computers used for the primary or backup controller |
| will be out of service for an extended period of time, it may be desirable |
| to relocate them. In order to do so, follow this procedure:</p> |
| <ol> |
| <li>Stop all SLURM daemons</li> |
| <li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>, |
| <i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li> |
| <li>Distribute the updated <i>slurm.conf</i> file file to all nodes</li> |
| <li>Restart all SLURM daemons</li> |
| </ol> |
| <p>There should be no loss of any running or pending jobs. Insure that |
| any nodes added to the cluster have a current <i>slurm.conf</i> file |
| installed. |
| <b>CAUTION:</b> If two nodes are simultaneously configured as the primary |
| controller (two nodes on which <i>ControlMachine</i> specify the local host |
| and the <i>slurmctld</i> daemon is executing on each), system behavior will be |
| destructive. If a compute node has an incorrect <i>ControlMachine</i> or |
| <i>BackupController</i> parameter, that node may be rendered unusable, but no |
| other harm will result. |
| |
| <p><a name="multi_slurm"><b>13. Can multiple SLURM systems be run in |
| parallel for testing purposes?</b></a><br> |
| Yes, this is a great way to test new versions of SLURM. |
| Just install the test version in a different location with a different |
| <i>slurm.conf</i>. |
| The test system's <i>slurm.conf</i> should specify different |
| pathnames and port numbers to avoid conflicts. |
| The only problem is if more than one version of SLURM is configured |
| with <i>switch/elan</i> or <i>switch/federation</i>. |
| In that case, there can be conflicting switch window requests from |
| the different SLURM systems. |
| This can be avoided by configuring the test system with <i>switch/none</i>. |
| MPI jobs started on an Elan or Federation switch system without the |
| switch windows configured will not execute properly, but other jobs |
| will run fine. |
| Another option for testing on Elan or Federation systems is to use |
| a different set of nodes for the different SLURM systems. |
| That will permit both systems to allocate switch windows without |
| conflicts. |
| |
| <p><a name="multi_slurmd"><b>14. Can slurm emulate a larger |
| cluster?</b></a><br> |
| Yes, this can be useful for testing purposes. |
| It has also been used to partition "fat" nodes into multiple SLURM nodes. |
| There are two ways to do this. |
| The best method for most conditins is to run one <i>slurmd</i> |
| daemon per emulated node in the cluster as follows. |
| <ol> |
| <li>When executing the <i>configure</i> program, use the option |
| <i>--multiple-slurmd</i> (or add that option to your <i>~/.rpmmacros</i> |
| file).</li> |
| <li>Build and install SLURM in the usual manner.</li> |
| <li>In <i>slurm.conf</i> define the desired node names (arbitrary |
| names used only by SLURM) as <i>NodeName</i> along with the actual |
| address of the physical node in <i>NodeHostname</i>. Multiple |
| <i>NodeName</i> values can be mapped to a single |
| <i>NodeHostname</i>. Note that each <i>NodeName</i> on a single |
| physical node needs to be configured to use a different port number. You |
| will also want to use the "%n" symbol in slurmd related path options in |
| slurm.conf. </li> |
| <li>When starting the <i>slurmd</i> daemon, include the <i>NodeName</i> |
| of the node that it is supposed to serve on the execute line.</li> |
| </ol> |
| <p>It is strongly recommended that SLURM version 1.2 or higher be used |
| for this due to it's improved support for multiple slurmd daemons. |
| See the |
| <a href="programmer_guide.shtml#multiple_slurmd_support">Programmers Guide</a> |
| for more details about configuring multiple slurmd support. |
| |
| <p>In order to emulate a really large cluster, it can be more |
| convenient to use a single <i>slurmd</i> daemon. |
| That daemon will not be able to launch many tasks, but can |
| suffice for developing or testing scheduling software. |
| Do not run job steps with more than a couple of tasks each |
| or execute more than a few jobs at any given time. |
| Doing so may result in the <i>slurmd</i> daemon exhausting its |
| memory and failing. |
| <b>Use this method with caution.</b> |
| <ol> |
| <li>Execute the <i>configure</i> program with your normal options.</li> |
| <li>Append the line "<i>#define HAVE_FRONT_END 1</i>" to the resulting |
| <i>config.h</i> file.</li> |
| <li>Build and install SLURM in the usual manner.</li> |
| <li>In <i>slurm.conf</i> define the desired node names (arbitrary |
| names used only by SLURM) as <i>NodeName</i> along with the actual |
| name and address of the <b>one</b> physical node in <i>NodeHostName</i> |
| and <i>NodeAddr</i>. |
| Up to 64k nodes can be configured in this virtual cluster.</li> |
| <li>Start your <i>slurmctld</i> and one <i>slurmd</i> daemon. |
| It is advisable to use the "-c" option to start the daemons without |
| trying to preserve any state files from previous executions. |
| Be sure to use the "-c" option when switch from this mode too.</li> |
| <li>Create job allocations as desired, but do not run job steps |
| with more than a couple of tasks.</li> |
| </ol> |
| <pre> |
| $ ./configure --enable-debug --prefix=... --sysconfdir=... |
| $ echo "#define HAVE_FRONT_END 1" >>config.h |
| $ make install |
| $ grep NodeHostName slurm.conf |
| <i>NodeName=dummy[1-1200] NodeHostName=localhost NodeAddr=127.0.0.1</i> |
| $ slurmctld -c |
| $ slurmd -c |
| $ sinfo |
| <i>PARTITION AVAIL TIMELIMIT NODES STATE NODELIST</i> |
| <i>pdebug* up 30:00 1200 idle dummy[1-1200]</i> |
| $ cat tmp |
| <i>#!/bin/bash</i> |
| <i>sleep 30</i> |
| $ srun -N200 -b tmp |
| <i>srun: jobid 65537 submitted</i> |
| $ srun -N200 -b tmp |
| <i>srun: jobid 65538 submitted</i> |
| $ srun -N800 -b tmp |
| <i>srun: jobid 65539 submitted</i> |
| $ squeue |
| <i>JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)</i> |
| <i>65537 pdebug tmp jette R 0:03 200 dummy[1-200]</i> |
| <i>65538 pdebug tmp jette R 0:03 200 dummy[201-400]</i> |
| <i>65539 pdebug tmp jette R 0:02 800 dummy[401-1200]</i> |
| </pre> |
| |
| <p><a name="extra_procs"><b>15. Can SLURM emulate nodes with more |
| resources than physically exist on the node?</b></a><br> |
| Yes in SLURM version 1.2 or higher. |
| In the <i>slurm.conf</i> file, set <i>FastSchedule=2</i> and specify |
| any desired node resource specifications (<i>Procs</i>, <i>Sockets</i>, |
| <i>CoresPerSocket</i>, <i>ThreadsPerCore</i>, and/or <i>TmpDisk</i>). |
| SLURM will use the resource specification for each node that is |
| given in <i>slurm.conf</i> and will not check these specifications |
| against those actaully found on the node. |
| |
| <p><a name="credential_replayed"><b>16. What does a "credential |
| replayed" error in the <i>SlurmdLogFile</i> indicate?</b></a><br> |
| This error is indicative of the <i>slurmd</i> daemon not being able |
| to respond to job initiation requests from the <i>srun</i> command |
| in a timely fashion (a few seconds). |
| <i>Srun</i> responds by resending the job initiation request. |
| When the <i>slurmd</i> daemon finally starts to respond, it |
| processes both requests. |
| The second request is rejected and the event is logged with |
| the "credential replayed" error. |
| If you check the <i>SlurmdLogFile</i> and <i>SlurmctldLogFile</i>, |
| you should see signs of the <i>slurmd</i> daemon's non-responsiveness. |
| A variety of factors can be responsible for this problem |
| including |
| <ul> |
| <li>Diskless nodes encountering network problems</li> |
| <li>Very slow Network Information Service (NIS)</li> |
| <li>The <i>Prolog</i> script taking a long time to complete</li> |
| </ul> |
| <p>In Slurm version 1.2, this can be addressed with the |
| <i>MessageTimeout</i> configuration parameter by setting a |
| value higher than the default 5 seconds. |
| In earlier versions of Slurm, the <i>--msg-timeout</i> option |
| of <i>srun</i> serves a similar purpose. |
| |
| <p><a name="large_time"><b>17. What does a "Warning: Note very large |
| processing time" in the <i>SlurmctldLogFile</i> indicate?</b></a><br> |
| This error is indicative of some operation taking an unexpectedly |
| long time to complete, over one second to be specific. |
| Setting the value of <i>SlurmctldDebug</i> configuration parameter |
| a value of six or higher should identify which operation(s) are |
| experiencing long delays. |
| This message typically indicates long delays in file system access |
| (writing state information or getting user information). |
| Another possibility is that the node on which the slurmctld |
| daemon executes has exhausted memory and is paging. |
| Try running the program <i>top</i> to check for this possibility. |
| |
| <p><a name="lightweight_core"><b>18. How can I add support for |
| lightweight core files?</b></a><br> |
| SLURM supports lightweight core files by setting environment variables |
| based upon the <i>srun --core</i> option. Of particular note, it |
| sets the <i>LD_PRELOAD</i> environment variable to load new functions |
| used to process a core dump. |
| >First you will need to acquire and install a shared object |
| library with the appropriate functions. |
| Then edit the SLURM code in <i>src/srun/core-format.c</i> to |
| specify a name for the core file type, |
| add a test for the existence of the library, |
| and set environment variables appropriately when it is used. |
| |
| <p><a name="limit_propagation"><b>19. Is resource limit propagation |
| useful on a homogeneous cluster?</b></a><br> |
| Resource limit propagation permits a user to modify resource limits |
| and submit a job with those limits. |
| By default, SLURM automatically propagates all resource limits in |
| effect at the time of job submission to the tasks spawned as part |
| of that job. |
| System administrators can utilize the <i>PropagateResourceLimits</i> |
| and <i>PropagateResourceLimitsExcept</i> configuration parameters to |
| change this behavior. |
| Users can override defaults using the <i>srun --propagate</i> |
| option. |
| See <i>"man slurm.conf"</i> and <i>"man srun"</i> for more information |
| about these options. |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 30 July 2007</p> |
| |
| <!--#include virtual="footer.txt"--> |