| .TH "srun" "1" "SLURM 2.0" "April 2009" "SLURM Commands" |
| |
| .SH "NAME" |
| srun \- Run parallel jobs |
| |
| .SH "SYNOPSIS" |
| \fBsrun\fR [\fIOPTIONS\fR...] \fIexecutable \fR[\fIargs\fR...] |
| |
| .SH "DESCRIPTION" |
| Run a parallel job on cluster managed by SLURM. If necessary, srun will |
| first create a resource allocation in which to run the parallel job. |
| |
| .SH "OPTIONS" |
| .LP |
| |
| .TP |
| \fB\-\-acctg\-freq\fR=<\fIseconds\fR> |
| Define the job accounting sampling interval. |
| This can be used to override the \fIJobAcctGatherFrequency\fR parameter in SLURM's |
| configuration file, \fIslurm.conf\fR. |
| A value of zero disables real the periodic job sampling and provides accounting |
| information only on job termination (reducing SLURM interference with the job). |
| |
| .TP |
| \fB\-B\fR \fB\-\-extra\-node\-info\fR=<\fIsockets\fR[:\fIcores\fR[:\fIthreads\fR]]> |
| Request a specific allocation of resources with details as to the |
| number and type of computational resources within a cluster: |
| number of sockets (or physical processors) per node, |
| cores per socket, and threads per core. |
| The total amount of resources being requested is the product of all of |
| the terms. |
| As with \-\-nodes, each value can be a single number or a range (e.g. min\-max). |
| An asterisk (*) can be used as a placeholder indicating that all available |
| resources of that type are to be utilized. |
| As with nodes, the individual levels can also be specified in separate |
| options if desired: |
| .nf |
| \fB\-\-sockets\-per\-node\fR=<\fIsockets\fR> |
| \fB\-\-cores\-per\-socket\fR=<\fIcores\fR> |
| \fB\-\-threads\-per\-core\fR=<\fIthreads\fR> |
| .fi |
| When the task/affinity plugin is enabled, |
| specifying an allocation in this manner also instructs SLURM to use |
| a CPU affinity mask to guarantee the request is filled as specified. |
| NOTE: Support for these options are configuration dependent. |
| The task/affinity plugin must be configured. |
| In addition either select/linear or select/cons_res plugin must be |
| configured. |
| If select/cons_res is configured, it must have a parameter of CR_Core, |
| CR_Core_Memory, CR_Socket, or CR_Socket_Memory. |
| |
| .TP |
| \fB\-\-begin\fR=<\fItime\fR> |
| Defer initiation of this job until the specified time. |
| It accepts times of the form \fIHH:MM:SS\fR to run a job at |
| a specific time of day (seconds are optional). |
| (If that time is already past, the next day is assumed.) |
| You may also specify \fImidnight\fR, \fInoon\fR, or |
| \fIteatime\fR (4pm) and you can have a time\-of\-day suffixed |
| with \fIAM\fR or \fIPM\fR for running in the morning or the evening. |
| You can also say what day the job will be run, by specifying |
| a date of the form \fIMMDDYY\fR or \fIMM/DD/YY\fR |
| \fIYYYY-MM-DD\fR. Combine date and time using the following |
| format \fIYYYY\-MM\-DD[THH:MM[:SS]]\fR. You can also |
| give times like \fInow + count time\-units\fR, where the time\-units |
| can be \fIseconds\fR (default), \fIminutes\fR, \fIhours\fR, |
| \fIdays\fR, or \fIweeks\fR and you can tell SLURM to run |
| the job today with the keyword \fItoday\fR and to run the |
| job tomorrow with the keyword \fItomorrow\fR. |
| The value may be changed after job submission using the |
| \fBscontrol\fR command. |
| For example: |
| .nf |
| \-\-begin=16:00 |
| \-\-begin=now+1hour |
| \-\-begin=now+60 (seconds by default) |
| \-\-begin=2010\-01\-20T12:34:00 |
| .fi |
| |
| .RS |
| .PP |
| Notes on date/time specifications: |
| \- Although the 'seconds' field of the HH:MM:SS time specification is |
| allowed by the code, note that the poll time of the SLURM scheduler |
| is not precise enough to guarantee dispatch of the job on the exact |
| second. The job will be eligible to start on the next poll |
| following the specified time. The exact poll interval depends on the |
| SLURM scheduler (e.g., 60 seconds with the default sched/builtin). |
| \- If no time (HH:MM:SS) is specified, the default is (00:00:00). |
| \- If a date is specified without a year (e.g., MM/DD) then the current |
| year is assumed, unless the combination of MM/DD and HH:MM:SS has |
| already passed for that year, in which case the next year is used. |
| .RE |
| |
| .TP |
| \fB\-\-checkpoint\fR=<\fItime\fR> |
| Specifies the interval between creating checkpoints of the job step. |
| By default, the job step will no checkpoints created. |
| Acceptable time formats include "minutes", "minutes:seconds", |
| "hours:minutes:seconds", "days\-hours", "days\-hours:minutes" and |
| "days\-hours:minutes:seconds". |
| |
| .TP |
| \fB\-\-checkpoint\-dir\fR=<\fIdirectory\fR> |
| Specifies the directory into which the job or job step's checkpoint should |
| be written (used by the checkpoint/blcr and checkpoint/xlch plugins only). |
| The default value is the current working directory. |
| Checkpoint files will be of the form "<job_id>.ckpt" for jobs |
| and "<job_id>.<step_id>.ckpt" for job steps. |
| |
| .TP |
| \fB\-\-comment\fR=<\fIstring\fR> |
| An arbitrary comment. |
| |
| .TP |
| \fB\-C\fR, \fB\-\-constraint\fR=<\fIlist\fR> |
| Specify a list of constraints. |
| The constraints are features that have been assigned to the nodes by |
| the slurm administrator. |
| The \fIlist\fR of constraints may include multiple features separated |
| by ampersand (AND) and/or vertical bar (OR) operators. |
| For example: \fB\-\-constraint="opteron&video"\fR or |
| \fB\-\-constraint="fast|faster"\fR. |
| In the first example, only nodes having both the feature "opteron" AND |
| the feature "video" will be used. |
| There is no mechanism to specify that you want one node with feature |
| "opteron" and another node with feature "video" in that case that no |
| node has both features. |
| If only one of a set of possible options should be used for all allocated |
| nodes, then use the OR operator and enclose the options within square brackets. |
| For example: "\fB\-\-constraint=[rack1|rack2|rack3|rack4]"\fR might |
| be used to specify that all nodes must be allocated on a single rack of |
| the cluster, but any of those four racks can be used. |
| A request can also specify the number of nodes needed with some feature |
| by appending an asterisk and count after the feature name. |
| For example "\fBsrun \-\-nodes=16 \-\-constraint=graphics*4 ..."\fR |
| indicates that the job requires 16 nodes at that at least four of those |
| nodes must have the feature "graphics." |
| Constraints with node counts may only be combined with AND operators. |
| If no nodes have the requested features, then the job will be rejected |
| by the slurm job manager. |
| |
| .TP |
| \fB\-\-contiguous\fR |
| If set, then the allocated nodes must form a contiguous set. |
| Not honored with the \fBtopology/tree\fR or \fBtopology/3d_torus\fR |
| plugins, both of which can modify the node ordering. |
| Not honored for a job step's allocation. |
| |
| .TP |
| \fB\-\-core\fR=<\fItype\fR> |
| Adjust corefile format for parallel job. If possible, srun will set |
| up the environment for the job such that a corefile format other than |
| full core dumps is enabled. If run with type = "list", srun will |
| print a list of supported corefile format types to stdout and exit. |
| |
| .TP |
| \fB\-\-cpu_bind\fR=[{\fIquiet,verbose\fR},]\fItype\fR |
| Bind tasks to CPUs. Used only when the task/affinity plugin is enabled. |
| The configuration parameter \fBTaskPluginParam\fR may override these options. |
| For example, if \fBTaskPluginParam\fR is configured to bind to cores, |
| your job will not be able to bind tasks to sockets. |
| NOTE: To have SLURM always report on the selected CPU binding for all |
| commands executed in a shell, you can enable verbose mode by setting |
| the SLURM_CPU_BIND environment variable value to "verbose". |
| |
| The following informational environment variables are set when \fB\-\-cpu_bind\fR |
| is in use: |
| .nf |
| SLURM_CPU_BIND_VERBOSE |
| SLURM_CPU_BIND_TYPE |
| SLURM_CPU_BIND_LIST |
| .fi |
| |
| See the \fBENVIRONMENT VARIABLE\fR section for a more detailed description |
| of the individual SLURM_CPU_BIND* variables. |
| |
| When using \fB\-\-cpus\-per\-task\fR to run multithreaded tasks, be aware that |
| CPU binding is inherited from the parent of the process. This means that |
| the multithreaded task should either specify or clear the CPU binding |
| itself to avoid having all threads of the multithreaded task use the same |
| mask/CPU as the parent. Alternatively, fat masks (masks which specify more |
| than one allowed CPU) could be used for the tasks in order to provide |
| multiple CPUs for the multithreaded tasks. |
| |
| By default, a job step has access to every CPU allocated to the job. |
| To ensure that distinct CPUs are allocated to each job step, us the |
| \fB\-\-exclusive\fR option. |
| |
| If the job step allocation includes an allocation with a number of |
| sockets, cores, or threads equal to the number of tasks to be started |
| then the tasks will by default be bound to the appropriate resources. |
| Disable this mode of operation by explicitly setting "-\-cpu\-bind=none". |
| |
| Note that a job step can be allocated different numbers of CPUs on each node |
| or be allocated CPUs not starting at location zero. Therefore one of the |
| options which automatically generate the task binding is recommended. |
| Explicitly specified masks or bindings are only honored when the job step |
| has been allocated every available CPU on the node. |
| |
| Binding a task to a NUMA locality domain means to bind the task to the set of |
| CPUs that belong to the NUMA locality domain or "NUMA node". |
| If NUMA locality domain options are used on systems with no NUMA support, then |
| each socket is considered a locality domain. |
| |
| Supported options include: |
| .PD 1 |
| .RS |
| .TP |
| .B q[uiet] |
| Quietly bind before task runs (default) |
| .TP |
| .B v[erbose] |
| Verbosely report binding before task runs |
| .TP |
| .B no[ne] |
| Do not bind tasks to CPUs (default) |
| .TP |
| .B rank |
| Automatically bind by task rank. |
| Task zero is bound to socket (or core or thread) zero, etc. |
| Not supported unless the entire node is allocated to the job. |
| .TP |
| .B map_cpu:<list> |
| Bind by mapping CPU IDs to tasks as specified |
| where <list> is <cpuid1>,<cpuid2>,...<cpuidN>. |
| CPU IDs are interpreted as decimal values unless they are preceded |
| with '0x' in which case they are interpreted as hexadecimal values. |
| Not supported unless the entire node is allocated to the job. |
| .TP |
| .B mask_cpu:<list> |
| Bind by setting CPU masks on tasks as specified |
| where <list> is <mask1>,<mask2>,...<maskN>. |
| CPU masks are \fBalways\fR interpreted as hexadecimal values but can be |
| preceded with an optional '0x'. |
| Not supported unless the entire node is allocated to the job. |
| .TP |
| .B rank_ldom |
| Bind to a NUMA locality domain by rank |
| .TP |
| .B map_ldom:<list> |
| Bind by mapping NUMA locality domain IDs to tasks as specified where |
| <list> is <ldom1>,<ldom2>,...<ldomN>. |
| The locality domain IDs are interpreted as decimal values unless they are |
| preceded with '0x' in which case they areinterpreted as hexadecimal values. |
| Not supported unless the entire node is allocated to the job. |
| .TP |
| .B mask_ldom:<list> |
| Bind by setting NUMA locality domain masks on tasks as specified |
| where <list> is <mask1>,<mask2>,...<maskN>. |
| NUMA locality domain masks are \fBalways\fR interpreted as hexadecimal |
| values but can be preceded with an optional '0x'. |
| Not supported unless the entire node is allocated to the job. |
| .TP |
| .B sockets |
| Automatically generate masks binding tasks to sockets. |
| If the number of tasks differs from the number of allocated sockets |
| this can result in sub\-optimal binding. |
| .TP |
| .B cores |
| Automatically generate masks binding tasks to cores. |
| If the number of tasks differs from the number of allocated cores |
| this can result in sub\-optimal binding. |
| .TP |
| .B threads |
| Automatically generate masks binding tasks to threads. |
| If the number of tasks differs from the number of allocated threads |
| this can result in sub\-optimal binding. |
| .TP |
| .B ldoms |
| Automatically generate masks binding tasks to NUMA locality domains. |
| If the number of tasks differs from the number of allocated locality domains |
| this can result in sub\-optimal binding. |
| .TP |
| .B help |
| Show this help message |
| .RE |
| |
| .TP |
| \fB\-c\fR, \fB\-\-cpus\-per\-task\fR=<\fIncpus\fR> |
| Request that \fIncpus\fR be allocated \fBper process\fR. This may be |
| useful if the job is multithreaded and requires more than one CPU |
| per task for optimal performance. The default is one CPU per process. |
| If \fB\-c\fR is specified without \fB\-n\fR, as many |
| tasks will be allocated per node as possible while satisfying |
| the \fB\-c\fR restriction. For instance on a cluster with 8 CPUs |
| per node, a job request for 4 nodes and 3 CPUs per task may be |
| allocated 3 or 6 CPUs per node (1 or 2 tasks per node) depending |
| upon resource consumption by other jobs. Such a job may be |
| unable to execute more than a total of 4 tasks. |
| This option may also be useful to spawn tasks without allocating |
| resources to the job step from the job's allocation when running |
| multiple job steps with the \fB\-\-exclusive\fR option. |
| |
| .TP |
| \fB\-D\fR, \fB\-\-chdir\fR=<\fIpath\fR> |
| have the remote processes do a chdir to \fIpath\fR before beginning |
| execution. The default is to chdir to the current working directory |
| of the \fBsrun\fR process. |
| |
| .TP |
| \fB\-d\fR, \fB\-\-slurmd\-debug\fR=<\fIlevel\fR> |
| Specify a debug level for slurmd(8). \fIlevel\fR may be an integer value |
| between 0 [quiet, only errors are displayed] and 4 [verbose operation]. |
| The slurmd debug information is copied onto the stderr of |
| the job. By default only errors are displayed. |
| |
| .TP |
| \fB\-e\fR, \fB\-\-error\fR=<\fImode\fR> |
| Specify how stderr is to be redirected. By default in interactive mode, |
| .B srun |
| redirects stderr to the same file as stdout, if one is specified. The |
| \fB\-\-error\fR option is provided to allow stdout and stderr to be |
| redirected to different locations. |
| See \fBIO Redirection\fR below for more options. |
| If the specified file already exists, it will be overwritten. |
| |
| .TP |
| \fB\-E\fR, \fB\-\-preserve-env\fR |
| Pass the current values of environment variables SLURM_NNODES and |
| SLURM_NPROCS through to the \fIexecutable\fR, rather than computing them |
| from commandline parameters. |
| |
| .TP |
| \fB\-\-epilog\fR=<\fIexecutable\fR> |
| \fBsrun\fR will run \fIexecutable\fR just after the job step completes. |
| The command line arguments for \fIexecutable\fR will be the command |
| and arguments of the job step. If \fIexecutable\fR is "none", then |
| no epilog will be run. This parameter overrides the SrunEpilog |
| parameter in slurm.conf. |
| |
| .TP |
| \fB\-\-exclusive\fR |
| When used to initiate a job, the job allocation cannot share nodes with |
| other running jobs. This is the oposite of \-\-share, whichever option |
| is seen last on the command line will win. (The default shared/exclusive |
| behaviour depends on system configuration.) |
| |
| This option can also be used when initiating more than job step within |
| an existing resource allocation and you want separate processors to |
| be dedicated to each job step. If sufficient processors are not |
| available to initiate the job step, it will be deferred. This can |
| be thought of as providing resource management for the job within |
| it's allocation. Note that all CPUs allocated to a job are available |
| to each job step unless the \fB\-\-exclusive\fR option is used plus |
| task affinity is configured. Since resource management is provided by |
| processor, the \fB\-\-ntasks\fR option must be specified, but the |
| following options should NOT be specified \fB\-\-nodes\fR, |
| \fB\-\-relative\fR, \fB\-\-distribution\fR=\fIarbitrary\fR. |
| See \fBEXAMPLE\fR below. |
| |
| .TP |
| \fB\-\-gid\fR=<\fIgroup\fR> |
| If \fBsrun\fR is run as root, and the \fB\-\-gid\fR option is used, |
| submit the job with \fIgroup\fR's group access permissions. \fIgroup\fR |
| may be the group name or the numerical group ID. |
| |
| .\".TP |
| .\"NOTE: Do not document feature until user release mechanism is available. |
| .\"\-H, \-\-hold |
| .\"Specify the job is to be submitted in a held state (priority of zero). |
| .\"A held job can now be released using scontrol to reset its priority. |
| |
| .TP |
| \fB\-\-help\fR |
| Display help information and exit. |
| |
| .TP |
| \fB\-\-hint\fR=<\fItype\fR> |
| Bind tasks according to application hints |
| .RS |
| .TP |
| .B compute_bound |
| Select settings for compute bound applications: |
| use all cores in each physical CPU |
| .TP |
| .B memory_bound |
| Select settings for memory bound applications: |
| use only one core in each physical CPU |
| .TP |
| .B [no]multithread |
| [don't] use extra threads with in-core multi-threading |
| which can benefit communication intensive applications |
| .TP |
| .B help |
| show this help message |
| .RE |
| |
| .TP |
| \fB\-I\fR, \fB\-\-immediate\fR[=<\fIseconds\fR>] |
| exit if resources are not available within the |
| time period specified. |
| If no argument is given, resources must be available immediately |
| for the request to succeed. |
| By default, \fB\-\-immediate\fR is off, and the command |
| will block until resources become available. |
| |
| .TP |
| \fB\-i\fR, \fB\-\-input\fR=<\fImode\fR> |
| Specify how stdin is to redirected. By default, |
| .B srun |
| redirects stdin from the terminal all tasks. See \fBIO Redirection\fR |
| below for more options. |
| For OS X, the poll() function does not support stdin, so input from |
| a terminal is not possible. |
| |
| .TP |
| \fB\-J\fR, \fB\-\-job\-name\fR=<\fIjobname\fR> |
| Specify a name for the job. The specified name will appear along with |
| the job id number when querying running jobs on the system. The default |
| is the supplied \fBexecutable\fR program's name. |
| |
| .TP |
| \fB\-\-jobid\fR=<\fIjobid\fR> |
| Initiate a job step under an already allocated job with job id \fIid\fR. |
| Using this option will cause \fBsrun\fR to behave exactly as if the |
| SLURM_JOB_ID environment variable was set. |
| |
| .TP |
| \fB\-K\fR, \fB\-\-kill\-on\-bad\-exit\fR |
| Terminate a job if any task exits with a non\-zero exit code. |
| |
| .TP |
| \fB\-k\fR, \fB\-\-no\-kill\fR |
| Do not automatically terminate a job of one of the nodes it has been |
| allocated fails. This option is only recognized on a job allocation, |
| not for the submission of individual job steps. |
| The job will assume all responsibilities for fault\-tolerance. The |
| active job step (MPI job) will almost certainly suffer a fatal error, |
| but subsequent job steps may be run if this option is specified. The |
| default action is to terminate job upon node failure. |
| |
| .TP |
| \fB\-l\fR, \fB\-\-label\fR |
| prepend task number to lines of stdout/err. Normally, stdout and stderr |
| from remote tasks is line\-buffered directly to the stdout and stderr of |
| \fBsrun\fR. |
| The \fB\-\-label\fR option will prepend lines of output with the remote |
| task id. |
| |
| .TP |
| \fB\-L\fR, \fB\-\-licenses\fR=<\fBlicense\fR> |
| Specification of licenses (or other resources available on all |
| nodes of the cluster) which must be allocated to this job. |
| License names can be followed by an asterisk and count |
| (the default count is one). |
| Multiple license names should be comma separated (e.g. |
| "\-\-licenses=foo*4,bar"). |
| |
| .TP |
| \fB\-m\fR, \fB\-\-distribution\fR= |
| <\fIblock\fR|\fIcyclic\fR|\fIarbitrary\fR|\fIplane=<options>\fR> |
| Specify an alternate distribution method for remote processes. |
| .RS |
| .TP |
| .B block |
| The block method of distribution will allocate processes in\-order to |
| the cpus on a node. If the number of processes exceeds the number of |
| cpus on all of the nodes in the allocation then all nodes will be |
| utilized. For example, consider an allocation of three nodes each with |
| two cpus. A four\-process block distribution request will distribute |
| those processes to the nodes with processes one and two on the first |
| node, process three on the second node, and process four on the third node. |
| Block distribution is the default behavior if the number of tasks |
| exceeds the number of nodes requested. |
| .TP |
| .B cyclic |
| The cyclic method distributes processes in a round\-robin fashion across |
| the allocated nodes. That is, process one will be allocated to the first |
| node, process two to the second, and so on. This is the default behavior |
| if the number of tasks is no larger than the number of nodes requested. |
| .TP |
| .B plane |
| The tasks are distributed in blocks of a specified size. |
| The options include a number representing the size of the task block. |
| This is followed by an optional specification of the task distribution |
| scheme within a block of tasks and between the blocks of tasks. |
| For more details (including examples and diagrams), please see |
| .br |
| https://computing.llnl.gov/linux/slurm/mc_support.html |
| .br |
| and |
| .br |
| https://computing.llnl.gov/linux/slurm/dist_plane.html. |
| .TP |
| .B arbitrary |
| The arbitrary method of distribution will allocate processes in\-order as |
| listed in file designated by the environment variable SLURM_HOSTFILE. If |
| this variable is listed it will over ride any other method specified. |
| If not set the method will default to block. Inside the hostfile must |
| contain at minimum the number of hosts requested. If requesting tasks |
| (\-n) your tasks will be laid out on the nodes in the order of the file. |
| .RE |
| |
| .TP |
| \fB\-\-mail\-type\fR=<\fItype\fR> |
| Notify user by email when certain event types occur. |
| Valid \fItype\fR values are BEGIN, END, FAIL, ALL (any state change). |
| The user to be notified is indicated with \fB\-\-mail\-user\fR. |
| |
| .TP |
| \fB\-\-mail\-user\fR=<\fIuser\fR> |
| User to receive email notification of state changes as defined by |
| \fB\-\-mail\-type\fR. |
| The default value is the submitting user. |
| |
| .TP |
| \fB\-\-mem\fR=<\fIMB\fR> |
| Specify the real memory required per node in MegaBytes. |
| Default value is \fBDefMemPerNode\fR and the maximum value is |
| \fBMaxMemPerNode\fR. If configured, both of parameters can be |
| seen using the \fBscontrol show config\fR command. |
| This parameter would generally be used of whole nodes |
| are allocated to jobs (\fBSelectType=select/linear\fR). |
| Also see \fB\-\-mem\-per\-cpu\fR. |
| \fB\-\-mem\fR and \fB\-\-mem\-per\-cpu\fR are mutually exclusive. |
| |
| .TP |
| \fB\-\-mem\-per\-cpu\fR=<\fIMB\fR> |
| Mimimum memory required per allocated CPU in MegaBytes. |
| Default value is \fBDefMemPerCPU\fR and the maximum value is |
| \fBMaxMemPerCPU\fR. If configured, both of parameters can be |
| seen using the \fBscontrol show config\fR command. |
| This parameter would generally be used of individual processors |
| are allocated to jobs (\fBSelectType=select/cons_res\fR). |
| Also see \fB\-\-mem\fR. |
| \fB\-\-mem\fR and \fB\-\-mem\-per\-cpu\fR are mutually exclusive. |
| |
| .TP |
| \fB\-\-mem_bind\fR=[{\fIquiet,verbose\fR},]\fItype\fR |
| Bind tasks to memory. Used only when the task/affinity plugin is enabled |
| and the NUMA memory functions are available. |
| \fBNote that the resolution of CPU and memory binding |
| may differ on some architectures.\fR For example, CPU binding may be performed |
| at the level of the cores within a processor while memory binding will |
| be performed at the level of nodes, where the definition of "nodes" |
| may differ from system to system. \fBThe use of any type other than |
| "none" or "local" is not recommended.\fR |
| If you want greater control, try running a simple test code with the |
| options "\-\-cpu_bind=verbose,none \-\-mem_bind=verbose,none" to determine |
| the specific configuration. |
| |
| NOTE: To have SLURM always report on the selected memory binding for |
| all commands executed in a shell, you can enable verbose mode by |
| setting the SLURM_MEM_BIND environment variable value to "verbose". |
| |
| The following informational environment variables are set when \fB\-\-mem_bind\ |
| is in use: |
| |
| .nf |
| SLURM_MEM_BIND_VERBOSE |
| SLURM_MEM_BIND_TYPE |
| SLURM_MEM_BIND_LIST |
| .fi |
| |
| See the \fBENVIRONMENT VARIABLES\fR section for a more detailed description |
| of the individual SLURM_MEM_BIND* variables. |
| |
| Supported options include: |
| .RS |
| .TP |
| .B q[uiet] |
| quietly bind before task runs (default) |
| .TP |
| .B v[erbose] |
| verbosely report binding before task runs |
| .TP |
| .B no[ne] |
| don't bind tasks to memory (default) |
| .TP |
| .B rank |
| bind by task rank (not recommended) |
| .TP |
| .B local |
| Use memory local to the processor in use |
| .TP |
| .B map_mem:<list> |
| bind by mapping a node's memory to tasks as specified |
| where <list> is <cpuid1>,<cpuid2>,...<cpuidN>. |
| CPU IDs are interpreted as decimal values unless they are preceded |
| with '0x' in which case they interpreted as hexadecimal values |
| (not recommended) |
| .TP |
| .B mask_mem:<list> |
| bind by setting memory masks on tasks as specified |
| where <list> is <mask1>,<mask2>,...<maskN>. |
| memory masks are \fBalways\fR interpreted as hexadecimal values. |
| Note that masks must be preceded with a '0x' if they don't begin |
| with [0-9] so they are seen as numerical values by srun. |
| .TP |
| .B help |
| show this help message |
| .RE |
| |
| .TP |
| \fB\-\-mincores\fR=<\fIn\fR> |
| Specify a minimum number of cores per socket. |
| |
| .TP |
| \fB\-\-mincpus\fR=<\fIn\fR> |
| Specify a minimum number of logical cpus/processors per node. |
| |
| .TP |
| \fB\-\-minsockets\fR=<\fIn\fR> |
| Specify a minimum number of sockets (physical processors) per node. |
| |
| .TP |
| \fB\-\-minthreads\fR=<\fIn\fR> |
| Specify a minimum number of threads per core. |
| |
| .TP |
| \fB\-\-msg\-timeout\fR=<\fIseconds\fR> |
| Modify the job launch message timeout. |
| The default value is \fBMessageTimeout\fR in the SLURM configuration file slurm.conf. |
| Changes to this are typically not recommended, but could be useful to diagnose problems. |
| |
| .TP |
| \fB\-\-mpi\fR=<\fImpi_type\fR> |
| Identify the type of MPI to be used. May result in unique initiation |
| procedures. |
| .RS |
| .TP |
| .B list |
| Lists available mpi types to choose from. |
| .TP |
| .B lam |
| Initiates one 'lamd' process per node and establishes necessary |
| environment variables for LAM/MPI. |
| .TP |
| .B mpich1_shmem |
| Initiates one process per node and establishes necessary |
| environment variables for mpich1 shared memory model. |
| This also works for mvapich built for shared memory. |
| .TP |
| .B mpichgm |
| For use with Myrinet. |
| .TP |
| .B mvapich |
| For use with Infiniband. |
| .TP |
| .B openmpi |
| For use with OpenMPI. |
| .TP |
| .B none |
| No special MPI processing. This is the default and works with |
| many other versions of MPI. |
| .RE |
| |
| .TP |
| \fB\-\-multi\-prog\fR |
| Run a job with different programs and different arguments for |
| each task. In this case, the executable program specified is |
| actually a configuration file specifying the executable and |
| arguments for each task. See \fBMULTIPLE PROGRAM CONFIGURATION\fR |
| below for details on the configuration file contents. |
| |
| .TP |
| \fB\-N\fR, \fB\-\-nodes\fR=<\fIminnodes\fR[\-\fImaxnodes\fR]> |
| Request that a minimum of \fIminnodes\fR nodes be allocated to this job. |
| The scheduler may decide to launch the job on more than \fIminnodes\fR nodes. |
| A limit on the maximum node count may be specified with \fImaxnodes\fR |
| (e.g. "\-\-nodes=2\-4"). The minimum and maximum node count may be the |
| same to specify a specific number of nodes (e.g. "\-\-nodes=2\-2" will ask |
| for two and ONLY two nodes). |
| The partition's node limits supersede those of the job. |
| If a job's node limits are outside of the range permitted for its |
| associated partition, the job will be left in a PENDING state. |
| This permits possible execution at a later time, when the partition |
| limit is changed. |
| If a job node limit exceeds the number of nodes configured in the |
| partition, the job will be rejected. |
| Note that the environment |
| variable \fBSLURM_NNODES\fR will be set to the count of nodes actually |
| allocated to the job. See the \fBENVIRONMENT VARIABLES \fR section |
| for more information. If \fB\-N\fR is not specified, the default |
| behavior is to allocate enough nodes to satisfy the requirements of |
| the \fB\-n\fR and \fB\-c\fR options. |
| The job will be allocated as many nodes as possible within the range specified |
| and without delaying the initiation of the job. |
| |
| .TP |
| \fB\-n\fR, \fB\-\-ntasks\fR=<\fInumber\fR> |
| Specify the number of tasks to run. Request that \fBsrun\fR |
| allocate resources for \fIntasks\fR tasks. |
| The default is one task per socket or core (depending upon the value |
| of the \fISelectTypeParameters\fR parameter in slurm.conf), but note |
| that the \fB\-\-cpus\-per\-task\fR option will change this default. |
| |
| .TP |
| \fB\-\-network\fR=<\fItype\fR> |
| Specify the communication protocol to be used. |
| This option is supported on AIX systems. |
| Since POE is used to launch tasks, this option is not normally used or |
| is specified using the \fBSLURM_NETWORK\fR environment variable. |
| The interpretation of \fItype\fR is system dependent. |
| For systems with an IBM Federation switch, the following |
| comma\-separated and case insensitive types are recognized: |
| \fBIP\fR (the default is user\-space), \fBSN_ALL\fR, \fBSN_SINGLE\fR, |
| \fBBULK_XFER\fR and adapter names (e.g. \fBSNI0\fR and \fBSNI1\fR). |
| For more information, on IBM systems see \fIpoe\fR documentation on |
| the environment variables \fBMP_EUIDEVICE\fR and \fBMP_USE_BULK_XFER\fR. |
| Note that only four jobs steps may be active at once on a node with the |
| \fBBULK_XFER\fR option due to limitations in the Federation switch driver. |
| |
| .TP |
| \fB\-\-nice\fR[=\fIadjustment\fR] |
| Run the job with an adjusted scheduling priority within SLURM. |
| With no adjustment value the scheduling priority is decreased |
| by 100. The adjustment range is from \-10000 (highest priority) |
| to 10000 (lowest priority). Only privileged users can specify |
| a negative adjustment. NOTE: This option is presently |
| ignored if \fISchedulerType=sched/wiki\fR or |
| \fISchedulerType=sched/wiki2\fR. |
| |
| .TP |
| \fB\-\-ntasks\-per\-core\fR=<\fIntasks\fR> |
| Request that no more than \fIntasks\fR be invoked on each core. |
| Similar to \fB\-\-ntasks\-per\-node\fR except at the core level |
| instead of the node level. Masks will automatically be generated |
| to bind the tasks to specific core unless \fB\-\-cpu_bind=none\fR |
| is specified. |
| NOTE: This option is not supported unless |
| \fISelectTypeParameters=CR_Core\fR or |
| \fISelectTypeParameters=CR_Core_Memory\fR is configured. |
| |
| .TP |
| \fB\-\-ntasks\-per\-socket\fR=<\fIntasks\fR> |
| Request that no more than \fIntasks\fR be invoked on each socket. |
| Similar to \fB\-\-ntasks\-per\-node\fR except at the socket level |
| instead of the node level. Masks will automatically be generated |
| to bind the tasks to specific sockets unless \fB\-\-cpu_bind=none\fR |
| is specified. |
| NOTE: This option is not supported unless |
| \fISelectTypeParameters=CR_Socket\fR or |
| \fISelectTypeParameters=CR_Socket_Memory\fR is configured. |
| |
| .TP |
| \fB\-\-ntasks\-per\-node\fR=<\fIntasks\fR> |
| Request that no more than \fIntasks\fR be invoked on each node. |
| This is similar to using \fB\-\-cpus\-per\-task\fR=\fIncpus\fR |
| but does not require knowledge of the actual number of cpus on |
| each node. In some cases, it is more convenient to be able to |
| request that no more than a specific number of ntasks be invoked |
| on each node. Examples of this include submitting |
| a hybrid MPI/OpenMP app where only one MPI "task/rank" should be |
| assigned to each node while allowing the OpenMP portion to utilize |
| all of the parallelism present in the node, or submitting a single |
| setup/cleanup/monitoring job to each node of a pre\-existing |
| allocation as one step in a larger job script. |
| |
| .TP |
| \fB\-O\fR, \fB\-\-overcommit\fR |
| Overcommit resources. Normally, \fBsrun\fR will allocate one task |
| per processor. By specifying \fB\-\-overcommit\fR you are explicitly |
| allowing more than one task per processor. However no more than |
| \fBMAX_TASKS_PER_NODE\fR tasks are permitted to execute per node. |
| |
| .TP |
| \fB\-o\fR, \fB\-\-output\fR=<\fImode\fR> |
| Specify the mode for stdout redirection. By default in interactive mode, |
| .B srun |
| collects stdout from all tasks and line buffers this output to |
| the attached terminal. With \fB\-\-output\fR stdout may be redirected |
| to a file, to one file per task, or to /dev/null. See section |
| \fBIO Redirection\fR below for the various forms of \fImode\fR. |
| If the specified file already exists, it will be overwritten. |
| .br |
| |
| If \fB\-\-error\fR is not also specified on the command line, both |
| stdout and stderr will directed to the file specified by \fB\-\-output\fR. |
| |
| .TP |
| \fB\-\-open\-mode\fR=<\fIappend|truncate\fR> |
| Open the output and error files using append or truncate mode as specified. |
| The default value is specified by the system configuration parameter |
| \fIJobFileAppend\fR. |
| |
| .TP |
| \fB\-P\fR, \fB\-\-dependency\fR=<\fIdependency_list\fR> |
| Defer the start of this job until the specified dependencies have been |
| satisfied completed. |
| <\fIdependency_list\fR> is of the form |
| <\fItype:job_id[:job_id][,type:job_id[:job_id]]\fR>. |
| Many jobs can share the same dependency and these jobs may even belong to |
| different users. The value may be changed after job submission using the |
| scontrol command. |
| .PD |
| .RS |
| .TP |
| \fBafter:job_id[:jobid...]\fR |
| This job can begin execution after the specified jobs have begun |
| execution. |
| .TP |
| \fBafterany:job_id[:jobid...]\fR |
| This job can begin execution after the specified jobs have terminated. |
| .TP |
| \fBafternotok:job_id[:jobid...]\fR |
| This job can begin execution after the specified jobs have terminated |
| in some failed state (non-zero exit code, node failure, timed out, etc). |
| .TP |
| \fBafterok:job_id[:jobid...]\fR |
| This job can begin execution after the specified jobs have successfully |
| executed (ran to completion with non-zero exit code). |
| .TP |
| \fBsingleton\fR |
| This job can begin execution after any previously launched jobs sharing the same |
| job name and user have terminated. |
| .RE |
| |
| .TP |
| \fB\-p\fR, \fB\-\-partition\fR=<\fIpartition name\fR> |
| Request a specific partition for the resource allocation. If not specified, |
| the default behaviour is to allow the slurm controller to select the default |
| partition as designated by the system administrator. |
| |
| .TP |
| \fB\-\-prolog\fR=<\fIexecutable\fR> |
| \fBsrun\fR will run \fIexecutable\fR just before launching the job step. |
| The command line arguments for \fIexecutable\fR will be the command |
| and arguments of the job step. If \fIexecutable\fR is "none", then |
| no prolog will be run. This parameter overrides the SrunProlog |
| parameter in slurm.conf. |
| |
| .TP |
| \fB\-\-propagate\fR[=\fIrlimits\fR] |
| Allows users to specify which of the modifiable (soft) resource limits |
| to propagate to the compute nodes and apply to their jobs. If |
| \fIrlimits\fR is not specified, then all resource limits will be |
| propagated. |
| The following rlimit names are supported by Slurm (although some |
| options may not be supported on some systems): |
| .RS |
| .TP 10 |
| \fBALL\fR |
| All limits listed below |
| .TP |
| \fBAS\fR |
| The maximum address space for a processes |
| .TP |
| \fBCORE\fR |
| The maximum size of core file |
| .TP |
| \fBCPU\fR |
| The maximum amount of CPU time |
| .TP |
| \fBDATA\fR |
| The maximum size of a process's data segment |
| .TP |
| \fBFSIZE\fR |
| The maximum size of files created |
| .TP |
| \fBMEMLOCK\fR |
| The maximum size that may be locked into memory |
| .TP |
| \fBNOFILE\fR |
| The maximum number of open files |
| .TP |
| \fBNPROC\fR |
| The maximum number of processes available |
| .TP |
| \fBRSS\fR |
| The maximum resident set size |
| .TP |
| \fBSTACK\fR |
| The maximum stack size |
| .RE |
| |
| .TP |
| \fB\-\-pty\fR |
| Execute task zero in pseudo terminal. |
| Implicitly sets \fB\-\-unbuffered\fR. |
| Implicitly sets \fB\-\-error\fR and \fB\-\-output\fR to /dev/null |
| for all tasks except task zero. |
| Not currently supported on AIX platforms. |
| |
| .TP |
| \fB\-Q\fR, \fB\-\-quiet\fR |
| Suppress informational messages from srun. Errors will still be displayed. |
| |
| .TP |
| \fB\-q\fR, \fB\-\-quit\-on\-interrupt\fR |
| Quit immediately on single SIGINT (Ctrl\-C). Use of this option |
| disables the status feature normally available when \fBsrun\fR receives |
| a single Ctrl\-C and causes \fBsrun\fR to instead immediately terminate the |
| running job. |
| |
| .TP |
| \fB\-r\fR, \fB\-\-relative\fR=<\fIn\fR> |
| Run a job step relative to node \fIn\fR of the current allocation. |
| This option may be used to spread several job steps out among the |
| nodes of the current job. If \fB\-r\fR is used, the current job |
| step will begin at node \fIn\fR of the allocated nodelist, where |
| the first node is considered node 0. The \fB\-r\fR option is not |
| permitted along with \fB\-w\fR or \fB\-x\fR, and will be silently |
| ignored when not running within a prior allocation (i.e. when |
| SLURM_JOB_ID is not set). The default for \fIn\fR is 0. If the |
| value of \fB\-\-nodes\fR exceeds the number of nodes identified |
| with the \fB\-\-relative\fR option, a warning message will be |
| printed and the \fB\-\-relative\fR option will take precedence. |
| |
| .TP |
| \fB\-\-resv-ports\fR |
| Reserve communication ports for this job. |
| Used for OpenMPI. |
| |
| .TP |
| \fB\-\-reservation\fR=<\fIname\fR> |
| Allocate resources for the job from the named reservation. |
| |
| .TP |
| \fB\-\-restart\-dir\fR=<\fIdirectory\fR> |
| Specifies the directory from which the job or job step's checkpoint should |
| be read (used by the checkpoint/blcrm and checkpoint/xlch plugins only). |
| |
| .TP |
| \fB\-s\fR, \fB\-\-share\fR |
| The job can share nodes with other running jobs. This may result in faster job |
| initiation and higher system utilization, but lower application performance. |
| |
| .TP |
| \fB\-T\fR, \fB\-\-threads\fR=<\fInthreads\fR> |
| Request that \fBsrun\fR |
| use \fInthreads\fR to initiate and control the parallel job. The |
| default value is the smaller of 60 or the number of nodes allocated. |
| This should only be used to set a low thread count for testing on |
| very small memory computers. |
| |
| .TP |
| \fB\-t\fR, \fB\-\-time\fR=<\fItime\fR> |
| Set a limit on the total run time of the job step. If the |
| requested time limit exceeds the partition's time limit, the job will |
| be left in a PENDING state (possibly indefinitely). The default time |
| limit is the partition's time limit. When the time limit is reached, |
| all of the job's tasks are sent SIGTERM followed by SIGKILL. The |
| interval between signals is specified by the SLURM configuration |
| parameter \fBKillWait\fR. A time limit of zero requests that no time |
| limit be imposed. Acceptable time formats include "minutes", |
| "minutes:seconds", "hours:minutes:seconds", "days\-hours", |
| "days\-hours:minutes" and "days\-hours:minutes:seconds". |
| |
| .TP |
| \fB\-\-task\-epilog\fR=<\fIexecutable\fR> |
| The \fBslurmstepd\fR daemon will run \fIexecutable\fR just after each task |
| terminates. This will be executed before any TaskEpilog parameter in |
| slurm.conf is executed. This is meant to be a very short\-lived |
| program. If it fails to terminate within a few seconds, it will be |
| killed along with any descendant processes. |
| |
| .TP |
| \fB\-\-task\-prolog\fR=<\fIexecutable\fR> |
| The \fBslurmstepd\fR daemon will run \fIexecutable\fR just before launching |
| each task. This will be executed after any TaskProlog parameter |
| in slurm.conf is executed. |
| Besides the normal environment variables, this has SLURM_TASK_PID |
| available to identify the process ID of the task being started. |
| Standard output from this program of the form |
| "export NAME=value" will be used to set environment variables |
| for the task being spawned. |
| |
| .TP |
| \fB\-\-tmp\fR=<\fIMB\fR> |
| Specify a minimum amount of temporary disk space. |
| |
| .TP |
| \fB\-U\fR, \fB\-\-account\fR=<\fIaccount\fR> |
| Change resource use by this job to specified account. |
| The \fIaccount\fR is an arbitrary string. The account name may |
| be changed after job submission using the \fBscontrol\fR |
| command. |
| |
| .TP |
| \fB\-u\fR, \fB\-\-unbuffered\fR |
| Do not line buffer stdout from remote tasks. This option cannot be used |
| with \fI\-\-label\fR. |
| |
| .TP |
| \fB\-\-usage\fR |
| Display brief help message and exit. |
| |
| .TP |
| \fB\-\-uid\fR=<\fIuser\fR> |
| Attempt to submit and/or run a job as \fIuser\fR instead of the |
| invoking user id. The invoking user's credentials will be used |
| to check access permissions for the target partition. User root |
| may use this option to run jobs as a normal user in a RootOnly |
| partition for example. If run as root, \fBsrun\fR will drop |
| its permissions to the uid specified after node allocation is |
| successful. \fIuser\fR may be the user name or numerical user ID. |
| |
| .TP |
| \fB\-V\fR, \fB\-\-version\fR |
| Display version information and exit. |
| |
| .TP |
| \fB\-v\fR, \fB\-\-verbose\fR |
| Increase the verbosity of srun's informational messages. Multiple |
| \fB\-v\fR's will further increase srun's verbosity. By default only |
| errors will be displayed. |
| |
| .TP |
| \fB\-W\fR, \fB\-\-wait\fR=<\fIseconds\fR> |
| Specify how long to wait after the first task terminates before terminating |
| all remaining tasks. A value of 0 indicates an unlimited wait (a warning will |
| be issued after 60 seconds). The default value is set by the WaitTime |
| parameter in the slurm configuration file (see \fBslurm.conf(5)\fR). This |
| option can be useful to insure that a job is terminated in a timely fashion |
| in the event that one or more tasks terminate prematurely. |
| |
| .TP |
| \fB\-w\fR, \fB\-\-nodelist\fR=<\fIhost1,host2,...\fR or \fIfilename\fR> |
| Request a specific list of hosts. The job will contain \fIat least\fR |
| these hosts. The list may be specified as a comma\-separated list of |
| hosts, a range of hosts (host[1\-5,7,...] for example), or a filename. |
| The host list will be assumed to be a filename if it contains a "/" |
| character. If you specify a max node count (\-N1\-2) if there are more |
| than 2 hosts in the file only the first 2 nodes will be used in the |
| request list. |
| |
| .TP |
| \fB\-\-wckey\fR=<\fIwckey\fR> |
| Specify wckey to be used with job. If TrackWCKey=no (default) in the |
| slurm.conf this value is ignored. |
| |
| .TP |
| \fB\-X\fR, \fB\-\-disable\-status\fR |
| Disable the display of task status when srun receives a single SIGINT |
| (Ctrl\-C). Instead immediately forward the SIGINT to the running job. |
| Without this option a second Ctrl\-C in one second is required to forcibly |
| terminate the job and \fBsrun\fR will immediately exit. May also be |
| set via the environment variable SLURM_DISABLE_STATUS. |
| |
| .TP |
| \fB\-x\fR, \fB\-\-exclude\fR=<\fIhost1,host2,...\fR or \fIfilename\fR> |
| Request that a specific list of hosts not be included in the resources |
| allocated to this job. The host list will be assumed to be a filename |
| if it contains a "/"character. |
| |
| .PP |
| The following options support Blue Gene systems, but may be |
| applicable to other systems as well. |
| |
| .TP |
| \fB\-\-blrts\-image\fR=<\fIpath\fR> |
| Path to blrts image for bluegene block. BGL only. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-\-cnload\-image\fR=<\fIpath\fR> |
| Path to compute node image for bluegene block. BGP only. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-\-conn\-type\fR=<\fItype\fR> |
| Require the partition connection type to be of a certain type. |
| On Blue Gene the acceptable of \fItype\fR are MESH, TORUS and NAV. |
| If NAV, or if not set, then SLURM will try to fit a TORUS else MESH. |
| You should not normally set this option. |
| SLURM will normally allocate a TORUS if possible for a given geometry. |
| If running on a BGP system and wanting to run in HTC mode (only for 1 |
| midplane and below). You can use HTC_S for SMP, HTC_D for Dual, HTC_V |
| for virtual node mode, and HTC_L for Linux mode. |
| |
| .TP |
| \fB\-g\fR, \fB\-\-geometry\fR=<\fIXxYxZ\fR> |
| Specify the geometry requirements for the job. The three numbers |
| represent the required geometry giving dimensions in the X, Y and |
| Z directions. For example "\-\-geometry=2x3x4", specifies a block |
| of nodes having 2 x 3 x 4 = 24 nodes (actually base partitions on |
| Blue Gene). |
| |
| .TP |
| \fB\-\-ioload\-image\fR=<\fIpath\fR> |
| Path to io image for bluegene block. BGP only. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-\-linux\-image\fR=<\fIpath\fR> |
| Path to linux image for bluegene block. BGL only. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-\-mloader\-image\fR=<\fIpath\fR> |
| Path to mloader image for bluegene block. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-R\fR, \fB\-\-no\-rotate\fR |
| Disables rotation of the job's requested geometry in order to fit an |
| appropriate partition. |
| By default the specified geometry can rotate in three dimensions. |
| |
| .TP |
| \fB\-\-ramdisk\-image\fR=<\fIpath\fR> |
| Path to ramdisk image for bluegene block. BGL only. |
| Default from \fIblugene.conf\fR if not set. |
| |
| .TP |
| \fB\-\-reboot\fR |
| Force the allocated nodes to reboot before starting the job. |
| |
| .PP |
| .B srun |
| will submit the job request to the slurm job controller, then initiate all |
| processes on the remote nodes. If the request cannot be met immediately, |
| .B srun |
| will block until the resources are free to run the job. If the |
| \fB\-I\fR (\fB\-\-immediate\fR) option is specified |
| .B srun |
| will terminate if resources are not immediately available. |
| .PP |
| When initiating remote processes |
| .B srun |
| will propagate the current working directory, unless |
| \fB\-\-chdir\fR=<\fIpath\fR> is specified, in which case \fIpath\fR will |
| become the working directory for the remote processes. |
| .PP |
| The \fB\-n\fB, \fB\-c\fR, and \fB\-N\fR options control how CPUs and |
| nodes will be allocated to the job. When specifying only the number |
| of processes to run with \fB\-n\fR, a default of one CPU per process |
| is allocated. By specifying the number of CPUs required per task (\fB\-c\fR), |
| more than one CPU may be allocated per process. If the number of nodes |
| is specified with \fB\-N\fR, |
| .B srun |
| will attempt to allocate \fIat least\fR the number of nodes specified. |
| .PP |
| Combinations of the above three options may be used to change how |
| processes are distributed across nodes and cpus. For instance, by specifying |
| both the number of processes and number of nodes on which to run, the |
| number of processes per node is implied. However, if the number of CPUs |
| per process is more important then number of processes (\fB\-n\fR) and the |
| number of CPUs per process (\fB\-c\fR) should be specified. |
| .PP |
| .B srun |
| will refuse to allocate more than one process per CPU unless |
| \fB\-\-overcommit\fR (\fB\-O\fR) is also specified. |
| .PP |
| .B srun |
| will attempt to meet the above specifications "at a minimum." That is, |
| if 16 nodes are requested for 32 processes, and some nodes do not have |
| 2 CPUs, the allocation of nodes will be increased in order to meet the |
| demand for CPUs. In other words, a \fIminimum\fR of 16 nodes are being |
| requested. However, if 16 nodes are requested for 15 processes, |
| .B srun |
| will consider this an error, as 15 processes cannot run across 16 nodes. |
| |
| .PP |
| .B "IO Redirection" |
| .PP |
| By default, stdout and stderr will be redirected from all tasks to the |
| stdout and stderr of \fBsrun\fR, and stdin will be redirected from the |
| standard input of \fBsrun\fR to all remote tasks. |
| For OS X, the poll() function does not support stdin, so input from |
| a terminal is not possible. |
| This behavior may be changed with the |
| \fB\-\-output\fR, \fB\-\-error\fR, and \fB\-\-input\fR |
| (\fB\-o\fR, \fB\-e\fR, \fB\-i\fR) options. Valid format specifications |
| for these options are |
| .TP 10 |
| \fBall\fR |
| stdout stderr is redirected from all tasks to srun. |
| stdin is broadcast to all remote tasks. |
| (This is the default behavior) |
| .TP |
| \fBnone\fR |
| stdout and stderr is not received from any task. |
| stdin is not sent to any task (stdin is closed). |
| .TP |
| \fItaskid\fR |
| stdout and/or stderr are redirected from only the task with relative |
| id equal to \fItaskid\fR, where 0 <= \fItaskid\fR <= \fIntasks\fR, |
| where \fIntasks\fR is the total number of tasks in the current job step. |
| stdin is redirected from the stdin of \fBsrun\fR to this same task. |
| This file will be written on the node executing the task. |
| .TP |
| \fIfilename\fR |
| \fBsrun\fR will redirect stdout and/or stderr to the named file from |
| all tasks. |
| stdin will be redirected from the named file and broadcast to all |
| tasks in the job. \fIfilename\fR refers to a path on the host |
| that runs \fBsrun\fR. Depending on the cluster's file system layout, |
| this may result in the output appearing in different places depending |
| on whether the job is run in batch mode. |
| .TP |
| format string |
| \fBsrun\fR allows for a format string to be used to generate the |
| named IO file |
| described above. The following list of format specifiers may be |
| used in the format string to generate a filename that will be |
| unique to a given jobid, stepid, node, or task. In each case, |
| the appropriate number of files are opened and associated with |
| the corresponding tasks. Note that any format string containing |
| %t, %n, and/or %N will be written on the node executing the task |
| rather than the node where \fBsrun\fR executes. |
| .RS 10 |
| .TP |
| %J |
| jobid.stepid of the running job. (e.g. "128.0") |
| .TP |
| %j |
| jobid of the running job. |
| .TP |
| %s |
| stepid of the running job. |
| .TP |
| %N |
| short hostname. This will create a separate IO file per node. |
| .TP |
| %n |
| Node identifier relative to current job (e.g. "0" is the first node of |
| the running job) This will create a separate IO file per node. |
| .TP |
| %t |
| task identifier (rank) relative to current job. This will create a |
| separate IO file per task. |
| .PP |
| A number placed between the percent character and format specifier may be |
| used to zero\-pad the result in the IO filename. This number is ignored if |
| the format specifier corresponds to non\-numeric data (%N for example). |
| |
| Some examples of how the format string may be used for a 4 task job step |
| with a Job ID of 128 and step id of 0 are included below: |
| .TP 15 |
| job%J.out |
| job128.0.out |
| .TP |
| job%4j.out |
| job0128.out |
| .TP |
| job%j\-%2t.out |
| job128\-00.out, job128\-01.out, ... |
| .PP |
| .RS -10 |
| .PP |
| |
| .SH "INPUT ENVIRONMENT VARIABLES" |
| .PP |
| Some srun options may be set via environment variables. |
| These environment variables, along with their corresponding options, |
| are listed below. |
| Note: Command line options will always override these settings. |
| .TP 22 |
| \fBPMI_FANOUT\fR |
| This is used exclusively with PMI (MPICH2 and MVAPICH2) and |
| controls the fanout of data communications. The srun command |
| sends messages to application programs (via the PMI library) |
| and those applications may be called upon to forward that |
| data to up to this number of additional tasks. Higher values |
| offload work from the srun command to the applications and |
| likely increase the vulnerability to failures. |
| The default value is 32. |
| .TP |
| \fBPMI_FANOUT_OFF_HOST\fR |
| This is used exclusively with PMI (MPICH2 and MVAPICH2) and |
| controls the fanout of data communications. The srun command |
| sends messages to application programs (via the PMI library) |
| and those applications may be called upon to forward that |
| data to additional tasks. By default, srun sends one message |
| per host and one task on that host forwards the data to other |
| tasks on that host up to \fBPMI_FANOUT\fR. |
| If \fBPMI_FANOUT_OFF_HOST\fR is defined, the user task |
| may be required to forward the data to tasks on other hosts. |
| Setting \fBPMI_FANOUT_OFF_HOST\fR may increase performance. |
| Since more work is performed by the PMI library loaded by |
| the user application, failures also can be more common and |
| more difficult to diagnose. |
| .TP |
| \fBPMI_TIME\fR |
| This is used exclusively with PMI (MPICH2 and MVAPICH2) and |
| controls how much the communications from the tasks to the |
| srun are spread out in time in order to avoid overwhelming the |
| srun command with work. The default value is 500 (microseconds) |
| per task. On relatively slow processors or systems with very |
| large processor counts (and large PMI data sets), higher values |
| may be required. |
| .TP |
| \fBSLURM_CONF\fR |
| The location of the SLURM configuration file. |
| .TP |
| \fBSLURM_ACCOUNT\fR |
| Same as \fB\-U, \-\-account\fR |
| .TP |
| \fBSLURM_ACCTG_FREQ\fR |
| Same as \fB\-\-acctg\-freq\fR |
| .TP |
| \fBSLURM_CHECKPOINT\fR |
| Same as \fB\-\-checkpoint\fR |
| .TP |
| \fBSLURM_CHECKPOINT_DIR\fR |
| Same as \fB\-\-checkpoint\-dir\fR |
| .TP |
| \fBSLURM_CONN_TYPE\fR |
| Same as \fB\-\-conn\-type\fR |
| .TP |
| \fBSLURM_CORE_FORMAT\fR |
| Same as \fB\-\-core\fR |
| .TP |
| \fBSLURM_CPU_BIND\fR |
| Same as \fB\-\-cpu_bind\fR |
| .TP |
| \fBSLURM_CPUS_PER_TASK\fR |
| Same as \fB\-c, \-\-ncpus\-per\-task\fR |
| .TP |
| \fBSLURM_DEBUG\fR |
| Same as \fB\-v, \-\-verbose\fR |
| .TP |
| \fBSLURMD_DEBUG\fR |
| Same as \fB\-d, \-\-slurmd\-debug\fR |
| .TP |
| \fBSLURM_DEPENDENCY\fR |
| \fB\-P, \-\-dependency\fR=<\fIjobid\fR> |
| .TP |
| \fBSLURM_DISABLE_STATUS\fR |
| Same as \fB\-X, \-\-disable\-status\fR |
| .TP |
| \fBSLURM_DIST_PLANESIZE\fR |
| Same as \fB\-m plane\fR |
| .TP |
| \fBSLURM_DISTRIBUTION\fR |
| Same as \fB\-m, \-\-distribution\fR |
| .TP |
| \fBSLURM_EPILOG\fR |
| Same as \fB\-\-epilog\fR |
| .TP |
| \fBSLURM_EXCLUSIVE\fR |
| Same as \fB\-\-exclusive\fR |
| .TP |
| \fBSLURM_GEOMETRY\fR |
| Same as \fB\-g, \-\-geometry\fR |
| .TP |
| \fBSLURM_JOB_NAME\fR |
| Same as \fB\-J, \-\-job\-name\fR except within an existing |
| allocation, in which case it is ignored to avoid using the batch job's name |
| as the name of each job step. |
| .TP |
| \fBSLURM_LABELIO\fR |
| Same as \fB\-l, \-\-label\fR |
| .TP |
| \fBSLURM_MEM_BIND\fR |
| Same as \fB\-\-mem_bind\fR |
| .TP |
| \fBSLURM_NETWORK\fR |
| Same as \fB\-\-network\fR |
| .TP |
| \fBSLURM_NNODES\fR |
| Same as \fB\-N, \-\-nodes\fR |
| .TP |
| \fBSLURM_NTASKS_PER_CORE\fR |
| Same as \fB\-\-ntasks\-per\-core\fR |
| .TP |
| \fBSLURM_NTASKS_PER_NODE\fR |
| Same as \fB\-\-ntasks\-per\-node\fR |
| .TP |
| \fBSLURM_NTASKS_PER_SOCKET\fR |
| Same as \fB\-\-ntasks\-per\-socket\fR |
| .TP |
| \fBSLURM_NO_ROTATE\fR |
| Same as \fB\-R, \-\-no\-rotate\fR |
| .TP |
| \fBSLURM_NPROCS\fR |
| Same as \fB\-n, \-\-ntasks\fR |
| .TP |
| \fBSLURM_OPEN_MODE\fR |
| Same as \fB\-\-open\-mode\fR |
| .TP |
| \fBSLURM_OVERCOMMIT\fR |
| Same as \fB\-O, \-\-overcommit\fR |
| .TP |
| \fBSLURM_PARTITION\fR |
| Same as \fB\-p, \-\-partition\fR |
| .TP |
| \fBSLURM_PROLOG\fR |
| Same as \fB\-\-prolog\fR |
| .TP |
| \fBSLURM_REMOTE_CWD\fR |
| Same as \fB\-D, \-\-chdir=\fR |
| .TP |
| \fBSLURM_RESTART_DIR\fR |
| Same as \fB\-\-restart-dir\fR |
| .TP |
| \fBSLURM_STDERRMODE\fR |
| Same as \fB\-e, \-\-error\fR |
| .TP |
| \fBSLURM_STDINMODE\fR |
| Same as \fB\-i, \-\-input\fR |
| .TP |
| \fBSLURM_STDOUTMODE\fR |
| Same as \fB\-o, \-\-output\fR |
| .TP |
| \fBSLURM_TASK_EPILOG\fR |
| Same as \fB\-\-task\-epilog\fR |
| .TP |
| \fBSLURM_TASK_PROLOG\fR |
| Same as \fB\-\-task\-prolog |
| .TP |
| \fBSLURM_THREADS\fR |
| Same as \fB\-T, \-\-threads\fR |
| .TP |
| \fBSLURM_TIMELIMIT\fR |
| Same as \fB\-t, \-\-time\fR |
| .TP |
| \fBSLURM_UNBUFFEREDIO\fR |
| Same as \fB\-u, \-\-unbuffered\fR |
| .TP |
| \fBSLURM_WAIT\fR |
| Same as \fB\-W, \-\-wait\fR |
| .TP |
| \fBSLURM_WORKING_DIR\fR |
| \fB\-D, \-\-chdir\fR |
| |
| .SH "OUTPUT ENVIRONMENT VARIABLES" |
| .PP |
| srun will set some environment variables in the environment |
| of the executing tasks on the remote compute nodes. |
| These environment variables are: |
| |
| .TP 22 |
| \fBBASIL_RESERVATION_ID\fR |
| The reservation ID on Cray systems running ALPS/BASIL only. |
| |
| .TP |
| \fBSLURM_CHECKPOINT_IMAGE_DIR\fR |
| Directory into which checkpoint images should be written |
| if specified on the execute line. |
| |
| .TP |
| \fBSLURM_CPU_BIND_VERBOSE\fR |
| \-\-cpu_bind verbosity (quiet,verbose). |
| .TP |
| \fBSLURM_CPU_BIND_TYPE\fR |
| \-\-cpu_bind type (none,rank,map_cpu:,mask_cpu:) |
| .TP |
| \fBSLURM_CPU_BIND_LIST\fR |
| \-\-cpu_bind map or mask list (<list of IDs or masks for this node>) |
| |
| .TP |
| \fBSLURM_CPUS_ON_NODE\fR |
| Count of processors available to the job on this node. |
| Note the select/linear plugin allocates entire nodes to |
| jobs, so the value indicates the total count of CPUs on the node. |
| The select/cons_res plugin allocates individual processors |
| to jobs, so this number indicates the number of processors |
| on this node allocated to the job. |
| |
| .TP |
| \fBSLURM_GTIDS\fR |
| Global task IDs running on this node. |
| Zero origin and comma separated. |
| .TP |
| \fBSLURM_JOB_DEPENDENCY\fR |
| Set to value of the \-\-dependency option. |
| .TP |
| \fBSLURM_JOB_ID\fR (and \fBSLURM_JOBID\fR for backwards compatibility) |
| Job id of the executing job |
| |
| .TP |
| \fBSLURM_LAUNCH_NODE_IPADDR\fR |
| IP address of the node from which the task launch was |
| initiated (where the srun command ran from) |
| .TP |
| \fBSLURM_LOCALID\fR |
| Node local task ID for the process within a job |
| |
| .TP |
| \fBSLURM_MEM_BIND_VERBOSE\fR |
| \-\-mem_bind verbosity (quiet,verbose). |
| .TP |
| \fBSLURM_MEM_BIND_TYPE\fR |
| \-\-mem_bind type (none,rank,map_mem:,mask_mem:) |
| .TP |
| \fBSLURM_MEM_BIND_LIST\fR |
| \-\-mem_bind map or mask list (<list of IDs or masks for this node>) |
| |
| .TP |
| \fBSLURM_NNODES\fR |
| Total number of nodes in the job's resource allocation |
| .TP |
| \fBSLURM_NODEID\fR |
| The relative node ID of the current node |
| .TP |
| \fBSLURM_NODELIST\fR |
| List of nodes allocated to the job |
| .TP |
| \fBSLURM_NPROCS\fR |
| Total number of processes in the current job |
| .TP |
| \fBSLURM_PRIO_PROCESS\fR |
| The scheduling priority (nice value) at the time of job submission. |
| This value is propagated to the spawned processes. |
| .TP |
| \fBSLURM_PROCID\fR |
| The MPI rank (or relative process ID) of the current process |
| .TP |
| \fBSLURM_STEPID\fR |
| The step ID of the current job |
| .TP |
| \fBSLURM_TASK_PID\fR |
| The process ID of the task being started. |
| .TP |
| \fBSLURM_TASKS_PER_NODE\fR |
| Number of tasks to be initiated on each node. Values are |
| comma separated and in the same order as SLURM_NODELIST. |
| If two or more consecutive nodes are to have the same task |
| count, that count is followed by "(x#)" where "#" is the |
| repetition count. For example, "SLURM_TASKS_PER_NODE=2(x3),1" |
| indicates that the first three nodes will each execute three |
| tasks and the fourth node will execute one task. |
| .TP |
| \fBSLURM_UMASK\fR |
| The umask (user file\-create mask) at the time of job submission. |
| This value is propagated to the spawned processes. |
| .TP |
| \fBMPIRUN_NOALLOCATE\fR |
| Do not allocate a block on Blue Gene systems only. |
| .TP |
| \fBMPIRUN_NOFREE\fR |
| Do not free a block on Blue Gene systems only. |
| .TP |
| \fBMPIRUN_PARTITION\fR |
| The block name on Blue Gene systems only. |
| |
| .SH "SIGNALS AND ESCAPE SEQUENCES" |
| Signals sent to the \fBsrun\fR command are automatically forwarded to |
| the tasks it is controlling with a few exceptions. The escape sequence |
| \fB<control\-c>\fR will report the state of all tasks associated with |
| the \fBsrun\fR command. If \fB<control\-c>\fR is entered twice within |
| one second, then the associated SIGINT signal will be sent to all tasks |
| and a termination sequence will be entered sending SIGCONT, SIGTERM, |
| and SIGKILL to all spawned tasks. |
| If a third \fB<control\-c>\fR is received, the srun program will be |
| terminated without waiting for remote tasks to exit or their I/O to |
| complete. |
| |
| The escape sequence \fB<control\-z>\fR is presently ignored. Our intent |
| is for this put the \fBsrun\fR command into a mode where various special |
| actions may be invoked. |
| |
| .SH "MPI SUPPORT" |
| MPI use depends upon the type of MPI being used. |
| There are three fundamentally different modes of operation used |
| by these various MPI implementation. |
| |
| 1. SLURM directly launches the tasks and performs initialization |
| of communications (Quadrics MPI, MPICH2, MPICH-GM, MVAPICH, MVAPICH2 |
| and some MPICH1 modes). For example: "srun \-n16 a.out". |
| |
| 2. SLURM creates a resource allocation for the job and then |
| mpirun launches tasks using SLURM's infrastructure (OpenMPI, |
| LAM/MPI, HP-MPI and some MPICH1 modes). |
| |
| 3. SLURM creates a resource allocation for the job and then |
| mpirun launches tasks using some mechanism other than SLURM, |
| such as SSH or RSH (BlueGene MPI and some MPICH1 modes). |
| These tasks initiated outside of SLURM's monitoring |
| or control. SLURM's epilog should be configured to purge |
| these tasks when the job's allocation is relinquished. |
| |
| See \fIhttps://computing.llnl.gov/linux/slurm/mpi_guide.html\fR |
| for more information on use of these various MPI implementation |
| with SLURM. |
| |
| .SH "MULTIPLE PROGRAM CONFIGURATION" |
| Comments in the configuration file must have a "#" in column one. |
| The configuration file contains the following fields separated by white |
| space: |
| .TP |
| Task rank |
| One or more task ranks to use this configuration. |
| Multiple values may be comma separated. |
| Ranges may be indicated with two numbers separated with a '\-' with |
| the smaller number first (e.g. "0\-4" and not "4\-0"). |
| To indicate all tasks, specify a rank of '*' (in which case you probably |
| should not be using this option). |
| If an attempt is made to initiate a task for which no executable |
| program is defined, the following error message will be produced |
| "No executable program specified for this task". |
| .TP |
| Executable |
| The name of the program to execute. |
| May be fully qualified pathname if desired. |
| .TP |
| Arguments |
| Program arguments. |
| The expression "%t" will be replaced with the task's number. |
| The expression "%o" will be replaced with the task's offset within |
| this range (e.g. a configured task rank value of "1\-5" would |
| have offset values of "0\-4"). |
| Single quotes may be used to avoid having the enclosed values interpreted. |
| This field is optional. |
| .PP |
| For example: |
| .nf |
| ################################################################### |
| # srun multiple program configuration file |
| # |
| # srun \-n8 \-l \-\-multi\-prog silly.conf |
| ################################################################### |
| 4\-6 hostname |
| 1,7 echo task:%t |
| 0,2\-3 echo offset:%o |
| |
| > srun \-n8 \-l \-\-multi\-prog silly.conf |
| 0: offset:0 |
| 1: task:1 |
| 2: offset:1 |
| 3: offset:2 |
| 4: linux15.llnl.gov |
| 5: linux16.llnl.gov |
| 6: linux17.llnl.gov |
| 7: task:7 |
| |
| .fi |
| |
| |
| .SH "EXAMPLES" |
| This simple example demonstrates the execution of the command \fBhostname\fR |
| in eight tasks. At least eight processors will be allocated to the job |
| (the same as the task count) on however many nodes are required to satisfy |
| the request. The output of each task will be proceeded with its task number. |
| (The machine "dev" in the example below has a total of two CPUs per node) |
| |
| .nf |
| |
| > srun \-n8 \-l hostname |
| 0: dev0 |
| 1: dev0 |
| 2: dev1 |
| 3: dev1 |
| 4: dev2 |
| 5: dev2 |
| 6: dev3 |
| 7: dev3 |
| |
| .fi |
| .PP |
| The output of test.sh would be found in the default output file |
| "slurm\-42.out." |
| .PP |
| The srun \fB\-r\fR option is used within a job script |
| to run two job steps on disjoint nodes in the following |
| example. The script is run using allocate mode instead |
| of as a batch job in this case. |
| |
| .nf |
| |
| > cat test.sh |
| #!/bin/sh |
| echo $SLURM_NODELIST |
| srun \-lN2 \-r2 hostname |
| srun \-lN2 hostname |
| |
| > salloc \-N4 test.sh |
| dev[7\-10] |
| 0: dev9 |
| 1: dev10 |
| 0: dev7 |
| 1: dev8 |
| |
| .fi |
| .PP |
| The follwing script runs two job steps in parallel |
| within an allocated set of nodes. |
| |
| .nf |
| |
| > cat test.sh |
| #!/bin/bash |
| srun \-lN2 \-n4 \-r 2 sleep 60 & |
| srun \-lN2 \-r 0 sleep 60 & |
| sleep 1 |
| squeue |
| squeue \-s |
| wait |
| |
| > salloc \-N4 test.sh |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST |
| 65641 batch test.sh grondo R 0:01 4 dev[7\-10] |
| |
| STEPID PARTITION USER TIME NODELIST |
| 65641.0 batch grondo 0:01 dev[7\-8] |
| 65641.1 batch grondo 0:01 dev[9\-10] |
| |
| .fi |
| .PP |
| This example demonstrates how one executes a simple MPICH job. |
| We use \fBsrun\fR to build a list of machines (nodes) to be used by |
| \fBmpirun\fR in its required format. A sample command line and |
| the script to be executed follow. |
| |
| .nf |
| |
| > cat test.sh |
| #!/bin/sh |
| MACHINEFILE="nodes.$SLURM_JOB_ID" |
| |
| # Generate Machinefile for mpich such that hosts are in the same |
| # order as if run via srun |
| # |
| srun \-l /bin/hostname | sort \-n | awk '{print $2}' > $MACHINEFILE |
| |
| # Run using generated Machine file: |
| mpirun \-np $SLURM_NPROCS \-machinefile $MACHINEFILE mpi\-app |
| |
| rm $MACHINEFILE |
| |
| > salloc \-N2 \-n4 test.sh |
| |
| .fi |
| .PP |
| This simple example demonstrates the execution of different jobs on different |
| nodes in the same srun. You can do this for any number of nodes or any |
| number of jobs. The executables are placed on the nodes sited by the |
| SLURM_NODEID env var. Starting at 0 and going to the number specified on |
| the srun commandline. |
| |
| .nf |
| |
| > cat test.sh |
| case $SLURM_NODEID in |
| 0) echo "I am running on " |
| hostname ;; |
| 1) hostname |
| echo "is where I am running" ;; |
| esac |
| |
| > srun \-N2 test.sh |
| dev0 |
| is where I am running |
| I am running on |
| dev1 |
| |
| .fi |
| .PP |
| This example demonstrates use of multi\-core options to control layout |
| of tasks. |
| We request that four sockets per node and two cores per socket be |
| dedicated to the job. |
| |
| .nf |
| |
| > srun \-N2 \-B 4\-4:2\-2 a.out |
| .fi |
| .PP |
| This example shows a script in which Slurm is used to provide resource |
| management for a job by executing the various job steps as processors |
| become available for their dedicated use. |
| |
| .nf |
| |
| > cat my.script |
| #!/bin/bash |
| srun \-\-exclusive \-n4 prog1 & |
| srun \-\-exclusive \-n3 prog2 & |
| srun \-\-exclusive \-n1 prog3 & |
| srun \-\-exclusive \-n1 prog4 & |
| wait |
| .fi |
| |
| |
| .SH "COPYING" |
| Copyright (C) 2006\-2007 The Regents of the University of California. |
| Copyright (C) 2008\-2009 Lawrence Livermore National Security. |
| Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). |
| CODE\-OCEC\-09\-009. All rights reserved. |
| .LP |
| This file is part of SLURM, a resource management program. |
| For details, see <https://computing.llnl.gov/linux/slurm/>. |
| .LP |
| SLURM is free software; you can redistribute it and/or modify it under |
| the terms of the GNU General Public License as published by the Free |
| Software Foundation; either version 2 of the License, or (at your option) |
| any later version. |
| .LP |
| SLURM is distributed in the hope that it will be useful, but WITHOUT ANY |
| WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS |
| FOR A PARTICULAR PURPOSE. See the GNU General Public License for more |
| details. |
| |
| .SH "SEE ALSO" |
| \fBsalloc\fR(1), \fBsttach\fR(1), \fBsbatch\fR(1), \fBsbcast\fR(1), |
| \fBscancel\fR(1), \fBscontrol\fR(1), \fBsqueue\fR(1), \fBslurm.conf\fR(5), |
| \fBsched_setaffinity\fR(2), \fBnuma\fR(3) |
| \fBgetrlimit\fR(2), |