| <!--#include virtual="header.txt"--> |
| |
| <h1>Frequently Asked Questions</h1> |
| |
| <h2>For Management</h2> |
| <ul> |
| <li><a href="#free">Is Slurm really free?</a></li> |
| <li><a href="#foss">Why should I use Slurm or other free software?</a></li> |
| <li><a href="#support">Why should I pay for free software?</a></li> |
| <li><a href="#acronym">What does "Slurm" stand for?</a></li> |
| </ul> |
| |
| <h2>For Researchers</h2> |
| <ul> |
| <li><a href="#cite">How should I cite work involving Slurm?</a></li> |
| </ul> |
| |
| <h2>For Users</h2> |
| <h3>Designing Jobs</h3> |
| <ul> |
| <li><a href="#steps">How can I run multiple jobs from within a single |
| script?</a></li> |
| <li><a href="#multi_batch">How can I run a job within an existing job |
| allocation?</a></li> |
| <li><a href="#cpu_count">Slurm documentation refers to CPUs, cores and threads. |
| What exactly is considered a CPU?</a></li> |
| <li><a href="#arbitrary">How do I run specific tasks on certain nodes |
| in my allocation?</a></li> |
| <li><a href="#batch_out">How can I get the task ID in the output or error file |
| name for a batch job?</a></li> |
| <li><a href="#user_env">How does Slurm establish the environment for my |
| job?</a></li> |
| <li><a href="#parallel_make">Can the <i>make</i> command utilize the resources |
| allocated to a Slurm job?</a></li> |
| <li><a href="#ansys">How can I run an Ansys program with Slurm?</a></li> |
| </ul> |
| <h3>Submitting Jobs</h3> |
| <ul> |
| <li><a href="#opts">Why are my srun options ignored?</a></li> |
| <li><a href="#sharing">Why does the srun --overcommit option not permit |
| multiple jobs to run on nodes?</a></li> |
| <li><a href="#unbuffered_cr">Why is the srun --u/--unbuffered option adding |
| a carriage return to my output?</a></li> |
| <li><a href="#sbatch_srun">What is the difference between the sbatch |
| and srun commands?</a></li> |
| <li><a href="#terminal">Can tasks be launched with a remote (pseudo) |
| terminal?</a></li> |
| <li><a href="#prompt">How can I get shell prompts in interactive mode?</a></li> |
| <li><a href="#x11">Can Slurm export an X11 display on an allocated compute node?</a></li> |
| </ul> |
| <h3>Scheduling</h3> |
| <ul> |
| <li><a href="#pending">Why is my job not running?</a></li> |
| <li><a href="#backfill">Why is the Slurm backfill scheduler not starting my |
| job?</a></li> |
| </ul> |
| <h3>Killed Jobs</h3> |
| <ul> |
| <li><a href="#purge">Why is my job killed prematurely?</a></li> |
| <li><a href="#inactive">Why is my batch job that launches no job steps being |
| killed?</a></li> |
| <li><a href="#force">What does "srun: Force Terminated job" |
| indicate?</a></li> |
| <li><a href="#early_exit">What does this mean: "srun: First task exited |
| 30s ago" followed by "srun Job Failed"?</a></li> |
| </ul> |
| <h3>Managing Jobs</h3> |
| <ul> |
| <li><a href="#hold">How can I temporarily prevent a job from running |
| (e.g. place it into a <i>hold</i> state)?</a></li> |
| <li><a href="#job_size">Can I change my job's size after it has started |
| running?</a></li> |
| <li><a href="#estimated_start_time">Why does squeue (and "scontrol show |
| jobid") sometimes not display a job's estimated start time?</a></li> |
| <li><a href="#squeue_color">Can squeue output be color coded?</a></li> |
| <li><a href="#comp">Why is my job/node in a COMPLETING state?</a></li> |
| <li><a href="#req">How can a job in a complete or failed state be requeued?</a></li> |
| <li><a href="#sview_colors">Why is sview not coloring/highlighting nodes |
| properly?</a></li> |
| <li><a href="#mpi_symbols">Why is my MPICH2 or MVAPICH2 job not running with |
| Slurm? Why does the DAKOTA program not run with Slurm?</a></li> |
| </ul> |
| <h3>Resource Limits</h3> |
| <ul> |
| <li><a href="#rlimit">Why are my resource limits not propagated?</a></li> |
| <li><a href="#mem_limit">Why are jobs not getting the appropriate |
| memory limit?</a></li> |
| <li><a href="#memlock">Why is my MPI job failing due to the locked memory |
| (memlock) limit being too low?</a></li> |
| </ul> |
| |
| <h2>For Administrators</h2> |
| <h3>Test Environments</h3> |
| <ul> |
| <li><a href="#multi_slurm">Can multiple Slurm systems be run in |
| parallel for testing purposes?</a></li> |
| <li><a href="#multi_slurmd">Can Slurm emulate a larger cluster?</a></li> |
| <li><a href="#extra_procs">Can Slurm emulate nodes with more |
| resources than physically exist on the node?</a></li> |
| </ul> |
| <h3>Build and Install</h3> |
| <ul> |
| <li><a href="#rpm">Why aren't pam_slurm.so, auth_none.so, or other components in a |
| Slurm RPM?</a></li> |
| <li><a href="#debug">How can I build Slurm with debugging symbols?</a></li> |
| <li><a href="#git_patch">How can a patch file be generated from a Slurm commit |
| in GitHub?</a></li> |
| <li><a href="#apply_patch">How can I apply a patch to my Slurm source?</a></li> |
| <li><a href="#epel">Why am I being offered an automatic update for Slurm?</a></li> |
| </ul> |
| <h3>Cluster Management</h3> |
| <ul> |
| <li><a href="#controller"> How should I relocate the primary or backup |
| controller?</a></li> |
| <li><a href="#clock">Do I need to maintain synchronized clocks |
| on the cluster?</a></li> |
| <li><a href="#stop_sched">How can I stop Slurm from scheduling jobs?</a></li> |
| <li><a href="#maint_time">How can I dry up the workload for a maintenance |
| period?</a></li> |
| <li><a href="#upgrade">What should I be aware of when upgrading Slurm?</a></li> |
| <li><a href="#db_upgrade">Is there anything exceptional to be aware of when |
| upgrading my database server?</a></li> |
| <li><a href="#cluster_acct">When adding a new cluster, how can the Slurm cluster |
| configuration be copied from an existing cluster to the new cluster?</a></li> |
| <li><a href="#state_info">How could some jobs submitted immediately before the |
| slurmctld daemon crashed be lost?</a></li> |
| <li><a href="#limit_propagation">Is resource limit propagation |
| useful on a homogeneous cluster?</a></li> |
| <li><a href="#enforce_limits">Why are the resource limits set in the database |
| not being enforced?</a></li> |
| <li><a href="#licenses">Can Slurm be configured to manage licenses?</a></li> |
| <li><a href="#torque">How easy is it to switch from PBS or Torque to Slurm?</a></li> |
| <li><a href="#mpi_perf">What might account for MPI performance being below the |
| expected level?</a></li> |
| <li><a href="#delete_partition">How do I safely remove partitions?</a></li> |
| <li><a href="#routing_queue">How can a routing queue be configured?</a></li> |
| <li><a href="#none_plugins">What happened to the "none" plugins?</a></li> |
| </ul> |
| <h3>Accounting Database</h3> |
| <ul> |
| <li><a href="#slurmdbd">Why should I use the slurmdbd instead of the |
| regular database plugins?</a></li> |
| <li><a href="#dbd_rebuild">How can I rebuild the database hierarchy?</a></li> |
| <li><a href="#ha_db">How critical is configuring high availability for my |
| database?</a></li> |
| <li><a href="#sql">How can I use double quotes in MySQL queries?</a></li> |
| </ul> |
| <h3>Compute Nodes (slurmd)</h3> |
| <ul> |
| <li><a href="#return_to_service">Why is a node shown in state DOWN when the node |
| has registered for service?</a></li> |
| <li><a href="#down_node">What happens when a node crashes?</a></li> |
| <li><a href="#multi_job">How can I control the execution of multiple |
| jobs per node?</a></li> |
| <li><a href="#time">Why are jobs allocated nodes and then unable to initiate |
| programs on some nodes?</a></li> |
| <li><a href="#ping"> Why does <i>slurmctld</i> log that some nodes |
| are not responding even if they are not in any partition?</a></li> |
| <li><a href="#state_preserve">How can I easily preserve drained node |
| information between major Slurm updates?</a></li> |
| <li><a href="#health_check_example">Does anyone have an example node health check |
| script for Slurm?</a></li> |
| <li><a href="#health_check">Why doesn't the <i>HealthCheckProgram</i> |
| execute on DOWN nodes?</a></li> |
| <li><a href="#slurmd_oom">How can I prevent the <i>slurmd</i> and |
| <i>slurmstepd</i> daemons from being killed when a node's memory |
| is exhausted?</a></li> |
| <li><a href="#ubuntu">I see the host of my calling node as 127.0.1.1 |
| instead of the correct IP address. Why is that?</a></li> |
| <li><a href="#add_nodes">How should I add nodes to Slurm?</a></li> |
| <li><a href="#rem_nodes">How should I remove nodes from Slurm?</a></li> |
| <li><a href="#reboot">Why is a compute node down with the reason set to |
| "Node unexpectedly rebooted"?</a></li> |
| <li><a href="#cgroupv2">How do I convert my nodes to Control Group (cgroup) |
| v2?</a></li> |
| <li><a href="#amazon_ec2">Can Slurm be used to run jobs on Amazon's EC2?</a></li> |
| </ul> |
| <h3>User Management</h3> |
| <ul> |
| <li><a href="#pam">How can PAM be used to control a user's limits on or |
| access to compute nodes?</a></li> |
| <li><a href="#pam_exclude">How can I exclude some users from pam_slurm?</a></li> |
| <li><a href="#user_account">Can a user's account be changed in the database?</a></li> |
| <li><a href="#changed_uid">I had to change a user's UID and now they cannot submit |
| jobs. How do I get the new UID to take effect?</a></li> |
| <li><a href="#sssd">How can I get SSSD to work with Slurm?</a></li> |
| </ul> |
| <h3>Jobs</h3> |
| <ul> |
| <li><a href="#suspend">How is job suspend/resume useful?</a></li> |
| <li><a href="#squeue_script">How can I suspend, resume, hold or release all |
| of the jobs belonging to a specific user, partition, etc?</a></li> |
| <li><a href="#restore_priority">After manually setting a job priority value, |
| how can its priority value be returned to being managed by the |
| priority/multifactor plugin?</a></li> |
| <li><a href="#scontrol_multi_jobs">Can I update multiple jobs with a single |
| <i>scontrol</i> command?</a></li> |
| <li><a href="#task_prolog">How could I automatically print a job's |
| Slurm job ID to its standard output?</a></li> |
| <li><a href="#write_to_job_stdout">Is it possible to write to user stdout?</a></li> |
| <li><a href="#orphan_procs">Why are user processes and <i>srun</i> |
| running even though the job is supposed to be completed?</a></li> |
| <li><a href="#reqspec">How can a job which has exited with a specific exit code |
| be requeued?</a></li> |
| <li><a href="#cpu_freq">Why is Slurm unable to set the CPU frequency for jobs?</a></li> |
| <li><a href="#salloc_default_command">Can the salloc command be configured to |
| launch a shell on a node in the job's allocation?</a></li> |
| <li><a href="#tmpfs_jobcontainer">How can I set up a private /tmp and /dev/shm for |
| jobs on my machine?</a></li> |
| <li><a href="#sysv_memory">How do I configure Slurm to work with System V IPC |
| enabled applications?</a></li> |
| </ul> |
| <h3>General Troubleshooting</h3> |
| <ul> |
| <li><a href="#core_dump">If a Slurm daemon core dumps, where can I find the |
| core file?</a></li> |
| <li><a href="#backtrace">How can I get a backtrace from a core file?</a></li> |
| </ul> |
| <h3>Error Messages</h3> |
| <ul> |
| <li><a href="#inc_plugin">"Cannot resolve X plugin operations" on |
| daemon startup</a></li> |
| <li><a href="#credential_replayed">"Credential replayed" in |
| <i>SlurmdLogFile</i></a></li> |
| <li><a href="#cred_invalid">"Invalid job credential"</a></li> |
| <li><a href="#cred_replay">"Task launch failed on node ... Job credential |
| replayed"</a></li> |
| <li><a href="#file_limit">"Unable to accept new connection: Too many open |
| files"</a></li> |
| <li><a href="#slurmd_log"><i>SlurmdDebug</i> fails to log job step information |
| at the appropriate level</a></li> |
| <li><a href="#batch_lost">"Batch JobId=# missing from batch node <node> |
| (not found BatchStartTime after startup)"</a></li> |
| <li><a href="#opencl_pmix">Multi-Instance GPU not working with Slurm and |
| PMIx; GPUs are "In use by another client"</a></li> |
| <li><a href="#accept_again">"srun: error: Unable to accept connection: |
| Resources temporarily unavailable"</a></li> |
| <li><a href="#large_time">"Warning: Note very large processing time" |
| in <i>SlurmctldLogFile</i></a></li> |
| <li><a href="#mysql_duplicate">"Duplicate entry" causes slurmdbd to |
| fail</a></li> |
| <li><a href="#json_serializer">"Unable to find plugin: serializer/json"</a></li> |
| </ul> |
| <h3>Third Party Integrations</h3> |
| <ul> |
| <li><a href="#globus">Can Slurm be used with Globus?</a></li> |
| <li><a href="#totalview">How can TotalView be configured to operate with |
| Slurm?</a></li> |
| </ul> |
| |
| <h2>For Management</h2> |
| <p><a id="free"><b>Is Slurm really free?</b></a><br> |
| Yes, Slurm is free and open source: |
| <ul> |
| <li>Slurm is free as defined by the |
| <a href="https://www.gnu.org/philosophy/free-sw.en.html">Free Software |
| Foundation</a></li> |
| <li>Slurm’s <a href="https://github.com/SchedMD/slurm">source code</a> and |
| <a href="https://slurm.schedmd.com/documentation.html">documentation</a> are |
| publicly available under the GNU GPL v2</li> |
| <li>Slurm can be <a href="https://www.schedmd.com/download-slurm/"> |
| downloaded</a>, used, modified, and redistributed at no monetary cost</li> |
| </ul></p> |
| |
| <p><a id="foss"><b>Why should I use Slurm or other free software?</b></a><br> |
| Free software, as with proprietary software, varies widely in quality, but the |
| mechanism itself has proven to be capable of producing high-quality software |
| that is trusted by companies around the world. The Linux kernel is a prominent |
| example, which is often trusted on web servers, infrastructure servers, |
| supercomputers, and mobile devices.</p> |
| |
| <p>Likewise, Slurm has become a trusted tool in the supercomputing world since |
| its initial release in 2002 and the founding of SchedMD in 2010 to continue |
| developing Slurm. Today, Slurm powers a majority of the |
| <a href="https://www.top500.org/">TOP500</a> supercomputers. Customers switching |
| from commercial workload managers to Slurm typically report higher scalability, |
| better performance and lower costs.</p> |
| |
| <p><a id="support"><b>Why should I pay for free software?</b></a><br> |
| Free software does not mean that it is without cost. Software requires |
| significant time and expertise to write, test, distribute, and maintain. If the |
| software is large and complex, like Slurm or the Linux kernel, these costs can |
| become very substantial.</p> |
| |
| <p>Slurm is often used for highly important tasks at major computing clusters |
| around the world. Due to the extensive features available and the complexity of |
| the code required to provide those features, many organizations prefer to have |
| experts available to provide tailored recommendations and troubleshooting |
| assistance. While Slurm has a global development community incorporating leading |
| edge technology, <a href="https://www.schedmd.com">SchedMD</a> personnel have |
| developed most of the code and can provide competitively priced commercial |
| support and on-site training.</p> |
| |
| <p><a id="acronym"><b>What does "Slurm" stand for?</b></a><br> |
| Nothing.</p> |
| <p>Originally, "SLURM" (completely capitalized) was an acronym for |
| "Simple Linux Utility for Resource Management". In 2012 the preferred |
| capitalization was changed to Slurm, and the acronym was dropped — the |
| developers preferred to think of Slurm as "sophisticated" rather than "Simple" |
| by this point. And, as Slurm continued to expand it's scheduling capabilities, |
| the "Resource Management" label was also viewed as outdated.</p> |
| |
| <h2>For Researchers</h2> |
| <p><a id="cite"><b>How should I cite work involving Slurm?</b></a><br> |
| We recommend citing the peer-reviewed paper from JSSPP 2023: |
| <a href="https://doi.org/10.1007/978-3-031-43943-8_1"> |
| Architecture of the Slurm Workload Manager.</a></p> |
| <pre>Jette, M.A., Wickberg, T. (2023). Architecture of the Slurm Workload Manager. |
| In: Klusáček, D., Corbalán, J., Rodrigo, G.P. (eds) Job Scheduling Strategies |
| for Parallel Processing. JSSPP 2023. Lecture Notes in Computer Science, |
| vol 14283. Springer, Cham. https://doi.org/10.1007/978-3-031-43943-8_1 |
| </pre> |
| |
| <h2>For Users</h2> |
| |
| <h3>Designing Jobs</h3> |
| |
| <p><a id="steps"><b>How can I run multiple jobs from within a |
| single script?</b></a><br> |
| A Slurm job is just a resource allocation. You can execute many |
| job steps within that allocation, either in parallel or sequentially. |
| Some jobs actually launch thousands of job steps this way. The job |
| steps will be allocated nodes that are not already allocated to |
| other job steps. This essentially provides a second level of resource |
| management within the job for the job steps.</p> |
| |
| <p><a id="multi_batch"><b>How can I run a job within an existing |
| job allocation?</b></a><br> |
| There is an srun option <i>--jobid</i> that can be used to specify |
| a job's ID. |
| For a batch job or within an existing resource allocation, the |
| environment variable <i>SLURM_JOB_ID</i> has already been defined, |
| so all job steps will run within that job allocation unless |
| otherwise specified. |
| The one exception to this is when submitting batch jobs. |
| When a batch job is submitted from within an existing batch job, |
| it is treated as a new job allocation request and will get a |
| new job ID unless explicitly set with the <i>--jobid</i> option. |
| If you specify that a batch job should use an existing allocation, |
| that job allocation will be released upon the termination of |
| that batch job.</p> |
| |
| <p><a id="cpu_count"><b>Slurm documentation refers to CPUs, cores and threads. |
| What exactly is considered a CPU?</b></a><br> |
| If your nodes are configured with hyperthreading, then a CPU is equivalent |
| to a hyperthread. |
| Otherwise a CPU is equivalent to a core. |
| You can determine if your nodes have more than one thread per core |
| using the command "scontrol show node" and looking at the values of |
| "ThreadsPerCore".</p> |
| <p>Note that even on systems with hyperthreading enabled, the resources will |
| generally be allocated to jobs at the level of a core (see NOTE below). |
| Two different jobs will not share a core except through the use of a partition |
| OverSubscribe configuration parameter. |
| For example, a job requesting resources for three tasks on a node with |
| ThreadsPerCore=2 will be allocated two full cores. |
| Note that Slurm commands contain a multitude of options to control |
| resource allocation with respect to base boards, sockets, cores and threads.</p> |
| <p>(<b>NOTE</b>: An exception to this would be if the system administrator |
| configured SelectTypeParameters=CR_CPU and each node's CPU count without its |
| socket/core/thread specification. In that case, each thread would be |
| independently scheduled as a CPU. This is not a typical configuration.)</p> |
| |
| <p><a id="arbitrary"><b>How do I run specific tasks on certain nodes |
| in my allocation?</b></a><br> |
| One of the distribution methods for srun '<b>-m</b> |
| or <b>--distribution</b>' is 'arbitrary'. This means you can tell Slurm to |
| layout your tasks in any fashion you want. For instance if I had an |
| allocation of 2 nodes and wanted to run 4 tasks on the first node and |
| 1 task on the second and my nodes allocated from SLURM_JOB_NODELIST |
| where tux[0-1] my srun line would look like this:<br><br> |
| <i>srun -n5 -m arbitrary -w tux[0,0,0,0,1] hostname</i><br><br> |
| If I wanted something similar but wanted the third task to be on tux 1 |
| I could run this:<br><br> |
| <i>srun -n5 -m arbitrary -w tux[0,0,1,0,0] hostname</i><br><br> |
| Here is a simple Perl script named arbitrary.pl that can be ran to easily lay |
| out tasks on nodes as they are in SLURM_JOB_NODELIST.</p> |
| <pre> |
| #!/usr/bin/perl |
| my @tasks = split(',', $ARGV[0]); |
| my @nodes = `scontrol show hostnames $SLURM_JOB_NODELIST`; |
| my $node_cnt = $#nodes + 1; |
| my $task_cnt = $#tasks + 1; |
| |
| if ($node_cnt < $task_cnt) { |
| print STDERR "ERROR: You only have $node_cnt nodes, but requested layout on $task_cnt nodes.\n"; |
| $task_cnt = $node_cnt; |
| } |
| |
| my $cnt = 0; |
| my $layout; |
| foreach my $task (@tasks) { |
| my $node = $nodes[$cnt]; |
| last if !$node; |
| chomp($node); |
| for(my $i=0; $i < $task; $i++) { |
| $layout .= "," if $layout; |
| $layout .= "$node"; |
| } |
| $cnt++; |
| } |
| print $layout; |
| </pre> |
| |
| <p>We can now use this script in our srun line in this fashion.<br><br> |
| <i>srun -m arbitrary -n5 -w `arbitrary.pl 4,1` -l hostname</i><br><br> |
| This will layout 4 tasks on the first node in the allocation and 1 |
| task on the second node.</p> |
| |
| <p><a id="batch_out"><b>How can I get the task ID in the output |
| or error file name for a batch job?</b></a><br> |
| If you want separate output by task, you will need to build a script |
| containing this specification. For example:</p> |
| <pre> |
| $ cat test |
| #!/bin/sh |
| echo begin_test |
| srun -o out_%j_%t hostname |
| |
| $ sbatch -n7 -o out_%j test |
| sbatch: Submitted batch job 65541 |
| |
| $ ls -l out* |
| -rw-rw-r-- 1 jette jette 11 Jun 15 09:15 out_65541 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_0 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_1 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_2 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_3 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_4 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_5 |
| -rw-rw-r-- 1 jette jette 6 Jun 15 09:15 out_65541_6 |
| |
| $ cat out_65541 |
| begin_test |
| |
| $ cat out_65541_2 |
| tdev2 |
| </pre> |
| |
| <p><a id="user_env"><b>How does Slurm establish the environment |
| for my job?</b></a><br> |
| Slurm processes are not run under a shell, but directly exec'ed |
| by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch |
| the processes). |
| The environment variables in effect at the time the <i>srun</i> command |
| is executed are propagated to the spawned processes. |
| The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed |
| as part of the process launch. You can also look at the <i>--export</i> option of |
| srun and sbatch. See man pages for details.</p> |
| |
| <p><a id="parallel_make"><b>Can the <i>make</i> command |
| utilize the resources allocated to a Slurm job?</b></a><br> |
| Yes. There is a patch available for GNU make version 3.81 |
| available as part of the Slurm distribution in the file |
| <i>contribs/make-3.81.slurm.patch</i>. For GNU make version 4.0 you |
| can use the patch in the file <i>contribs/make-4.0.slurm.patch</i>. |
| This patch will use Slurm to launch tasks across a job's current resource |
| allocation. Depending upon the size of modules to be compiled, this may |
| or may not improve performance. If most modules are thousands of lines |
| long, the use of additional resources should more than compensate for the |
| overhead of Slurm's task launch. Use with make's <i>-j</i> option within an |
| existing Slurm allocation. Outside of a Slurm allocation, make's behavior |
| will be unchanged.</p> |
| |
| <p><a id="ansys"><b>How can I run an Ansys program with Slurm?</b></a><br> |
| If you are talking about an interactive run of the Ansys app, then you can use |
| this simple script (it is for Ansys Fluent):</p> |
| <pre> |
| $ cat ./fluent-srun.sh |
| #!/usr/bin/env bash |
| HOSTSFILE=.hostlist-job$SLURM_JOB_ID |
| if [ "$SLURM_PROCID" == "0" ]; then |
| srun hostname -f > $HOSTSFILE |
| fluent -t $SLURM_NTASKS -cnf=$HOSTSFILE -ssh 3d |
| rm -f $HOSTSFILE |
| fi |
| exit 0 |
| </pre> |
| |
| <p>To run an interactive session, use srun like this:</p> |
| <pre> |
| $ srun -n <tasks> ./fluent-srun.sh |
| </pre> |
| |
| <h3>Submitting Jobs</h3> |
| |
| <p><a id="opts"><b>Why are my srun options ignored?</b></a><br> |
| Everything after the command <span class="commandline">srun</span> is |
| examined to determine if it is a valid option for srun. The first |
| token that is not a valid option for srun is considered the command |
| to execute and everything after that is treated as an option to |
| the command. For example:</p> |
| <blockquote> |
| <p><span class="commandline">srun -N2 uptime -pdebug</span></p> |
| </blockquote> |
| <p>srun processes "-N2" as an option to itself. "uptime" is the command to |
| execute and "-pdebug" is treated as an option to the uptime command. Depending |
| on the command and options provided, you may get an invalid option message or |
| unexpected behavior if the options happen to be valid.</p> |
| |
| <p>Options for srun should appear before the command to be run:</p> |
| |
| <blockquote> |
| <p><span class="commandline">srun -N2 -pdebug uptime</span></p> |
| </blockquote> |
| |
| <p><a id="sharing"><b>Why does the srun --overcommit option not permit multiple jobs |
| to run on nodes?</b></a><br> |
| The <b>--overcommit</b> option is a means of indicating that a job or job step is willing |
| to execute more than one task per processor in the job's allocation. For example, |
| consider a cluster of two processor nodes. The srun execute line may be something |
| of this sort</p> |
| <blockquote> |
| <p><span class="commandline">srun --ntasks=4 --nodes=1 a.out</span></p> |
| </blockquote> |
| <p>This will result in not one, but two nodes being allocated so that each of the four |
| tasks is given its own processor. Note that the srun <b>--nodes</b> option specifies |
| a minimum node count and optionally a maximum node count. A command line of</p> |
| <blockquote> |
| <p><span class="commandline">srun --ntasks=4 --nodes=1-1 a.out</span></p> |
| </blockquote> |
| <p>would result in the request being rejected. If the <b>--overcommit</b> option |
| is added to either command line, then only one node will be allocated for all |
| four tasks to use.</p> |
| <p>More than one job can execute simultaneously on the same compute resource |
| (e.g. CPU) through the use of srun's <b>--oversubscribe</b> option in |
| conjunction with the <b>OverSubscribe</b> parameter in Slurm's partition |
| configuration. See the man pages for srun and slurm.conf for more information.</p> |
| |
| <p><a id="unbuffered_cr"><b>Why is the srun --u/--unbuffered option adding |
| a carriage character return to my output?</b></a><br> |
| The libc library used by many programs internally buffers output rather than |
| writing it immediately. This is done for performance reasons. |
| The only way to disable this internal buffering is to configure the program to |
| write to a pseudo terminal (PTY) rather than to a regular file. |
| This configuration causes <u>some</u> implementations of libc to prepend the |
| carriage return character before all line feed characters. |
| Removing the carriage return character would result in desired formatting |
| in some instances, while causing bad formatting in other cases. |
| In any case, Slurm is not adding the carriage return character, but displaying |
| the actual program's output.</p> |
| |
| <p><a id="sbatch_srun"><b>What is the difference between the sbatch |
| and srun commands?</b></a><br> |
| The srun command has two different modes of operation. First, if not run within |
| an existing job (i.e. not within a Slurm job allocation created by salloc or |
| sbatch), then it will create a job allocation and spawn an application. |
| If run within an existing allocation, the srun command only spawns the |
| application. |
| For this question, we will only address the first mode of operation and compare |
| creating a job allocation using the sbatch and srun commands.</p> |
| |
| <p>The srun command is designed for interactive use, with someone monitoring |
| the output. |
| The output of the application is seen as output of the srun command, |
| typically at the user's terminal. |
| The sbatch command is designed to submit a script for later execution and its |
| output is written to a file. |
| Command options used in the job allocation are almost identical. |
| The most noticeable difference in options is that the sbatch command supports |
| the concept of <a href="job_array.html">job arrays</a>, while srun does not. |
| Another significant difference is in fault tolerance. |
| Failures involving sbatch jobs typically result in the job being requeued |
| and executed again, while failures involving srun typically result in an |
| error message being generated with the expectation that the user will respond |
| in an appropriate fashion.</p> |
| |
| <p><a id="terminal"><b>Can tasks be launched with a remote (pseudo) |
| terminal?</b></a><br> |
| The best method is to use <code>salloc</code> with |
| <b>use_interactive_step</b> set in the <b>LaunchParameters</b> option in |
| <i>slurm.conf</i>. See |
| <a href="#prompt">getting shell prompts in interactive mode</a>.</p> |
| |
| <p><a id="prompt"><b>How can I get shell prompts in interactive |
| mode?</b></a><br> |
| Starting in 20.11, the recommended way to get an interactive shell prompt is |
| to configure <b>use_interactive_step</b> in <i>slurm.conf</i>:</p> |
| <pre> |
| LaunchParameters=use_interactive_step |
| </pre> |
| <p>This configures <code>salloc</code> to automatically launch an interactive |
| shell via <code>srun</code> on a node in the allocation whenever |
| <code>salloc</code> is called without a program to execute.</p> |
| |
| <p>By default, <b>use_interactive_step</b> creates an <i>interactive step</i> on |
| a node in the allocation and runs the shell in that step. An interactive step |
| is to an interactive shell what a batch step is to a batch script - both have |
| access to all resources in the allocation on the node they are running on, but |
| do not "consume" them.</p> |
| |
| <p>Note that beginning in 20.11, steps created by srun are now exclusive. This |
| means that the previously-recommended way to get an interactive shell, |
| <span class="commandline">srun --pty $SHELL</span>, will no longer work, as the |
| shell's step will now consume all resources on the node and cause subsequent |
| <span class="commandline">srun</span> calls to pend.</p> |
| |
| <p>An alternative but not recommended method is to make use of srun's |
| <i>--pty</i> option, (e.g. <i>srun --pty bash -i</i>). |
| Srun's <i>--pty</i> option runs task zero in pseudo terminal mode. Bash's |
| <i>-i</i> option instructs it to run in interactive mode (with prompts). |
| However, unlike the batch or interactive steps, this launches a step which |
| consumes all resources in the job. This means that subsequent steps cannot be |
| launched in the job unless they use the <i>--overlap</i> option. If task plugins |
| are configured, the shell is limited to CPUs of the first task. Subsequent |
| steps (which must be launched with <i>--overlap</i>) may be limited to fewer |
| resources than expected or may fail to launch tasks altogether if multiple |
| nodes were requested. Therefore, this alternative should rarely be used; |
| <code>salloc</code> should be used instead. |
| </p> |
| |
| <p><a id="x11"><b>Can Slurm export an X11 display on an allocated compute node?</b></a><br/> |
| You can use the X11 builtin feature starting at version 17.11. |
| It is enabled by setting <i>PrologFlags=x11</i> in <i>slurm.conf</i>. |
| Other X11 plugins must be deactivated. |
| <br/> |
| Run it as shown: |
| </p> |
| <pre> |
| $ ssh -X user@login1 |
| $ srun -n1 --pty --x11 xclock |
| </pre> |
| <p> |
| An alternative for older versions is to build and install an optional SPANK |
| plugin for that functionality. Instructions to build and install the plugin |
| follow. This SPANK plugin will not work if used in combination with native X11 |
| support so you must disable it compiling Slurm with <i>--disable-x11</i>. This |
| plugin relies on openssh library and it provides features such as GSSAPI |
| support.<br/> Update the Slurm installation path as needed:</p> |
| <pre> |
| # It may be obvious, but don't forget the -X on ssh |
| $ ssh -X alex@testserver.com |
| |
| # Get the plugin |
| $ mkdir git |
| $ cd git |
| $ git clone https://github.com/hautreux/slurm-spank-x11.git |
| $ cd slurm-spank-x11 |
| |
| # Manually edit the X11_LIBEXEC_PROG macro definition |
| $ vi slurm-spank-x11.c |
| $ vi slurm-spank-x11-plug.c |
| $ grep "define X11_" slurm-spank-x11.c |
| #define X11_LIBEXEC_PROG "/opt/slurm/17.02/libexec/slurm-spank-x11" |
| $ grep "define X11_LIBEXEC_PROG" slurm-spank-x11-plug.c |
| #define X11_LIBEXEC_PROG "/opt/slurm/17.02/libexec/slurm-spank-x11" |
| |
| |
| # Compile |
| $ gcc -g -o slurm-spank-x11 slurm-spank-x11.c |
| $ gcc -g -I/opt/slurm/17.02/include -shared -fPIC -o x11.so slurm-spank-x11-plug.c |
| |
| # Install |
| $ mkdir -p /opt/slurm/17.02/libexec |
| $ install -m 755 slurm-spank-x11 /opt/slurm/17.02/libexec |
| $ install -m 755 x11.so /opt/slurm/17.02/lib/slurm |
| |
| # Configure |
| $ echo -e "optional x11.so" >> /opt/slurm/17.02/etc/plugstack.conf |
| $ cd ~/tests |
| |
| # Run |
| $ srun -n1 --pty --x11 xclock |
| alex@node1's password: |
| </pre> |
| |
| <h3>Scheduling</h3> |
| |
| <p><a id="pending"><b>Why is my job not running?</b></a><br> |
| The answer to this question depends on a lot of factors. The main one is which |
| scheduler is used by Slurm. Executing the command</p> |
| <blockquote> |
| <p> <span class="commandline">scontrol show config | grep SchedulerType</span></p> |
| </blockquote> |
| <p> will supply this information. If the scheduler type is <b>builtin</b>, then |
| jobs will be executed in the order of submission for a given partition. Even if |
| resources are available to initiate your job immediately, it will be deferred |
| until no previously submitted job is pending. If the scheduler type is <b>backfill</b>, |
| then jobs will generally be executed in the order of submission for a given partition |
| with one exception: later submitted jobs will be initiated early if doing so does |
| not delay the expected execution time of an earlier submitted job. In order for |
| backfill scheduling to be effective, users' jobs should specify reasonable time |
| limits. If jobs do not specify time limits, then all jobs will receive the same |
| time limit (that associated with the partition), and the ability to backfill schedule |
| jobs will be limited. The backfill scheduler does not alter job specifications |
| of required or excluded nodes, so jobs which specify nodes will substantially |
| reduce the effectiveness of backfill scheduling. See the <a href="#backfill"> |
| backfill</a> section for more details. For any scheduler, you can check priorities |
| of jobs using the command <span class="commandline">scontrol show job</span>. |
| Other reasons can include waiting for resources, memory, qos, reservations, etc. |
| As a guideline, issue an <span class="commandline">scontrol show job <jobid></span> |
| and look at the field <i>State</i> and <i>Reason</i> to investigate the cause. |
| A full list and explanation of the different Reasons can be found in the |
| <a href="resource_limits.html#reasons">resource limits</a> page.</p> |
| |
| <p><a id="backfill"><b>Why is the Slurm backfill scheduler not starting my job? |
| </b></a><br> |
| The most common problem is failing to set job time limits. If all jobs have |
| the same time limit (for example the partition's time limit), then backfill |
| will not be effective. Note that partitions can have both default and maximum |
| time limits, which can be helpful in configuring a system for effective |
| backfill scheduling.</p> |
| |
| <p>In addition, there are a multitude of backfill scheduling parameters |
| which can impact which jobs are considered for backfill scheduling, such |
| as the maximum number of jobs tested per user. For more information see |
| the slurm.conf man page and check the configuration of SchedulerParameters |
| on your system.</p> |
| |
| <h3>Killed Jobs</h3> |
| |
| <p><a id="purge"><b>Why is my job killed prematurely?</b></a><br> |
| Slurm has a job purging mechanism to remove inactive jobs (resource allocations) |
| before reaching its time limit, which could be infinite. |
| This inactivity time limit is configurable by the system administrator. |
| You can check its value with the command</p> |
| <blockquote> |
| <p><span class="commandline">scontrol show config | grep InactiveLimit</span></p> |
| </blockquote> |
| <p>The value of InactiveLimit is in seconds. |
| A zero value indicates that job purging is disabled. |
| A job is considered inactive if it has no active job steps or if the srun |
| command creating the job is not responding. |
| In the case of a batch job, the srun command terminates after the job script |
| is submitted. |
| Therefore batch job pre- and post-processing is limited to the InactiveLimit. |
| Contact your system administrator if you believe the InactiveLimit value |
| should be changed.</p> |
| |
| <p><a id="inactive"><b>Why is my batch job that launches no |
| job steps being killed?</b></a><br> |
| Slurm has a configuration parameter <i>InactiveLimit</i> intended |
| to kill jobs that do not spawn any job steps for a configurable |
| period of time. Your system administrator may modify the <i>InactiveLimit</i> |
| to satisfy your needs. Alternately, you can just spawn a job step |
| at the beginning of your script to execute in the background. It |
| will be purged when your script exits or your job otherwise terminates. |
| A line of this sort near the beginning of your script should suffice:<br> |
| <i>srun -N1 -n1 sleep 999999 &</i></p> |
| |
| <p><a id="force"><b>What does "srun: Force Terminated job" |
| indicate?</b></a><br> |
| The srun command normally terminates when the standard output and |
| error I/O from the spawned tasks end. This does not necessarily |
| happen at the same time that a job step is terminated. For example, |
| a file system problem could render a spawned task non-killable |
| at the same time that I/O to srun is pending. Alternately a network |
| problem could prevent the I/O from being transmitted to srun. |
| In any event, the srun command is notified when a job step is |
| terminated, either upon reaching its time limit or being explicitly |
| killed. If the srun has not already terminated, the message |
| "srun: Force Terminated job" is printed. |
| If the job step's I/O does not terminate in a timely fashion |
| thereafter, pending I/O is abandoned and the srun command |
| exits.</p> |
| |
| <p><a id="early_exit"><b>What does this mean: |
| "srun: First task exited 30s ago" |
| followed by "srun Job Failed"?</b></a><br> |
| The srun command monitors when tasks exit. By default, 30 seconds |
| after the first task exits, the job is killed. |
| This typically indicates some type of job failure and continuing |
| to execute a parallel job when one of the tasks has exited is |
| not normally productive. This behavior can be changed using srun's |
| <i>--wait=<time></i> option to either change the timeout |
| period or disable the timeout altogether. See srun's man page |
| for details.</p> |
| |
| <h3>Managing Jobs</h3> |
| |
| <p><a id="hold"><b>How can I temporarily prevent a job from running |
| (e.g. place it into a <i>hold</i> state)?</b></a><br> |
| The easiest way to do this is to change a job's earliest begin time |
| (optionally set at job submit time using the <i>--begin</i> option). |
| The example below places a job into hold state (preventing its initiation |
| for 30 days) and later permitting it to start now.</p> |
| <pre> |
| $ scontrol update JobId=1234 StartTime=now+30days |
| ... later ... |
| $ scontrol update JobId=1234 StartTime=now |
| </pre> |
| |
| <p><a id="job_size"><b>Can I change my job's size after it has started |
| running?</b></a><br> |
| Slurm supports the ability to decrease the size of jobs. |
| Requesting fewer hardware resources, and changing partition, qos, |
| reservation, licenses, etc. is only allowed for pending jobs.</p> |
| |
| <p>Use the <i>scontrol</i> command to change a job's size either by specifying |
| a new node count (<i>NumNodes=</i>) for the job or identify the specific nodes |
| (<i>NodeList=</i>) that you want the job to retain. |
| Any job steps running on the nodes which are relinquished by the job will be |
| killed unless initiated with the <i>--no-kill</i> option. |
| After the job size is changed, some environment variables created by Slurm |
| containing information about the job's environment will no longer be valid and |
| should either be removed or altered (e.g. SLURM_JOB_NUM_NODES, |
| SLURM_JOB_NODELIST and SLURM_NTASKS). |
| The <i>scontrol</i> command will generate a script that can be executed to |
| reset local environment variables. |
| You must retain the SLURM_JOB_ID environment variable in order for the |
| <i>srun</i> command to gather information about the job's current state and |
| specify the desired node and/or task count in subsequent <i>srun</i> invocations. |
| A new accounting record is generated when a job is resized, showing the job to |
| have been resubmitted and restarted at the new size. |
| An example is shown below.</p> |
| <pre> |
| #!/bin/bash |
| srun my_big_job |
| scontrol update JobId=$SLURM_JOB_ID NumNodes=2 |
| . slurm_job_${SLURM_JOB_ID}_resize.sh |
| srun -N2 my_small_job |
| rm slurm_job_${SLURM_JOB_ID}_resize.* |
| </pre> |
| |
| <p><a id="estimated_start_time"><b>Why does squeue (and "scontrol show |
| jobid") sometimes not display a job's estimated start time?</b></a><br> |
| When the backfill scheduler is configured, it provides an estimated start time |
| for jobs that are candidates for backfill. Pending jobs with dependencies |
| will not have an estimate as it is difficult to predict what resources will |
| be available when the jobs they are dependent on terminate. Also note that |
| the estimate is better for jobs expected to start soon, as most running jobs |
| end before their estimated time. There are other restrictions on backfill that |
| may apply. See the <a href="#backfill">backfill</a> section for more details. |
| </p> |
| |
| <p><a id="squeue_color"><b>Can squeue output be color coded?</b></a><br> |
| The squeue command output is not color coded, but other tools can be used to |
| add color. One such tool is ColorWrapper |
| (<a href="https://github.com/rrthomas/cw">https://github.com/rrthomas/cw</a>). |
| A sample ColorWrapper configuration file and output are shown below.</p> |
| <pre> |
| path /bin:/usr/bin:/sbin:/usr/sbin:<env> |
| usepty |
| base green+ |
| match red:default (Resources) |
| match black:default (null) |
| match black:cyan N/A |
| regex cyan:default PD .*$ |
| regex red:default ^\d*\s*C .*$ |
| regex red:default ^\d*\s*CG .*$ |
| regex red:default ^\d*\s*NF .*$ |
| regex white:default ^JOBID.* |
| </pre> |
| <img src="squeue_color.png" width=600> |
| |
| <p><a id="comp"><b>Why is my job/node in a COMPLETING state?</b></a><br> |
| When a job is terminating, both the job and its nodes enter the COMPLETING state. |
| As the Slurm daemon on each node determines that all processes associated with |
| the job have terminated, that node changes state to IDLE or some other appropriate |
| state for use by other jobs. |
| When every node allocated to a job has determined that all processes associated |
| with it have terminated, the job changes state to COMPLETED or some other |
| appropriate state (e.g. FAILED). |
| Normally, this happens within a second. |
| However, if the job has processes that cannot be terminated with a SIGKILL |
| signal, the job and one or more nodes can remain in the COMPLETING state |
| for an extended period of time. |
| This may be indicative of processes hung waiting for a core file |
| to complete I/O or operating system failure. |
| If this state persists, the system administrator should check for processes |
| associated with the job that cannot be terminated then use the |
| <span class="commandline">scontrol</span> command to change the node's |
| state to DOWN (e.g. "scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing"), |
| reboot the node, then reset the node's state to IDLE |
| (e.g. "scontrol update NodeName=<i>name</i> State=RESUME"). |
| Note that setting the node DOWN will terminate all running or suspended |
| jobs associated with that node. |
| An alternative is to set the node's state to DRAIN until all jobs |
| associated with it terminate before setting it DOWN and re-booting.</p> |
| <p>Note that Slurm has two configuration parameters that may be used to |
| automate some of this process. |
| <i>UnkillableStepProgram</i> specifies a program to execute when |
| non-killable processes are identified. |
| <i>UnkillableStepTimeout</i> specifies how long to wait for processes |
| to terminate. |
| See the "man slurm.conf" for more information about these parameters.</p> |
| |
| <p><a id="req"><b>How can a job in a complete or failed state be requeued?</b></a> |
| <br> |
| Slurm supports requeuing jobs in a done or failed state. Use the |
| command:</p> |
| <p><b>scontrol requeue job_id</b></p> |
| <p>The job will then be requeued back in the PENDING state and scheduled again. |
| See man(1) scontrol. |
| </p> |
| <p>Consider a simple job like this:</p> |
| <pre> |
| $cat zoppo |
| #!/bin/sh |
| echo "hello, world" |
| exit 10 |
| |
| $sbatch -o here ./zoppo |
| Submitted batch job 10 |
| </pre> |
| <p> |
| The job finishes in FAILED state because it exits with |
| a non zero value. We can requeue the job back to |
| the PENDING state and the job will be dispatched again. |
| </p> |
| <pre> |
| $ scontrol requeue 10 |
| $ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 10 mira zoppo david PD 0:00 1 (NonZeroExitCode) |
| $ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 10 mira zoppo david R 0:03 1 alanz1 |
| </pre> |
| <p>Slurm supports requeuing jobs in a hold state with the command:</p> |
| <p><b>scontrol requeuehold job_id</b></p> |
| <p>The job can be in state RUNNING, SUSPENDED, COMPLETED or FAILED |
| before being requeued.</p> |
| <pre> |
| $ scontrol requeuehold 10 |
| $ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 10 mira zoppo david PD 0:00 1 (JobHeldUser) |
| </pre> |
| |
| <p><a id="sview_colors"><b>Why is sview not coloring/highlighting nodes |
| properly?</b></a><br> |
| sview color-coding is affected by the GTK theme. The node status grid |
| is made up of button widgets and certain GTK themes don't show the color |
| setting as desired. Changing GTK themes can restore proper color-coding.</p> |
| |
| <p><a id="mpi_symbols"><b>Why is my MPICH2 or MVAPICH2 job not running with |
| Slurm? Why does the DAKOTA program not run with Slurm?</b></a><br> |
| The Slurm library used to support MPICH2 or MVAPICH2 references a variety of |
| symbols. If those symbols resolve to functions or variables in your program |
| rather than the appropriate library, the application will fail. For example |
| <a href="http://dakota.sandia.gov">DAKOTA</a>, versions 5.1 and |
| older, contains a function named regcomp, which will get used rather |
| than the POSIX regex functions. Rename DAKOTA's function and |
| references from regcomp to something else to make it work properly.</p> |
| |
| <h3>Resource Limits</h3> |
| |
| <p><a id="rlimit"><b>Why are my resource limits not propagated?</b></a><br> |
| When the <span class="commandline">srun</span> command executes, it captures the |
| resource limits in effect at submit time on the node where srun executes. |
| These limits are propagated to the allocated nodes before initiating the |
| user's job. |
| The Slurm daemons running on the allocated nodes then try to establish |
| identical resource limits for the job being initiated. |
| There are several possible reasons for not being able to establish those |
| resource limits.</p> |
| <ul> |
| <li>The hard resource limits applied to Slurm's slurmd daemon are lower |
| than the user's soft resources limits on the submit host. Typically |
| the slurmd daemon is initiated by the init daemon with the operating |
| system default limits. This may be addressed either through use of the |
| ulimit command in the /etc/sysconfig/slurm file or enabling |
| <a href="#pam">PAM in Slurm</a>.</li> |
| <li>The user's hard resource limits on the allocated node are lower than |
| the same user's soft hard resource limits on the node from which the |
| job was submitted. It is recommended that the system administrator |
| establish uniform hard resource limits for users on all nodes |
| within a cluster to prevent this from occurring.</li> |
| <li>PropagateResourceLimits or PropagateResourceLimitsExcept parameters are |
| configured in slurm.conf and avoid propagation of specified limits.</li> |
| </ul> |
| <p><b>NOTE</b>: This may produce the error message |
| "Can't propagate RLIMIT_...". |
| The error message is printed only if the user explicitly specifies that |
| the resource limit should be propagated or the srun command is running |
| with verbose logging of actions from the slurmd daemon (e.g. "srun -d6 ...").</p> |
| |
| <p><a id="mem_limit"><b>Why are jobs not getting the appropriate |
| memory limit?</b></a><br> |
| This is probably a variation on the <a href="#memlock">locked memory limit</a> |
| problem described above. |
| Use the same solution for the AS (Address Space), RSS (Resident Set Size), |
| or other limits as needed.</p> |
| |
| <p><a id="memlock"><b>Why is my MPI job failing due to the |
| locked memory (memlock) limit being too low?</b></a><br> |
| By default, Slurm propagates all of your resource limits at the |
| time of job submission to the spawned tasks. |
| This can be disabled by specifically excluding the propagation of |
| specific limits in the <i>slurm.conf</i> file. For example |
| <i>PropagateResourceLimitsExcept=MEMLOCK</i> might be used to |
| prevent the propagation of a user's locked memory limit from a |
| <a href="quickstart_admin.html#login">login node</a> to a dedicated |
| node used for his parallel job. |
| If the user's resource limit is not propagated, the limit in |
| effect for the <i>slurmd</i> daemon will be used for the spawned job. |
| A simple way to control this is to ensure that user <i>root</i> has a |
| sufficiently large resource limit and ensuring that <i>slurmd</i> takes |
| full advantage of this limit. For example, you can set user root's |
| locked memory limit ulimit to be unlimited on the compute nodes (see |
| <i>"man limits.conf"</i>) and ensuring that <i>slurmd</i> takes |
| full advantage of this limit (e.g. by adding <i>"LimitMEMLOCK=infinity"</i> |
| to your systemd's <i>slurmd.service</i> file). It may also be desirable to lock |
| the slurmd daemon's memory to help ensure that it keeps responding if memory |
| swapping begins. A sample <i>/etc/sysconfig/slurm</i> which can be read from |
| systemd is shown below. |
| Related information about <a href="#pam">PAM</a> is also available.</p> |
| <pre> |
| # |
| # Example /etc/sysconfig/slurm |
| # |
| # Memlocks the slurmd process's memory so that if a node |
| # starts swapping, the slurmd will continue to respond |
| SLURMD_OPTIONS="-M" |
| </pre> |
| |
| <h2>For Administrators</h2> |
| |
| <h3>Test Environments</h3> |
| |
| <p><a id="multi_slurm"><b>Can multiple Slurm systems be run in |
| parallel for testing purposes?</b></a><br> |
| Yes, this is a great way to test new versions of Slurm. |
| Just install the test version in a different location with a different |
| <i>slurm.conf</i>. |
| The test system's <i>slurm.conf</i> should specify different |
| pathnames and port numbers to avoid conflicts. |
| The only problem is if more than one version of Slurm is configured |
| with <i>burst_buffer/*</i> plugins or others that may interact with external |
| system APIs. |
| In that case, there can be conflicting API requests from |
| the different Slurm systems. |
| This can be avoided by configuring the test system with <i>burst_buffer/none</i>.</p> |
| |
| <p><a id="multi_slurmd"><b>Can Slurm emulate a larger cluster?</b></a><br> |
| Yes, this can be useful for testing purposes. |
| It has also been used to partition "fat" nodes into multiple Slurm nodes. |
| There are two ways to do this. |
| The best method for most conditions is to run one <i>slurmd</i> |
| daemon per emulated node in the cluster as follows.</p> |
| <ol> |
| <li>When executing the <i>configure</i> program, use the option |
| <i>--enable-multiple-slurmd</i> (or add that option to your <i>~/.rpmmacros</i> |
| file).</li> |
| <li>Build and install Slurm in the usual manner.</li> |
| <li>In <i>slurm.conf</i> define the desired node names (arbitrary |
| names used only by Slurm) as <i>NodeName</i> along with the actual |
| address of the physical node in <i>NodeHostname</i>. Multiple |
| <i>NodeName</i> values can be mapped to a single |
| <i>NodeHostname</i>. Note that each <i>NodeName</i> on a single |
| physical node needs to be configured to use a different port number |
| (set <i>Port</i> to a unique value on each line for each node). You |
| will also want to use the "%n" symbol in slurmd related path options in |
| slurm.conf (<i>SlurmdLogFile</i> and <i>SlurmdPidFile</i>). </li> |
| <li>When starting the <i>slurmd</i> daemon, include the <i>NodeName</i> |
| of the node that it is supposed to serve on the execute line (e.g. |
| "slurmd -N hostname").</li> |
| <li> This is an example of the <i>slurm.conf</i> file with the emulated nodes |
| and ports configuration. Any valid value for the CPUs, memory or other |
| valid node resources can be specified.</li> |
| </ol> |
| |
| <pre> |
| NodeName=dummy26[1-100] NodeHostName=achille Port=[6001-6100] NodeAddr=127.0.0.1 CPUs=4 RealMemory=6000 |
| PartitionName=mira Default=yes Nodes=dummy26[1-100] |
| </pre> |
| |
| <p>See the |
| <a href="programmer_guide.html#multiple_slurmd_support">Programmers Guide</a> |
| for more details about configuring multiple slurmd support.</p> |
| |
| <p><a id="extra_procs"><b>Can Slurm emulate nodes with more |
| resources than physically exist on the node?</b></a><br> |
| Yes. In the slurm.conf file, configure <i>SlurmdParameters=config_overrides</i> |
| and specify |
| any desired node resource specifications (<i>CPUs</i>, <i>Sockets</i>, |
| <i>CoresPerSocket</i>, <i>ThreadsPerCore</i>, and/or <i>TmpDisk</i>). |
| Slurm will use the resource specification for each node that is |
| given in <i>slurm.conf</i> and will not check these specifications |
| against those actually found on the node. The system would best be configured |
| with <i>TaskPlugin=task/none</i>, so that launched tasks can run on any |
| available CPU under operating system control.</p> |
| |
| <h3>Build and Install</h3> |
| |
| <p><a id="rpm"><b>Why aren't pam_slurm.so, auth_none.so, or other components in a |
| Slurm RPM?</b></a><br> |
| It is possible that at build time the required dependencies for building the |
| library are missing. If you want to build the library then install pam-devel |
| and compile again. See the file slurm.spec in the Slurm distribution for a list |
| of other options that you can specify at compile time with rpmbuild flags |
| and your <i>rpmmacros</i> file.</p> |
| |
| <p>The auth_none plugin is in a separate RPM and not built by default. |
| Using the auth_none plugin means that Slurm communications are not |
| authenticated, so you probably do not want to run in this mode of operation |
| except for testing purposes. If you want to build the auth_none RPM then |
| add <i>--with auth_none</i> on the rpmbuild command line or add |
| <i>%_with_auth_none</i> to your ~/rpmmacros file. See the file slurm.spec |
| in the Slurm distribution for a list of other options.</p> |
| |
| <p><a id="debug"><b>How can I build Slurm with debugging symbols?</b></a><br> |
| When configuring, run the configure script with <i>--enable-developer</i> option. |
| That will provide asserts, debug messages and the <i>-Werror</i> flag, that |
| will in turn activate <i>--enable-debug</i>. |
| <br/>With the <i>--enable-debug</i> flag, the code will be compiled with |
| <i>-ggdb3</i> and <i>-g -O1 -fno-strict-aliasing</i> flags that will produce |
| extra debugging information. Another possible option to use is |
| <i>--disable-optimizations</i> that will set <i>-O0</i>. |
| See also <i>auxdir/x_ac_debug.m4</i> for more details.</p> |
| |
| <p><a id="git_patch"><b>How can a patch file be generated from a Slurm |
| commit in GitHub?</b></a><br> |
| Find and open the commit in GitHub then append ".patch" to the URL and save |
| the resulting file. For an example, see: |
| <a href="https://github.com/SchedMD/slurm/commit/91e543d433bed11e0df13ce0499be641774c99a3.patch"> |
| https://github.com/SchedMD/slurm/commit/91e543d433bed11e0df13ce0499be641774c99a3.patch</a> |
| </p> |
| |
| <p><a id="apply_patch"><b>How can I apply a patch to my Slurm source?</b></a> |
| <br> |
| If you have a patch file that you need to apply to your source, such as a |
| security or bug fix patch supplied by SchedMD's support, you can do |
| so with the <b>patch</b> command. You would first extract the contents of the |
| source tarball for the version you are using. You can then apply the patch |
| to the extracted source. Below is an example of how to do this with the |
| source for Slurm 23.11.1: |
| <pre> |
| $ tar xjvf slurm-23.11.1.tar.bz2 > /dev/null |
| $ patch -p1 -d slurm-23.11.1/ < example.patch |
| patching file src/slurmctld/step_mgr.c |
| </pre> |
| </p> |
| |
| <p>Once the patch has been applied to the source code, you can proceed to |
| build Slurm as you would normally if you build with <b>make</b>. If you use |
| <b>rpmbuild</b> to build Slurm, you will have to create a tarball with the |
| patched files. The filename of the tarball must match the original filename |
| to avoid errors. |
| <pre> |
| $ tar cjvf slurm-23.11.1.tar.bz2 slurm-23.11.1/ > /dev/null |
| $ rpmbuild -ta slurm-23.11.1.tar.bz2 > /dev/null |
| </pre> |
| </p> |
| |
| <p>Alternatively, as of Slurm 24.11.0 when using <b>rpmbuild</b>, a patched |
| package may be created directly by placing the patch file in the same directory |
| as the source tarball and executing the following command:</p> |
| <pre> |
| $ rpmbuild -ta --define 'patch security.patch' slurm-24.11.0.tar.bz2 |
| </pre> |
| |
| <p><a id="epel"><b>Why am I being offered an automatic update for Slurm?</b></a> |
| <br> |
| EPEL has added Slurm packages to their repository to make them more widely |
| available to the Linux community. However, this packaged version is not |
| supported or maintained by SchedMD, and is not recommend for customers at this |
| time. If you are using the EPEL repo you could be offered an update for Slurm |
| that you may not anticipate. In order to prevent Slurm from being upgraded |
| unintentionally, we recommend you modify the EPEL repository configuration file |
| to exclude all Slurm packages from automatic updates.</p> |
| <pre> |
| exclude=slurm* |
| </pre> |
| |
| <h3>Cluster Management</h3> |
| |
| <p><a id="controller"><b>How should I relocate the primary or |
| backup controller?</b></a><br> |
| If the cluster's computers used for the primary or backup controller |
| will be out of service for an extended period of time, it may be desirable |
| to relocate them. In order to do so, follow this procedure:</p> |
| <ol> |
| <li>(Slurm 23.02 and older) Drain the cluster of running jobs</li> |
| <li>Stop all Slurm daemons</li> |
| <li>Modify the <i>SlurmctldHost</i> values in the <i>slurm.conf</i> file</li> |
| <li>Distribute the updated <i>slurm.conf</i> file to all nodes</li> |
| <li>Copy the <i>StateSaveLocation</i> directory to the new host and |
| make sure the permissions allow the <i>SlurmUser</i> to read and write it. |
| <li>Restart all Slurm daemons</li> |
| </ol> |
| <p>Starting with Slurm 23.11, jobs that were started by the old controller will |
| receive the updated controller address and will continue and finish normally. |
| On older versions, jobs started by the old controller will still try to report |
| back to the older controller. |
| In both cases, there should be no loss of any pending jobs. |
| Ensure that any nodes added to the cluster have a current <i>slurm.conf</i> |
| file installed.</p> |
| |
| <p><b>CAUTION:</b> If two nodes are simultaneously configured as the primary |
| controller (two nodes on which <i>SlurmctldHost</i> specify the local host |
| and the <i>slurmctld</i> daemon is executing on each), system behavior will be |
| destructive. If a compute node has an incorrect <i>SlurmctldHost</i> parameter, |
| that node may be rendered unusable, but no other harm will result.</p> |
| |
| <p><a id="clock"><b>Do I need to maintain synchronized |
| clocks on the cluster?</b></a><br> |
| In general, yes. Having inconsistent clocks may cause nodes to be unusable and |
| generate errors in Slurm log files regarding expired credentials. For example: |
| </p> |
| <pre> |
| error: Munge decode failed: Expired credential |
| ENCODED: Wed May 12 12:34:56 2008 |
| DECODED: Wed May 12 12:01:12 2008 |
| </pre> |
| |
| <p><a id="stop_sched"><b>How can I stop Slurm from scheduling jobs?</b></a><br> |
| You can stop Slurm from scheduling jobs on a per partition basis by setting |
| that partition's state to DOWN. Set its state UP to resume scheduling. |
| For example:</p> |
| <pre> |
| $ scontrol update PartitionName=foo State=DOWN |
| $ scontrol update PartitionName=bar State=UP |
| </pre> |
| |
| <p><a id="maint_time"><b>How can I dry up the workload for a |
| maintenance period?</b></a><br> |
| Create a resource reservation as described in Slurm's |
| <a href="reservations.html">Resource Reservation Guide</a>.</p> |
| |
| <p><a id="upgrade"><b>What should I be aware of when upgrading Slurm?</b></a><br> |
| Refer to the <a href="upgrades.html">Upgrade Guide</a> for details.</p> |
| |
| <p><a id="db_upgrade"><b>Is there anything exceptional to be aware of when |
| upgrading my database server?</b></a><br> |
| Generally, no. Special cases are noted in the <a href="upgrades.html#db_server"> |
| Database server</a> section of the Upgrade Guide.</p> |
| |
| <p><a id="cluster_acct"><b>When adding a new cluster, how can the Slurm cluster |
| configuration be copied from an existing cluster to the new cluster?</b></a><br> |
| Accounts need to be configured for the cluster. An easy way to copy information from |
| an existing cluster is to use the sacctmgr command to dump that cluster's information, |
| modify it using some editor, the load the new information using the sacctmgr |
| command. See the sacctmgr man page for details, including an example.</p> |
| |
| <p><a id="state_info"><b>How could some jobs submitted immediately before |
| the slurmctld daemon crashed be lost?</b></a><br> |
| Any time the slurmctld daemon or hardware fails before state information reaches |
| disk can result in lost state. |
| Slurmctld writes state frequently (every five seconds by default), but with |
| large numbers of jobs, the formatting and writing of records can take seconds |
| and recent changes might not be written to disk. |
| Another example is if the state information is written to file, but that |
| information is cached in memory rather than written to disk when the node fails. |
| The interval between state saves being written to disk can be configured at |
| build time by defining SAVE_MAX_WAIT to a different value than five.</p> |
| |
| <p><a id="limit_propagation"><b>Is resource limit propagation |
| useful on a homogeneous cluster?</b></a><br> |
| Resource limit propagation permits a user to modify resource limits |
| and submit a job with those limits. |
| By default, Slurm automatically propagates all resource limits in |
| effect at the time of job submission to the tasks spawned as part |
| of that job. |
| System administrators can utilize the <i>PropagateResourceLimits</i> |
| and <i>PropagateResourceLimitsExcept</i> configuration parameters to |
| change this behavior. |
| Users can override defaults using the <i>srun --propagate</i> |
| option. |
| See <i>"man slurm.conf"</i> and <i>"man srun"</i> for more information |
| about these options.</p> |
| |
| <p><a id="enforce_limits"><b>Why are the resource limits set in the |
| database not being enforced?</b></a><br> |
| In order to enforce resource limits, set the value of |
| <b>AccountingStorageEnforce</b> in each cluster's slurm.conf configuration |
| file appropriately. If <b>AccountingStorageEnforce</b> does not contains |
| an option of "limits", then resource limits will not be enforced on that cluster. |
| See <a href="resource_limits.html">Resource Limits</a> for more information.</p> |
| |
| <p><a id="licenses"><b>Can Slurm be configured to manage licenses?</b></a><br> |
| Slurm does not provide a native integration with third party license managers, |
| but it does provide for the allocation of global resources called licenses. |
| Use the Licenses configuration parameter in your slurm.conf file |
| (e.g. "Licenses=foo:10,bar:20"). Jobs can request licenses and be granted |
| exclusive use of those resources (e.g. "sbatch --licenses=foo:2,bar:1 ..."). |
| It is not currently possible to change the total number of licenses on a system |
| without restarting the slurmctld daemon, but it is possible to dynamically |
| reserve licenses and remove them from being available to jobs on the system |
| (e.g. "scontrol update reservation=licenses_held licenses=foo:5,bar:2"). |
| For more information see the <a href="licenses.html">Licenses Guide</a>.</p> |
| |
| <p><a id="torque"><b>How easy is it to switch from PBS or Torque to Slurm?</b></a><br> |
| A lot of users don't even notice the difference. |
| Slurm has wrappers available for the mpiexec, pbsnodes, qdel, qhold, qrls, |
| qstat, and qsub commands (see contribs/torque in the distribution and the |
| "slurm-torque" RPM). |
| There is also a wrapper for the showq command at |
| <a href="https://github.com/pedmon/slurm_showq"> |
| https://github.com/pedmon/slurm_showq</a>.</p> |
| |
| <p>Slurm recognizes and translates the "#PBS" options in batch scripts. |
| Most, but not all options are supported.</p> |
| |
| <p>Slurm also includes a SPANK plugin that will set all of the PBS environment |
| variables based upon the Slurm environment (e.g. PBS_JOBID, PBS_JOBNAME, |
| PBS_WORKDIR, etc.). |
| One environment not set by PBS_ENVIRONMENT, which if set would result in the |
| failure of some MPI implementations. |
| The plugin will be installed in<br> |
| <install_directory>/lib/slurm/spank_pbs.so<br> |
| See the SPANK man page for configuration details.</p> |
| |
| <p><a id="mpi_perf"><b>What might account for MPI performance being below |
| the expected level?</b></a><br> |
| Starting the slurmd daemons with limited locked memory can account for this. |
| Adding the line "ulimit -l unlimited" to the <i>/etc/sysconfig/slurm</i> file can |
| fix this.</p> |
| |
| <p><a id="delete_partition"><b>How do I safely remove partitions? |
| </b></a><br> |
| Partitions should be removed using the |
| "scontrol delete PartitionName=<partition>" command. This is because |
| scontrol will prevent any partitions from being removed that are in use. |
| Partitions need to be removed from the slurm.conf after being removed using |
| scontrol or they will return after a restart. |
| An existing job's partition(s) can be updated with the "scontrol update |
| JobId=<jobid> Partition=<partition(s)>" command. |
| Removing a partition from the slurm.conf and restarting will cancel any existing |
| jobs that reference the removed partitions. |
| </p> |
| |
| <p><a id="routing_queue"><b>How can a routing queue be configured?</b></a><br> |
| A job submit plugin is designed to have access to a job request from a user, |
| plus information about all of the available system partitions/queue. |
| An administrator can write a C plugin or LUA script to set an incoming job's |
| partition based upon its size, time limit, etc. |
| See the <a href="https://slurm.schedmd.com/job_submit_plugins.html"> Job Submit Plugin API</a> |
| guide for more information. |
| Also see the available job submit plugins distributed with Slurm for examples |
| (look in the "src/plugins/job_submit" directory).</p> |
| |
| <p><a id="none_plugins"><b>What happened to the "none" plugins?</b></a><br> |
| In Slurm 23.02 and earlier, several parameters had a plugin named "none" |
| that would essentially disable the setting. In version 23.11, those plugins |
| named "none" were removed. To disable a setting you just need to leave it |
| unset. If you still have a plugin defined as "none", Slurm will still |
| recognize it and treat it as though it was unset. Parameters that previously |
| had a "none" plugin are: |
| <ul> |
| <li>AccountingStorageType</li> |
| <li>AcctGatherEnergyType</li> |
| <li>AcctGatherInterconnectType</li> |
| <li>AcctGatherFilesystemType</li> |
| <li>AcctGatherProfileType</li> |
| <li>CliFilterPlugins</li> |
| <li>CoreSpecPlugin</li> |
| <li>ExtSensorsType</li> |
| <li>JobAcctGatherType</li> |
| <li>JobCompType</li> |
| <li>JobContainerType</li> |
| <li>MCSPlugin</li> |
| <li>MpiDefault</li> |
| <li>PowerParameters</li> |
| <li>PreemptType</li> |
| <li>PrioritySiteFactorPlugin</li> |
| <li>SwitchType</li> |
| <li>TaskPlugin</li> |
| <li>TopologyPlugin</li> |
| </ul></p> |
| |
| <h3>Accounting Database</h3> |
| |
| <p><a id="slurmdbd"><b>Why should I use the slurmdbd instead of the |
| regular database plugins?</b></a><br> |
| While the normal storage plugins will work fine without the added |
| layer of the slurmdbd there are some great benefits to using the |
| slurmdbd.</p> |
| <ol> |
| <li>Added security. Using the slurmdbd you can have an authenticated |
| connection to the database.</li> |
| <li>Offloading processing from the controller. With the slurmdbd there is no |
| slowdown to the controller due to a slow or overloaded database.</li> |
| <li>Keeping enterprise wide accounting from all Slurm clusters in one database. |
| The slurmdbd is multi-threaded and designed to handle all the |
| accounting for the entire enterprise.</li> |
| <li>With the database plugins you can query with sacct accounting stats from |
| any node Slurm is installed on. With the slurmdbd you can also query any |
| cluster using the slurmdbd from any other cluster's nodes. Other tools like |
| sreport are also available.</li> |
| </ol> |
| |
| <p><a id="dbd_rebuild"><b>How can I rebuild the database hierarchy?</b></a><br> |
| If you see errors of this sort:</p> |
| <pre> |
| error: Can't find parent id 3358 for assoc 1504, this should never happen. |
| </pre> |
| <p>in the slurmctld log file, this is indicative that the database hierarchy |
| information has been corrupted, typically due to a hardware failure or |
| administrator error in directly modifying the database. In order to rebuild |
| the database information, start the slurmdbd daemon with the "-R" option |
| followed by an optional comma separated list of cluster names to operate on.</p> |
| |
| <p><a id="ha_db"><b>How critical is configuring high availability for my |
| database?</b></a></p> |
| <ul> |
| <li>Consider if you really need a high-availability MySQL setup. A short outage |
| of slurmdbd is not a problem, because slurmctld will store all data in memory |
| and send it to slurmdbd when it resumes operations. The slurmctld daemon will |
| also cache all user limits and fair share information.</li> |
| <li>You cannot use NDB, since SlurmDBD's MySQL implementation uses keys on BLOB |
| values (and potentially other features on the incompatibility list).</li> |
| <li>You can set up "classical" Linux HA, with heartbeat/corosync to migrate IP |
| between primary/backup mysql servers and: |
| <ul> |
| <li>Configure one way replication of mysql, and change primary/backup roles on |
| failure</li> |
| <li>Use shared storage for primary/backup mysql servers database, and start |
| backup on primary mysql failure.</li> |
| </ul> |
| </li> |
| </ul> |
| |
| <p><a id="sql"><b>How can I use double quotes in MySQL queries?</b></a><br> |
| Execute:</p> |
| <pre> |
| SET session sql_mode='ANSI_QUOTES'; |
| </pre> |
| <p>This will allow double quotes in queries like this:</p> |
| <pre> |
| show columns from "tux_assoc_table" where Field='is_def'; |
| </pre> |
| |
| <h3>Compute Nodes (slurmd)</h3> |
| |
| <p><a id="return_to_service"><b>Why is a node shown in state |
| DOWN when the node has registered for service?</b></a><br> |
| The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i> |
| controls how DOWN nodes are handled. |
| Set its value to one in order for DOWN nodes to automatically be |
| returned to service once the <i>slurmd</i> daemon registers |
| with a valid node configuration. |
| A value of zero is the default and results in a node staying DOWN |
| until an administrator explicitly returns it to service using |
| the command "scontrol update NodeName=whatever State=RESUME". |
| See "man slurm.conf" and "man scontrol" for more |
| details.</p> |
| |
| <p><a id="down_node"><b>What happens when a node crashes?</b></a><br> |
| A node is set DOWN when the slurmd daemon on it stops responding |
| for <i>SlurmdTimeout</i> as defined in <i>slurm.conf</i>. |
| The node can also be set DOWN when certain errors occur or the |
| node's configuration is inconsistent with that defined in <i>slurm.conf</i>. |
| Any active job on that node will be killed unless it was submitted |
| with the srun option <i>--no-kill</i>. |
| Any active job step on that node will be killed. |
| See the slurm.conf and srun man pages for more information.</p> |
| |
| <p><a id="multi_job"><b>How can I control the execution of multiple |
| jobs per node?</b></a><br> |
| There are two mechanisms to control this. |
| If you want to allocate individual processors on a node to jobs, |
| configure <i>SelectType=select/cons_tres</i>. |
| See <a href="cons_tres.html">Consumable Resources in Slurm</a> |
| for details about this configuration. |
| If you want to allocate whole nodes to jobs, configure |
| configure <i>SelectType=select/linear</i>. |
| Each partition also has a configuration parameter <i>OverSubscribe</i> |
| that enables more than one job to execute on each node. |
| See <i>man slurm.conf</i> for more information about these |
| configuration parameters.</p> |
| |
| <p><a id="time"><b>Why are jobs allocated nodes and then unable |
| to initiate programs on some nodes?</b></a><br> |
| This typically indicates that the time on some nodes is not consistent |
| with the node on which the <i>slurmctld</i> daemon executes. In order to |
| initiate a job step (or batch job), the <i>slurmctld</i> daemon generates |
| a credential containing a time stamp. If the <i>slurmd</i> daemon |
| receives a credential containing a time stamp later than the current |
| time or more than a few minutes in the past, it will be rejected. |
| If you check in the <i>SlurmdLogFile</i> on the nodes of interest, you |
| will likely see messages of this sort: "<i>Invalid job credential from |
| <some IP address>: Job credential expired</i>." Make the times |
| consistent across all of the nodes and all should be well.</p> |
| |
| <p><a id="ping"><b>Why does <i>slurmctld</i> log that some nodes |
| are not responding even if they are not in any partition?</b></a><br> |
| The <i>slurmctld</i> daemon periodically pings the <i>slurmd</i> |
| daemon on every configured node, even if not associated with any |
| partition. You can control the frequency of this ping with the |
| <i>SlurmdTimeout</i> configuration parameter in <i>slurm.conf</i>.</p> |
| |
| <p><a id="state_preserve"><b>How can I easily preserve drained node |
| information between major Slurm updates?</b></a><br> |
| Major Slurm updates generally have changes in the state save files and |
| communication protocols, so a cold-start (without state) is generally |
| required. If you have nodes in a DRAIN state and want to preserve that |
| information, you can easily build a script to preserve that information |
| using the <i>sinfo</i> command. The following command line will report the |
| <i>Reason</i> field for every node in a DRAIN state and write the output |
| in a form that can be executed later to restore state.</p> |
| <pre> |
| sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'" |
| </pre> |
| |
| <p><a id="health_check_example"><b>Does anyone have an example node |
| health check script for Slurm?</b></a><br> |
| Probably the most comprehensive and lightweight health check tool out |
| there is |
| <a href="https://github.com/mej/nhc">Node Health Check</a>. |
| It has integration with Slurm as well as Torque resource managers.</p> |
| |
| <p><a id="health_check"><b>Why doesn't the <i>HealthCheckProgram</i> |
| execute on DOWN nodes?</b></a><br> |
| Hierarchical communications are used for sending this message. If there |
| are DOWN nodes in the communications hierarchy, messages will need to |
| be re-routed. This limits Slurm's ability to tightly synchronize the |
| execution of the <i>HealthCheckProgram</i> across the cluster, which |
| could adversely impact performance of parallel applications. |
| The use of CRON or node startup scripts may be better suited to ensure |
| that <i>HealthCheckProgram</i> gets executed on nodes that are DOWN |
| in Slurm.</p> |
| |
| <p><a id="slurmd_oom"><b>How can I prevent the <i>slurmd</i> and |
| <i>slurmstepd</i> daemons from being killed when a node's memory |
| is exhausted?</b></a><br> |
| You can set the value in the <i>/proc/self/oom_adj</i> for |
| <i>slurmd</i> and <i>slurmstepd</i> by initiating the <i>slurmd</i> |
| daemon with the <i>SLURMD_OOM_ADJ</i> and/or <i>SLURMSTEPD_OOM_ADJ</i> |
| environment variables set to the desired values. |
| A value of -17 typically will disable killing.</p> |
| |
| <p><a id="ubuntu"><b>I see the host of my calling node as 127.0.1.1 |
| instead of the correct IP address. Why is that?</b></a><br> |
| Some systems by default will put your host in the /etc/hosts file as |
| something like</p> |
| <pre> |
| 127.0.1.1 snowflake.llnl.gov snowflake |
| </pre> |
| <p>This will cause srun and Slurm commands to use the 127.0.1.1 address |
| instead of the correct address and prevent communications between nodes. |
| The solution is to either remove this line or configure a different NodeAddr |
| that is known by your other nodes.</p> |
| |
| <p>The CommunicationParameters=NoInAddrAny configuration parameter is subject to |
| this same problem, which can also be addressed by removing the actual node |
| name from the "127.0.1.1" as well as the "127.0.0.1" |
| addresses in the /etc/hosts file. It is ok if they point to |
| localhost, but not the actual name of the node.</p> |
| |
| <p><a id="add_nodes"><b>How should I add nodes to Slurm?</b></a><br> |
| The slurmctld daemon has many bitmaps to track state of nodes and cores in the |
| cluster. Adding nodes to a running cluster would require the slurmctld daemon |
| to rebuild all of those bitmaps, which required restarting the daemon in older |
| versions of Slurm. Communications from the slurmd daemons on the compute |
| nodes to the slurmctld daemon include a configuration file checksum, so you |
| should maintain the same slurm.conf file on all nodes.</p> |
| |
| <p>The following procedure is recommended on <b>Slurm 24.05</b> and older |
| (see below for 24.11 and newer):</p> |
| <ol> |
| <li>Stop the slurmctld daemon (e.g. <code>systemctl stop slurmctld</code> |
| on the head node)</li> |
| <li>Update the <b>slurm.conf</b> file on all nodes in the cluster</li> |
| <li>Restart the slurmd daemons on all nodes (e.g. |
| <code>systemctl restart slurmd</code> on all nodes)</li> |
| <li>Restart the slurmctld daemon (e.g. <code>systemctl start slurmctld</code> |
| on the head node)</li> |
| </ol> |
| |
| <p>The following procedure is sufficient on <b>Slurm 24.11</b> and newer:</p> |
| <ol> |
| <li>Update the <b>slurm.conf</b> file on all nodes in the cluster</li> |
| <li>Run <code>scontrol reconfigure</code></li> |
| </ol> |
| |
| <p><b>NOTE</b>: Jobs submitted with srun, and that are waiting for an |
| allocation, prior to new nodes being added to the slurm.conf can fail if the |
| job is allocated one of the new nodes.</p> |
| |
| <p><a id="rem_nodes"><b>How should I remove nodes from Slurm?</b></a><br> |
| To safely remove a node from a cluster, it's best to drain the node of all jobs. |
| This ensures that job processes aren't running on the node after removal. On |
| restart of the controller, if a node is removed from a running job the |
| controller will kill the job on any remaining allocated nodes and attempt to |
| requeue the job if possible.</p> |
| |
| <p>The following procedure is recommended on <b>Slurm 24.05</b> and older |
| (see below for 24.11 and newer):</p> |
| <ol> |
| <li>Drain node of all jobs (e.g. |
| <code>scontrol update nodename='%N' state=drain reason='removing nodes'</code> |
| )</li> |
| <li>Stop the slurmctld daemon (e.g. <code>systemctl stop slurmctld</code> |
| on the head node)</li> |
| <li>Update the <b>slurm.conf</b> file on all nodes in the cluster</li> |
| <li>Restart the slurmd daemons on all nodes (e.g. |
| <code>systemctl restart slurmd</code> on all nodes)</li> |
| <li>Restart the slurmctld daemon (e.g. <code>systemctl start slurmctld</code> |
| on the head node)</li> |
| </ol> |
| |
| <p>The following procedure is sufficient on <b>Slurm 24.11</b> and newer:</p> |
| <ol> |
| <li>Drain node of all jobs (e.g. |
| <code>scontrol update nodename='%N' state=drain reason='removing nodes'</code> |
| )</li> |
| <li>Update the <b>slurm.conf</b> file on all nodes in the cluster</li> |
| <li>Run <code>scontrol reconfigure</code></li> |
| </ol> |
| |
| <p><b>NOTE</b>: Removing nodes from the cluster may cause some errors in the |
| logs. Verify that any errors in the logs are for nodes that you intended to |
| remove.</p> |
| |
| <p><a id="reboot"><b>Why is a compute node down with the reason set to |
| "Node unexpectedly rebooted"?</b></a><br> |
| This is indicative of the slurmctld daemon running on the cluster's head node |
| as well as the slurmd daemon on the compute node when the compute node reboots. |
| If you want to prevent this condition from setting the node into a DOWN state |
| then configure ReturnToService to 2. See the slurm.conf man page for details. |
| Otherwise use scontrol or sview to manually return the node to service.</p> |
| |
| <p><a id="cgroupv2"><b>How do I convert my nodes to Control Group (cgroup) |
| v2?</b></a><br> |
| Refer to the <a href="cgroup_v2.html#conversion">cgroup v2</a> documentation |
| for the conversion procedure.</p> |
| |
| <p><a id="amazon_ec2"><b>Can Slurm be used to run jobs on |
| Amazon's EC2?</b></a><br> |
| Yes, here is a description of Slurm use with |
| <a href="http://aws.amazon.com/ec2/">Amazon's EC2</a> courtesy of |
| Ashley Pittman:</p> |
| <p>I do this regularly and have no problem with it, the approach I take is to |
| start as many instances as I want and have a wrapper around |
| ec2-describe-instances that builds a /etc/hosts file with fixed hostnames |
| and the actual IP addresses that have been allocated. The only other step |
| then is to generate a slurm.conf based on how many node you've chosen to boot |
| that day. I run this wrapper script on my laptop and it generates the files |
| and they rsyncs them to all the instances automatically.</p> |
| <p>One thing I found is that Slurm refuses to start if any nodes specified in |
| the slurm.conf file aren't resolvable, I initially tried to specify cloud[0-15] |
| in slurm.conf, but then if I configure less than 16 nodes in /etc/hosts this |
| doesn't work so I dynamically generate the slurm.conf as well as the hosts |
| file.</p> |
| <p>As a comment about EC2 I run just run generic AMIs and have a persistent EBS |
| storage device which I attach to the first instance when I start up. This |
| contains a /usr/local which has my software like Slurm, pdsh and MPI installed |
| which I then copy over the /usr/local on the first instance and NFS export to |
| all other instances. This way I have persistent home directories and a very |
| simple first-login script that configures the virtual cluster for me.</p> |
| |
| <h3>User Management</h3> |
| |
| <p><a id="pam"><b>How can PAM be used to control a user's limits on |
| or access to compute nodes?</b></a><br> |
| To control a user's limits on a compute node:</p> |
| <p>First, enable Slurm's use of PAM by setting <i>UsePAM=1</i> in |
| <i>slurm.conf</i>.</p> |
| <p>Second, establish PAM configuration file(s) for Slurm in <i>/etc/pam.conf</i> |
| or the appropriate files in the <i>/etc/pam.d</i> directory (e.g. |
| <i>/etc/pam.d/sshd</i> by adding the line "account required pam_slurm.so". |
| A basic configuration you might use is:</p> |
| <pre> |
| account required pam_unix.so |
| account required pam_slurm.so |
| auth required pam_localuser.so |
| session required pam_limits.so |
| </pre> |
| <p>Third, set the desired limits in <i>/etc/security/limits.conf</i>. |
| For example, to set the locked memory limit to unlimited for all users:</p> |
| <pre> |
| * hard memlock unlimited |
| * soft memlock unlimited |
| </pre> |
| <p>Finally, you need to disable Slurm's forwarding of the limits from the |
| session from which the <i>srun</i> initiating the job ran. By default |
| all resource limits are propagated from that session. For example, adding |
| the following line to <i>slurm.conf</i> will prevent the locked memory |
| limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.</p> |
| |
| <p>To control a user's access to a compute node:</p> |
| <p>The pam_slurm_adopt and pam_slurm modules prevent users from |
| logging into nodes that they have not been allocated (except for user |
| root, which can always login). |
| They are both included with the Slurm distribution.</p> |
| <p>The pam_slurm_adopt module is highly recommended for most installations, |
| and is documented in its <a href="pam_slurm_adopt.shtml">own guide</a>.</p> |
| <p>pam_slurm is older and less functional. |
| These modules are built by default for RPM packages, but can be disabled using |
| the .rpmmacros option "%_without_pam 1" or by entering the command line |
| option "--without pam" when the configure program is executed. |
| Their source code is in the "contribs/pam" and "contribs/pam_slurm_adopt" |
| directories respectively.</p> |
| <p>The use of either pam_slurm_adopt or pam_slurm does not require |
| <i>UsePAM</i> being set. The two uses of PAM are independent.</p> |
| |
| <p><a id="pam_exclude"><b>How can I exclude some users from pam_slurm?</b></a><br> |
| <b>CAUTION:</b> Please test this on a test machine/VM before you actually do |
| this on your Slurm computers.</p> |
| |
| <p><b>Step 1.</b> Make sure pam_listfile.so exists on your system. |
| The following command is an example on Redhat 6:</p> |
| <pre> |
| ls -la /lib64/security/pam_listfile.so |
| </pre> |
| |
| <p><b>Step 2.</b> Create user list (e.g. /etc/ssh/allowed_users):</p> |
| <pre> |
| # /etc/ssh/allowed_users |
| root |
| myadmin |
| </pre> |
| <p>And, change file mode to keep it secret from regular users(Optional):</p> |
| <pre> |
| chmod 600 /etc/ssh/allowed_users |
| </pre> |
| <p><b>NOTE</b>: root is not necessarily listed on the allowed_users, but I |
| feel somewhat safe if it's on the list.</p> |
| |
| <p><b>Step 3.</b> On /etc/pam.d/sshd, add pam_listfile.so with sufficient flag |
| before pam_slurm.so (e.g. my /etc/pam.d/sshd looks like this):</p> |
| <pre> |
| #%PAM-1.0 |
| auth required pam_sepermit.so |
| auth include password-auth |
| account sufficient pam_listfile.so item=user sense=allow file=/etc/ssh/allowed_users onerr=fail |
| account required pam_slurm.so |
| account required pam_nologin.so |
| account include password-auth |
| password include password-auth |
| # pam_selinux.so close should be the first session rule |
| session required pam_selinux.so close |
| session required pam_loginuid.so |
| # pam_selinux.so open should only be followed by sessions to be executed in the user context |
| session required pam_selinux.so open env_params |
| session optional pam_keyinit.so force revoke |
| session include password-auth |
| </pre> |
| <p>(Information courtesy of Koji Tanaka, Indiana University)</p> |
| |
| <p><a id="user_account"><b>Can a user's account be changed in the database?</b></a><br> |
| A user's account can not be changed directly. A new association needs to be |
| created for the user with the new account. Then the association with the old |
| account can be deleted.</p> |
| <pre> |
| # Assume user "adam" is initially in account "physics" |
| sacctmgr create user name=adam cluster=tux account=physics |
| sacctmgr delete user name=adam cluster=tux account=chemistry |
| </pre> |
| |
| <p><a id="changed_uid"><b>I had to change a user's UID and now they cannot submit |
| jobs. How do I get the new UID to take effect?</b></a><br> |
| When changing UIDs, you will also need to restart the slurmctld for the changes to |
| take effect. Normally, when adding a new user to the system, the UID is filled in |
| automatically and immediately. If the user isn't known on the system yet, there is a |
| thread that runs every hour that fills in those UIDs when they become known, but it |
| doesn't recognize UID changes of preexisting users. But you can simply restart the |
| slurmctld for those changes to be recognized.</p> |
| |
| <p><a id="sssd"><b>How can I get SSSD to work with Slurm?</b></a><br> |
| SSSD or System Security Services Daemon does not allow enumeration of |
| group members by default. Note that enabling enumeration in large |
| environments might not be feasible. However, Slurm does not need enumeration |
| except for some specific quirky configurations (multiple groups with the same |
| GID), so it's probably safe to leave enumeration disabled. |
| SSSD is also case sensitive by default for some configurations, which could |
| possibly raise other issues. Add the following lines |
| to <i>/etc/sssd/sssd.conf</i> on your head node to address these issues:</p> |
| <pre> |
| enumerate = True |
| case_sensitive = False |
| </pre> |
| |
| <h3>Jobs</h3> |
| |
| <p><a id="suspend"><b>How is job suspend/resume useful?</b></a><br> |
| Job suspend/resume is most useful to get particularly large jobs initiated |
| in a timely fashion with minimal overhead. Say you want to get a full-system |
| job initiated. Normally you would need to either cancel all running jobs |
| or wait for them to terminate. Canceling jobs results in the loss of |
| their work to that point from their beginning. |
| Waiting for the jobs to terminate can take hours, depending upon your |
| system configuration. A more attractive alternative is to suspend the |
| running jobs, run the full-system job, then resume the suspended jobs. |
| This can easily be accomplished by configuring a special queue for |
| full-system jobs and using a script to control the process. |
| The script would stop the other partitions, suspend running jobs in those |
| partitions, and start the full-system partition. |
| The process can be reversed when desired. |
| One can effectively gang schedule (time-slice) multiple jobs |
| using this mechanism, although the algorithms to do so can get quite |
| complex. |
| Suspending and resuming a job makes use of the SIGSTOP and SIGCONT |
| signals respectively, so swap and disk space should be sufficient to |
| accommodate all jobs allocated to a node, either running or suspended.</p> |
| |
| <p><a id="squeue_script"><b>How can I suspend, resume, hold or release all |
| of the jobs belonging to a specific user, partition, etc?</b></a><br> |
| There isn't any filtering by user, partition, etc. available in the scontrol |
| command; however the squeue command can be used to perform the filtering and |
| build a script which you can then execute. For example:</p> |
| <pre> |
| $ squeue -u adam -h -o "scontrol hold %i" >hold_script |
| </pre> |
| |
| <p><a id="restore_priority"><b>After manually setting a job priority |
| value, how can its priority value be returned to being managed by the |
| priority/multifactor plugin?</b></a><br> |
| Hold and then release the job as shown below.</p> |
| <pre> |
| $ scontrol hold <jobid> |
| $ scontrol release <jobid> |
| </pre> |
| |
| <p><a id="scontrol_multi_jobs"><b>Can I update multiple jobs with a |
| single <i>scontrol</i> command?</b></a><br> |
| No, but you can probably use <i>squeue</i> to build the script taking |
| advantage of its filtering and formatting options. For example:</p> |
| <pre> |
| $ squeue -tpd -h -o "scontrol update jobid=%i priority=1000" >my.script |
| </pre> |
| |
| <p><a id="task_prolog"><b>How could I automatically print a job's |
| Slurm job ID to its standard output?</b></a><br> |
| The configured <i>TaskProlog</i> is the only thing that can write to |
| the job's standard output or set extra environment variables for a job |
| or job step. To write to the job's standard output, precede the message |
| with "print ". To export environment variables, output a line of this |
| form "export name=value". The example below will print a job's Slurm |
| job ID and allocated hosts for a batch job only.</p> |
| |
| <pre> |
| #!/bin/sh |
| # |
| # Sample TaskProlog script that will print a batch job's |
| # job ID and node list to the job's stdout |
| # |
| |
| if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ] |
| then |
| echo "print ==========================================" |
| echo "print SLURM_JOB_ID = $SLURM_JOB_ID" |
| echo "print SLURM_JOB_NODELIST = $SLURM_JOB_NODELIST" |
| echo "print ==========================================" |
| fi |
| </pre> |
| |
| <p><a id="write_to_job_stdout"><b>Is it possible to write to user stdout?</b></a> |
| <br>The way user I/O is handled by Slurm makes it impossible to write to the |
| user process as an admin after the user process is executed (execve is called). |
| This happens right after the call to |
| <a href="prolog_epilog.html">TaskProlog</a>, which is the last moment we can |
| write to the stdout of the user process. Slurm assumes that this file |
| descriptor is only owned by the user process while running. The file descriptor |
| is opened as specified and passed to the task so it makes use of the file |
| descriptor directly. Slumstepd is able to log error messages to the error file |
| by duplicating the standard error of the process.</p> |
| |
| <p>It is possible to write to standard error from SPANK plugins, but this |
| can't be used to append a job summary, since the file descriptors are opened |
| with a close-on-exec flag and are closed by the operating system right after |
| the user process completes. In theory, a central place that could be used to |
| prepare some kind of job summary is EpilogSlurmctld. However, using it to |
| write to a file where user output is stored may be problematic. The script is |
| running as SlurmUser, so intensive validation of the file name may be required |
| (e.g. to prevent users from specifying something like /etc/passwd as the |
| output file). It's also possible that a job could have multiple output files |
| (see <a href=srun.html#OPT_filename-pattern>filename pattern</a> in the srun |
| man page).</p> |
| |
| <p><a id="orphan_procs"><b>Why are user processes and <i>srun</i> |
| running even though the job is supposed to be completed?</b></a><br> |
| Slurm relies upon a configurable process tracking plugin to determine |
| when all of the processes associated with a job or job step have completed. |
| Those plugins relying upon a kernel patch can reliably identify every process. |
| Those plugins dependent upon process group IDs or parent process IDs are not |
| reliable. See the <i>ProctrackType</i> description in the <i>slurm.conf</i> |
| man page for details. We rely upon the cgroup plugin for most systems.</p> |
| |
| <p><a id="reqspec"><b>How can a job which has exited with a specific exit |
| code be requeued?</b></a><br> |
| Slurm supports requeue in hold with a <b>SPECIAL_EXIT</b> state using the |
| command:</p> |
| |
| <pre>scontrol requeuehold State=SpecialExit job_id</pre> |
| |
| <p>This is useful when users want to requeue and flag a job which has exited |
| with a specific error case. See man scontrol(1) for more details.</p> |
| |
| <pre> |
| $ scontrol requeuehold State=SpecialExit 10 |
| $ squeue |
| JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) |
| 10 mira zoppo david SE 0:00 1 (JobHeldUser) |
| </pre> |
| <p> |
| The job can be later released and run again. |
| </p> |
| <p> |
| The requeuing of jobs which exit with a specific exit code can be |
| automated using an <b>EpilogSlurmctld</b>, see man(5) slurm.conf. |
| This is an example of a script which exit code depends on the existence |
| of a file. |
| </p> |
| |
| <pre> |
| $ cat exitme |
| #!/bin/sh |
| # |
| echo "hi! `date`" |
| if [ ! -e "/tmp/myfile" ]; then |
| echo "going out with 8" |
| exit 8 |
| fi |
| rm /tmp/myfile |
| echo "going out with 0" |
| exit 0 |
| </pre> |
| <p> |
| This is an example of an EpilogSlurmctld that checks the job exit value |
| looking at the <b>SLURM_JOB_EXIT2</b> environment variable and requeues a job if |
| it exited with value 8. The SLURM_JOB_EXIT2 has the format "exit:sig", the first |
| number is the exit code, typically as set by the exit() function. |
| The second number of the signal that caused the process to terminate if |
| it was terminated by a signal. |
| </p> |
| |
| <pre> |
| $ cat slurmctldepilog |
| #!/bin/sh |
| |
| export PATH=/bin:/home/slurm/linux/bin |
| LOG=/home/slurm/linux/log/logslurmepilog |
| |
| echo "Start `date`" >> $LOG 2>&1 |
| echo "Job $SLURM_JOB_ID exitcode $SLURM_JOB_EXIT_CODE2" >> $LOG 2>&1 |
| exitcode=`echo $SLURM_JOB_EXIT_CODE2|awk '{split($0, a, ":"); print a[1]}'` >> $LOG 2>&1 |
| if [ "$exitcode" == "8" ]; then |
| echo "Found REQUEUE_EXIT_CODE: $REQUEUE_EXIT_CODE" >> $LOG 2>&1 |
| scontrol requeuehold state=SpecialExit $SLURM_JOB_ID >> $LOG 2>&1 |
| echo $? >> $LOG 2>&1 |
| else |
| echo "Job $SLURM_JOB_ID exit all right" >> $LOG 2>&1 |
| fi |
| echo "Done `date`" >> $LOG 2>&1 |
| |
| exit 0 |
| </pre> |
| <p> |
| Using the exitme script as an example, we have it exit with a value of 8 on |
| the first run, then when it gets requeued in hold with SpecialExit state |
| we touch the file /tmp/myfile, then release the job which will finish |
| in a COMPLETE state. |
| </p> |
| |
| <p><a id="cpu_freq"><b>Why is Slurm unable to set the CPU frequency for |
| jobs?</b></a><br> |
| First check that Slurm is configured to bind jobs to specific CPUs by |
| making sure that TaskPlugin is configured to either affinity or cgroup. |
| Next check that your processor is configured to permit frequency |
| control by examining the values in the file |
| <i>/sys/devices/system/cpu/cpu0/cpufreq</i> where "cpu0" represents a CPU ID 0. |
| Of particular interest is the file <i>scaling_available_governors</i>, |
| which identifies the CPU governors available. |
| If "userspace" is not an available CPU governor, this may well be due to the |
| <i>intel_pstate</i> driver being installed. |
| Information about disabling the <i>intel_pstate</i> driver is available |
| from<br> |
| <a href="https://bugzilla.kernel.org/show_bug.cgi?id=57141"> |
| https://bugzilla.kernel.org/show_bug.cgi?id=57141</a> and<br> |
| <a href="http://unix.stackexchange.com/questions/121410/setting-cpu-governor-to-on-demand-or-conservative"> |
| http://unix.stackexchange.com/questions/121410/setting-cpu-governor-to-on-demand-or-conservative</a>.</p> |
| |
| <p><a id="salloc_default_command"><b>Can the salloc command be configured to |
| launch a shell on a node in the job's allocation?</b></a><br> |
| Yes, just set "use_interactive_step" as part of the LaunchParameters |
| configuration option in slurm.conf.</p> |
| |
| <p><a id="tmpfs_jobcontainer"><b>How can I set up a private /tmp and /dev/shm for |
| jobs on my machine?</b></a> |
| <br/> |
| Tmpfs job container plugin can be used by including |
| <i>JobContainerType=job_container/tmpfs</i> |
| in your slurm.conf file. It additionally requires a |
| <a href="job_container.conf.html">job_container.conf</a> file to be |
| set up which is further described in the man page. |
| Tmpfs plugin creates a private mount namespace inside of which it mounts a |
| private /tmp to a location that is configured in job_container.conf. The basepath |
| is used to construct the mount path, by creating a job specific directory inside it |
| and mounting /tmp to it. Since all the mounts are created inside of a mount |
| namespace which is private, they are only visible inside the job. Hence this |
| proves to be a useful solution for jobs that are on shared nodes, since each |
| job can only view mounts created in their own mount namespace. A private |
| /dev/shm is also mounted to isolate it between different jobs.</p> |
| <p> |
| Mount namespace construction also happens before job's spank environment is |
| set up. Hence all spank related job steps will view only private /tmp the |
| plugin creates. The plugin also provides an optional initialization script that |
| is invoked before the job's namespace is constructed. This can be useful for |
| any site specific customization that may be necessary.</p> |
| <pre> |
| parallels@linux_vb:~$ echo $SLURM_JOB_ID |
| 7 |
| parallels@linux_vb:~$ findmnt -o+PROPAGATION | grep /tmp |
| └─/tmp /dev/sda1[/storage/7/.7] ext4 rw,relatime,errors=remount-ro,data=ordered private |
| </pre> |
| <p>In the example above, <i>BasePath</i> points to /storage and a slurm job with |
| job id 7 is set up to mount /tmp on /storage/7/.7. When user from inside a job |
| tries to look up mounts, they can see that their /tmp is mounted. However |
| they are prevented from mistakenly accessing the backing directory directly.</p> |
| <pre> |
| parallels@linux_vb:~$ cd /storage/7/ |
| bash: cd: /storage/7/: Permission denied |
| </pre> |
| <p>They are allowed to access (read/write) /tmp only.</p> |
| <p> |
| Additionally pam_slurm_adopt has also been extended to support this functionality. |
| If a user starts an ssh session which is managed by pam_slurm_adopt, then |
| the user's process joins the namespace that is constructed by tmpfs plugin. |
| Hence in ssh sessions, user has the same view of /tmp and /dev/shm as |
| their job. This functionality is enabled by default in pam_slurm_adopt |
| but can be disabled explicitly by appending <i>join_container=false</i> as shown:</p> |
| <pre> |
| account sufficient pam_slurm_adopt.so join_container=false |
| </pre> |
| |
| <p><a id="sysv_memory"><b>How do I configure Slurm to work with System V IPC |
| enabled applications?</b></a><br> |
| Slurm is generally agnostic to |
| <a href="http://man7.org/linux/man-pages/man2/ipc.2.html"> |
| System V IPC</a> (a.k.a. "sysv ipc" in the Linux kernel). |
| Memory accounting of processes using sysv ipc changes depending on the value |
| of <a href="https://www.kernel.org/doc/Documentation/sysctl/kernel.txt"> |
| sysctl kernel.shm_rmid_forced</a> (added in Linux kernel 3.1): |
| </p> |
| <ul> |
| <li>shm_rmid_forced = 1 |
| <br> |
| Forces all shared memory usage of processes to be accounted and reported by the |
| kernel to Slurm. This breaks the separate namespace of sysv ipc and may cause |
| unexpected application issues without careful planning. Processes that share |
| the same sysv ipc namespaces across jobs may end up getting OOM killed when |
| another job ends and their allocation percentage increases. |
| </li> |
| <li>shm_rmid_forced = 0 (default in most Linux distributions) |
| <br> |
| System V memory usage will not be reported by Slurm for jobs. |
| It is generally suggested to configure the |
| <a href="https://www.kernel.org/doc/Documentation/sysctl/kernel.txt"> |
| sysctl kernel.shmmax</a> parameter. The value of kernel.shmmax times the |
| maximum number of job processes should be deducted from each node's |
| configured RealMemory in your slurm.conf. Most Linux distributions set the |
| default to what is effectively unlimited, which can cause the OOM killer |
| to activate for unrelated new jobs or even for the slurmd process. If any |
| processes use sysv memory mechanisms, the Linux kernel OOM killer will never |
| be able to free the used memory. A Slurm job epilog script will be needed to |
| free any of the user memory. Setting kernel.shmmax=0 will disable sysv ipc |
| memory allocations but may cause application issues. |
| </li> |
| </ul> |
| |
| <h3>General Troubleshooting</h3> |
| |
| <p><a id="core_dump"><b>If a Slurm daemon core dumps, where can I find the |
| core file?</b></a><br> |
| If <i>slurmctld</i> is started with the -D option, then the core file will be |
| written to the current working directory. If <i>SlurmctldLogFile</i> is an |
| absolute path, the core file will be written to this directory. Otherwise the |
| core file will be written to the <i>StateSaveLocation</i>, or "/var/tmp/" as a |
| last resort.<br> |
| SlurmUser must have write permission on the directories. If none of the above |
| directories have write permission for SlurmUser, no core file will be produced.</p> |
| |
| <p>If <i>slurmd</i> is started with the -D option, then the core file will also be |
| written to the current working directory. If <i>SlurmdLogFile</i> is an |
| absolute path, the core file will be written to the this directory. |
| Otherwise the core file will be written to the <i>SlurmdSpoolDir</i>, or |
| "/var/tmp/" as a last resort.<br> |
| If none of the above directories can be written, no core file will be produced. |
| </p> |
| |
| <p>For <i>slurmstepd</i>, the core file will depend upon when the failure |
| occurs. If it is running in a privileged phase, it will be in the same location |
| as that described above for the slurmd daemon. If it is running in an |
| unprivileged phase, it will be in the spawned job's working directory.</p> |
| |
| |
| <p>Nevertheless, in some operating systems this can vary:</p> |
| <ul> |
| <li> |
| I.e. in RHEL the event |
| may be captured by abrt daemon and generated in the defined abrt configured |
| dump location (i.e. /var/spool/abrt). |
| </li> |
| </ul> |
| |
| <p>Normally, distributions need some more tweaking in order to allow the core |
| files to be generated correctly.</p> |
| |
| <p>slurmstepd uses the setuid() (set user ID) function to escalate |
| privileges. It is possible that in certain systems and for security policies, |
| this causes the core files not to be generated. |
| <br>To allow the generation in such systems you usually must enable the |
| suid_dumpable kernel parameter:</p> |
| |
| Set:<br> |
| /proc/sys/fs/suid_dumpable to 2<br> |
| or<br> |
| sysctl fs.suid_dumpable=2<br><br> |
| or set it permanently in sysctl.conf<br> |
| fs.suid_dumpable = 2<br><br> |
| |
| <p>The value of 2, "suidsafe", makes any binary which normally not be dumped is |
| dumped readable by root only.<br>This allows the end user to remove such a dump |
| but not access it directly. For security reasons core dumps in this mode will |
| not overwrite one another or other files.<br> This mode is appropriate when |
| administrators are attempting to debug problems in a normal environment.</p> |
| |
| <p>Then you must also set the core pattern to an absolute pathname:</p> |
| |
| <pre>sysctl kernel.core_pattern=/tmp/core.%e.%p</pre> |
| |
| <p>We recommend reading your distribution's documentation about the |
| configuration of these parameters.</p> |
| |
| <p>It is also usually needed to configure the system core limits, since it can be |
| set to 0.</p> |
| <pre> |
| $ grep core /etc/security/limits.conf |
| # - core - limits the core file size (KB) |
| * hard core unlimited |
| * soft core unlimited |
| </pre> |
| <p>In some systems it is not enough to set a hard limit, you must set also a |
| soft limit.</p> |
| |
| <p>Also, for generating the limits in userspace, the |
| <i>PropagateResourceLimits=CORE</i> parameter in slurm.conf could be needed.</p> |
| |
| <p>Be also sure to give SlurmUser the appropriate permissions to write in the |
| core location directories.</p> |
| |
| <p><b>NOTE</b>: On a diskless node depending on the core_pattern or if |
| /var/spool/abrt is pointing to an in-memory filespace like tmpfs, if the job |
| caused an OOM, then the generation of the core may fill up your machine's |
| memory and hang it. It is encouraged then to make coredumps go to a persistent |
| storage. Be careful of multiple nodes writing a core dump to a shared |
| filesystem since it may significantly impact it. |
| </p> |
| |
| <b>Other exceptions:</b> |
| |
| <p>On Centos 6, also set "ProcessUnpackaged = yes" in the file |
| /etc/abrt/abrt-action-save-package-data.conf. |
| |
| <p>On RHEL6, also set "DAEMON_COREFILE_LIMIT=unlimited" in the file |
| rc.d/init.d/functions.</p> |
| |
| <p>On a SELinux enabled system, or on a distribution with similar security |
| system, get sure it is allowing to dump cores:</p> |
| |
| <pre>$ getsebool allow_daemons_dump_core</pre> |
| |
| <p>coredumpctl can also give valuable information:</p> |
| |
| <pre>$ coredumpctl info</pre> |
| |
| <p><a id="backtrace"><b>How can I get a backtrace from a core file?</b></a><br> |
| If you do have a crash that generates a core file, you will want to get a |
| backtrace of that crash to send to SchedMD for evaluation.</p> |
| |
| <p><b>NOTE</b>: Core files must be analyzed by the same binary that was used |
| when they were generated. Compile time differences make it almost impossible |
| for SchedMD to use a core file from a different system. You should always |
| send a backtrace rather than a core file when submitting a support request.</p> |
| |
| <p>In order to generate a backtrace you must use <i>gdb</i>, specify the |
| path to the <i>slurm*</i> binary that generated the crash, and specify the |
| path to the core file. Below is an example of how to get a backtrace of a |
| core file generated by <i>slurmctld</i>: |
| <pre> |
| gdb -ex 't a a bt full' -batch /path/to/slurmctld <core_file> |
| </pre> |
| </p> |
| |
| <p>You can also use <i>gdb</i> to generate a backtrace without a core file. |
| This can be useful if you are experiencing a crash on startup and aren't |
| getting a core file for some reason. You would want to start the binary |
| from inside of <i>gdb</i>, wait for it to crash, and generate the backtrace. |
| Below is an example, using <i>slurmctld</i> as the example binary: |
| <pre> |
| (gdb) /path/to/slurmctld |
| (gdb) set print pretty |
| (gdb) r -d |
| (gdb) t a a bt full |
| </pre> |
| </p> |
| |
| <p>You may also need to get a backtrace of a running daemon if it is stuck |
| or hung. To do this you would point <i>gdb</i> at the running binary and |
| have it generate the backtrace. Below is an example, again using |
| <i>slurmctld</i> as the example: |
| <pre> |
| gdb -ex 't a a bt' -batch -p $(pidof slurmctld) |
| </pre> |
| </p> |
| |
| <h3>Error Messages</h3> |
| |
| <p><a id="inc_plugin"><b>"Cannot resolve X plugin operations" on |
| daemon startup</b></a><br> |
| This means that symbols expected in the plugin were |
| not found by the daemon. This typically happens when the |
| plugin was built or installed improperly or the configuration |
| file is telling the plugin to use an old plugin (say from the |
| previous version of Slurm). Restart the daemon in verbose mode |
| for more information (e.g. "slurmctld -Dvvvvv").</p> |
| |
| <p><a id="credential_replayed"><b>"Credential replayed" in |
| <i>SlurmdLogFile</i></b></a><br> |
| This error is indicative of the <i>slurmd</i> daemon not being able |
| to respond to job initiation requests from the <i>srun</i> command |
| in a timely fashion (a few seconds). |
| <i>Srun</i> responds by resending the job initiation request. |
| When the <i>slurmd</i> daemon finally starts to respond, it |
| processes both requests. |
| The second request is rejected and the event is logged with |
| the "credential replayed" error. |
| If you check the <i>SlurmdLogFile</i> and <i>SlurmctldLogFile</i>, |
| you should see signs of the <i>slurmd</i> daemon's non-responsiveness. |
| A variety of factors can be responsible for this problem |
| including</p> |
| <ul> |
| <li>Diskless nodes encountering network problems</li> |
| <li>Very slow Network Information Service (NIS)</li> |
| <li>The <i>Prolog</i> script taking a long time to complete</li> |
| </ul> |
| <p>Configure <i>MessageTimeout</i> in slurm.conf to a value higher than the |
| default 10 seconds.</p> |
| |
| <p><a id="cred_invalid"><b>"Invalid job credential"</b></a><br> |
| This error is indicative of Slurm's job credential files being inconsistent across |
| the cluster. All nodes in the cluster must have the matching public and private |
| keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the |
| Slurm configuration file <b>slurm.conf</b>.</p> |
| |
| <p><a id="cred_replay"><b>"Task launch failed on node ... Job credential |
| replayed"</b></a><br> |
| This error indicates that a job credential generated by the slurmctld daemon |
| corresponds to a job that the slurmd daemon has already revoked. |
| The slurmctld daemon selects job ID values based upon the configured |
| value of <b>FirstJobId</b> (the default value is 1) and each job gets |
| a value one larger than the previous job. |
| On job termination, the slurmctld daemon notifies the slurmd on each |
| allocated node that all processes associated with that job should be |
| terminated. |
| The slurmd daemon maintains a list of the jobs which have already been |
| terminated to avoid replay of task launch requests. |
| If the slurmctld daemon is cold-started (with the "-c" option |
| or "/etc/init.d/slurm startclean"), it starts job ID values |
| over based upon <b>FirstJobId</b>. |
| If the slurmd is not also cold-started, it will reject job launch requests |
| for jobs that it considers terminated. |
| This solution to this problem is to cold-start all slurmd daemons whenever |
| the slurmctld daemon is cold-started.</p> |
| |
| <p><a id="file_limit"><b>"Unable to accept new connection: Too many open |
| files"</b></a><br> |
| The srun command automatically increases its open file limit to |
| the hard limit in order to process all of the standard input and output |
| connections to the launched tasks. It is recommended that you set the |
| open file hard limit to 8192 across the cluster.</p> |
| |
| <p><a id="slurmd_log"><b><i>SlurmdDebug</i> fails to log job step information |
| at the appropriate level</b></a><br> |
| There are two programs involved here. One is <b>slurmd</b>, which is |
| a persistent daemon running at the desired debug level. The second |
| program is <b>slurmstepd</b>, which executes the user job and its |
| debug level is controlled by the user. Submitting the job with |
| an option of <i>--debug=#</i> will result in the desired level of |
| detail being logged in the <i>SlurmdLogFile</i> plus the output |
| of the program.</p> |
| |
| <p><a id="batch_lost"><b>"Batch JobId=# missing from batch node <node> |
| (not found BatchStartTime after startup)"</b></a><br> |
| A shell is launched on node zero of a job's allocation to execute |
| the submitted program. The <i>slurmd</i> daemon executing on each compute |
| node will periodically report to the <i>slurmctld</i> what programs it |
| is executing. If a batch program is expected to be running on some |
| node (i.e. node zero of the job's allocation) and is not found, the |
| message above will be logged and the job canceled. This typically is |
| associated with exhausting memory on the node or some other critical |
| failure that cannot be recovered from.</p> |
| |
| <p><a id="opencl_pmix"><b>Multi-Instance GPU not working with Slurm and PMIx; |
| GPUs are "In use by another client"</b></a><br/> |
| PMIx uses the <b>hwloc API</b> for different purposes, including |
| <i>OS device</i> features like querying sysfs folders (such as |
| <i>/sys/class/net</i> and <i>/sys/class/infiniband</i>) to get the names of |
| Infiniband HCAs. With the above mentioned features, hwloc defaults to |
| querying the OpenCL devices, which creates handles on <i>/dev/nvidia*</i> files. |
| These handles are kept by slurmstepd and will result in the following error |
| inside a job: |
| </p> |
| <pre> |
| $ nvidia-smi mig --id 1 --create-gpu-instance FOO,FOO --default-compute-instance |
| Unable to create a GPU instance on GPU 1 using profile FOO: In use by another client |
| </pre> |
| <p> |
| In order to use Multi-Instance GPUs with Slurm and PMIx you can instruct hwloc |
| to not query OpenCL devices by setting the |
| <span class="commandline">HWLOC_COMPONENTS=-opencl</span> environment |
| variable for slurmd, i.e. setting this variable in systemd unit file for slurmd. |
| </p> |
| |
| <p><a id="accept_again"><b>"srun: error: Unable to accept connection: |
| Resources temporarily unavailable"</b></a><br> |
| This has been reported on some larger clusters running SUSE Linux when |
| a user's resource limits are reached. You may need to increase limits |
| for locked memory and stack size to resolve this problem.</p> |
| |
| <p><a id="large_time"><b>"Warning: Note very large processing time" |
| in <i>SlurmctldLogFile</i></b></a><br> |
| This error is indicative of some operation taking an unexpectedly |
| long time to complete, over one second to be specific. |
| Setting the value of the <i>SlurmctldDebug</i> configuration parameter |
| to <i>debug2</i> or higher should identify which operation(s) are |
| experiencing long delays. |
| This message typically indicates long delays in file system access |
| (writing state information or getting user information). |
| Another possibility is that the node on which the slurmctld |
| daemon executes has exhausted memory and is paging. |
| Try running the program <i>top</i> to check for this possibility.</p> |
| |
| <p><a id="mysql_duplicate"><b>"Duplicate entry" causes slurmdbd to |
| fail</b></a><br> |
| This problem has been rarely observed with MySQL, but not MariaDB. |
| The root cause of the failure seems to be reaching the upper limit on the auto increment field. |
| Upgrading to MariaDB is recommended. |
| If that is not possible then: backup the database, remove the duplicate record(s), |
| and restart the slurmdbd daemon as shown below.</p> |
| <pre> |
| $ slurmdbd -Dvv |
| ... |
| slurmdbd: debug: Table "cray_job_table" has changed. Updating... |
| slurmdbd: error: mysql_query failed: 1062 Duplicate entry '2711-1478734628' for key 'id_job' |
| ... |
| |
| $ mysqldump --single-transaction -u<user> -p<user> slurm_acct_db >/tmp/slurm_db_backup.sql |
| |
| $ mysql |
| mysql> use slurm_acct_db; |
| mysql> delete from cray_job_table where id_job='2711-1478734628'; |
| mysql> quit; |
| Bye |
| </pre> |
| |
| <p>If necessary, you can edit the database dump and recreate the database as |
| shown below.</p> |
| <pre> |
| $ mysql |
| mysql> drop database slurm_acct_db; |
| mysql> create database slurm_acct_db; |
| mysql> quit; |
| Bye |
| |
| $ mysql -u<user> -p<user> </tmp/slurm_db_backup.sql |
| </pre> |
| |
| <p><a id="json_serializer"><b>"Unable to find plugin: serializer/json" |
| </b></a><br/> |
| Several parts of Slurm have swapped to using our centralized serializer |
| code. JSON or YAML plugins are only required if one of the functions that |
| require it is executed. If one of the functions is executed it will fail to |
| create the JSON/YAML output and the linker will fail with the following error: |
| </p> |
| <pre> |
| slurmctld: fatal: Unable to find plugin: serializer/json |
| </pre> |
| <p> |
| In most cases, these are required for new functionality added after Slurm-20.02. |
| However, with each release, we have been adding more places that use the |
| serializer plugins. Because the list is evolving we do not plan on listing all |
| the commands that require the plugins but will instead provide the error |
| (shown above). To correct the issue, please make sure that Slurm is configured, |
| compiled and installed with the relevant JSON or YAML library (or preferably |
| both). Configure can be made to explicitly request these libraries: |
| </p> |
| <pre> |
| ./configure --with-json=PATH --with-yaml=PATH $@ |
| </pre> |
| <p> |
| Most distributions include packages to make installation relatively easy. |
| Please make sure to install the 'dev' or 'devel' packages along with the |
| library packages. We also provide explicit instructions on how to install from |
| source: <a href="related_software.html#yaml">libyaml</a> and |
| <a href="related_software.html#jwt">libjwt</a>. |
| </p> |
| |
| <h3>Third Party Integrations</h3> |
| |
| <p><a id="globus"><b>Can Slurm be used with Globus?</b></a><br> |
| Yes. Build and install Slurm's Torque/PBS command wrappers along with |
| the Perl APIs from Slurm's <i>contribs</i> directory and configure |
| <a href="http://www-unix.globus.org/">Globus</a> to use those PBS commands. |
| Note there are RPMs available for both of these packages, named |
| <i>torque</i> and <i>perlapi</i> respectively. |
| |
| <p><a id="totalview"><b>How can TotalView be configured to operate with |
| Slurm?</b></a><br> |
| The following lines should also be added to the global <i>.tvdrc</i> file |
| for TotalView to operate with Slurm:</p> |
| <pre> |
| # Enable debug server bulk launch: Checked |
| dset -set_as_default TV::bulk_launch_enabled true |
| |
| # Command: |
| # Beginning with TV 7X.1, TV supports Slurm and %J. |
| # Specify --mem-per-cpu=0 in case Slurm configured with default memory |
| # value and we want TotalView to share the job's memory limit without |
| # consuming any of the job's memory so as to block other job steps. |
| dset -set_as_default TV::bulk_launch_string {srun --mem-per-cpu=0 -N%N -n%N -w`awk -F. 'BEGIN {ORS=","} {if (NR==%N) ORS=""; print $1}' %t1` -l --input=none %B/tvdsvr%K -callback_host %H -callback_ports %L -set_pws %P -verbosity %V -working_directory %D %F} |
| |
| # Temp File 1 Prototype: |
| # Host Lines: |
| # Slurm NodeNames need to be unadorned hostnames. In case %R returns |
| # fully qualified hostnames, list the hostnames in %t1 here, and use |
| # awk in the launch string above to strip away domain name suffixes. |
| dset -set_as_default TV::bulk_launch_tmpfile1_host_lines {%R} |
| </pre> |
| <!-- OLD FORMAT |
| dset TV::parallel_configs { |
| name: Slurm; |
| description: Slurm; |
| starter: srun %s %p %a; |
| style: manager_process; |
| tasks_option: -n; |
| nodes_option: -N; |
| env: ; |
| force_env: false; |
| } |
| !--> |
| |
| <p style="text-align:center;">Last modified 09 May 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |