| <!--#include virtual="header.txt"--> |
| |
| <h1>Containers Guide</h1> |
| |
| <h2 id="contents">Contents<a class="slurm_link" href="#contents"></a></h2> |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#limitations">Known limitations</a></li> |
| <li><a href="#prereq">Prerequisites</a></li> |
| <li><a href="#software">Required software</a></li> |
| <li><a href="#example">Example configurations for various OCI Runtimes</a></li> |
| <li><a href="#testing">Testing OCI runtime outside of Slurm</a></li> |
| <li><a href="#request">Requesting container jobs or steps</a></li> |
| <li><a href="#docker-scrun">Integration with Rootless Docker</a></li> |
| <li><a href="#podman-scrun">Integration with Podman</a></li> |
| <li><a href="#bundle">OCI Container bundle</a></li> |
| <li><a href="#ex-ompi5-pmix4">Example OpenMPI v5 + PMIx v4 container</a></li> |
| <li><a href="#plugin">Container support via Plugin</a> |
| <ul> |
| <li><a href="#shifter">Shifter</a></li> |
| <li><a href="#enroot1">ENROOT and Pyxis</a></li> |
| <li><a href="#sarus">Sarus</a></li> |
| </ul></li> |
| </ul> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| <p>Containers are being adopted in HPC workloads. |
| Containers rely on existing kernel features to allow greater user control over |
| what applications see and can interact with at any given time. For HPC |
| Workloads, these are usually restricted to the |
| <a href="http://man7.org/linux/man-pages/man7/mount_namespaces.7.html">mount namespace</a>. |
| Slurm natively supports the requesting of unprivileged OCI Containers for jobs |
| and steps.</p> |
| |
| <p>Setting up containers requires several steps: |
| <ol> |
| <li>Set up the <a href="#prereq">kernel</a> and a |
| <a href="#software">container runtime</a>.</li> |
| <li>Deploy a suitable <a href="oci.conf.html">oci.conf</a> file accessible to |
| the compute nodes (<a href="#example">examples below</a>).</li> |
| <li>Restart or reconfigure slurmd on the compute nodes.</li> |
| <li>Generate <a href="#bundle">OCI bundles</a> for containers that are needed |
| and place them on the compute nodes.</li> |
| <li>Verify that you can <a href="#testing">run containers directly</a> through |
| the chosen OCI runtime.</li> |
| <li>Verify that you can <a href="#request">request a container</a> through |
| Slurm.</li> |
| </ol> |
| </p> |
| |
| <h2 id="limitations">Known limitations |
| <a class="slurm_link" href="#limitations"></a> |
| </h2> |
| <p>The following is a list of known limitations of the Slurm OCI container |
| implementation.</p> |
| |
| <ul> |
| <li>All containers must run under unprivileged (i.e. rootless) invocation. |
| All commands are called by Slurm as the user with no special |
| permissions.</li> |
| |
| <li>Custom container networks are not supported. All containers should work |
| with the <a href="https://docs.docker.com/network/host/">"host" |
| network</a>.</li> |
| |
| <li>Slurm will not transfer the OCI container bundle to the execution |
| nodes. The bundle must already exist on the requested path on the |
| execution node.</li> |
| |
| <li>Containers are limited by the OCI runtime used. If the runtime does not |
| support a certain feature, then that feature will not work for any job |
| using a container.</li> |
| |
| <li>oci.conf must be configured on the execution node for the job, otherwise the |
| requested container will be ignored by Slurm (but can be used by the |
| job or any given plugin).</li> |
| </ul> |
| |
| <h2 id="prereq">Prerequisites<a class="slurm_link" href="#prereq"></a></h2> |
| <p>The host kernel must be configured to allow user land containers:</p> |
| <pre> |
| sudo sysctl -w kernel.unprivileged_userns_clone=1 |
| sudo sysctl -w kernel.apparmor_restrict_unprivileged_unconfined=0 |
| sudo sysctl -w kernel.apparmor_restrict_unprivileged_userns=0 |
| </pre> |
| |
| <p>Docker also provides a tool to verify the kernel configuration: |
| <pre>$ dockerd-rootless-setuptool.sh check --force |
| [INFO] Requirements are satisfied</pre> |
| </p> |
| |
| <h2 id="software">Required software: |
| <a class="slurm_link" href="#software"></a> |
| </h2> |
| <ul> |
| <li>Fully functional |
| <a href="https://github.com/opencontainers/runtime-spec/blob/master/runtime.md"> |
| OCI runtime</a>. It needs to be able to run outside of Slurm first.</li> |
| |
| <li>Fully functional OCI bundle generation tools. Slurm requires OCI |
| Container compliant bundles for jobs.</li> |
| </ul> |
| |
| <h2 id="example">Example configurations for various OCI Runtimes |
| <a class="slurm_link" href="#example"></a> |
| </h2> |
| <p> |
| The <a href="https://github.com/opencontainers/runtime-spec">OCI Runtime |
| Specification</a> provides requirements for all compliant runtimes but |
| does <b>not</b> expressly provide requirements on how runtimes will use |
| arguments. In order to support as many runtimes as possible, Slurm provides |
| pattern replacement for commands issued for each OCI runtime operation. |
| This will allow a site to edit how the OCI runtimes are called as needed to |
| ensure compatibility. |
| </p> |
| <p> |
| For <i>runc</i> and <i>crun</i>, there are two sets of examples provided. |
| The OCI runtime specification only provides the <i>start</i> and <i>create</i> |
| operations sequence, but these runtimes provides a much more efficient <i>run</i> |
| operation. Sites are strongly encouraged to use the <i>run</i> operation |
| (if provided) as the <i>start</i> and <i>create</i> operations require that |
| Slurm poll the OCI runtime to know when the containers have completed execution. |
| While Slurm attempts to be as efficient as possible with polling, it will |
| result in a thread using CPU time inside of the job and slower response of |
| Slurm to catch when container execution is complete. |
| </p> |
| <p> |
| The examples provided have been tested to work but are only suggestions. Sites |
| are expected to ensure that the resultant root directory used will be secure |
| from cross user viewing and modifications. The examples provided point to |
| "/run/user/%U" where %U will be replaced with the numeric user id. Systemd |
| manages "/run/user/" (independently of Slurm) and will likely need additional |
| configuration to ensure the directories exist on compute nodes when the users |
| will not log in to the nodes directly. This configuration is generally achieved |
| by calling |
| <a href="https://www.freedesktop.org/software/systemd/man/latest/loginctl.html#enable-linger%20USER%E2%80%A6"> |
| loginctl to enable lingering sessions</a>. Be aware that the directory in this |
| example will be cleaned up by systemd once the user session ends on the node. |
| </p> |
| |
| <h3 id="runc_create_start">oci.conf example for runc using create/start: |
| <a class="slurm_link" href="#runc_create_start"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="runc --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeCreate="runc --rootless=true --root=/run/user/%U/ create %n.%u.%j.%s.%t -b %b" |
| RunTimeStart="runc --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t" |
| RunTimeKill="runc --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="runc --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| </pre> |
| </p> |
| |
| <h3 id="runc_run">oci.conf example for runc using run (recommended over using |
| create/start):<a class="slurm_link" href="#runc_run"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="runc --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeKill="runc --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="runc --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| RunTimeRun="runc --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b" |
| </pre> |
| </p> |
| |
| <h3 id="crun_create_start">oci.conf example for crun using create/start: |
| <a class="slurm_link" href="#crun_create_start"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="crun --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeKill="crun --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="crun --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| RunTimeCreate="crun --rootless=true --root=/run/user/%U/ create --bundle %b %n.%u.%j.%s.%t" |
| RunTimeStart="crun --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t" |
| </pre> |
| </p> |
| |
| <h3 id="crun_run">oci.conf example for crun using run (recommended over using |
| create/start):<a class="slurm_link" href="#crun_run"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="crun --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeKill="crun --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="crun --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| RunTimeRun="crun --rootless=true --root=/run/user/%U/ run --bundle %b %n.%u.%j.%s.%t" |
| </pre> |
| </p> |
| |
| <h3 id="nvidia_create_start"> |
| oci.conf example for nvidia-container-runtime using create/start: |
| <a class="slurm_link" href="#nvidia_create_start"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeCreate="nvidia-container-runtime --rootless=true --root=/run/user/%U/ create %n.%u.%j.%s.%t -b %b" |
| RunTimeStart="nvidia-container-runtime --rootless=true --root=/run/user/%U/ start %n.%u.%j.%s.%t" |
| RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| </pre> |
| </p> |
| |
| <h3 id="nvidia_run"> |
| oci.conf example for nvidia-container-runtime using run (recommended over using |
| create/start):<a class="slurm_link" href="#nvidia_run"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="nvidia-container-runtime --rootless=true --root=/run/user/%U/ state %n.%u.%j.%s.%t" |
| RunTimeKill="nvidia-container-runtime --rootless=true --root=/run/user/%U/ kill -a %n.%u.%j.%s.%t" |
| RunTimeDelete="nvidia-container-runtime --rootless=true --root=/run/user/%U/ delete --force %n.%u.%j.%s.%t" |
| RunTimeRun="nvidia-container-runtime --rootless=true --root=/run/user/%U/ run %n.%u.%j.%s.%t -b %b" |
| </pre> |
| </p> |
| |
| <h3 id="singularity_native">oci.conf example for |
| <a href="https://docs.sylabs.io/guides/4.1/admin-guide/installation.html"> |
| Singularity v4.1.3</a> using native runtime: |
| <a class="slurm_link" href="#singularity_native"></a></h3> |
| <p> |
| <pre> |
| IgnoreFileConfigJson=true |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeRun="singularity exec --userns %r %@" |
| RunTimeKill="kill -s SIGTERM %p" |
| RunTimeDelete="kill -s SIGKILL %p" |
| </pre> |
| </p> |
| |
| <h3 id="singularity_oci">oci.conf example for |
| <a href="https://docs.sylabs.io/guides/4.0/admin-guide/installation.html"> |
| Singularity v4.0.2</a> in OCI mode: |
| <a class="slurm_link" href="#singularity_oci"></a></h3> |
| <p> |
| Singularity v4.x requires setuid mode for OCI support. |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="sudo singularity oci state %n.%u.%j.%s.%t" |
| RunTimeRun="sudo singularity oci run --bundle %b %n.%u.%j.%s.%t" |
| RunTimeKill="sudo singularity oci kill %n.%u.%j.%s.%t" |
| RunTimeDelete="sudo singularity oci delete %n.%u.%j.%s.%t" |
| </pre> |
| </p> |
| |
| <p><b>WARNING</b>: Singularity (v4.0.2) requires <i>sudo</i> or setuid binaries |
| for OCI support, which is a security risk since the user is able to modify |
| these calls. This example is only provided for testing purposes.</p> |
| <p><b>WARNING</b>: |
| <a href="https://groups.google.com/a/lbl.gov/g/singularity/c/vUMUkMlrpQc/m/gIsEiiP7AwAJ"> |
| Upstream singularity development</a> of the OCI interface appears to have |
| ceased and sites should use the <a href="#singularity_native">user |
| namespace support</a> instead.</p> |
| |
| <h3 id="singularity_hpcng">oci.conf example for hpcng Singularity v3.8.0: |
| <a class="slurm_link" href="#singularity_hpcng"></a></h3> |
| <p> |
| <pre> |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeQuery="sudo singularity oci state %n.%u.%j.%s.%t" |
| RunTimeCreate="sudo singularity oci create --bundle %b %n.%u.%j.%s.%t" |
| RunTimeStart="sudo singularity oci start %n.%u.%j.%s.%t" |
| RunTimeKill="sudo singularity oci kill %n.%u.%j.%s.%t" |
| RunTimeDelete="sudo singularity oci delete %n.%u.%j.%s.%t |
| </pre> |
| </p> |
| |
| <p><b>WARNING</b>: Singularity (v3.8.0) requires <i>sudo</i> or setuid binaries |
| for OCI support, which is a security risk since the user is able to modify |
| these calls. This example is only provided for testing purposes.</p> |
| <p><b>WARNING</b>: |
| <a href="https://groups.google.com/a/lbl.gov/g/singularity/c/vUMUkMlrpQc/m/gIsEiiP7AwAJ"> |
| Upstream singularity development</a> of the OCI interface appears to have |
| ceased and sites should use the <a href="#singularity_native">user |
| namespace support</a> instead.</p> |
| |
| <h3 id="charliecloud">oci.conf example for |
| <a href="https://github.com/hpc/charliecloud">Charliecloud</a> (v0.30) |
| <a class="slurm_link" href="#charliecloud"></a></h3> |
| <p> |
| <pre> |
| IgnoreFileConfigJson=true |
| CreateEnvFile=newline |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeRun="env -i PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin/:/sbin/ USER=$(whoami) HOME=/home/$(whoami)/ ch-run -w --bind /etc/group:/etc/group --bind /etc/passwd:/etc/passwd --bind /etc/slurm:/etc/slurm --bind %m:/var/run/slurm/ --bind /var/run/munge/:/var/run/munge/ --set-env=%e --no-passwd %r -- %@" |
| RunTimeKill="kill -s SIGTERM %p" |
| RunTimeDelete="kill -s SIGKILL %p" |
| </pre> |
| </p> |
| |
| <h3 id="enroot">oci.conf example for |
| <a href="https://github.com/NVIDIA/enroot">Enroot</a> (3.3.0) |
| <a class="slurm_link" href="#enroot"></a></h3> |
| <p> |
| <pre> |
| IgnoreFileConfigJson=true |
| CreateEnvFile=newline |
| EnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeEnvExclude="^(SLURM_CONF|SLURM_CONF_SERVER)=" |
| RunTimeRun="/usr/local/bin/enroot-start-wrapper %b %m %e -- %@" |
| RunTimeKill="kill -s SIGINT %p" |
| RunTimeDelete="kill -s SIGTERM %p" |
| </pre> |
| </p> |
| |
| <p>/usr/local/bin/enroot-start-wrapper: |
| <pre> |
| #!/bin/bash |
| BUNDLE="$1" |
| SPOOLDIR="$2" |
| ENVFILE="$3" |
| shift 4 |
| IMAGE= |
| |
| export USER=$(whoami) |
| export HOME="$BUNDLE/" |
| export TERM |
| export ENROOT_SQUASH_OPTIONS='-comp gzip -noD' |
| export ENROOT_ALLOW_SUPERUSER=n |
| export ENROOT_MOUNT_HOME=y |
| export ENROOT_REMAP_ROOT=y |
| export ENROOT_ROOTFS_WRITABLE=y |
| export ENROOT_LOGIN_SHELL=n |
| export ENROOT_TRANSFER_RETRIES=2 |
| export ENROOT_CACHE_PATH="$SPOOLDIR/" |
| export ENROOT_DATA_PATH="$SPOOLDIR/" |
| export ENROOT_TEMP_PATH="$SPOOLDIR/" |
| export ENROOT_ENVIRON="$ENVFILE" |
| |
| if [ ! -f "$BUNDLE" ] |
| then |
| IMAGE="$SPOOLDIR/container.sqsh" |
| enroot import -o "$IMAGE" -- "$BUNDLE" && \ |
| enroot create "$IMAGE" |
| CONTAINER="container" |
| else |
| CONTAINER="$BUNDLE" |
| fi |
| |
| enroot start -- "$CONTAINER" "$@" |
| rc=$? |
| |
| [ $IMAGE ] && unlink $IMAGE |
| |
| exit $rc |
| </pre> |
| </p> |
| |
| <h3 id="multiple-runtimes">Handling multiple runtimes |
| <a class="slurm_link" href="#multiple-runtimes"></a> |
| </h3> |
| |
| <p>If you wish to accommodate multiple runtimes in your environment, |
| it is possible to do so with a bit of extra setup. This section outlines one |
| possible way to do so:</p> |
| |
| <ol> |
| <li>Create a generic oci.conf that calls a wrapper script |
| <pre> |
| IgnoreFileConfigJson=true |
| RunTimeRun="/opt/slurm-oci/run %b %m %u %U %n %j %s %t %@" |
| RunTimeKill="kill -s SIGTERM %p" |
| RunTimeDelete="kill -s SIGKILL %p" |
| </pre> |
| </li> |
| <li>Create the wrapper script to check for user-specific run configuration |
| (e.g., /opt/slurm-oci/run) |
| <pre> |
| #!/bin/bash |
| if [[ -e ~/.slurm-oci-run ]]; then |
| ~/.slurm-oci-run "$@" |
| else |
| /opt/slurm-oci/slurm-oci-run-default "$@" |
| fi |
| </pre> |
| </li> |
| <li>Create a generic run configuration to use as the default |
| (e.g., /opt/slurm-oci/slurm-oci-run-default) |
| <pre> |
| #!/bin/bash --login |
| # Parse |
| CONTAINER="$1" |
| SPOOL_DIR="$2" |
| USER_NAME="$3" |
| USER_ID="$4" |
| NODE_NAME="$5" |
| JOB_ID="$6" |
| STEP_ID="$7" |
| TASK_ID="$8" |
| shift 8 # subsequent arguments are the command to run in the container |
| # Run |
| apptainer run --bind /var/spool --containall "$CONTAINER" "$@" |
| </pre> |
| </li> |
| <li>Add executable permissions to both scripts |
| <pre>chmod +x /opt/slurm-oci/run /opt/slurm-oci/slurm-oci-run-default</pre> |
| </li> |
| </ol> |
| |
| <p>Once this is done, users may create a script at '~/.slurm-oci-run' if |
| they wish to customize the container run process, such as using a different |
| container runtime. Users should model this file after the default |
| '/opt/slurm-oci/slurm-oci-run-default'</p> |
| |
| <h2 id="testing">Testing OCI runtime outside of Slurm |
| <a class="slurm_link" href="#testing"></a> |
| </h2> |
| <p>Slurm calls the OCI runtime directly in the job step. If it fails, |
| then the job will also fail.</p> |
| <ul> |
| <li>Go to the directory containing the OCI Container bundle: |
| <pre>cd $ABS_PATH_TO_BUNDLE</pre></li> |
| |
| <li>Execute OCI Container runtime (You can find a few examples on how to build |
| a bundle <a href="#bundle">below</a>): |
| <pre>$OCIRunTime $ARGS create test --bundle $PATH_TO_BUNDLE</pre> |
| <pre>$OCIRunTime $ARGS start test</pre> |
| <pre>$OCIRunTime $ARGS kill test</pre> |
| <pre>$OCIRunTime $ARGS delete test</pre> |
| If these commands succeed, then the OCI runtime is correctly |
| configured and can be tested in Slurm. |
| </li> |
| </ul> |
| |
| <h2 id="request">Requesting container jobs or steps |
| <a class="slurm_link" href="#request"></a> |
| </h2> |
| <p> |
| <i>salloc</i>, <i>srun</i> and <i>sbatch</i> (in Slurm 21.08+) have the |
| '--container' argument, which can be used to request container runtime |
| execution. The requested job container will not be inherited by the steps |
| called, excluding the batch and interactive steps. |
| </p> |
| |
| <ul> |
| <li>Batch step inside of container: |
| <pre>sbatch --container $ABS_PATH_TO_BUNDLE --wrap 'bash -c "cat /etc/*rel*"' |
| </pre></li> |
| |
| <li>Batch job with step 0 inside of container: |
| <pre> |
| sbatch --wrap 'srun bash -c "--container $ABS_PATH_TO_BUNDLE cat /etc/*rel*"' |
| </pre></li> |
| |
| <li>Interactive step inside of container: |
| <pre>salloc --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"</pre></li> |
| |
| <li>Interactive job step 0 inside of container: |
| <pre>salloc srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*" |
| </pre></li> |
| |
| <li>Job with step 0 inside of container: |
| <pre>srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*"</pre></li> |
| |
| <li>Job with step 1 inside of container: |
| <pre>srun srun --container $ABS_PATH_TO_BUNDLE bash -c "cat /etc/*rel*" |
| </pre></li> |
| </ul> |
| |
| <p><b>NOTE</b>: Commands run with the <code>--container</code> flag are resolved |
| through PATH <i>before</i> they are sent to the container. If the container has |
| a unique file structure, it may be necessary to give the full path to the |
| command or specify <code>--export=NONE</code> to have the container define the |
| PATH to be used: |
| <pre>srun --container $ABS_PATH_TO_BUNDLE --export=NONE bash -c "cat /etc/*rel*" |
| </pre></p> |
| |
| <h2 id="docker-scrun">Integration with Rootless Docker (Docker Engine v20.10+ & Slurm-23.02+) |
| <a class="slurm_link" href="#docker-scrun"></a> |
| </h2> |
| <p>Slurm's <a href="scrun.html">scrun</a> can be directly integrated with <a |
| href="https://docs.docker.com/engine/security/rootless/">Rootless Docker</a> to |
| run containers as jobs. No special user permissions are required and <b>should |
| not</b> be granted to use this functionality.</p> |
| <h3>Prerequisites</h3> |
| <ol> |
| <li><a href="slurm.conf.html">slurm.conf</a> must be configured to use Munge |
| authentication.<pre>AuthType=auth/munge</pre></li> |
| <li><a href="scrun.html#SECTION_Example-<B>scrun.lua</B>-scripts">scrun.lua</a> |
| must be configured for site storage configuration.</li> |
| <li><a href="https://docs.docker.com/engine/security/rootless/#routing-ping-packets"> |
| Configure kernel to allow pings</a></li> |
| <li><a href="https://docs.docker.com/engine/security/rootless/#exposing-privileged-ports"> |
| Configure rootless dockerd to allow listening on privileged ports |
| </a></li> |
| <li><a href="scrun.html#SECTION_Example-%3CB%3Escrun.lua%3C/B%3E-scripts"> |
| scrun.lua</a> must be present on any node where scrun may be run. The |
| example should be sufficient for most environments but paths should be |
| modified to match available local storage.</li> |
| <li><a href="oci.conf.html">oci.conf</a> must be present on any node where any |
| container job may be run. Example configurations for |
| <a href="https://slurm.schedmd.com/containers.html#example"> |
| known OCI runtimes</a> are provided above. Examples may require |
| paths to be correct to installation locations.</li> |
| </ol> |
| <h3>Limitations</h3> |
| <ol> |
| <li>JWT authentication is not supported.</li> |
| <li>Docker container building is not currently functional pending merge of |
| <a href="https://github.com/moby/moby/pull/41442"> Docker pull request</a>.</li> |
| <li>Docker does <b>not</b> expose configuration options to disable security |
| options needed to run jobs. This requires that all calls to docker provide the |
| following command line arguments. This can be done via shell variable, an |
| alias, wrapper function, or wrapper script: |
| <pre>--security-opt label:disable --security-opt seccomp=unconfined --security-opt apparmor=unconfined --net=none</pre> |
| Docker's builtin security functionality is not required (or wanted) for |
| containers being run by Slurm. Docker is only acting as a container image |
| lifecycle manager. The containers will be executed remotely via Slurm following |
| the existing security configuration in Slurm outside of unprivileged user |
| control.</li> |
| <li>All containers must use the |
| <a href="https://docs.docker.com/network/drivers/none/">"none" networking driver |
| </a>. Attempting to use bridge, overlay, host, ipvlan, or macvlan can result in |
| scrun being isolated from the network and not being able to communicate with |
| the Slurm controller. The container is run by Slurm on the compute nodes which |
| makes having Docker setup a network isolation layer ineffective for the |
| container.</li> |
| <li><code>docker exec</code> command is not supported.</li> |
| <li><code>docker swarm</code> command is not supported.</li> |
| <li><code>docker compose</code>/<code>docker-compose</code> command is not |
| supported.</li> |
| <li><code>docker pause</code> command is not supported.</li> |
| <li><code>docker unpause</code> command is not supported.</li> |
| <li><code>docker swarm</code> command is not supported.</li> |
| <li>All <code>docker</code> commands are not supported inside of containers.</li> |
| <li><a href="https://docs.docker.com/reference/api/engine/">Docker API</a> is |
| not supported inside of containers.</li> |
| </ol> |
| |
| <h3>Setup procedure</h3> |
| <ol> |
| <li><a href="https://docs.docker.com/engine/security/rootless/"> Install and |
| configure Rootless Docker</a><br> Rootless Docker must be fully operational and |
| able to run containers before continuing.</li> |
| <li> |
| Setup environment for all docker calls: |
| <pre>export DOCKER_HOST=unix://$XDG_RUNTIME_DIR/docker.sock</pre> |
| All commands following this will expect this environment variable to be set.</li> |
| <li>Stop rootless docker: <pre>systemctl --user stop docker</pre></li> |
| <li>Configure Docker to call scrun instead of the default OCI runtime. |
| <!-- Docker does not document: --runtime= argument --> |
| <ul> |
| <li>To configure for all users: <pre>/etc/docker/daemon.json</pre></li> |
| <li>To configure per user: <pre>~/.config/docker/daemon.json</pre></li> |
| </ul> |
| Set the following fields to configure Docker: |
| <pre>{ |
| "experimental": true, |
| "iptables": false, |
| "bridge": "none", |
| "no-new-privileges": true, |
| "rootless": true, |
| "selinux-enabled": false, |
| "default-runtime": "slurm", |
| "runtimes": { |
| "slurm": { |
| "path": "/usr/local/bin/scrun" |
| } |
| }, |
| "data-root": "/run/user/${USER_ID}/docker/", |
| "exec-root": "/run/user/${USER_ID}/docker-exec/" |
| }</pre> |
| Correct path to scrun as if installation prefix was configured. Replace |
| ${USER_ID} with numeric user id or target a different directory with global |
| write permissions and sticky bit. Rootless docker requires a different root |
| directory than the system's default to avoid permission errors.</li> |
| <li>It is strongly suggested that sites consider using inter-node shared |
| filesystems to store Docker's containers. While it is possible to have a |
| scrun.lua script to push and pull images for each deployment, there can be a |
| massive performance penalty. Using a shared filesystem will avoid moving these |
| files around.<br>Possible configuration additions to daemon.json to use a |
| shared filesystem with <a |
| href="https://docs.docker.com/storage/storagedriver/vfs-driver/"> vfs storage |
| driver</a>: |
| <pre>{ |
| "storage-driver": "vfs", |
| "data-root": "/path/to/shared/filesystem/user_name/data/", |
| "exec-root": "/path/to/shared/filesystem/user_name/exec/", |
| }</pre> |
| Any node expected to be able to run containers from Docker must have ability to |
| at least read the filesystem used. Full write privileges are suggested and will |
| be required if changes to the container filesystem are desired.</li> |
| |
| <li>Configure dockerd to not setup network namespace, which will break scrun's |
| ability to talk to the Slurm controller. |
| <!-- Docker does not document: --runtime= argument --> |
| <ul> |
| <li>To configure for all users: |
| <pre>/etc/systemd/user/docker.service.d/override.conf</pre></li> |
| <li>To configure per user: |
| <pre>~/.config/systemd/user/docker.service.d/override.conf</pre></li> |
| </ul> |
| <pre> |
| [Service] |
| Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_PORT_DRIVER=none" |
| Environment="DOCKERD_ROOTLESS_ROOTLESSKIT_NET=host" |
| </pre> |
| </li> |
| <li>Reload docker's service unit in systemd: |
| <pre>systemctl --user daemon-reload</pre></li> |
| <li>Start rootless docker: <pre>systemctl --user start docker</pre></li> |
| <li>Verify Docker is using scrun: |
| <pre>export DOCKER_SECURITY="--security-opt label=disable --security-opt seccomp=unconfined --security-opt apparmor=unconfined --net=none" |
| docker run $DOCKER_SECURITY hello-world |
| docker run $DOCKER_SECURITY alpine /bin/printenv SLURM_JOB_ID |
| docker run $DOCKER_SECURITY alpine /bin/hostname |
| docker run $DOCKER_SECURITY -e SCRUN_JOB_NUM_NODES=10 alpine /bin/hostname</pre> |
| </li> |
| </ol> |
| |
| <h2 id="podman-scrun">Integration with Podman (Slurm-23.02+) |
| <a class="slurm_link" href="#podman-scrun"></a> |
| </h2> |
| <p> |
| Slurm's <a href="scrun.html">scrun</a> can be directly integrated with |
| <a href="https://podman.io/">Podman</a> |
| to run containers as jobs. No special user permissions are required and |
| <b>should not</b> be granted to use this functionality. |
| </p> |
| <h3>Prerequisites</h3> |
| <ol> |
| <li>Slurm must be fully configured and running on host running podman.</li> |
| <li><a href="slurm.conf.html">slurm.conf</a> must be configured to use Munge |
| authentication.<pre>AuthType=auth/munge</pre></li> |
| <li><a href="scrun.html">scrun.lua</a> must be configured for site storage |
| configuration.</li> |
| <li><a href="scrun.html#SECTION_Example-%3CB%3Escrun.lua%3C/B%3E-scripts"> |
| scrun.lua</a> must be present on any node where scrun may be run. The |
| example should be sufficient for most environments but paths should be |
| modified to match available local storage.</li> |
| <li><a href="oci.conf.html">oci.conf</a> |
| must be present on any node where any container job may be run. |
| Example configurations for |
| <a href="https://slurm.schedmd.com/containers.html#example"> |
| known OCI runtimes</a> are provided above. Examples may require |
| paths to be correct to installation locations.</li> |
| </ol> |
| </ol> |
| <h3>Limitations</h3> |
| <ol> |
| <li>JWT authentication is not supported.</li> |
| <li>All containers must use |
| <a href="https://github.com/containers/podman/blob/main/docs/tutorials/basic_networking.md"> |
| host networking</a></li> |
| <li><code>podman exec</code> command is not supported.</li> |
| <li><code>podman-compose</code> command is not supported, due to only being |
| partially implemented. Some compositions may work but each container |
| may be run on different nodes. The network for all containers must be |
| the <code>network_mode: host</code> device.</li> |
| <li><code>podman kube</code> command is not supported.</li> |
| <li><code>podman pod</code> command is not supported.</li> |
| <li><code>podman farm</code> command is not supported.</li> |
| <li>All <code>podman</code> commands are not supported inside of containers.</li> |
| <li>Podman REST API is not supported inside of containers.</li> |
| </ol> |
| |
| <h3>Setup procedure</h3> |
| <ol> |
| <li><a href="https://podman.io/docs/installation">Install Podman</li> |
| <li><a href="https://github.com/containers/podman/blob/main/docs/tutorials/rootless_tutorial.md"> |
| Configure rootless Podman</a></li> |
| <li>Verify rootless podman is configured |
| <pre>$ podman info --format '{{.Host.Security.Rootless}}' |
| true</pre></li> |
| <li>Verify rootless Podman is fully functional before adding Slurm support: |
| <ul> |
| <li>The value printed by the following commands should be the same: |
| <pre>$ id |
| $ podman run --userns keep-id alpine id</pre> |
| <pre>$ sudo id |
| $ podman run --userns nomap alpine id</pre></li> |
| </ul></li> |
| <li> |
| Configure Podman to call scrun instead of the <a |
| href="https://github.com/opencontainers/runtime-spec"> default OCI runtime</a>. |
| See <a href="https://github.com/containers/common/blob/main/docs/containers.conf.5.md"> |
| upstream documentation</a> for details on configuration locations and loading |
| order for containers.conf. |
| <ul> |
| <li>To configure for all users: |
| <code>/etc/containers/containers.conf</code></li> |
| <li>To configure per user: |
| <code>$XDG_CONFIG_HOME/containers/containers.conf</code> |
| or |
| <code>~/.config/containers/containers.conf</code> |
| (if <code>$XDG_CONFIG_HOME</code> is not defined).</li> |
| </ul> |
| Set the following configuration parameters to configure Podman's containers.conf: |
| <pre>[containers] |
| apparmor_profile = "unconfined" |
| cgroupns = "host" |
| cgroups = "enabled" |
| default_sysctls = [] |
| label = false |
| netns = "host" |
| no_hosts = true |
| pidns = "host" |
| utsns = "host" |
| userns = "host" |
| log_driver = "journald" |
| |
| [engine] |
| cgroup_manager = "systemd" |
| runtime = "slurm" |
| remote = false |
| |
| [engine.runtimes] |
| slurm = [ |
| "/usr/local/bin/scrun", |
| "/usr/bin/scrun" |
| ]</pre> |
| Correct path to scrun as if installation prefix was configured.</li> |
| <li>The "cgroup_manager" field will need to be swapped to "cgroupfs" on systems |
| not running systemd.</li> |
| <li>It is strongly suggested that sites consider using inter-node shared |
| filesystems to store Podman's containers. While it is possible to have a |
| scrun.lua script to push and pull images for each deployment, there can be a |
| massive performance penalty. Using a shared filesystem will avoid moving these |
| files around.<br> |
| <ul> |
| <li>To configure for all users: <pre>/etc/containers/storage.conf</pre></li> |
| <li>To configure per user: <pre>$XDG_CONFIG_HOME/containers/storage.conf</pre></li> |
| </ul> |
| Possible configuration additions to storage.conf to use a shared filesystem with |
| <a href="https://docs.podman.io/en/latest/markdown/podman.1.html#storage-driver-value"> |
| vfs storage driver</a>: |
| <pre>[storage] |
| driver = "vfs" |
| runroot = "$HOME/containers" |
| graphroot = "$HOME/containers" |
| |
| [storage.options] |
| pull_options = {use_hard_links = "true", enable_partial_images = "true"} |
| |
| |
| [storage.options.vfs] |
| ignore_chown_errors = "true"</pre> |
| Any node expected to be able to run containers from Podman must have ability to |
| at least read the filesystem used. Full write privileges are suggested and will |
| be required if changes to the container filesystem are desired.</li> |
| <li> Verify Podman is using scrun: |
| <pre>podman run hello-world |
| podman run alpine printenv SLURM_JOB_ID |
| podman run alpine hostname |
| podman run alpine -e SCRUN_JOB_NUM_NODES=10 hostname |
| salloc podman run --env-host=true alpine hostname |
| salloc sh -c 'podman run -e SLURM_JOB_ID=$SLURM_JOB_ID alpine hostname'</pre> |
| </li> |
| <li>Optional: Create alias for Docker: |
| <pre>alias docker=podman</pre> or |
| <pre>alias docker='podman --config=/some/path "$@"'</pre> |
| </li> |
| </ol> |
| |
| <h3>Troubleshooting</h3> |
| <ul> |
| <li>Podman runs out of locks: |
| <pre>$ podman run alpine uptime |
| Error: allocating lock for new container: allocation failed; exceeded num_locks (2048) |
| </pre> |
| <ol> |
| <li>Try renumbering:<pre>podman system renumber</pre></li> |
| <li>Try resetting all storage:<pre>podman system reset</pre></li> |
| </ol> |
| </li> |
| </ul> |
| |
| <h2 id="bundle">OCI Container bundle |
| <a class="slurm_link" href="#bundle"></a> |
| </h2> |
| <p>There are multiple ways to generate an OCI Container bundle. The |
| instructions below are the method we found the easiest. The OCI standard |
| provides the requirements for any given bundle: |
| <a href="https://github.com/opencontainers/runtime-spec/blob/master/bundle.md"> |
| Filesystem Bundle</a> |
| </p> |
| |
| <p>Here are instructions on how to generate a container using a few |
| alternative container solutions:</p> |
| |
| <ul> |
| <li>Create an image and prepare it for use with runc: |
| <ol> |
| <li> |
| Use an existing tool to create a filesystem image in /image/rootfs: |
| <ul> |
| <li> |
| debootstrap: |
| <pre>sudo debootstrap stable /image/rootfs http://deb.debian.org/debian/</pre> |
| </li> |
| <li> |
| yum: |
| <pre>sudo yum --config /etc/yum.conf --installroot=/image/rootfs/ --nogpgcheck --releasever=${CENTOS_RELEASE} -y</pre> |
| </li> |
| <li> |
| docker: |
| <pre> |
| mkdir -p ~/oci_images/alpine/rootfs |
| cd ~/oci_images/ |
| docker pull alpine |
| docker create --name alpine alpine |
| docker export alpine | tar -C ~/oci_images/alpine/rootfs -xf - |
| docker rm alpine</pre> |
| </li> |
| </ul> |
| |
| <li> |
| Configure a bundle for runtime to execute: |
| <ul> |
| <li>Use <a href="https://github.com/opencontainers/runc">runc</a> |
| to generate a config.json: |
| <pre> |
| cd ~/oci_images/alpine |
| runc --rootless=true spec --rootless</pre> |
| </li> |
| <li>Test running image:</li> |
| <pre> |
| srun --container ~/oci_images/alpine/ uptime</pre> |
| </li> |
| </ul> |
| </ol> |
| </li> |
| |
| <li>Use <a href="https://github.com/opencontainers/umoci">umoci</a> |
| and skopeo to generate a full image: |
| <pre> |
| mkdir -p ~/oci_images/ |
| cd ~/oci_images/ |
| skopeo copy docker://alpine:latest oci:alpine:latest |
| umoci unpack --rootless --image alpine ~/oci_images/alpine |
| srun --container ~/oci_images/alpine uptime</pre> |
| </li> |
| |
| <li> |
| Use <a href="https://sylabs.io/guides/3.1/user-guide/oci_runtime.html"> |
| singularity</a> to generate a full image: |
| <pre> |
| mkdir -p ~/oci_images/alpine/ |
| cd ~/oci_images/alpine/ |
| singularity pull alpine |
| sudo singularity oci mount ~/oci_images/alpine/alpine_latest.sif ~/oci_images/alpine |
| mv config.json singularity_config.json |
| runc spec --rootless |
| srun --container ~/oci_images/alpine/ uptime</pre> |
| </li> |
| </ul> |
| |
| <h2 id="ex-ompi5-pmix4">Example OpenMPI v5 + PMIx v4 container |
| <a class="slurm_link" href="#ex-ompi5-pmix4"></a> |
| </h2> |
| |
| Minimalist Dockerfile to generate a image with OpenMPI and PMIx to test basic MPI jobs. |
| |
| <h4>Dockerfile</h4> |
| <pre> |
| FROM almalinux:latest |
| RUN dnf -y update && dnf -y upgrade && dnf install -y epel-release && dnf -y update |
| RUN dnf -y install make automake gcc gcc-c++ kernel-devel bzip2 python3 wget libevent-devel hwloc-devel |
| |
| WORKDIR /usr/local/src/ |
| RUN wget --quiet 'https://github.com/openpmix/openpmix/releases/download/v5.0.7/pmix-5.0.7.tar.bz2' -O - | tar --no-same-owner -xvjf - |
| WORKDIR /usr/local/src/pmix-5.0.7/ |
| RUN ./configure && make -j && make install |
| |
| WORKDIR /usr/local/src/ |
| RUN wget --quiet --inet4-only 'https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.7.tar.bz2' -O - | tar --no-same-owner -xvjf - |
| WORKDIR /usr/local/src/openmpi-5.0.7/ |
| RUN ./configure --disable-pty-support --enable-ipv6 --without-slurm --with-pmix --enable-debug && make -j && make install |
| |
| WORKDIR /usr/local/src/openmpi-5.0.7/examples |
| RUN make && cp -v hello_c ring_c connectivity_c spc_example /usr/local/bin |
| </pre> |
| |
| <h2 id="plugin">Container support via Plugin |
| <a class="slurm_link" href="#plugin"></a></h2> |
| |
| <p>Slurm allows container developers to create <a href="plugins.html">SPANK |
| Plugins</a> that can be called at various points of job execution to support |
| containers. Any site using one of these plugins to start containers <b>should |
| not</b> have an "oci.conf" configuration file. The "oci.conf" file activates the |
| builtin container functionality which may conflict with the SPANK based plugin |
| functionality.</p> |
| |
| <p>The following projects are third party container solutions that have been |
| designed to work with Slurm, but they have not been tested or validated by |
| SchedMD.</p> |
| |
| <h3 id="shifter">Shifter<a class="slurm_link" href="#shifter"></a></h3> |
| |
| <p><a href="https://github.com/NERSC/shifter">Shifter</a> is a container |
| project out of <a href="http://www.nersc.gov/">NERSC</a> |
| to provide HPC containers with full scheduler integration. |
| |
| <ul> |
| <li>Shifter provides full |
| <a href="https://github.com/NERSC/shifter/wiki/SLURM-Integration"> |
| instructions to integrate with Slurm</a>. |
| </li> |
| <li>Presentations about Shifter and Slurm: |
| <ul> |
| <li> <a href="https://slurm.schedmd.com/SLUG15/shifter.pdf"> |
| Never Port Your Code Again - Docker functionality with Shifter using SLURM |
| </a> </li> |
| <li> <a href="https://www.slideshare.net/insideHPC/shifter-containers-in-hpc-environments"> |
| Shifter: Containers in HPC Environments |
| </a> </li> |
| </ul> |
| </li> |
| </ul> |
| </p> |
| |
| <h3 id="enroot1">ENROOT and Pyxis<a class="slurm_link" href="#enroot1"></a></h3> |
| |
| <p><a href="https://github.com/NVIDIA/enroot">Enroot</a> is a user namespace |
| container system sponsored by <a href="https://www.nvidia.com">NVIDIA</a> |
| that supports: |
| <ul> |
| <li>Slurm integration via |
| <a href="https://github.com/NVIDIA/pyxis">pyxis</a> |
| </li> |
| <li>Native support for Nvidia GPUs</li> |
| <li>Faster Docker image imports</li> |
| </ul> |
| </p> |
| |
| <h3 id="sarus">Sarus<a class="slurm_link" href="#sarus"></a></h3> |
| |
| <p><a href="https://github.com/eth-cscs/sarus">Sarus</a> is a privileged |
| container system sponsored by ETH Zurich |
| <a href="https://user.cscs.ch/tools/containers/sarus/">CSCS</a> that supports: |
| <ul> |
| <li> |
| <a href="https://sarus.readthedocs.io/en/latest/config/slurm-global-sync-hook.html"> |
| Slurm image synchronization via OCI hook</a> |
| </li> |
| <li>Native OCI Image support</li> |
| <li>NVIDIA GPU Support</li> |
| <li>Similar design to <a href="#shifter">Shifter</a></li> |
| </ul> |
| Overview slides of Sarus are |
| <a href="http://hpcadvisorycouncil.com/events/2019/swiss-workshop/pdf/030419/K_Mariotti_CSCS_SARUS_OCI_ContainerRuntime_04032019.pdf"> |
| here</a>. |
| </p> |
| |
| <hr size=4 width="100%"> |
| |
| <p style="text-align:center;">Last modified 26 June 2025</p> |
| |
| <!--#include virtual="footer.txt"--> |