| <!--#include virtual="header.txt"--> |
| |
| <h1>Slurm Burst Buffer Guide</h1> |
| |
| <ul> |
| <li><a href="#overview">Overview</a></li> |
| <li><a href="#configuration">Configuration (for system administrators)</a> |
| <ul> |
| <li><a href="#common_config">Common Configuration</a></li> |
| <li><a href="#datawarp_config">Datawarp</a></li> |
| <li><a href="#lua_config">Lua</a></li> |
| </ul> |
| </li> |
| <li><a href="#lua-implementation">Lua Implementation (for system |
| administrators)</a> |
| <ul> |
| <li><a href="#burst_buffer_lua">How does burst_buffer.lua run?</a></li> |
| <li><a href="#lua_warnings">Warnings</a></li> |
| </ul> |
| </li> |
| <li><a href="#resources">Burst Buffer Resources</a> |
| <ul> |
| <li><a href="#datawarp_resources">Datawarp</a></li> |
| <li><a href="#lua_resources">Lua</a></li> |
| </ul> |
| </li> |
| <li><a href="#submit">Job Submission Commands</a> |
| <ul> |
| <li><a href="#submit_dw">Datawarp</a></li> |
| <li><a href="#submit_lua">Lua</a></li> |
| </ul> |
| </li> |
| <li><a href="#persist">Persistent Burst Buffer Creation and Deletion Directives</a></li> |
| <li><a href="#het-job-support">Heterogeneous Job Support</a></li> |
| <li><a href="#command-line">Command-line Job Options</a> |
| <ul> |
| <li><a href="#command-line-dw">Datawarp</a></li> |
| <li><a href="#command-line-lua">Lua</a></li> |
| </ul> |
| </li> |
| <li><a href="#symbols">Symbol Replacement</a></li> |
| <li><a href="#status">Status Commands</a></li> |
| <li><a href="#reservation">Advanced Reservations</a></li> |
| <li><a href="#dependencies">Job Dependencies</a></li> |
| <li><a href="#states">Burst Buffer States and Job States</a></li> |
| </ul> |
| |
| <h2 id="overview">Overview<a class="slurm_link" href="#overview"></a></h2> |
| |
| <p>This guide explains how to use Slurm burst buffer plugins. Where appropriate, |
| it explains how these plugins work in order to give guidance about how to best |
| use these plugins.</p> |
| |
| <p>The Slurm burst buffer plugins call a script at different points during the |
| lifetime of a job:</p> |
| <ol> |
| <li>At job submission</li> |
| <li>While the job is pending after an estimated start time is |
| established. This is called "stage-in."</li> |
| <li>Once the job has been scheduled but has not started running yet. |
| This is called "pre-run."</li> |
| <li>Once the job has completed or been cancelled, but Slurm has not |
| released resources for the job yet. This is called "stage-out."</li> |
| <li>Once the job has completed, and Slurm has released resources for |
| the job. This is called "teardown."</li> |
| </ol> |
| |
| <p>This script runs on the slurmctld node. These are the supported plugins:</p> |
| <ul> |
| <li>datawarp</li> |
| <li>lua</li> |
| </ul> |
| |
| <h3 id="overview-dw">Datawarp |
| <a class="slurm_link" href="#overview-dw"></a> |
| </h3> |
| |
| <p>This plugin provides hooks to Cray's Datawarp APIs. Datawarp implements burst |
| buffers, which are a shared high-speed storage resource. Slurm provides support |
| for allocating these resources, staging files in, scheduling compute nodes for |
| jobs using these resources, and staging files out. Burst buffers can also be |
| used as temporary storage during a job's lifetime, without file staging. |
| Another typical use case is for persistent storage, not associated with any |
| specific job.</p> |
| |
| <h3 id="overview-lua">Lua |
| <a class="slurm_link" href="#overview-lua"></a> |
| </h3> |
| |
| <p>This plugin provides hooks to an API that is defined by a Lua script. This |
| plugin was developed to provide system administrators with a way to do any task |
| (not only file staging) at different points in a job's life cycle. These tasks |
| might include file staging, node maintenance, or any other task that is desired |
| to run during one or more of the five job states listed above.</p> |
| |
| <p>The burst buffer APIs will only be called for a job that specifically |
| requests using them. The <a href="#submit">Job Submission Commands</a> section |
| explains how a job can request using the burst buffer APIs.</p> |
| |
| |
| <h2 id="configuration">Configuration (for system administrators) |
| <a class="slurm_link" href="#configuration"></a> |
| </h2> |
| |
| <h3 id="common_config">Common Configuration |
| <a class="slurm_link" href="#common_config"></a> |
| </h3> |
| |
| <ul> |
| <li>To enable a burst buffer plugin, set <code>BurstBufferType</code> in |
| slurm.conf. If it is not set, then no burst buffer plugin will be loaded. |
| Only one burst buffer plugin may be specified.</li> |
| <li>In slurm.conf, you may set <code>DebugFlags=BurstBuffer</code> for detailed |
| logging from the burst buffer plugin. This will result in very verbose logging |
| and is not intended for prolonged use in a production system, but this may be |
| useful for debugging.</li> |
| <li><a href="resource_limits.html">TRES limits</a> for burst buffers can be |
| configured by association or QOS in the same way that TRES limits can be |
| configured for nodes, CPUs, or any GRES. To make Slurm track burst buffer |
| resources, add <code>bb/datawarp</code> (for the datawarp plugin) or |
| <code>bb/lua</code> (for the lua plugin) to <code>AccountingStorageTres</code> |
| in slurm.conf.</li> |
| <li>The size of a job's burst buffer requirements can be used as a factor in |
| setting the job priority as described in the |
| <a href="priority_multifactor.html">multifactor priority document</a>. |
| The <a href="#resources">Burst Buffer Resources</a> section explains how |
| these resources are defined.</li> |
| <li>Burst-buffer-specific configurations can be set in burst_buffer.conf. |
| Configuration settings include things like which users may use burst buffers, |
| timeouts, paths to burst buffer scripts, etc. See the |
| <a href="burst_buffer.conf.html">burst_buffer.conf</a> manual |
| for more information.</li> |
| <li>The JSON-C library must be installed in order to build Slurm's |
| <code>burst_buffer/datawarp</code> and <code>burst_buffer/lua</code> plugins, |
| which must parse JSON format data. See Slurm's |
| <a href="related_software.html#json">JSON installation information</a> for |
| details.</li> |
| </ul> |
| |
| <h3 id="datawarp_config">Datawarp |
| <a class="slurm_link" href="#datawarp_config"></a> |
| </h3> |
| |
| <p>slurm.conf:</p> |
| <pre> |
| BurstBufferType=burst_buffer/datawarp |
| </pre> |
| |
| <p>The datawarp plugin calls two scripts:</p> |
| <ul> |
| <li><b>dw_wlm_cli</b> - the Slurm burst_buffer/datawarp plugin calls this |
| script to perform burst buffer functions. It should have been provided by Cray. |
| The location of this script is defined by GetSysState in burst_buffer.conf. A |
| template of this script is provided with Slurm:</li> |
| <code>src/plugins/burst_buffer/datawarp/dw_wlm_cli</code> |
| <li><b>dwstat</b> - the Slurm burst_buffer/datawarp plugin calls this script to |
| get status information. It should have been provided by Cray. The location of |
| this script is defined by GetSysStatus in burst_buffer.conf. A template of this |
| script is provided with Slurm:</li> |
| <code>src/plugins/burst_buffer/datawarp/dwstat</code> |
| </ul> |
| |
| <h3 id="lua_config">Lua<a class="slurm_link" href="#lua_config"></a></h3> |
| |
| <p>slurm.conf:</p> |
| <pre> |
| BurstBufferType=burst_buffer/lua |
| </pre> |
| |
| <p>The lua plugin calls a single script which must be named burst_buffer.lua. |
| This script needs to exist in the same directory as slurm.conf. The following |
| functions are required to exist, although they may do nothing but return |
| success:</p> |
| <ul> |
| <li><code>slurm_bb_job_process</code></li> |
| <li><code>slurm_bb_pools</code></li> |
| <li><code>slurm_bb_job_teardown</code></li> |
| <li><code>slurm_bb_setup</code></li> |
| <li><code>slurm_bb_data_in</code></li> |
| <li><code>slurm_bb_test_data_in</code></li> |
| <li><code>slurm_bb_real_size</code></li> |
| <li><code>slurm_bb_paths</code></li> |
| <li><code>slurm_bb_pre_run</code></li> |
| <li><code>slurm_bb_post_run</code></li> |
| <li><code>slurm_bb_data_out</code></li> |
| <li><code>slurm_bb_test_data_out</code></li> |
| <li><code>slurm_bb_get_status</code></li> |
| </ul> |
| |
| <p>A template of burst_buffer.lua is provided with Slurm: |
| <code>etc/burst_buffer.lua.example</code></p> |
| |
| <p>This template documents many more details about the functions such as |
| required parameters, when each function is called, return values for each |
| function, and some simple examples.</p> |
| |
| <h2 id="lua-implementation">Lua Implementation |
| <a class="slurm_link" href="#lua-implementation"></a> |
| </h2> |
| |
| <p>This purpose of this section is to provide additional information about the |
| Lua plugin to help system administrators who desire to implement the Lua API. |
| The most important points in this section are:</p> |
| <ul> |
| <li>Some functions in burst_buffer.lua must run quickly and cannot be killed; |
| the remaining functions are allowed to run for as long as needed and can be |
| killed.</li> |
| <li>A maximum of 512 copies of burst_buffer.lua are allowed to run concurrently |
| in order to avoid exceeding system limits.</li> |
| </ul> |
| |
| <h3 id="burst_buffer_lua">How does burst_buffer.lua run? |
| <a class="slurm_link" href="#burst_buffer_lua"></a> |
| </h3> |
| |
| <p>Lua scripts may either be run by themselves in a separate process via the |
| <code>fork()</code> and <code>exec()</code> system calls, or they may be called |
| via Lua's C API from within an existing process. One of the goals of the lua |
| plugin was to avoid calling <code>fork()</code> from within slurmctld because |
| it can severely harm performance of the slurmctld. The datawarp plugin calls |
| <code>fork()</code> and <code>exec()</code> from slurmctld for every burst |
| buffer API call, and this has been shown to severely harm slurmctld |
| performance. Therefore, slurmctld calls burst_buffer.lua using Lua's C API |
| instead of using <code>fork()</code>.</p> |
| |
| <p>Some functions in burst_buffer.lua are allowed to run for a long time, but |
| they may need to be killed if the job is cancelled, if slurmctld is restarted, |
| or if they run for longer than the configured timeout in burst_buffer.conf. |
| However, a call to a Lua script via Lua's C API cannot be killed from within |
| the same process; only killing the entire process that called the Lua |
| script can kill the Lua script.</p> |
| |
| <p>To address this situation, burst_buffer.lua is called in two different |
| ways:</p> |
| |
| <ul> |
| <li>The <code>slurm_bb_job_process</code>, <code>slurm_bb_pools</code> and |
| <code>slurm_bb_paths</code> functions are called from slurmctld. |
| Because of the explanation above, |
| a script running one of these functions cannot be killed. Since these functions |
| are called while slurmctld holds some mutexes, it will be extremely harmful to |
| slurmctld performance and responsiveness if they are slow. Because it is faster |
| to call these functions directly than to call <code>fork()</code> to create a |
| new process, this was deemed an acceptable tradeoff. As a result, <i>these |
| functions cannot be killed</i>.</li> |
| <li>The remaining functions in burst_buffer.lua are able to run longer without |
| adverse effects. These need to be able to be killed. These functions are called |
| from a lightweight Slurm daemon called slurmscriptd. Whenever one of these |
| functions needs to run, slurmctld tells slurmscriptd to run that function; |
| slurmscriptd then calls <code>fork()</code> to create a new process, then calls |
| the appropriate function. This avoids calling <code>fork()</code> from |
| slurmctld while still providing a way to kill running copies of burst_buffer.lua |
| when needed. As a result, <i>these functions can be killed, and they will be |
| killed if they run for longer than the appropriate timeout value as configured |
| in burst_buffer.conf</i>.</li> |
| </ul> |
| |
| <p>The way in which each function is called is also documented in the |
| burst_buffer.lua.example file.</p> |
| |
| <h3 id="lua_warnings">Warnings |
| <a class="slurm_link" href="#lua_warnings"></a> |
| </h3> |
| |
| <p>Do not install a signal handler in burst_buffer.lua because |
| it is called directly from slurmctld. If slurmctld receives a signal, it |
| could attempt to run the signal handler from burst_buffer.lua, even after a call |
| to burst_buffer.lua is completed, which results in a crash.</p> |
| |
| |
| <h2 id="resources">Burst Buffer Resources |
| <a class="slurm_link" href="#resources"></a> |
| </h2> |
| |
| <p>The burst buffer API may define burst buffer resource "pools" from which a |
| job may request a certain amount of pool space. If a pool does not have |
| sufficient space to fulfill a job's request, that job will remain pending until |
| the pool does have enough space. Once the pool has enough space, Slurm may begin |
| stage-in for the job. When stage-in begins, Slurm subtracts the job's requested |
| space from the pool's available space. When teardown completes, Slurm adds the |
| job's requested space back into the pool's available space. The |
| <a href="#submit">Job Submission Commands</a> section explains how a job may |
| request space from a pool. Pool space is a scalar quantity.</p> |
| |
| <h3 id="datawarp_resources">Datawarp |
| <a class="slurm_link" href="#datawarp_resources"></a> |
| </h3> |
| |
| <ul> |
| <li>Pools are defined by <code>dw_wlm_cli</code>, and represent bytes. This |
| script prints a JSON-formatted string defining the pools to stdout.</li> |
| <li>If a job does not request a pool, then the pool defined by |
| <code>DefaultPool</code> in burst_buffer.conf will be used. If a job does |
| not request a pool and <code>DefaultPool</code> |
| is not defined, then the job will be rejected.</li> |
| </ul> |
| |
| <h3 id="lua_resources">Lua |
| <a class="slurm_link" href="#lua_resources"></a> |
| </h3> |
| |
| <ul> |
| <li>Pools are optional in this plugin, and can represent anything.</li> |
| <li><code>DefaultPool</code> in burst_buffer.conf is not used in this |
| plugin.</li> |
| <li>Pools are defined by burst_buffer.lua in the function |
| <code>slurm_bb_pools</code>. If pools are not desired, then this function should |
| just return <code>slurm.SUCCESS</code>. If pools are desired, then this function |
| should return two values: (1) <code>slurm.SUCCESS</code>, and (2) a |
| JSON-formatted string defining the pools. An example is provided in |
| burst_buffer.lua.example. The current valid fields in the JSON string are:</li> |
| <ul> |
| <li><b>id</b> - a string defining the name of the pool</li> |
| <li><b>quantity</b> - a number defining the amount of space in the |
| pool</li> |
| <li><b>granularity</b> - a number defining the lowest resolution of |
| space that may be allocated from this pool. If a job does not request a |
| number that is a multiple of granularity, then the job's request will |
| be rounded up to the nearest multiple of granularity. For example, |
| if granularity equals 1000, then the smallest amount of space that may |
| be allocated from this pool for a single job is 1000. If a job requests |
| less than 1000 units from this pool, then the job's request will be |
| rounded up to 1000.</li> |
| </ul> |
| </ul> |
| |
| |
| <h2 id="submit">Job Submission Commands |
| <a class="slurm_link" href="#submit"></a> |
| </h2> |
| |
| <p>The normal mode of operation is for batch jobs to specify burst buffer |
| requirements within the batch script. Commented batch script lines containing a |
| specific directive (depending on which plugin is being used) will inform Slurm |
| that it should run the burst buffer stages for that job. These lines will also |
| describe the burst buffer requirements for the job.</p> |
| |
| <p>The salloc and srun commands can specify burst buffer requirements with the |
| <code>--bb</code> and <code>--bbf</code> options. This is described in the |
| <a href="#command-line">Command-line Job Options</a> section.</p> |
| |
| <p>All burst buffer directives should be specified in comments at the top of |
| the batch script. They may be placed before, after, or interspersed with any |
| <code>#SBATCH</code> directives. All burst buffer stages happen at specific |
| points in the job's life cycle, as described in the |
| <a href="#overview">Overview</a> section; they do not happen during the job's |
| execution. For example, all of the persistent burst buffer (used only by the |
| datawarp plugin) creations and deletions happen before the job's compute |
| portion happens. In a similar fashion, you can't run stage-in at various points |
| in the script execution; burst buffer stage-in is performed before the job |
| begins and stage-out is performed after the job completes.</p> |
| |
| <p>For both plugins, a job may request a certain amount of space (size or |
| <b>capacity</b>) from a burst buffer resource <b>pool</b>.</p> |
| |
| <ul> |
| <li>A <b>pool</b> specification is simply a string that matches the name of the |
| pool. For example: <code>pool=pool1</code></li> |
| <li>A <b>capacity</b> specification is a number indicating the amount of space |
| required from the pool. A <b>capacity</b> specification can include a suffix of |
| "N" (nodes), "K|KiB", "M|MiB", "G|GiB", "T|TiB", "P|PiB" (for powers of 1024) |
| and "KB", "MB", "GB", "TB", "PB" (for powers of 1000). <b>NOTE</b>: Usually |
| Slurm interprets KB, MB, GB, TB, PB, units as powers of 1024, but for Burst |
| Buffers size specifications Slurm supports both IEC/SI formats. This is because |
| the CRAY API supports both formats.</li> |
| </ul> |
| |
| <p>At job submission, Slurm performs basic directive validation and also runs a |
| function in the burst buffer script. This function can perform validation of |
| the directives used in the job script. If Slurm determines options are invalid, |
| or if the burst buffer script returns an error, the job will be rejected and an |
| error message will be returned directly to the user.</p> |
| |
| <p>Note that unrecognized options may be ignored in order to support backward |
| compatibility (i.e. a job submission would not fail in the case of an option |
| recognized by some versions of Slurm, but not recognized by other versions). If |
| the job is accepted, but later fails (e.g. some problem staging files), the job |
| will be held and its "Reason" field will be set to an error message provided by |
| the underlying infrastructure.</p> |
| |
| <p>Users may also request to be notified by email upon completion of burst |
| buffer stage out using the <code>--mail-type=stage_out</code> or |
| <code>--mail-type=all</code> option. The subject line of the email will be of |
| this form:</p> |
| |
| <pre> |
| SLURM Job_id=12 Name=my_app Staged Out, StageOut time 00:05:07 |
| </pre> |
| |
| <p>The following plugin subsections give additional information that is |
| specific to each plugin and provide example job scripts. Command-line examples |
| are given in the |
| <a href="#command-line">Command-line Job Options</a> section.</p> |
| |
| <h3 id="submit_dw">Datawarp |
| <a class="slurm_link" href="#submit_dw"></a> |
| </h3> |
| |
| <p>The directive of <code>#DW</code> (for "DataWarp") is used for burst buffer |
| directives when using the <code>burst_buffer/datawarp</code> plugin. Please |
| reference Cray documentation for details about the DataWarp options. For |
| DataWarp systems, the directive of <code>#BB</code> can be used to create or |
| delete persistent burst buffer storage. |
| <br> |
| <b>NOTE</b>: The <code>#BB</code> directive is used since the |
| command is interpreted by Slurm and not by the Cray Datawarp software. This is |
| discussed more in the <a href="#persist">Persistent Burst Buffer</a> |
| section.</p> |
| |
| <p>For job-specific burst buffers, it is required to specify a burst buffer |
| <b>capacity</b>. If the job does not specify <b>capacity</b> then the job will |
| be rejected. A job may also specify the pool from which it wants resources; if |
| the job does not specify a pool, then the pool specified by DefaultPool in |
| burst_buffer.conf will be used (if configured).</p> |
| |
| <p>The following job script requests burst buffer resources from the default |
| pool and requests files to be staged in and staged out:</p> |
| |
| <pre> |
| #!/bin/bash |
| #DW jobdw type=scratch capacity=1GB access_mode=striped,private pfs=/scratch |
| #DW stage_in type=file source=/tmp/a destination=/ss/file1 |
| #DW stage_out type=file destination=/tmp/b source=/ss/file1 |
| srun application.sh |
| </pre> |
| |
| <h3 id="submit_lua">Lua |
| <a class="slurm_link" href="#submit_lua"></a> |
| </h3> |
| |
| <p>The default directive for this plugin is <code>#BB_LUA</code>. The directive |
| used by this plugin may be changed by setting the <b>Directive</b> option in |
| burst_buffer.conf. Since the directive must always begin with a <code>#</code> |
| sign (which starts a comment in a shell script) this option should specify only |
| the string following the <code>#</code> sign. For example, if burst_buffer.conf |
| contains the following:</p> |
| |
| <pre>Directive=BB_EXAMPLE</pre> |
| |
| <p>then the burst buffer directive will be <code>#BB_EXAMPLE</code>.</p> |
| |
| <p>If the <b>Directive</b> option is not specified in burst_buffer.conf, then |
| the default directive for this plugin (<code>#BB_LUA</code>) will be used.</p> |
| |
| <p>Since this plugin was designed to be generic and flexible, this plugin only |
| requires the directive to be given. If the directive is given, Slurm will run |
| all burst buffer stages for the job.</p> |
| |
| <p>Example of the minimum information required for all burst buffer stages to |
| run for the job:</p> |
| |
| <pre> |
| #!/bin/bash |
| #BB_LUA |
| srun application.sh |
| </pre> |
| |
| <p>Because burst buffer pools are optional for this plugin (see the <a |
| href="#resources">Burst Buffer Resources</a> section), a job is not required to |
| specify a pool or capacity. If pools are provided by the burst buffer API, |
| then a job may request a pool and capacity:</p> |
| |
| <pre> |
| #!/bin/bash |
| #BB_LUA pool=pool1 capacity=1K |
| srun application.sh |
| </pre> |
| |
| <p>A job may choose whether or not to specify a pool. If a job does not specify |
| a pool, then the job is still allowed to run and the burst buffer stages will |
| still run for this job (as long as the burst buffer directive was given). If |
| the job specifies a pool but that pool is not found, then the job is |
| rejected.</p> |
| |
| <p>The system administrator may validate burst buffer options in the |
| <code>slurm_bb_job_process</code> function in burst_buffer.lua. This might |
| include requiring a job to specify a pool or validating any additional options |
| that the system administrator decides to implement.</p> |
| |
| |
| <h2 id="persist">Persistent Burst Buffer Creation and Deletion Directives |
| <a class="slurm_link" href="#persist"></a> |
| </h2> |
| |
| <p>This section only applies to the datawarp plugin, since persistent burst |
| buffers are not used in any other burst buffer plugin.</p> |
| |
| <p>These options are used to create and delete persistent burst buffers:</p> |
| <ul> |
| <li><code>#BB create_persistent name=<name> capacity=<number> |
| [access=<access>] [pool=<pool> [type=<type>]</code></li> |
| <li><code>#BB destroy_persistent name=<name> [hurry]</code></li> |
| </ul> |
| |
| <p>Options for creating and deleting persistent burst buffers:</p> |
| <ul> |
| <li><b>name</b> - The persistent burst buffer name may not start with a numeric |
| value (numeric names are reserved for job-specific burst buffers).</li> |
| <li><b>capacity</b> - Described in the |
| <a href="#submit">Job Submission Commands</a> section.</li> |
| <li><b>pool</b> - Described in the |
| <a href="#submit">Job Submission Commands</a> section.</li> |
| <li><b>access</b> - The access parameter identifies the buffer access mode. |
| Supported access modes for the datawarp plugin include:</li> |
| <ul> |
| <li>striped</li> |
| <li>private</li> |
| <li>ldbalance</li> |
| </ul> |
| <li><b>type</b> - The type parameter identifies the buffer type. Supported type |
| modes for the datawarp plugin include:</li> |
| <ul> |
| <li>cache</li> |
| <li>scratch</li> |
| </ul> |
| </ul> |
| |
| <p>Multiple persistent burst buffers may be created or deleted within a single |
| job.</p> |
| |
| <p>Example - Creating two persistent burst buffers:</p> |
| |
| <pre> |
| #!/bin/bash |
| #BB create_persistent name=alpha capacity=32GB access=striped type=scratch |
| #BB create_persistent name=beta capacity=16GB access=striped type=scratch |
| srun application.sh |
| </pre> |
| |
| <p>Example - Destroying two persistent burst buffers:</p> |
| |
| <pre> |
| #!/bin/bash |
| #BB destroy_persistent name=alpha |
| #BB destroy_persistent name=beta |
| srun application.sh |
| </pre> |
| |
| <p>Persistent burst buffers can be created and deleted by a job requiring no |
| compute resources. Submit a job with the desired burst buffer directives and |
| specify a node count of zero (e.g. <code>sbatch -N0 setup_buffers.bash</code>). |
| Attempts to submit a zero size job without burst buffer directives or with |
| job-specific burst buffer directives will generate an error. Note that zero |
| size jobs are not supported for job arrays or heterogeneous job |
| allocations.</p> |
| |
| <p><b>NOTE</b>: The ability to create and destroy persistent burst buffers may |
| be limited by the <code>Flags</code> option in the burst_buffer.conf file. |
| See the <a href="burst_buffer.conf.html">burst_buffer.conf</a> man page for |
| more information. |
| By default only <a href="user_permissions.html">privileged users</a> |
| (i.e. Slurm operators and administrators) |
| can create or destroy persistent burst buffers.</p> |
| |
| <h2 id="het-job-support">Heterogeneous Job Support |
| <a class="slurm_link" href="#het-job-support"></a> |
| </h2> |
| |
| <p>Heterogeneous jobs may request burst buffers. Burst buffer hooks will run |
| once for each component that has burst buffer directives. For example, if a |
| heterogeneous job has three components and two of them have burst buffer |
| directives, the burst buffer hooks will run once for each of the two components |
| with burst buffer directives, but not for the third component without burst |
| buffer directives. Further information and examples can be found in the |
| <a href=heterogeneous_jobs.html#burst_buffer>heterogeneous jobs</a> page. |
| </p> |
| |
| <h2 id="command-line">Command-line Job Options |
| <a class="slurm_link" href="#command-line"></a> |
| </h2> |
| |
| <p>In addition to putting burst buffer directives in the batch script, the |
| command-line options <code>--bb</code> and <code>--bbf</code> may also include |
| burst buffer directives. These command-line options are available for salloc, |
| sbatch, and srun. Note that the <code>--bb</code> option cannot create or |
| destroy persistent burst buffers.</p> |
| |
| <p>The <code>--bbf</code> option takes as an argument a filename and that file |
| should contain a collection of burst buffer operations identical to those used |
| for batch jobs.</p> |
| |
| <p>Alternatively, the <code>--bb</code> option may be used to specify burst |
| buffer directives as the option argument. The behavior of this option depends |
| on which burst buffer plugin is used. When the <code>--bb</code> option is |
| used, Slurm parses this option and creates a temporary burst buffer script file |
| that is used internally by the burst buffer plugins.</p> |
| |
| <h3 id="command-line-dw">Datawarp |
| <a class="slurm_link" href="#command-line-dw"></a> |
| </h3> |
| |
| <p>When using the <code>--bb</code> option, the format of the directives can |
| either be identical to those used in a batch script OR a very limited set of |
| options can be used, which are translated to the equivalent script for later |
| processing. The following options are allowed:</p> |
| <ul> |
| <li><code>access=<access></code></li> |
| <li><code>capacity=<number></code></li> |
| <li><code>swap=<number></code></li> |
| <li><code>type=<type></code></li> |
| <li><code>pool=<name></code></li> |
| </ul> |
| |
| <p>Multiple options should be space separated. If a swap option is specified, |
| the job must also specify the required node count.</p> |
| |
| <p>Example:</p> |
| |
| <pre> |
| # Sample execute line: |
| srun --bb="capacity=1G access=striped type=scratch" a.out |
| |
| # Equivalent script as generated by Slurm's burst_buffer/datawarp plugin |
| #DW jobdw capacity=1GiB access_mode=striped type=scratch |
| </pre> |
| |
| <h3 id="command-line-lua">Lua |
| <a class="slurm_link" href="#command-line-lua"></a> |
| </h3> |
| |
| <p>This plugin does not do any special parsing or translating of burst buffer |
| directives given by the <code>--bb</code> option. When using the |
| <code>--bb</code> option, the format is identical to the batch script: Slurm |
| only enforces that the burst buffer directive must be specified. See additional |
| information in the Lua subsection of <a href="#submit">Job Submission |
| Commands</a>.</p> |
| |
| <p>Example:</p> |
| |
| <pre> |
| # Sample execute line: |
| srun --bb="#BB_LUA pool=pool1 capacity=1K" |
| |
| # Equivalent script as generated by Slurm's burst_buffer/lua plugin |
| #BB_LUA pool=pool1 capacity=1K |
| </pre> |
| |
| |
| <h2 id="symbols">Symbol Replacement |
| <a class="slurm_link" href="#symbols"></a> |
| </h2> |
| |
| <p>Slurm supports a number of symbols that can be used to automatically |
| fill in certain job details, e.g. to make stage-in or stage-out directory |
| paths vary with each job submission.</p> |
| |
| <p>Supported symbols include: |
| |
| <table border=1 cellspacing=4 cellpadding=4> |
| <tr><td>%%</td><td>%</td></tr> |
| <tr><td>%A</td><td>Array Master Job Id</td></tr> |
| <tr><td>%a</td><td>Array Task Id</td></tr> |
| <tr><td>%d</td><td>Workdir</td></tr> |
| <tr><td>%j</td><td>Job Id</td></tr> |
| <tr><td>%u</td><td>User Name</td></tr> |
| <tr><td>%x</td><td>Job Name</td></tr> |
| <tr><td>\\</td><td>Stop further processing of the line</td></tr> |
| </table> |
| </p> |
| |
| <h2 id="status">Status Commands<a class="slurm_link" href="#status"></a></h2> |
| |
| <p>Burst buffer information that Slurm tracks is available by using the |
| <code>scontrol show burst</code> command or by using the sview command's |
| Burst Buffer tab. Examples follow.</p> |
| |
| <p>Datawarp plugin example:</p> |
| |
| <pre> |
| $ scontrol show burst |
| Name=datawarp DefaultPool=wlm_pool Granularity=200GiB TotalSpace=5800GiB FreeSpace=4600GiB UsedSpace=1600GiB |
| Flags=EmulateCray |
| StageInTimeout=86400 StageOutTimeout=86400 ValidateTimeout=5 OtherTimeout=300 |
| GetSysState=/home/marshall/slurm/master/install/c1/sbin/dw_wlm_cli |
| GetSysStatus=/home/marshall/slurm/master/install/c1/sbin/dwstat |
| Allocated Buffers: |
| JobID=169509 CreateTime=2021-08-11T10:19:06 Pool=wlm_pool Size=1200GiB State=allocated UserID=marshall(1017) |
| JobID=169508 CreateTime=2021-08-11T10:18:46 Pool=wlm_pool Size=400GiB State=staged-in UserID=marshall(1017) |
| Per User Buffer Use: |
| UserID=marshall(1017) Used=1600GiB |
| </pre> |
| |
| <p>Lua plugin example:</p> |
| |
| <pre> |
| $ scontrol show burst |
| Name=lua DefaultPool=(null) Granularity=1 TotalSpace=0 FreeSpace=0 UsedSpace=0 |
| PoolName[0]=pool1 Granularity=1KiB TotalSpace=10000KiB FreeSpace=9750KiB UsedSpace=250KiB |
| PoolName[1]=pool2 Granularity=2 TotalSpace=10 FreeSpace=10 UsedSpace=0 |
| PoolName[2]=pool3 Granularity=1 TotalSpace=4 FreeSpace=4 UsedSpace=0 |
| PoolName[3]=pool4 Granularity=1 TotalSpace=5GB FreeSpace=4GB UsedSpace=1GB |
| Flags=DisablePersistent |
| StageInTimeout=86400 StageOutTimeout=86400 ValidateTimeout=5 OtherTimeout=300 |
| GetSysState=(null) |
| GetSysStatus=(null) |
| Allocated Buffers: |
| JobID=169504 CreateTime=2021-08-11T10:13:38 Pool=pool1 Size=250KiB State=allocated UserID=marshall(1017) |
| JobID=169502 CreateTime=2021-08-11T10:12:06 Pool=pool4 Size=1GB State=allocated UserID=marshall(1017) |
| Per User Buffer Use: |
| UserID=marshall(1017) Used=1000256KB |
| </pre> |
| |
| <p>Access to a burst buffer status API is available from scontrol using the |
| <code>scontrol show bbstat ...</code> or <code>scontrol show dwstat ...</code> |
| commands. Options following <code>bbstat</code> or <code>dwstat</code> on the |
| scontrol execute line are passed directly to the bbstat or dwstat commands, as |
| shown below. In the datawarp plugin, this command calls Cray's dwstat script. |
| See Cray Datawarp documentation for details about dwstat options and output. In |
| the lua plugin, this command calls the <code>slurm_bb_get_status</code> |
| function in burst_buffer.lua.</p> |
| |
| <p>Datawarp plugin example:</p> |
| |
| <pre> |
| /opt/cray/dws/default/bin/dwstat |
| $ scontrol show dwstat |
| pool units quantity free gran' |
| wlm_pool bytes 7.28TiB 7.28TiB 1GiB' |
| |
| $ scontrol show dwstat sessions |
| sess state token creator owner created expiration nodes |
| 832 CA--- 783000000 tester 12345 2015-09-08T16:20:36 never 20 |
| 833 CA--- 784100000 tester 12345 2015-09-08T16:21:36 never 1 |
| 903 D---- 1875700000 tester 12345 2015-09-08T17:26:05 never 0 |
| |
| $ scontrol show dwstat configurations |
| conf state inst type access_type activs |
| 715 CA--- 753 scratch stripe 1 |
| 716 CA--- 754 scratch stripe 1 |
| 759 D--T- 807 scratch stripe 0 |
| 760 CA--- 808 scratch stripe 1 |
| </pre> |
| |
| <p>A Lua plugin example can be found in the <code>slurm_bb_get_status</code> |
| function in the <code>etc/burst_buffer.lua.example</code> file provided |
| with Slurm.</p> |
| |
| |
| <h2 id="reservation">Advanced Reservations |
| <a class="slurm_link" href="#reservation"></a> |
| </h2> |
| |
| <p>Burst buffer resources can be placed in an advanced reservation using the |
| <i>BurstBuffer</i> option. |
| The argument consists of four elements:<br> |
| <code>[plugin:][pool:]#[units]</code> |
| |
| <ul> |
| <li><b>plugin</b> is the burst buffer plugin name, currently either "datawarp" |
| or "lua".</li> |
| <li><b>pool</b> specifies a burst buffer resource pool. |
| If "type" is not specified, the number is a measure of storage space.</li> |
| <li><b>#</b> (meaning number) should be replaced with a positive integer.</li> |
| <li><b>units</b> has the same format as the suffix of capacity in the |
| <a href="#submit">Job Submission Commands</a> section.</li> |
| |
| <p>Jobs using this reservation are not restricted to these burst buffer |
| resources, but may use these reserved resources plus any which are generally |
| available. Some examples follow.</p> |
| |
| <pre> |
| $ scontrol create reservation starttime=now duration=60 \ |
| users=alan flags=any_nodes \ |
| burstbuffer=datawarp:100G |
| |
| $ scontrol create reservation StartTime=noon duration=60 \ |
| users=brenda NodeCnt=8 \ |
| BurstBuffer=datawarp:20G |
| |
| $ scontrol create reservation StartTime=16:00 duration=60 \ |
| users=joseph flags=any_nodes \ |
| BurstBuffer=datawarp:pool_test:4G |
| </pre> |
| |
| <h2 id="dependencies">Job Dependencies |
| <a class="slurm_link" href="#dependencies"></a> |
| </h2> |
| |
| <p>If two jobs use burst buffers and one is dependent on the other (e.g. |
| <code>sbatch --dependency=afterok:123 ...</code>) then the second job will not |
| begin until the first job completes and its burst buffer stage-out completes. |
| If the second job does not use a burst buffer, but is dependent upon the first |
| job's completion, then it will not wait for the stage-out operation of the first |
| job to complete. |
| The second job can be made to wait for the first job's stage-out operation to |
| complete using the "afterburstbuffer" dependency option (e.g. |
| <code>sbatch --dependency=afterburstbuffer:123 ...</code>).</p> |
| |
| |
| <h2 id="states">Burst Buffer States and Job States |
| <a class="slurm_link" href="#states"></a> |
| </h2> |
| |
| <p>These are the different possible burst buffer states:</p> |
| |
| <ul> |
| <li><code>pending</code></li> |
| <li><code>allocating</code></li> |
| <li><code>allocated</code></li> |
| <li><code>deleting</code></li> |
| <li><code>deleted</code></li> |
| <li><code>staging-in</code></li> |
| <li><code>staged-in</code></li> |
| <li><code>pre-run</code></li> |
| <li><code>alloc-revoke</code></li> |
| <li><code>running</code></li> |
| <li><code>suspended</code></li> |
| <li><code>post-run</code></li> |
| <li><code>staging-out</code></li> |
| <li><code>teardown</code></li> |
| <li><code>teardown-fail</code></li> |
| <li><code>complete</code></li> |
| </ul> |
| |
| <p>These states appear in the "BurstBufferState" field in the output of |
| <code>scontrol show job</code>. This field only appears for jobs that requested |
| a burst buffer. The states <code>allocating</code>, <code>allocated</code>, |
| <code>deleting</code> and <code>deleted</code> are used |
| for persistent burst buffers only (not for job-specific burst buffers). The |
| state <code>alloc-revoke</code> happens if a failure in Slurm's select plugin |
| occurs in between Slurm allocating resources for a job and actually starting |
| the job. This should never happen.</p> |
| <p>When a job requests a burst buffer, this is what the job and burst buffer |
| state transitions look like:</p> |
| |
| <ol> |
| <li>Job is submitted. Job state and burst buffer state are both |
| <code>pending</code>.</li> |
| <li>Burst buffer stage-in starts. Job state: <code>pending</code> with reason: |
| <code>BurstBufferStageIn</code>. Burst buffer state: <code>staging-in</code>. |
| </li> |
| <li>When stage-in completes, the job is eligible to be scheduled (barring any |
| other limits). Job state: <code>pending</code>. Burst buffer state: |
| <code>staged-in</code>.</li> |
| <li>When the job is scheduled and allocated resources, the burst buffer pre-run |
| stage begins. Job state: <code>running+configuring</code>. Burst buffer state: |
| <code>pre-run</code>.</li> |
| <li>When pre-run finishes, the <code>configuring</code> flag is cleared from |
| the job and the job can actually start running. Job state and burst buffer |
| state are both <code>running</code>.</li> |
| <li>When the job completes (even if it fails), burst buffer stage-out starts. |
| Job state: <code>stage-out</code>. Burst buffer state: |
| <code>staging-out</code>.</li> |
| <li>When stage-out completes, teardown starts. Job state: <code>complete</code>. |
| Burst buffer state: <code>teardown</code>.</li> |
| </ol> |
| |
| <p>There are some situations which will change the state transitions. Examples |
| include:</p> |
| |
| <ul> |
| <li>Burst buffer operation failures:</li> |
| <ul> |
| <li>If teardown fails, then the burst buffer state changes to |
| teardown-fail. Teardown will be retried. For the burst_buffer/lua |
| plugin, teardown will run a maximum of 3 times before giving up and |
| destroying the burst buffer.</li> |
| <li>If either stage-in or stage-out fail and Flags=teardownFailure is |
| configured in burst_buffer.conf, then teardown runs. Otherwise, the job |
| is held and the burst buffer remains in the same state so it may be |
| inspected and manually destroyed with <code>scancel --hurry</code>.</li> |
| <li>If pre-run fails, then the job is held and teardown runs.</li> |
| </ul> |
| <li>When a job is cancelled, the current burst buffer script for that job |
| (if running) is killed. If <code>scancel --hurry</code> was used, or if the job |
| never ran, stage-out is skipped and it goes straight to teardown. Otherwise, |
| stage-out begins.</li> |
| <li>If slurmctld is stopped, Slurm kills all running burst buffer scripts for |
| all jobs and burst buffer state is saved for each job. When slurmctld restarts, |
| for each job it reads the burst buffer state and does one of the following:</li> |
| <ul> |
| <li><b>Pending</b> - Do nothing, since no burst buffer scripts were |
| killed.</li> |
| <li><b>Staging-in, staged-in</b> - run teardown, wait for a short time, |
| then restart stage-in.</li> |
| <li><b>Pre-run</b> - Restart pre-run.</li> |
| <li><b>Running</b> - Do nothing, since no burst buffer scripts were |
| killed.</li> |
| <li><b>Post-run, staging-out</b> - Restart post-run.</li> |
| <li><b>Teardown, teardown-fail</b> - Restart teardown.</li> |
| </ul> |
| </ul> |
| |
| <p><b>NOTE</b>: There are many other things not listed here that affect the job |
| state. This document focuses on burst buffers and does not attempt to address |
| all possible job state transitions.</p> |
| |
| <p style="text-align:center;">Last modified 21 August 2023</p> |
| |
| <!--#include virtual="footer.txt"--> |