doc/html/cray.shtml - SchedMD/slurm - Git at Google

 <!--#include virtual="header.txt"-->

 <h1>Slurm User and Administrator Guide for Cray XC Systems</h1>

 <ul>
 <li><a href="#user_guide">User Guide</a></li>
 <li><a href="#features">Cray Specific Features</a></li>
 <li><a href="#admin_guide">Administrator Guide</a></li>
 <li><a href="#setup">Cray System Setup</a></li>
 <li><a href="#ha">High Availability</a></li>
 <li><a href="http://www.cray.com">Cray</a></li>
 </ul>

 <HR SIZE=4>

 <h2 id="user_guide">User Guide<a class="slurm_link" href="#user_guide"></a></h2>

 <p>This document describes the unique features of Slurm on Cray XC
   computers natively, or without the use of Cray's Application Level
   Placement Scheduler (ALPS).  You should be familiar with the Slurm's
   mode of operation on Linux clusters before studying the differences
   in Cray system operation described in this document.  When running
   Slurm in native mode a Cray system will function very similar to a
   Linux cluster.
 </p>

 <p>Slurm is designed to operate as a workload manager on Cray XC systems
   (Cascade) without the use of ALPS. In addition to providing the same look and
   feel of a regular Linux cluster this also allows for many functionalities
   such as:
 <ul>
 <li>Ability to run multiple jobs per node</li>
 <li>Ability to status running jobs with sstat</li>
 <li>Full accounting support for job steps</li>
 <li>Ability to run multiple jobs/steps in background from the same
   session</li>
 </ul>
 </p>

 <h2 id="features">Cray Specific Features
 <a class="slurm_link" href="#features"></a>
 </h2>
 <ul>
 <li>Network Performance Counters</li>
 <p>
   To access Cray's Network Performance Counters (NPC) you can use
   the <i>--network</i> option in sbatch/salloc/srun to request them.
   There are 2 different types of counters, system and blade.
 <p>
   For the system option (--network=system) only one job can use system at
   a time.   Only nodes requested will be marked in use for the job
   allocation.  If the job does not fill up the entire system the rest
   of the nodes are not able to be used by other jobs using NPC, if
   idle their state will appear as PerfCnts.  These nodes are still
   available for other jobs not using NPC.
 </p>
 <p>
   For the blade option (--network=blade) Only nodes requested
   will be marked in use for the job allocation.  If the job does not
   fill up the entire blade(s) allocated to the job those blade(s) are not
   able to be used by other jobs using NPC, if idle their state will appear as
   PerfCnts.  These nodes are still available for other jobs not using NPC.
 </p>

 <li>Core Specialization</li>
 <p>
   To use set <b><i>CoreSpecPlugin=core_spec/cray_aries</i></b>.
   Ability to reserve a number of cores allocated to the job for system
   operations and not used by the application. The application will not
   use these cores, but will be charged for their allocation.
 </p>
 </ul>

 <h2 id="admin_guide">Admin Guide
 <a class="slurm_link" href="#admin_guide"></a>
 </h2>
 <p>
   Many new plugins were added to utilize the Cray system without
   ALPS.  These should be set up in your slurm.conf outside of your
   normal configuration.
 <ul>

 <li>AcctGatherEnergyType</li>
 <p>
   Set <b><i>AcctGatherEnergyType=acct_gather_energy/pm_counters</i></b> to
   have the Cray XC baseboard management controller report energy usage
   data to Slurm.
 </p>

 <li>BurstBuffer</li>
 <p>
   Set <b><i>BurstBufferPlugins=burst_buffer/datawarp</i></b> to use.
   The burst buffer capability on Cray systems is also known by the name
   <i>DataWarp</i>.
   For more information, see
   <a href="burst_buffer.html">Slurm Burst Buffer Guide</a>.
 </p>

 <li>CoreSpec</li>
 <p>
   To use set <b><i>CoreSpecPlugin=core_spec/cray_aries</i></b>.
 </p>

 <li>JobSubmit</li>
 <p>
   Set <b><i>JobSubmitPlugins=job_submit/cray_aries</i></b> to use.
   This plugin is primarily used to set a gres=craynetwork value which
   is used to limit the number of applications that can run on a node
   at once.  That number can be at most 4. This craynetwork gres needs
   to be set up in your slurm.conf to ensure proper functionality.
   For example...
   <pre>
     ...
     Grestypes=craynetwork
     NodeName=nid000[00-10] gres=craynetwork:4
     ...
   </pre>
 </p>

 <li>Power</li>
 <p>
   Set <b><i>PowerPlugin=power/cray_aries</i></b> to use.
   <b><i>PowerParameters</i></b> is also typically configured.
   For more information, see
   <a href="power_mgmt.html">Slurm Power Management Guide</a>.
 </p>

 <li>Proctrack</li>
 <p>
   Set <b><i>ProctrackType=proctrack/cray_aries</i></b> to use.
 </p>

 <li>Select</li>
 <p>
   Set <b><i>SelectType=select/cray_aries</i></b> to use.  This plugin is
   a layered plugin.  Which means it enhances a lower layer select
   plugin.  By default it is layered on top of the <i>select/linear</i>
   plugin.  It can also be layered on top of the <i>select/cons_tres</i> plugin
   by using the <b><i>SelectTypeParameters=other_cons_tres</i></b>,
   doing this will allow you to run multiple jobs on a Cray node just
   like on a normal Linux cluster.
   Use additional <b><i>SelectTypeParameters</i></b> to identify the resources
   to allocate (e.g. cores, sockets, memory, etc.). See the slurm.conf man
   page for details.
 </p>

 <li>SlurmctldPort, SlurmdPort, SrunPortRange</li>
 <p>
   Realm-Specific IP Addressing (RSIP) will automatically try to interact with
   anything opened on ports 8192 to 60000. Configure SlurmctldPort, SlurmdPort,
   and SrunPortRange to use ports above 60000. In the case of SrunPortRange,
   making 1000 or more ports available is recommended.
 </p>

 <li>Switch</li>
 <p>
   Set <b><i>SwitchType=switch/cray_aries</i></b> to use.
 </p>

 <li>Task</li>
 <p>
   Set <b><i>TaskPlugin=cray_aries,cgroup</i></b> to use. Use of the
   <b><i>task/cgroup</i></b> plugin is required alongside <i>task/cray_aries</i>.
   You may also use the <i>task/affinity</i> plugin along with
   <i>task/cray_aries,task/cgroup</i> if desired (i.e.
   <b><i>TaskPlugin=cray_aries,affinity,cgroup</i></b>). Note that plugins
   are used in the order they are defined in the comma separated list,
   and that <i>task/cray_aries</i> must be listed before <i>task/cgroup</i>
   due to internal dependencies between the two plugins.
 </p>
 </ul>

 <h2 id="setup">Cray system setup<a class="slurm_link" href="#setup"></a></h2>
 <p>Some Slurm plugins (burst_buffer/datawarp and power/cray_aries) plugins
 parse JSON format data.
 These plugins are designed to make use of the JSON-C library for this purpose.
 See <a href="download.html#json">JSON-C installation instructions</a> for
 details.</p>

 <p>
   Some services on the system need to be set up to run correctly with
   Slurm.  Below is how to restart the service and the nodes they run
   on.  It is probably a good idea to set this up to happen automatically.
 <ul>
   <li>boot node</li>
   <ul>
     <li>WLM_DETECT_ACTIVE=SLURM /etc/init.d/aeld restart</li>
   </ul>
   <li>sdb node</li>
   <ul>
     <li>WLM_DETECT_ACTIVE=SLURM /etc/init.d/ncmd restart</li>
     <lI>WLM_DETECT_ACTIVE=SLURM /etc/init.d/apptermd restart</li>
   </ul>
 </ul>
 <p>
   As with Linux clusters you will need to start a slurmd on each of your
   compute nodes.  If you choose to use munge authentication, advised,
   you will also need munge installed and a munged running on each of
   your compute nodes as well.  See the <a href="quickstart_admin.html">
   quick start guide</a> for more info.  Outside of the differences
   listed in this file it can be used to set up your Cray system to run
   Slurm natively.
 </p>
 <p>
   On larger systems, you may wish to set the PMI_MMAP_SYNC_WAIT_TIME
   environment variable in your users' profiles to a larger value than
   the default (180 seconds) to prevent PMI from falsely detecting job
   launch failures.
 </p>
 <h2 id="ha">High Availability<a class="slurm_link" href="#ha"></a></h2>
 <p>
 A backup controller can be setup in or outside the Cray. However, when the
 backup is within the Cray, both the primary and the backup controllers will go
 down when the Cray is rebooted. It is best to setup the backup controller on a
 Cray external node so that the controller can still receive new jobs when the
 Cray is down.  When the backup is configured on an external node the
 <b><i>no_backup_scheduling</i></b> <b><i>SchedulerParameter</i></b> should be
 specified in the slurm.conf. This allows new jobs to be submitted while the Cray
 is down and prevents any new jobs from being started.
 </p>


 <p style="text-align:center;">Last modified 27 July 2023</p>

 <!--#include virtual="footer.txt"-->
	<!--#include virtual="header.txt"-->

	<h1>Slurm User and Administrator Guide for Cray XC Systems</h1>

	<ul>
	<li><a href="#user_guide">User Guide</a></li>
	<li><a href="#features">Cray Specific Features</a></li>
	<li><a href="#admin_guide">Administrator Guide</a></li>
	<li><a href="#setup">Cray System Setup</a></li>
	<li><a href="#ha">High Availability</a></li>
	<li><a href="http://www.cray.com">Cray</a></li>
	</ul>

	<HR SIZE=4>

	<h2 id="user_guide">User Guide<a class="slurm_link" href="#user_guide"></a></h2>

	<p>This document describes the unique features of Slurm on Cray XC
	computers natively, or without the use of Cray's Application Level
	Placement Scheduler (ALPS). You should be familiar with the Slurm's
	mode of operation on Linux clusters before studying the differences
	in Cray system operation described in this document. When running
	Slurm in native mode a Cray system will function very similar to a
	Linux cluster.
	</p>

	<p>Slurm is designed to operate as a workload manager on Cray XC systems
	(Cascade) without the use of ALPS. In addition to providing the same look and
	feel of a regular Linux cluster this also allows for many functionalities
	such as:
	<ul>
	<li>Ability to run multiple jobs per node</li>
	<li>Ability to status running jobs with sstat</li>
	<li>Full accounting support for job steps</li>
	<li>Ability to run multiple jobs/steps in background from the same
	session</li>
	</ul>
	</p>

	<h2 id="features">Cray Specific Features
	<a class="slurm_link" href="#features"></a>
	</h2>
	<ul>
	<li>Network Performance Counters</li>
	<p>
	To access Cray's Network Performance Counters (NPC) you can use
	the <i>--network</i> option in sbatch/salloc/srun to request them.
	There are 2 different types of counters, system and blade.
	<p>
	For the system option (--network=system) only one job can use system at
	a time. Only nodes requested will be marked in use for the job
	allocation. If the job does not fill up the entire system the rest
	of the nodes are not able to be used by other jobs using NPC, if
	idle their state will appear as PerfCnts. These nodes are still
	available for other jobs not using NPC.
	</p>
	<p>
	For the blade option (--network=blade) Only nodes requested
	will be marked in use for the job allocation. If the job does not
	fill up the entire blade(s) allocated to the job those blade(s) are not
	able to be used by other jobs using NPC, if idle their state will appear as
	PerfCnts. These nodes are still available for other jobs not using NPC.
	</p>

	<li>Core Specialization</li>
	<p>
	To use set <b><i>CoreSpecPlugin=core_spec/cray_aries</i></b>.
	Ability to reserve a number of cores allocated to the job for system
	operations and not used by the application. The application will not
	use these cores, but will be charged for their allocation.
	</p>
	</ul>

	<h2 id="admin_guide">Admin Guide
	<a class="slurm_link" href="#admin_guide"></a>
	</h2>
	<p>
	Many new plugins were added to utilize the Cray system without
	ALPS. These should be set up in your slurm.conf outside of your
	normal configuration.
	<ul>

	<li>AcctGatherEnergyType</li>
	<p>
	Set <b><i>AcctGatherEnergyType=acct_gather_energy/pm_counters</i></b> to
	have the Cray XC baseboard management controller report energy usage
	data to Slurm.
	</p>

	<li>BurstBuffer</li>
	<p>
	Set <b><i>BurstBufferPlugins=burst_buffer/datawarp</i></b> to use.
	The burst buffer capability on Cray systems is also known by the name
	<i>DataWarp</i>.
	For more information, see
	<a href="burst_buffer.html">Slurm Burst Buffer Guide</a>.
	</p>

	<li>CoreSpec</li>
	<p>
	To use set <b><i>CoreSpecPlugin=core_spec/cray_aries</i></b>.
	</p>

	<li>JobSubmit</li>
	<p>
	Set <b><i>JobSubmitPlugins=job_submit/cray_aries</i></b> to use.
	This plugin is primarily used to set a gres=craynetwork value which
	is used to limit the number of applications that can run on a node
	at once. That number can be at most 4. This craynetwork gres needs
	to be set up in your slurm.conf to ensure proper functionality.
	For example...
	<pre>
	...
	Grestypes=craynetwork
	NodeName=nid000[00-10] gres=craynetwork:4
	...
	</pre>
	</p>

	<li>Power</li>
	<p>
	Set <b><i>PowerPlugin=power/cray_aries</i></b> to use.
	<b><i>PowerParameters</i></b> is also typically configured.
	For more information, see
	<a href="power_mgmt.html">Slurm Power Management Guide</a>.
	</p>

	<li>Proctrack</li>
	<p>
	Set <b><i>ProctrackType=proctrack/cray_aries</i></b> to use.
	</p>

	<li>Select</li>
	<p>
	Set <b><i>SelectType=select/cray_aries</i></b> to use. This plugin is
	a layered plugin. Which means it enhances a lower layer select
	plugin. By default it is layered on top of the <i>select/linear</i>
	plugin. It can also be layered on top of the <i>select/cons_tres</i> plugin
	by using the <b><i>SelectTypeParameters=other_cons_tres</i></b>,
	doing this will allow you to run multiple jobs on a Cray node just
	like on a normal Linux cluster.
	Use additional <b><i>SelectTypeParameters</i></b> to identify the resources
	to allocate (e.g. cores, sockets, memory, etc.). See the slurm.conf man
	page for details.
	</p>

	<li>SlurmctldPort, SlurmdPort, SrunPortRange</li>
	<p>
	Realm-Specific IP Addressing (RSIP) will automatically try to interact with
	anything opened on ports 8192 to 60000. Configure SlurmctldPort, SlurmdPort,
	and SrunPortRange to use ports above 60000. In the case of SrunPortRange,
	making 1000 or more ports available is recommended.
	</p>

	<li>Switch</li>
	<p>
	Set <b><i>SwitchType=switch/cray_aries</i></b> to use.
	</p>

	<li>Task</li>
	<p>
	Set <b><i>TaskPlugin=cray_aries,cgroup</i></b> to use. Use of the
	<b><i>task/cgroup</i></b> plugin is required alongside <i>task/cray_aries</i>.
	You may also use the <i>task/affinity</i> plugin along with
	<i>task/cray_aries,task/cgroup</i> if desired (i.e.
	<b><i>TaskPlugin=cray_aries,affinity,cgroup</i></b>). Note that plugins
	are used in the order they are defined in the comma separated list,
	and that <i>task/cray_aries</i> must be listed before <i>task/cgroup</i>
	due to internal dependencies between the two plugins.
	</p>
	</ul>

	<h2 id="setup">Cray system setup<a class="slurm_link" href="#setup"></a></h2>
	<p>Some Slurm plugins (burst_buffer/datawarp and power/cray_aries) plugins
	parse JSON format data.
	These plugins are designed to make use of the JSON-C library for this purpose.
	See <a href="download.html#json">JSON-C installation instructions</a> for
	details.</p>

	<p>
	Some services on the system need to be set up to run correctly with
	Slurm. Below is how to restart the service and the nodes they run
	on. It is probably a good idea to set this up to happen automatically.
	<ul>
	<li>boot node</li>
	<ul>
	<li>WLM_DETECT_ACTIVE=SLURM /etc/init.d/aeld restart</li>
	</ul>
	<li>sdb node</li>
	<ul>
	<li>WLM_DETECT_ACTIVE=SLURM /etc/init.d/ncmd restart</li>
	<lI>WLM_DETECT_ACTIVE=SLURM /etc/init.d/apptermd restart</li>
	</ul>
	</ul>
	<p>
	As with Linux clusters you will need to start a slurmd on each of your
	compute nodes. If you choose to use munge authentication, advised,
	you will also need munge installed and a munged running on each of
	your compute nodes as well. See the <a href="quickstart_admin.html">
	quick start guide</a> for more info. Outside of the differences
	listed in this file it can be used to set up your Cray system to run
	Slurm natively.
	</p>
	<p>
	On larger systems, you may wish to set the PMI_MMAP_SYNC_WAIT_TIME
	environment variable in your users' profiles to a larger value than
	the default (180 seconds) to prevent PMI from falsely detecting job
	launch failures.
	</p>
	<h2 id="ha">High Availability<a class="slurm_link" href="#ha"></a></h2>
	<p>
	A backup controller can be setup in or outside the Cray. However, when the
	backup is within the Cray, both the primary and the backup controllers will go
	down when the Cray is rebooted. It is best to setup the backup controller on a
	Cray external node so that the controller can still receive new jobs when the
	Cray is down. When the backup is configured on an external node the
	<b><i>no_backup_scheduling</i></b> <b><i>SchedulerParameter</i></b> should be
	specified in the slurm.conf. This allows new jobs to be submitted while the Cray
	is down and prevents any new jobs from being started.
	</p>


	<p style="text-align:center;">Last modified 27 July 2023</p>

	<!--#include virtual="footer.txt"-->