blob: d9e7d8252f232bcda8da4935966af1d1ce447bf6 [file] [log] [blame] [edit]
<!--#include virtual="header.txt"-->
<h1>BlueGene User and Administrator Guide</h1>
<h2>Overview</h2>
<p>This document describes the unique features of SLURM on the
<a href="http://www.research.ibm.com/bluegene/">IBM BlueGene</a> systems.
You should be familiar with the SLURM's mode of operation on Linux clusters
before studying the relatively few differences in BlueGene operation
described in this document.</p>
<p>BlueGene systems have several unique features making for a few
differences in how SLURM operates there.
The BlueGene system consists of one or more <i>base partitions</i> or
<i>midplanes</i> connected in a three-dimensional torus.
Each <i>base partition</i> consists of 512 <i>c-nodes</i> each containing two processors;
one designed primarily for computations and the other primarily for managing communications.
The <i>c-nodes</i> can execute only one process and thus are unable to execute both
the user's jobs and SLURM's <i>slurmd</i> daemon.
Thus the <i>slurmd</i> daemon executes on one of the BlueGene <i>Front End Nodes</i>.
This single <i>slurmd</i> daemon provides (almost) all of the normal SLURM services
for every <i>base partition</i> on the system. </p>
<p>Internally SLURM treats each <i>base partition</i> as one node with
1024 processors, which keeps the number of entities being managed reasonable.
Since the current BlueGene software can sub-allocate a <i>base partition</i>
into blocks of 32 and/or 128 <i>c-nodes</i>, more than one user job can execute
on each <i>base partition</i> (subject to system administrator configuration).
To effectively utilize this environment, SLURM tools present the user with
the view that each <i>c-nodes</i> is a separate node, so allocation requests
and status information use <i>c-node</i> counts (this is a new feature in
SLURM version 1.1).
Since the <i>c-node</i> count can be very large, the suffix "k" can be used
to represent multiples of 1024 (e.g. "2k" is equivalent to "2048").</p>
<h2>User Tools</h2>
<p>The normal set of SLURM user tools: sbatch, scancel, sinfo, squeue, and scontrol
provide all of the expected services except support for job steps.
SLURM performs resource allocation for the job, but initiation of tasks is performed
using the <i>mpirun</i> command. SLURM has no concept of a job step on BlueGene.
Seven new sbatch options are available:
<i>--geometry</i> (specify job size in each dimension),
<i>--no-rotate</i> (disable rotation of geometry),
<i>--conn-type</i> (specify interconnect type between base partitions, mesh or torus).
<i>--blrts-image</i> (specify alternative blrts image for bluegene --block. Default if not set, BGL only.)
<i>--cnload-image</i> (specify alternative c-node image for bluegene block. Default if not set, BGP only.)
<i>--ioload-image</i> (specify alternative io image for bluegene block. Default if not set, BGP only.)
<i>--linux-image</i> (specify alternative linux image for bluegene block. Default if not set, BGL only.)
<i>--mloader-image</i> (specify alternative mloader image for bluegene block. Default if not set).
<i>--ramdisk-image</i> (specify alternative ramdisk image for bluegene block. Default if not set, BGL only.)
The <i>--nodes</i> option with a minimum and (optionally) maximum node count continues
to be available.
Note that this is a c-node count.</p>
<p>To reiterate: sbatch is used to submit a job script,
but mpirun is used to launch the parallel tasks.
Note that a SLURM batch job's default stdout and stderr file names are generated
using the SLURM job ID.
When the SLURM control daemon is restarted, SLURM job ID values can be repeated,
therefore it is recommended that batch jobs explicitly specify unique names for
stdout and stderr files using the srun options <i>--output</i> and <i>--error</i>
respectively.
While the salloc command may be used to create an interactive SLURM job,
it will be the responsibility of the user to insure that the <i>bgblock</i>
is ready for use before initiating any mpirun commands.
SLURM will assume this responsibility for batch jobs.
The script that you submit to SLURM can contain multiple invocations of mpirun as
well as any desired commands for pre- and post-processing.
The mpirun command will get its <i>bgblock</i> information from the
<i>MPIRUN_PARTITION</i> as set by SLURM. A sample script is shown below.
<pre>
#!/bin/bash
# pre-processing
date
# processing
mpirun -exec /home/user/prog -cwd /home/user -args 123
mpirun -exec /home/user/prog -cwd /home/user -args 124
# post-processing
date
</pre></p>
<h3><a name="naming">Naming Convensions</a></h3>
<p>The naming of base partitions includes a three-digit suffix representing the its
coordinates in the X, Y and Z dimensions with a zero origin.
For example, "bg012" represents the base partition whose coordinate is at X=0, Y=1 and Z=2. In a system
configured with <i>small blocks</i> (any block less than a full base partition) there will be divisions
into the base partition notation. For example, if there were 64 psets in the
configuration, bg012[0-15] represents
the first quarter or first 16 ionodes of a midplane. In BlueGene/L
this would be 128 c-node block. To represent the first nodecard in the
second quarter or ionodes 16-19 the notation would be bg012[16-19], or
a 32 c-node block.
Since jobs must allocate consecutive base partitions in all three dimensions, we have developed
an abbreviated format for describing the base partitions in one of these three-dimensional blocks.
The base partition has a prefix determined from the system which is followed by the end-points
of the block enclosed in square-brackets and separated by an "x".
For example, "bg[620x731]" is used to represent the eight base partitions enclosed in a block
with end-points and bg620 and bg731 (bg620, bg621, bg630, bg631, bg720, bg721,
bg730 and bg731).</p></a>
<p>
<b>IMPORTANT:</b> SLURM version 1.2 or higher can handle a bluegene system of
sizes up to 36x36x36. To try to keep with the 'three-digit suffix
representing the its coordinates in the X, Y and Z dimensions with a
zero origin', we now support A-Z as valid numbers. This makes it so
the prefix <b>must always be lower case</b>, and any letters in the
three-digit suffix <b> must always be upper case</b>. This schema
should be used in your slurm.conf file and in your bluegene.conf file
if you put a prefix there even though it is not necessary there. This
schema should also be used to specify midplanes or locations in
configure mode of smap.
<br>
valid: bgl[000xC44] bgl000 bglZZZ
<br>
invalid: BGL[000xC44] BglC00 bglb00 Bglzzz
</p>
<p>One new tool provided is <i>smap</i>.
As of SLURM version 1.2, <i>sview</i> is
another new tool offering even more viewing and configuring options.
Smap is aware of system topography and provides a map of what base partitions
are allocated to jobs, partitions, etc.
See the smap man page for details.
A sample of smap output is provided below showing the location of five jobs.
Note the format of the list of base partitions allocated to each job.
Also note that idle (unassigned) base partitions are indicated by a period.
Down and drained base partitions (those not available for use) are
indicated by a number sign (bg703 in the display below).
The legend is for illustrative purposes only.
The origin (zero in every dimension) is shown at the rear left corner of the bottom plane.
Each set of four consecutive lines represents a plane in the Y dimension.
Values in the X dimension increase to the right.
Values in the Z dimension increase down and toward the left.</p>
<pre>
a a a a b b d d ID JOBID PARTITION BG_BLOCK USER NAME ST TIME NODES BP_LIST
a a a a b b d d a 12345 batch RMP0 joseph tst1 R 43:12 32k bg[000x333]
a a a a b b c c b 12346 debug RMP1 chris sim3 R 12:34 8k bg[420x533]
a a a a b b c c c 12350 debug RMP2 danny job3 R 0:12 4k bg[622x733]
d 12356 debug RMP3 dan colu R 18:05 8k bg[600x731]
a a a a b b d d e 12378 debug RMP4 joseph asx4 R 0:34 2k bg[612x713]
a a a a b b d d
a a a a b b c c
a a a a b b c c
a a a a . . d d
a a a a . . d d
a a a a . . e e Y
a a a a . . e e |
|
a a a a . . d d 0----X
a a a a . . d d /
a a a a . . . . /
a a a a . . . # Z
</pre>
<p>Note that jobs enter the SLURM state RUNNING as soon as the have been
allocated a bgblock.
If the bgblock is in a READY state, the job will begin execution almost
immediately.
Otherwise the execution of the job will not actually begin until the
bgblock is in a READY state, which can require booting the block and
a delay of minutes to do so.
You can identify the bgblock associated with your job using the command
<i>smap -Dj -c</i> and the state of the bgblock with the command
<i>smap -Db -c</i>.
The time to boot a bgblock is related to its size, but should range from
from a few minutes to about 15 minutes for a bgblock containing 128
base partitions.
Only after the bgblock is READY will your job's output file be created
and the script execution begin.
If the bgblock boot fails, SLURM will attempt to reboot several times
before draining the associated base partitions and aborting the job.</p>
<p>The job will continue to be in a RUNNING state until the bgjob has
completed and the bgblock ownership is changed.
The time for completing a bgjob has frequently been on the order of
five minutes.
In summary, your job may appear in SLURM as RUNNING for 15 minutes
before the script actually begins to 5 minutes after it completes.
These delays are the result of the BlueGene infrastructure issues and are
not due to anything in SLURM.</p>
<p>When using smap in default output mode you can scroll through
the different windows using the arrow keys.
The <b>up</b> and <b>down</b> arrow keys scroll
the window containing the grid, and the <b>left</b> and <b>right</b> arrow
keys scroll the window containing the text information.</p>
<p class="footer"><a href="#top">top</a></p>
<h2>System Administration</h2>
<p>Building a BlueGene compatible system is dependent upon the
<i>configure</i> program locating some expected files.
In particular for a BlueGene/L system, the configure script searches
for <i>libdb2.so</i> in the directories <i>/home/bgdb2cli/sqllib</i>
and <i>/u/bgdb2cli/sqllib</i>. If your DB2 library file is in a
different location, use the configure
option <i>--with-db2-dir=PATH</i> to specify the parent directory.
If you have the same version of the operating system on both the
Service Node (SN) and the Front End Nodes (FEN) then you can configure
and build one set of files on the SN and install them on both the SN and FEN.
Note that all smap functionality will be provided on the FEN
except for the ability to map SLURM node names to and from
row/rack/midplane data, which requires direct use of the Bridge API
calls only available on the SN.</p>
<p>If you have different versions of the operating system on the SN and FEN
(as was the case for some early system installations), then you will need
to configure and build two sets of files for installation.
One set will be for the Service Node (SN), which has direct access to the
Bridge APIs.
The second set will be for the Front End Nodes (FEN), which lack access to the
Bridge APIs and interact with using Remote Procedure Calls to the slurmctld
daemon.
You should see "#define HAVE_BG 1" and "#define HAVE_FRONT_END 1" in the "config.h"
file for both the SN and FEN builds.
You should also see "#define HAVE_BG_FILES 1" in config.h on the SN before
building SLURM. </p>
<p>The slurmctld daemon should execute on the system's service node.
If an optional backup daemon is used, it must be in some location where
it is capable of executing Bridge APIs.
One slurmd daemon should be configured to execute on one of the front end nodes.
That one slurmd daemon represents communications channel for every base partition.
You can use the scontrol command to drain individual nodes as desired and
return them to service. </p>
<p>The <i>slurm.conf</i> (configuration) file needs to have the value of <i>InactiveLimit</i>
set to zero or not specified (it defaults to a value of zero).
This is because there are no job steps and we don't want to purge jobs prematurely.
The value of <i>SelectType</i> must be set to "select/bluegene" in order to have
node selection performed using a system aware of the system's topography
and interfaces.
The value of <i>Prolog</i> should be set to the full pathname of a program that
will delay execution until the bgblock identified by the MPIRUN_PARTITION
environment variable is ready for use. It is recommended that you construct a script
that serves this function and calls the supplied program <i>sbin/slurm_prolog</i>.
The value of <i>Epilog</i> should be set to the full pathname of a program that
will wait until the bgblock identified by the MPIRUN_PARTITION environment
variable is no longer usable by this job. It is recommended that you construct a script
that serves this function and calls the supplied program <i>sbin/slurm_epilog</i>.
The prolog and epilog programs are used to insure proper synchronization
between the slurmctld daemon, the user job, and MMCS.
A multitude of other functions may also be placed into the prolog and
epilog as desired (e.g. enabling/disabling user logins, puring file systmes,
etc.). Sample prolog and epilog scripts follow. </p>
<pre>
#!/bin/bash
# Sample BlueGene Prolog script
#
# Wait for bgblock to be ready for this job's use
/usr/sbin/slurm_prolog
#!/bin/bash
# Sample BlueGene Epilog script
#
# Cancel job to start the termination process for this job
# and release the bgblock
/usr/bin/scancel $SLURM_JOB_ID
#
# Wait for bgblock to be released from this job's use
/usr/sbin/slurm_epilog
</pre>
<p>Since jobs with different geometries or other characteristics might not
interfere with each other, scheduling is somewhat different on a BlueGene
system than typical clusters.
SLURM's builtin scheduler on BlueGene will sort pending jobs and then attempt
to schedule <b>all</b> of them in priority order.
This essentially functions as if there is a separate queue for each job size.
SLURM's backfill scheduler on BlueGene will enforce FIFO (first-in first-out)
scheduling with backfill (lower priority jobs will start early if doing so
will not impact the expected initiation time of a higher priority job).
As on other systems, effective backfill relies upon users setting reasonable
job time limits.
Note that SLURM does support different partitions with an assortment of
different scheduling parameters.
For example, SLURM can have defined a partition for full system jobs that
is enabled to execute jobs only at certain times; while a default partition
could be configured to execute jobs at other times.
Jobs could still be queued in a partition that is configured in a DOWN
state and scheduled to execute when changed to an UP state.
Base partitions can also be moved between slurm partitions either by changing
the <i>slurm.conf</i> file and restarting the slurmctld daemon or by using
the scontrol reconfig command. </p>
<p>SLURM node and partition descriptions should make use of the
<a href="#naming">naming</a> conventions described above. For example,
"NodeName=bg[000x733] NodeAddr=frontend0 NodeHostname=frontend0 Procs=1024"
is used in <i>slurm.conf</i> to define a BlueGene system with 128 midplanes
in an 8 by 4 by 4 matrix.
The node name prefix of "bg" defined by NodeName can be anything you want,
but needs to be consistent throughout the <i>slurm.conf</i> file.
Note that the values of both NodeAddr and NodeHostname for all
128 base partitions is the name of the front-end node executing
the slurmd daemon.
No computer is actually expected to a hostname of "bg000" and no
attempt will be made to route message traffic to this address. </p>
<p>While users are unable to initiate SLURM job steps on BlueGene systems,
this restriction does not apply to user root or <i>SlurmUser</i>.
Be advised that the one slurmd supporting all nodes is unable to manage a
large number of job steps, so this ability should be used only to verify normal
SLURM operation.
If large numbers of job steps are initiated by slurmd, expect the daemon to
fail due to lack of memory or other resources.
It is best to minimize other work on the front-end node executing slurmd
so as to maximize its performance and minimize other risk factors.</p>
<a name="bluegene-conf"><h2>Bluegene.conf File Creation</h2></a>
<p>In addition to the normal <i>slurm.conf</i> file, a new
<i>bluegene.conf</i> configuration file is required with information pertinent
to the sytem.
Put <i>bluegene.conf</i> into the SLURM configuration directory with
<i>slurm.conf</i>.
A sample file is installed in <i>bluegene.conf.example</i>.
System administrators should use the <i>smap</i> tool to build appropriate
configuration file for static partitioning.
Note that <i>smap -Dc</i> can be run without the SLURM daemons
active to establish the initial configuration.
Note that the defined bgblocks may not overlap (except for the
full-system bgblock, which is implicitly created).
See the smap man page for more information.</p>
<p>There are 3 different modes which the system administrator can define
BlueGene partitions (or bgblocks) available to execute jobs: static,
overlap, and dynamic.
Jobs must then execute in one of the created bgblocks.
(<b>NOTE:</b> bgblocks are unrelated to SLURM partitions.)</p>
<p>The default mode of partitioning is <i>static</i>.
In this mode, the system administrator must explicitly define each
of the bgblocks in the <i>bluegene.conf</i> file.
Each of these bgblocks are explicitly configured with either a
mesh or torus interconnect.
They must also not overlap, except for the implicitly defined full-system
bgblock.
Note that bgblocks are not rebooted between jobs in the mode
except when going to/from full-system jobs.
Eliminating bgblock booting can significantly improve system
utilization (eliminating boot time) and reliability.</p>
<p>The second mode is <i>overlap</i> partitioning.
Overlap partitioning is very similar to static partitioning in that
each bgblocks must be explicitly defined in the <i>bluegene.conf</i>
file, but these partitions can overlap each other.
In this mode <b>it is highly recommended that none of the bgblocks
have any passthroughs in the X-dimension associated to them</b>.
Usually this is only an issue on larger BlueGene systems.
<b>It is advisable to use this mode with extreme caution.</b>
Make sure you know what you doing to assure the bgblocks will
boot without dependency on the state of any base partition
not included the bgblock.</p>
<p>In the two previous modes you must insure that the base
partitions defined in <i>bluegene.conf</i> are consistent with
those defined in <i>slurm.conf</i>.
Note the <i>bluegene.conf</i> file contains only the numeric
coordinates of base partitions while <i>slurm.conf</i> contains
the name prefix in addition to the numeric coordinates.</p>
<p>The final mode is <i>dynamic</i> partitioning.
Dynamic partitioning was developed primarily for smaller BlueGene systems,
but can be used on larger systems.
Dynamic partitioning may introduce fragmentation of resources.
This fragementaiton may be severe since SLURM will run a job anywhere
resources are available with little thought of the future.
As with overlap partitioning, <b>use dynamic partitioning with
caution!</b>
This mode can result in job starvation since smaller jobs will run
if resources are available and prevent larger jobs from running.
Bgblocks need not be assigned in the <i>bluegene.conf</i> file
for this mode.</p>
<p>Blocks can be freed or set in an error state with scontrol,
(i.e. "<i>scontrol update BlockName=RMP0 state=error</i>").
This will end any job on the block and set the state of the block to ERROR
making it so no job will run on the block. To set it back to a useable
state set the state to free (i.e.
"<i>scontrol update BlockName=RMP0 state=free</i>").
<p>Alternatively, if only part of a base partition needs to be put
into an error state which isn't already in a block of the size you
need, you can set a set of ionodes into an error state with scontrol,
(i.e. "<i>scontrol update subbpname=bg000[0-3] state=error</i>").
This will end any job on the nodes listed, create a block there, and set
the state of the block to ERROR making it so no job will run on the
block. To set it back to a useable state set the state to free (i.e.
"<i>scontrol update BlockName=RMP0 state=free</i>" or
"<i>scontrol update subbpname=bg000[0-3] state=free</i>"). This is
helpful to allow other jobs to run on the unaffected nodes in
the base partition.
<p>One of these modes must be defined in the <i>bluegene.conf</i> file
with the option <i>LayoutMode=MODE</i> (where MODE=STATIC, DYNAMIC or OVERLAP).</p>
<p>The number of c-nodes in a base partition and in a node card must
be defined.
This is done using the keywords <i>BasePartitionNodeCnt=NODE_COUNT</i>
and <i>NodeCardNodeCnt=NODE_COUNT</i> respectively in the <i>bluegene.conf</i>
file (i.e. <i>BasePartitionNodeCnt=512</i> and <i>NodeCardNodeCnt=32</i>).</p>
<p>Note that the <i>Numpsets</i> values defined in
<i>bluegene.conf</i> is used only when SLURM creates bgblocks this
determines if the system is IO rich or not. For most bluegene/L
systems this value is either 8 (for IO poor systems) or 64 (for IO rich
systems).
<p>The <i>Images</i> can change during job start based on input from
the user.
If you change the bgblock layout, then slurmctld and slurmd should
both be cold-started (e.g. <b>/etc/init.d/slurm startclean</b>).
If you wish to modify the <i>Numpsets</i> values
for existing bgblocks, either modify them manually or destroy the bgblocks
and let SLURM recreate them.
Note that in addition to the bgblocks defined in <i>bluegene.conf</i>, an
additional bgblock is created containing all resources defined
all of the other defined bgblocks.
Make use of the SLURM partition mechanism to control access to these
bgblocks.
A sample <i>bluegene.conf</i> file is shown below.
<pre>
###############################################################################
# Global specifications for BlueGene system
#
# BlrtsImage: BlrtsImage used for creation of all bgblocks.
# LinuxImage: LinuxImage used for creation of all bgblocks.
# MloaderImage: MloaderImage used for creation of all bgblocks.
# RamDiskImage: RamDiskImage used for creation of all bgblocks.
#
# You may add extra images which a user can specify from the srun
# command line (see man srun). When adding these images you may also add
# a Groups= at the end of the image path to specify which groups can
# use the image.
#
# AltBlrtsImage: Alternative BlrtsImage(s).
# AltLinuxImage: Alternative LinuxImage(s).
# AltMloaderImage: Alternative MloaderImage(s).
# AltRamDiskImage: Alternative RamDiskImage(s).
#
# LayoutMode: Mode in which slurm will create blocks:
# STATIC: Use defined non-overlapping bgblocks
# OVERLAP: Use defined bgblocks, which may overlap
# DYNAMIC: Create bgblocks as needed for each job
# BasePartitionNodeCnt: Number of c-nodes per base partition
# NodeCardNodeCnt: Number of c-nodes per node card.
# Numpsets: The Numpsets used for creation of all bgblocks
# equals this value multiplied by the number of
# base partitions in the bgblock.
#
# BridgeAPILogFile: Pathname of file in which to write the
# Bridge API logs.
# BridgeAPIVerbose: How verbose the BG Bridge API logs should be
# 0: Log only error and warning messages
# 1: Log level 0 and information messages
# 2: Log level 1 and basic debug messages
# 3: Log level 2 and more debug message
# 4: Log all messages
# DenyPassthrough: Prevents use of passthrough ports in specific
# dimensions, X, Y, and/or Z, plus ALL
#
# NOTE: The bgl_serial value is set at configuration time using the
# "--with-bgl-serial=" option. Its default value is "BGL".
###############################################################################
# These are the default images with are used if the user doesn't specify
# which image they want
BlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts
LinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf
MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts
RamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf
#Only group jette can use these images
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw2.rts Groups=jette
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage2.elf Groups=jette
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader2.rts Groups=jette
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk2.elf Groups=jette
# Since no groups are specified here any user can use them
AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw3.rts
AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage3.elf
AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader3.rts
AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk3.elf
# Another option for images would be a "You can use anything you like image" *
# This allows the user to use any image entered with no security checking
AltBlrtsImage=* Groups=da,adamb
AltLinuxImage=* Groups=da,adamb
AltMloaderImage=* Groups=da,adamb
AltRamDiskImage=* Groups=da,adamb
LayoutMode=STATIC
BasePartitionNodeCnt=512
NodeCardNodeCnt=32
NumPsets=64 # An I/O rich environment
BridgeAPILogFile=/var/log/slurm/bridgeapi.log
BridgeAPIVerbose=0
#DenyPassthrough=X,Y,Z
###############################################################################
# Define the static/overlap partitions (bgblocks)
#
# BPs: The base partitions (midplanes) in the bgblock using XYZ coordinates
# Type: Connection type "MESH" or "TORUS" or "SMALL", default is "TORUS"
# Type SMALL will divide a midplane into multiple bgblocks
# based off options NodeCards and Quarters to determine type of
# small blocks.
#
# IMPORTANT NOTES:
# * Ordering is very important for laying out switch wires. Please create
# blocks with smap, and once done don't move the order of blocks
# created.
# * A bgblock is implicitly created containing all resources on the system
# * Bgblocks must not overlap (except for implicitly created bgblock)
# This will be the case when smap is used to create a configuration file
# * All Base partitions defined here must also be defined in the slurm.conf file
# * Define only the numeric coordinates of the bgblocks here. The prefix
# will be based upon the name defined in slurm.conf
###############################################################################
# LEAVE NEXT LINE AS A COMMENT, Full-system bgblock, implicitly created
# BPs=[000x001] Type=TORUS # 1x1x2 = 2 midplanes
###############################################################################
# volume = 1x1x1 = 1
BPs=[000x000] Type=TORUS # 1x1x1 = 1 midplane
BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized
# cnode blocks 3-Base
# Partition Quarter sized
# c-node blocks
</pre></p>
<p>The above <i>bluegene.conf</i> file defines multiple bgblocks to be
created in a single midplane (see the "SMALL" option).
Using this mechanism, up to 32 independent jobs each consisting of 1
32 cnodes can be executed
simultaneously on a one-rack BlueGene system.
If defining bgblocks of <i>Type=SMALL</i>, the SLURM partition
containing them as defined in <i>slurm.conf</i> must have the
parameter <i>Shared=force</i> to enable scheduling of multiple
jobs on what SLURM considers a single node.
SLURM partitions that do not contain bgblocks of <i>Type=SMALL</i>
may have the parameter <i>Shared=no</i> for a slight improvement in
scheduler performance.
As in all SLURM configuration files, parameters and values
are case insensitive.</p>
<p> With a BlueGene/P system the image names are different. The
correct image names are CnloadImage, MloaderImage, and IoloadImage.
You can also use alternate images just the same as discribed above.
<p>One more thing is required to support SLURM interactions with
the DB2 database (at least as of the time this was written).
DB2 database access is required by the slurmctld daemon only.
All other SLURM daemons and commands interact with DB2 using
remote procedure calls, which are processed by slurmctld.
DB2 access is dependent upon the environment variable
<i>BRIDGE_CONFIG_FILE</i>.
Make sure this is set appropriate before initiating the
slurmctld daemon.
If desired, this environment variable and any other logic
can be executed through the script <i>/etc/sysconfig/slurm</i>,
which is automatically executed by <i>/etc/init.d/slurm</i>
prior to initiating the SLURM daemons.</p>
<p>When slurmctld is initially started on an idle system, the bgblocks
already defined in MMCS are read using the Bridge APIs.
If these bgblocks do not correspond to those defined in the <i>bluegene.conf</i>
file, the old bgblocks with a prefix of "RMP" are destroyed and new ones
created.
When a job is scheduled, the appropriate bgblock is identified,
its user set, and it is booted.
Node use (virtual or coprocessor) is set from the mpirun command line now,
SLURM has nothing to do with setting the node use.
Subsequent jobs use this same bgblock without rebooting by changing
the associated user field.
The only time bgblocks should be freed and rebooted, in normal operation,
is when going to or from full-system
jobs (two or more bgblocks sharing base partitions can not be in a
ready state at the same time).
When this logic became available at LLNL, approximately 85 percent of
bgblock boots were eliminated and the overhead of job startup went
from about 24% to about 6% of total job time.
Note that bgblocks will remain in a ready (booted) state when
the SLURM daemons are stopped.
This permits SLURM daemon restarts without loss of running jobs
or rebooting of bgblocks. </p>
<p>Be aware that SLURM will issue multiple bgblock boot requests as
needed (e.g. when the boot fails).
If the bgblock boot requests repeatedly fail, SLURM will configure
the failing base partitions to a DRAINED state so as to avoid continuing
repeated reboots and the likely failure of user jobs.
A system administrator should address the problem before returning
the base partitions to service.</p>
<p>If you cold-start slurmctld (<b>/etc/init.d/slurm startclean</b>
or <b>slurmctld -c</b>) it is recommended that you also cold-start
the slurmd at the same time.
Failure to do so may result in errors being reported by both slurmd
and slurmctld due to bgblocks that previously existed being deleted.</p>
<p>A new tool <i>sfree</i> has also been added to help system
administrators free a bgblock on request (i.e.
"<i>sfree --bgblock=&lt;blockname&gt;</i>").
Run <i>sfree --help</i> for more information.</p>
<h4>Resource Reservations</h4>
<p><b>This reservation mechanism for less than an entire midplane is still
under development.</b></p>
<p>SLURM's advance reservation mechanism is designed to reserve resources
at the level of whole nodes, which on a BlueGene systems would represent
whole midplanes. In order to support advanced reservations with a finer
grained resolution, you can configure one license per c-node on the system
and reserve c-nodes instead of entire midplanes. For example, in slurm.conf
specify something of this sort: "<i>Licenses=cnode*512</i>". Then create an
advanced reservation with a command like this:<br>
"<i>scontrol create reservation licenses="cnode*32" starttime=now duration=30:00 users=joe</i>".</p>
<p>There is also a job_submit/cnode plugin available for use that will
automatically set a job's license specification to match his c-node request
(i.e. a command like<br>
"<i>sbatch -N32 my.sh</i>" would automatically be translated to<br>
"<i>sbatch -N32 --licenses=cnode*32 my.sh</i>" by the slurmctld daemon.
Enable this plugin in the slurm.conf configuration file with the option
"<i>JobSubmitPlugins=cnode</i>".</p>
<h4>Debugging</h4>
<p>All of the testing and debugging guidance provided in
<a href="quickstart_admin.html"> Quick Start Administrator Guide</a>
apply to BlueGene systems.
One can start the <i>slurmctld</i> and <i>slurmd</i> in the foreground
with extensive debugging to establish basic functionality.
Once running in production, the configured <i>SlurmctldLog</i> and
<i>SlurmdLog</i> files will provide historical system information.
On BlueGene systems, there is also a <i>BridgeAPILogFile</i> defined
in <i>bluegene.conf</i> which can be configured to contain detailed
information about every Bridge API call issued.</p>
<p>Note that slurmcltld log messages of the sort
<i>Nodes bg[000x133] not responding</i> are indicative of the slurmd
daemon serving as a front-end to those base partitions is not responding (on
non-BlueGene systems, the slurmd actually does run on the compute
nodes, so the message is more meaningful there). </p>
<p>Note that you can emulate a BlueGene/L system on stand-alone Linux
system.
Run <b>configure</b> with the <b>--enable-bgl-emulation</b> option.
This will define "HAVE_BG", "HAVE_BGL", and "HAVE_FRONT_END" in the
config.h file.
You can also emulate a BlueGene/P system with
the <b>--enable-bgp-emulation</b> option.
This will define "HAVE_BG", "HAVE_BGP", and "HAVE_FRONT_END" in the
config.h file.
Then execute <b>make</b> normally.
These variables will build the code as if it were running
on an actual BlueGene computer, but avoid making calls to the
Bridge libary (that is controlled by the variable "HAVE_BG_FILES",
which is left undefined). You can use this to test configurations,
scheduling logic, etc. </p>
<p class="footer"><a href="#top">top</a></p>
<p style="text-align:center;">Last modified 17 March 2009</p>
<!--#include virtual="footer.txt"-->