| <!--#include virtual="header.txt"--> |
| |
| <h1>BlueGene User and Administrator Guide</h1> |
| |
| <h2>Overview</h2> |
| |
| <p>This document describes the unique features of SLURM on the |
| <a href="http://www.research.ibm.com/bluegene/">IBM BlueGene</a> systems. |
| You should be familiar with the SLURM's mode of operation on Linux clusters |
| before studying the relatively few differences in BlueGene operation |
| described in this document.</p> |
| |
| <p>BlueGene systems have several unique features making for a few |
| differences in how SLURM operates there. |
| The BlueGene system consists of one or more <i>base partitions</i> or |
| <i>midplanes</i> connected in a three-dimensional torus. |
| Each <i>base partition</i> consists of 512 <i>c-nodes</i> each containing two processors; |
| one designed primarily for computations and the other primarily for managing communications. |
| The <i>c-nodes</i> can execute only one process and thus are unable to execute both |
| the user's jobs and SLURM's <i>slurmd</i> daemon. |
| Thus the <i>slurmd</i> daemon executes on one of the BlueGene <i>Front End Nodes</i>. |
| This single <i>slurmd</i> daemon provides (almost) all of the normal SLURM services |
| for every <i>base partition</i> on the system. </p> |
| |
| <p>Internally SLURM treats each <i>base partition</i> as one node with |
| 1024 processors, which keeps the number of entities being managed reasonable. |
| Since the current BlueGene software can sub-allocate a <i>base partition</i> |
| into blocks of 32 and/or 128 <i>c-nodes</i>, more than one user job can execute |
| on each <i>base partition</i> (subject to system administrator configuration). |
| To effectively utilize this environment, SLURM tools present the user with |
| the view that each <i>c-nodes</i> is a separate node, so allocation requests |
| and status information use <i>c-node</i> counts (this is a new feature in |
| SLURM version 1.1). |
| Since the <i>c-node</i> count can be very large, the suffix "k" can be used |
| to represent multiples of 1024 (e.g. "2k" is equivalent to "2048").</p> |
| |
| <h2>User Tools</h2> |
| |
| <p>The normal set of SLURM user tools: sbatch, scancel, sinfo, squeue, and scontrol |
| provide all of the expected services except support for job steps. |
| SLURM performs resource allocation for the job, but initiation of tasks is performed |
| using the <i>mpirun</i> command. SLURM has no concept of a job step on BlueGene. |
| Seven new sbatch options are available: |
| <i>--geometry</i> (specify job size in each dimension), |
| <i>--no-rotate</i> (disable rotation of geometry), |
| <i>--conn-type</i> (specify interconnect type between base partitions, mesh or torus). |
| <i>--blrts-image</i> (specify alternative blrts image for bluegene --block. Default if not set, BGL only.) |
| <i>--cnload-image</i> (specify alternative c-node image for bluegene block. Default if not set, BGP only.) |
| <i>--ioload-image</i> (specify alternative io image for bluegene block. Default if not set, BGP only.) |
| <i>--linux-image</i> (specify alternative linux image for bluegene block. Default if not set, BGL only.) |
| <i>--mloader-image</i> (specify alternative mloader image for bluegene block. Default if not set). |
| <i>--ramdisk-image</i> (specify alternative ramdisk image for bluegene block. Default if not set, BGL only.) |
| The <i>--nodes</i> option with a minimum and (optionally) maximum node count continues |
| to be available. |
| |
| Note that this is a c-node count.</p> |
| |
| <p>To reiterate: sbatch is used to submit a job script, |
| but mpirun is used to launch the parallel tasks. |
| Note that a SLURM batch job's default stdout and stderr file names are generated |
| using the SLURM job ID. |
| When the SLURM control daemon is restarted, SLURM job ID values can be repeated, |
| therefore it is recommended that batch jobs explicitly specify unique names for |
| stdout and stderr files using the srun options <i>--output</i> and <i>--error</i> |
| respectively. |
| While the salloc command may be used to create an interactive SLURM job, |
| it will be the responsibility of the user to insure that the <i>bgblock</i> |
| is ready for use before initiating any mpirun commands. |
| SLURM will assume this responsibility for batch jobs. |
| The script that you submit to SLURM can contain multiple invocations of mpirun as |
| well as any desired commands for pre- and post-processing. |
| The mpirun command will get its <i>bgblock</i> information from the |
| <i>MPIRUN_PARTITION</i> as set by SLURM. A sample script is shown below. |
| <pre> |
| #!/bin/bash |
| # pre-processing |
| date |
| # processing |
| mpirun -exec /home/user/prog -cwd /home/user -args 123 |
| mpirun -exec /home/user/prog -cwd /home/user -args 124 |
| # post-processing |
| date |
| </pre></p> |
| |
| <h3><a name="naming">Naming Convensions</a></h3> |
| <p>The naming of base partitions includes a three-digit suffix representing the its |
| coordinates in the X, Y and Z dimensions with a zero origin. |
| For example, "bg012" represents the base partition whose coordinate is at X=0, Y=1 and Z=2. In a system |
| configured with <i>small blocks</i> (any block less than a full base partition) there will be divisions |
| into the base partition notation. For example, if there were 64 psets in the |
| configuration, bg012[0-15] represents |
| the first quarter or first 16 ionodes of a midplane. In BlueGene/L |
| this would be 128 c-node block. To represent the first nodecard in the |
| second quarter or ionodes 16-19 the notation would be bg012[16-19], or |
| a 32 c-node block. |
| Since jobs must allocate consecutive base partitions in all three dimensions, we have developed |
| an abbreviated format for describing the base partitions in one of these three-dimensional blocks. |
| The base partition has a prefix determined from the system which is followed by the end-points |
| of the block enclosed in square-brackets and separated by an "x". |
| For example, "bg[620x731]" is used to represent the eight base partitions enclosed in a block |
| with end-points and bg620 and bg731 (bg620, bg621, bg630, bg631, bg720, bg721, |
| bg730 and bg731).</p></a> |
| |
| <p> |
| <b>IMPORTANT:</b> SLURM version 1.2 or higher can handle a bluegene system of |
| sizes up to 36x36x36. To try to keep with the 'three-digit suffix |
| representing the its coordinates in the X, Y and Z dimensions with a |
| zero origin', we now support A-Z as valid numbers. This makes it so |
| the prefix <b>must always be lower case</b>, and any letters in the |
| three-digit suffix <b> must always be upper case</b>. This schema |
| should be used in your slurm.conf file and in your bluegene.conf file |
| if you put a prefix there even though it is not necessary there. This |
| schema should also be used to specify midplanes or locations in |
| configure mode of smap. |
| |
| <br> |
| valid: bgl[000xC44] bgl000 bglZZZ |
| <br> |
| invalid: BGL[000xC44] BglC00 bglb00 Bglzzz |
| </p> |
| |
| <p>One new tool provided is <i>smap</i>. |
| As of SLURM version 1.2, <i>sview</i> is |
| another new tool offering even more viewing and configuring options. |
| Smap is aware of system topography and provides a map of what base partitions |
| are allocated to jobs, partitions, etc. |
| See the smap man page for details. |
| A sample of smap output is provided below showing the location of five jobs. |
| Note the format of the list of base partitions allocated to each job. |
| Also note that idle (unassigned) base partitions are indicated by a period. |
| Down and drained base partitions (those not available for use) are |
| indicated by a number sign (bg703 in the display below). |
| The legend is for illustrative purposes only. |
| The origin (zero in every dimension) is shown at the rear left corner of the bottom plane. |
| Each set of four consecutive lines represents a plane in the Y dimension. |
| Values in the X dimension increase to the right. |
| Values in the Z dimension increase down and toward the left.</p> |
| |
| <pre> |
| a a a a b b d d ID JOBID PARTITION BG_BLOCK USER NAME ST TIME NODES BP_LIST |
| a a a a b b d d a 12345 batch RMP0 joseph tst1 R 43:12 32k bg[000x333] |
| a a a a b b c c b 12346 debug RMP1 chris sim3 R 12:34 8k bg[420x533] |
| a a a a b b c c c 12350 debug RMP2 danny job3 R 0:12 4k bg[622x733] |
| d 12356 debug RMP3 dan colu R 18:05 8k bg[600x731] |
| a a a a b b d d e 12378 debug RMP4 joseph asx4 R 0:34 2k bg[612x713] |
| a a a a b b d d |
| a a a a b b c c |
| a a a a b b c c |
| |
| a a a a . . d d |
| a a a a . . d d |
| a a a a . . e e Y |
| a a a a . . e e | |
| | |
| a a a a . . d d 0----X |
| a a a a . . d d / |
| a a a a . . . . / |
| a a a a . . . # Z |
| </pre> |
| |
| <p>Note that jobs enter the SLURM state RUNNING as soon as the have been |
| allocated a bgblock. |
| If the bgblock is in a READY state, the job will begin execution almost |
| immediately. |
| Otherwise the execution of the job will not actually begin until the |
| bgblock is in a READY state, which can require booting the block and |
| a delay of minutes to do so. |
| You can identify the bgblock associated with your job using the command |
| <i>smap -Dj -c</i> and the state of the bgblock with the command |
| <i>smap -Db -c</i>. |
| The time to boot a bgblock is related to its size, but should range from |
| from a few minutes to about 15 minutes for a bgblock containing 128 |
| base partitions. |
| Only after the bgblock is READY will your job's output file be created |
| and the script execution begin. |
| If the bgblock boot fails, SLURM will attempt to reboot several times |
| before draining the associated base partitions and aborting the job.</p> |
| |
| <p>The job will continue to be in a RUNNING state until the bgjob has |
| completed and the bgblock ownership is changed. |
| The time for completing a bgjob has frequently been on the order of |
| five minutes. |
| In summary, your job may appear in SLURM as RUNNING for 15 minutes |
| before the script actually begins to 5 minutes after it completes. |
| These delays are the result of the BlueGene infrastructure issues and are |
| not due to anything in SLURM.</p> |
| |
| <p>When using smap in default output mode you can scroll through |
| the different windows using the arrow keys. |
| The <b>up</b> and <b>down</b> arrow keys scroll |
| the window containing the grid, and the <b>left</b> and <b>right</b> arrow |
| keys scroll the window containing the text information.</p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <h2>System Administration</h2> |
| |
| <p>Building a BlueGene compatible system is dependent upon the |
| <i>configure</i> program locating some expected files. |
| In particular for a BlueGene/L system, the configure script searches |
| for <i>libdb2.so</i> in the directories <i>/home/bgdb2cli/sqllib</i> |
| and <i>/u/bgdb2cli/sqllib</i>. If your DB2 library file is in a |
| different location, use the configure |
| option <i>--with-db2-dir=PATH</i> to specify the parent directory. |
| If you have the same version of the operating system on both the |
| Service Node (SN) and the Front End Nodes (FEN) then you can configure |
| and build one set of files on the SN and install them on both the SN and FEN. |
| Note that all smap functionality will be provided on the FEN |
| except for the ability to map SLURM node names to and from |
| row/rack/midplane data, which requires direct use of the Bridge API |
| calls only available on the SN.</p> |
| |
| <p>If you have different versions of the operating system on the SN and FEN |
| (as was the case for some early system installations), then you will need |
| to configure and build two sets of files for installation. |
| One set will be for the Service Node (SN), which has direct access to the |
| Bridge APIs. |
| The second set will be for the Front End Nodes (FEN), which lack access to the |
| Bridge APIs and interact with using Remote Procedure Calls to the slurmctld |
| daemon. |
| You should see "#define HAVE_BG 1" and "#define HAVE_FRONT_END 1" in the "config.h" |
| file for both the SN and FEN builds. |
| You should also see "#define HAVE_BG_FILES 1" in config.h on the SN before |
| building SLURM. </p> |
| |
| <p>The slurmctld daemon should execute on the system's service node. |
| If an optional backup daemon is used, it must be in some location where |
| it is capable of executing Bridge APIs. |
| One slurmd daemon should be configured to execute on one of the front end nodes. |
| That one slurmd daemon represents communications channel for every base partition. |
| You can use the scontrol command to drain individual nodes as desired and |
| return them to service. </p> |
| |
| <p>The <i>slurm.conf</i> (configuration) file needs to have the value of <i>InactiveLimit</i> |
| set to zero or not specified (it defaults to a value of zero). |
| This is because there are no job steps and we don't want to purge jobs prematurely. |
| The value of <i>SelectType</i> must be set to "select/bluegene" in order to have |
| node selection performed using a system aware of the system's topography |
| and interfaces. |
| The value of <i>Prolog</i> should be set to the full pathname of a program that |
| will delay execution until the bgblock identified by the MPIRUN_PARTITION |
| environment variable is ready for use. It is recommended that you construct a script |
| that serves this function and calls the supplied program <i>sbin/slurm_prolog</i>. |
| The value of <i>Epilog</i> should be set to the full pathname of a program that |
| will wait until the bgblock identified by the MPIRUN_PARTITION environment |
| variable is no longer usable by this job. It is recommended that you construct a script |
| that serves this function and calls the supplied program <i>sbin/slurm_epilog</i>. |
| The prolog and epilog programs are used to insure proper synchronization |
| between the slurmctld daemon, the user job, and MMCS. |
| A multitude of other functions may also be placed into the prolog and |
| epilog as desired (e.g. enabling/disabling user logins, puring file systmes, |
| etc.). Sample prolog and epilog scripts follow. </p> |
| |
| <pre> |
| #!/bin/bash |
| # Sample BlueGene Prolog script |
| # |
| # Wait for bgblock to be ready for this job's use |
| /usr/sbin/slurm_prolog |
| |
| |
| #!/bin/bash |
| # Sample BlueGene Epilog script |
| # |
| # Cancel job to start the termination process for this job |
| # and release the bgblock |
| /usr/bin/scancel $SLURM_JOB_ID |
| # |
| # Wait for bgblock to be released from this job's use |
| /usr/sbin/slurm_epilog |
| </pre> |
| |
| <p>Since jobs with different geometries or other characteristics might not |
| interfere with each other, scheduling is somewhat different on a BlueGene |
| system than typical clusters. |
| SLURM's builtin scheduler on BlueGene will sort pending jobs and then attempt |
| to schedule <b>all</b> of them in priority order. |
| This essentially functions as if there is a separate queue for each job size. |
| SLURM's backfill scheduler on BlueGene will enforce FIFO (first-in first-out) |
| scheduling with backfill (lower priority jobs will start early if doing so |
| will not impact the expected initiation time of a higher priority job). |
| As on other systems, effective backfill relies upon users setting reasonable |
| job time limits. |
| Note that SLURM does support different partitions with an assortment of |
| different scheduling parameters. |
| For example, SLURM can have defined a partition for full system jobs that |
| is enabled to execute jobs only at certain times; while a default partition |
| could be configured to execute jobs at other times. |
| Jobs could still be queued in a partition that is configured in a DOWN |
| state and scheduled to execute when changed to an UP state. |
| Base partitions can also be moved between slurm partitions either by changing |
| the <i>slurm.conf</i> file and restarting the slurmctld daemon or by using |
| the scontrol reconfig command. </p> |
| |
| <p>SLURM node and partition descriptions should make use of the |
| <a href="#naming">naming</a> conventions described above. For example, |
| "NodeName=bg[000x733] NodeAddr=frontend0 NodeHostname=frontend0 Procs=1024" |
| is used in <i>slurm.conf</i> to define a BlueGene system with 128 midplanes |
| in an 8 by 4 by 4 matrix. |
| The node name prefix of "bg" defined by NodeName can be anything you want, |
| but needs to be consistent throughout the <i>slurm.conf</i> file. |
| Note that the values of both NodeAddr and NodeHostname for all |
| 128 base partitions is the name of the front-end node executing |
| the slurmd daemon. |
| No computer is actually expected to a hostname of "bg000" and no |
| attempt will be made to route message traffic to this address. </p> |
| |
| <p>While users are unable to initiate SLURM job steps on BlueGene systems, |
| this restriction does not apply to user root or <i>SlurmUser</i>. |
| Be advised that the one slurmd supporting all nodes is unable to manage a |
| large number of job steps, so this ability should be used only to verify normal |
| SLURM operation. |
| If large numbers of job steps are initiated by slurmd, expect the daemon to |
| fail due to lack of memory or other resources. |
| It is best to minimize other work on the front-end node executing slurmd |
| so as to maximize its performance and minimize other risk factors.</p> |
| |
| <a name="bluegene-conf"><h2>Bluegene.conf File Creation</h2></a> |
| <p>In addition to the normal <i>slurm.conf</i> file, a new |
| <i>bluegene.conf</i> configuration file is required with information pertinent |
| to the sytem. |
| Put <i>bluegene.conf</i> into the SLURM configuration directory with |
| <i>slurm.conf</i>. |
| A sample file is installed in <i>bluegene.conf.example</i>. |
| System administrators should use the <i>smap</i> tool to build appropriate |
| configuration file for static partitioning. |
| Note that <i>smap -Dc</i> can be run without the SLURM daemons |
| active to establish the initial configuration. |
| Note that the defined bgblocks may not overlap (except for the |
| full-system bgblock, which is implicitly created). |
| See the smap man page for more information.</p> |
| |
| <p>There are 3 different modes which the system administrator can define |
| BlueGene partitions (or bgblocks) available to execute jobs: static, |
| overlap, and dynamic. |
| Jobs must then execute in one of the created bgblocks. |
| (<b>NOTE:</b> bgblocks are unrelated to SLURM partitions.)</p> |
| |
| <p>The default mode of partitioning is <i>static</i>. |
| In this mode, the system administrator must explicitly define each |
| of the bgblocks in the <i>bluegene.conf</i> file. |
| Each of these bgblocks are explicitly configured with either a |
| mesh or torus interconnect. |
| They must also not overlap, except for the implicitly defined full-system |
| bgblock. |
| Note that bgblocks are not rebooted between jobs in the mode |
| except when going to/from full-system jobs. |
| Eliminating bgblock booting can significantly improve system |
| utilization (eliminating boot time) and reliability.</p> |
| |
| <p>The second mode is <i>overlap</i> partitioning. |
| Overlap partitioning is very similar to static partitioning in that |
| each bgblocks must be explicitly defined in the <i>bluegene.conf</i> |
| file, but these partitions can overlap each other. |
| In this mode <b>it is highly recommended that none of the bgblocks |
| have any passthroughs in the X-dimension associated to them</b>. |
| Usually this is only an issue on larger BlueGene systems. |
| <b>It is advisable to use this mode with extreme caution.</b> |
| Make sure you know what you doing to assure the bgblocks will |
| boot without dependency on the state of any base partition |
| not included the bgblock.</p> |
| |
| <p>In the two previous modes you must insure that the base |
| partitions defined in <i>bluegene.conf</i> are consistent with |
| those defined in <i>slurm.conf</i>. |
| Note the <i>bluegene.conf</i> file contains only the numeric |
| coordinates of base partitions while <i>slurm.conf</i> contains |
| the name prefix in addition to the numeric coordinates.</p> |
| |
| <p>The final mode is <i>dynamic</i> partitioning. |
| Dynamic partitioning was developed primarily for smaller BlueGene systems, |
| but can be used on larger systems. |
| Dynamic partitioning may introduce fragmentation of resources. |
| This fragementaiton may be severe since SLURM will run a job anywhere |
| resources are available with little thought of the future. |
| As with overlap partitioning, <b>use dynamic partitioning with |
| caution!</b> |
| This mode can result in job starvation since smaller jobs will run |
| if resources are available and prevent larger jobs from running. |
| Bgblocks need not be assigned in the <i>bluegene.conf</i> file |
| for this mode.</p> |
| |
| <p>Blocks can be freed or set in an error state with scontrol, |
| (i.e. "<i>scontrol update BlockName=RMP0 state=error</i>"). |
| This will end any job on the block and set the state of the block to ERROR |
| making it so no job will run on the block. To set it back to a useable |
| state set the state to free (i.e. |
| "<i>scontrol update BlockName=RMP0 state=free</i>"). |
| |
| <p>Alternatively, if only part of a base partition needs to be put |
| into an error state which isn't already in a block of the size you |
| need, you can set a set of ionodes into an error state with scontrol, |
| (i.e. "<i>scontrol update subbpname=bg000[0-3] state=error</i>"). |
| This will end any job on the nodes listed, create a block there, and set |
| the state of the block to ERROR making it so no job will run on the |
| block. To set it back to a useable state set the state to free (i.e. |
| "<i>scontrol update BlockName=RMP0 state=free</i>" or |
| "<i>scontrol update subbpname=bg000[0-3] state=free</i>"). This is |
| helpful to allow other jobs to run on the unaffected nodes in |
| the base partition. |
| |
| |
| <p>One of these modes must be defined in the <i>bluegene.conf</i> file |
| with the option <i>LayoutMode=MODE</i> (where MODE=STATIC, DYNAMIC or OVERLAP).</p> |
| |
| <p>The number of c-nodes in a base partition and in a node card must |
| be defined. |
| This is done using the keywords <i>BasePartitionNodeCnt=NODE_COUNT</i> |
| and <i>NodeCardNodeCnt=NODE_COUNT</i> respectively in the <i>bluegene.conf</i> |
| file (i.e. <i>BasePartitionNodeCnt=512</i> and <i>NodeCardNodeCnt=32</i>).</p> |
| |
| <p>Note that the <i>Numpsets</i> values defined in |
| <i>bluegene.conf</i> is used only when SLURM creates bgblocks this |
| determines if the system is IO rich or not. For most bluegene/L |
| systems this value is either 8 (for IO poor systems) or 64 (for IO rich |
| systems). |
| <p>The <i>Images</i> can change during job start based on input from |
| the user. |
| If you change the bgblock layout, then slurmctld and slurmd should |
| both be cold-started (e.g. <b>/etc/init.d/slurm startclean</b>). |
| If you wish to modify the <i>Numpsets</i> values |
| for existing bgblocks, either modify them manually or destroy the bgblocks |
| and let SLURM recreate them. |
| Note that in addition to the bgblocks defined in <i>bluegene.conf</i>, an |
| additional bgblock is created containing all resources defined |
| all of the other defined bgblocks. |
| Make use of the SLURM partition mechanism to control access to these |
| bgblocks. |
| A sample <i>bluegene.conf</i> file is shown below. |
| <pre> |
| ############################################################################### |
| # Global specifications for BlueGene system |
| # |
| # BlrtsImage: BlrtsImage used for creation of all bgblocks. |
| # LinuxImage: LinuxImage used for creation of all bgblocks. |
| # MloaderImage: MloaderImage used for creation of all bgblocks. |
| # RamDiskImage: RamDiskImage used for creation of all bgblocks. |
| # |
| # You may add extra images which a user can specify from the srun |
| # command line (see man srun). When adding these images you may also add |
| # a Groups= at the end of the image path to specify which groups can |
| # use the image. |
| # |
| # AltBlrtsImage: Alternative BlrtsImage(s). |
| # AltLinuxImage: Alternative LinuxImage(s). |
| # AltMloaderImage: Alternative MloaderImage(s). |
| # AltRamDiskImage: Alternative RamDiskImage(s). |
| # |
| # LayoutMode: Mode in which slurm will create blocks: |
| # STATIC: Use defined non-overlapping bgblocks |
| # OVERLAP: Use defined bgblocks, which may overlap |
| # DYNAMIC: Create bgblocks as needed for each job |
| # BasePartitionNodeCnt: Number of c-nodes per base partition |
| # NodeCardNodeCnt: Number of c-nodes per node card. |
| # Numpsets: The Numpsets used for creation of all bgblocks |
| # equals this value multiplied by the number of |
| # base partitions in the bgblock. |
| # |
| # BridgeAPILogFile: Pathname of file in which to write the |
| # Bridge API logs. |
| # BridgeAPIVerbose: How verbose the BG Bridge API logs should be |
| # 0: Log only error and warning messages |
| # 1: Log level 0 and information messages |
| # 2: Log level 1 and basic debug messages |
| # 3: Log level 2 and more debug message |
| # 4: Log all messages |
| # DenyPassthrough: Prevents use of passthrough ports in specific |
| # dimensions, X, Y, and/or Z, plus ALL |
| # |
| # NOTE: The bgl_serial value is set at configuration time using the |
| # "--with-bgl-serial=" option. Its default value is "BGL". |
| ############################################################################### |
| # These are the default images with are used if the user doesn't specify |
| # which image they want |
| BlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw.rts |
| LinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage.elf |
| MloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader.rts |
| RamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk.elf |
| |
| #Only group jette can use these images |
| AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw2.rts Groups=jette |
| AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage2.elf Groups=jette |
| AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader2.rts Groups=jette |
| AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk2.elf Groups=jette |
| |
| # Since no groups are specified here any user can use them |
| AltBlrtsImage=/bgl/BlueLight/ppcfloor/bglsys/bin/rts_hw3.rts |
| AltLinuxImage=/bgl/BlueLight/ppcfloor/bglsys/bin/zImage3.elf |
| AltMloaderImage=/bgl/BlueLight/ppcfloor/bglsys/bin/mmcs-mloader3.rts |
| AltRamDiskImage=/bgl/BlueLight/ppcfloor/bglsys/bin/ramdisk3.elf |
| |
| # Another option for images would be a "You can use anything you like image" * |
| # This allows the user to use any image entered with no security checking |
| AltBlrtsImage=* Groups=da,adamb |
| AltLinuxImage=* Groups=da,adamb |
| AltMloaderImage=* Groups=da,adamb |
| AltRamDiskImage=* Groups=da,adamb |
| |
| LayoutMode=STATIC |
| BasePartitionNodeCnt=512 |
| NodeCardNodeCnt=32 |
| NumPsets=64 # An I/O rich environment |
| BridgeAPILogFile=/var/log/slurm/bridgeapi.log |
| BridgeAPIVerbose=0 |
| |
| #DenyPassthrough=X,Y,Z |
| |
| ############################################################################### |
| # Define the static/overlap partitions (bgblocks) |
| # |
| # BPs: The base partitions (midplanes) in the bgblock using XYZ coordinates |
| # Type: Connection type "MESH" or "TORUS" or "SMALL", default is "TORUS" |
| # Type SMALL will divide a midplane into multiple bgblocks |
| # based off options NodeCards and Quarters to determine type of |
| # small blocks. |
| # |
| # IMPORTANT NOTES: |
| # * Ordering is very important for laying out switch wires. Please create |
| # blocks with smap, and once done don't move the order of blocks |
| # created. |
| # * A bgblock is implicitly created containing all resources on the system |
| # * Bgblocks must not overlap (except for implicitly created bgblock) |
| # This will be the case when smap is used to create a configuration file |
| # * All Base partitions defined here must also be defined in the slurm.conf file |
| # * Define only the numeric coordinates of the bgblocks here. The prefix |
| # will be based upon the name defined in slurm.conf |
| ############################################################################### |
| # LEAVE NEXT LINE AS A COMMENT, Full-system bgblock, implicitly created |
| # BPs=[000x001] Type=TORUS # 1x1x2 = 2 midplanes |
| ############################################################################### |
| # volume = 1x1x1 = 1 |
| BPs=[000x000] Type=TORUS # 1x1x1 = 1 midplane |
| BPs=[001x001] Type=SMALL 32CNBlocks=4 128CNBlocks=3 # 1x1x1 = 4-Nodecard sized |
| # cnode blocks 3-Base |
| # Partition Quarter sized |
| # c-node blocks |
| |
| </pre></p> |
| |
| <p>The above <i>bluegene.conf</i> file defines multiple bgblocks to be |
| created in a single midplane (see the "SMALL" option). |
| Using this mechanism, up to 32 independent jobs each consisting of 1 |
| 32 cnodes can be executed |
| simultaneously on a one-rack BlueGene system. |
| If defining bgblocks of <i>Type=SMALL</i>, the SLURM partition |
| containing them as defined in <i>slurm.conf</i> must have the |
| parameter <i>Shared=force</i> to enable scheduling of multiple |
| jobs on what SLURM considers a single node. |
| SLURM partitions that do not contain bgblocks of <i>Type=SMALL</i> |
| may have the parameter <i>Shared=no</i> for a slight improvement in |
| scheduler performance. |
| As in all SLURM configuration files, parameters and values |
| are case insensitive.</p> |
| |
| <p> With a BlueGene/P system the image names are different. The |
| correct image names are CnloadImage, MloaderImage, and IoloadImage. |
| You can also use alternate images just the same as discribed above. |
| |
| <p>One more thing is required to support SLURM interactions with |
| the DB2 database (at least as of the time this was written). |
| DB2 database access is required by the slurmctld daemon only. |
| All other SLURM daemons and commands interact with DB2 using |
| remote procedure calls, which are processed by slurmctld. |
| DB2 access is dependent upon the environment variable |
| <i>BRIDGE_CONFIG_FILE</i>. |
| Make sure this is set appropriate before initiating the |
| slurmctld daemon. |
| If desired, this environment variable and any other logic |
| can be executed through the script <i>/etc/sysconfig/slurm</i>, |
| which is automatically executed by <i>/etc/init.d/slurm</i> |
| prior to initiating the SLURM daemons.</p> |
| |
| <p>When slurmctld is initially started on an idle system, the bgblocks |
| already defined in MMCS are read using the Bridge APIs. |
| If these bgblocks do not correspond to those defined in the <i>bluegene.conf</i> |
| file, the old bgblocks with a prefix of "RMP" are destroyed and new ones |
| created. |
| When a job is scheduled, the appropriate bgblock is identified, |
| its user set, and it is booted. |
| Node use (virtual or coprocessor) is set from the mpirun command line now, |
| SLURM has nothing to do with setting the node use. |
| Subsequent jobs use this same bgblock without rebooting by changing |
| the associated user field. |
| The only time bgblocks should be freed and rebooted, in normal operation, |
| is when going to or from full-system |
| jobs (two or more bgblocks sharing base partitions can not be in a |
| ready state at the same time). |
| When this logic became available at LLNL, approximately 85 percent of |
| bgblock boots were eliminated and the overhead of job startup went |
| from about 24% to about 6% of total job time. |
| Note that bgblocks will remain in a ready (booted) state when |
| the SLURM daemons are stopped. |
| This permits SLURM daemon restarts without loss of running jobs |
| or rebooting of bgblocks. </p> |
| |
| <p>Be aware that SLURM will issue multiple bgblock boot requests as |
| needed (e.g. when the boot fails). |
| If the bgblock boot requests repeatedly fail, SLURM will configure |
| the failing base partitions to a DRAINED state so as to avoid continuing |
| repeated reboots and the likely failure of user jobs. |
| A system administrator should address the problem before returning |
| the base partitions to service.</p> |
| |
| <p>If you cold-start slurmctld (<b>/etc/init.d/slurm startclean</b> |
| or <b>slurmctld -c</b>) it is recommended that you also cold-start |
| the slurmd at the same time. |
| Failure to do so may result in errors being reported by both slurmd |
| and slurmctld due to bgblocks that previously existed being deleted.</p> |
| |
| <p>A new tool <i>sfree</i> has also been added to help system |
| administrators free a bgblock on request (i.e. |
| "<i>sfree --bgblock=<blockname></i>"). |
| Run <i>sfree --help</i> for more information.</p> |
| |
| <h4>Resource Reservations</h4> |
| |
| <p><b>This reservation mechanism for less than an entire midplane is still |
| under development.</b></p> |
| |
| <p>SLURM's advance reservation mechanism is designed to reserve resources |
| at the level of whole nodes, which on a BlueGene systems would represent |
| whole midplanes. In order to support advanced reservations with a finer |
| grained resolution, you can configure one license per c-node on the system |
| and reserve c-nodes instead of entire midplanes. For example, in slurm.conf |
| specify something of this sort: "<i>Licenses=cnode*512</i>". Then create an |
| advanced reservation with a command like this:<br> |
| "<i>scontrol create reservation licenses="cnode*32" starttime=now duration=30:00 users=joe</i>".</p> |
| |
| <p>There is also a job_submit/cnode plugin available for use that will |
| automatically set a job's license specification to match his c-node request |
| (i.e. a command like<br> |
| "<i>sbatch -N32 my.sh</i>" would automatically be translated to<br> |
| "<i>sbatch -N32 --licenses=cnode*32 my.sh</i>" by the slurmctld daemon. |
| Enable this plugin in the slurm.conf configuration file with the option |
| "<i>JobSubmitPlugins=cnode</i>".</p> |
| |
| <h4>Debugging</h4> |
| |
| <p>All of the testing and debugging guidance provided in |
| <a href="quickstart_admin.html"> Quick Start Administrator Guide</a> |
| apply to BlueGene systems. |
| One can start the <i>slurmctld</i> and <i>slurmd</i> in the foreground |
| with extensive debugging to establish basic functionality. |
| Once running in production, the configured <i>SlurmctldLog</i> and |
| <i>SlurmdLog</i> files will provide historical system information. |
| On BlueGene systems, there is also a <i>BridgeAPILogFile</i> defined |
| in <i>bluegene.conf</i> which can be configured to contain detailed |
| information about every Bridge API call issued.</p> |
| |
| <p>Note that slurmcltld log messages of the sort |
| <i>Nodes bg[000x133] not responding</i> are indicative of the slurmd |
| daemon serving as a front-end to those base partitions is not responding (on |
| non-BlueGene systems, the slurmd actually does run on the compute |
| nodes, so the message is more meaningful there). </p> |
| |
| <p>Note that you can emulate a BlueGene/L system on stand-alone Linux |
| system. |
| Run <b>configure</b> with the <b>--enable-bgl-emulation</b> option. |
| This will define "HAVE_BG", "HAVE_BGL", and "HAVE_FRONT_END" in the |
| config.h file. |
| You can also emulate a BlueGene/P system with |
| the <b>--enable-bgp-emulation</b> option. |
| This will define "HAVE_BG", "HAVE_BGP", and "HAVE_FRONT_END" in the |
| config.h file. |
| Then execute <b>make</b> normally. |
| These variables will build the code as if it were running |
| on an actual BlueGene computer, but avoid making calls to the |
| Bridge libary (that is controlled by the variable "HAVE_BG_FILES", |
| which is left undefined). You can use this to test configurations, |
| scheduling logic, etc. </p> |
| |
| <p class="footer"><a href="#top">top</a></p> |
| |
| <p style="text-align:center;">Last modified 17 March 2009</p> |
| |
| <!--#include virtual="footer.txt"--> |