RELEASE_NOTES - SchedMD/slurm - Git at Google

 RELEASE NOTES FOR SLURM VERSION 2.0
 11 February 2009 (after SLURM 1.4.0-pre8 released)


 IMPORTANT NOTE:
 SLURM state files in version 2.0 are different from those of version 1.3.
 After installing SLURM version 2.0, plan to restart without preserving
 jobs or other state information. While SLURM version 1.3 is still running,
 cancel all pending and running jobs (e.g.
 "scancel --state=pending; scancel --state=running"). Then stop and restart
 daemons with the "-c" option or use "/etc/init.d/slurm startclean".

 If using the slurmdbd (SLURM DataBase Daemon) you must update this first.
 The 2.0 slurmdbd will work with SLURM daemons at version 1.3.7 and above.
 You will not need to update all clusters at the same time, but it is very
 important to update slurmdbd first and having it running before updating
 any other clusters making use of it.  No real harm will come from updating
 your systems before the slurmdbd, but they will not talk to each other
 until you do.

 There are substantial changes in the slurm.conf configuration file. It
 is recommended that you rebuild your configuration file using the tool
 doc/html/configurator.html that comes with the distribution.

 SLURM can continue to be used as a simple resource manager, but optional
 plugins support sophisticated scheduling algorithms. These plugins do require
 the use of a database containing user and bank account information, so
 more administration work is required. SLURM's modular design lets you
 control the functionality that you want it to provide.

 HIGHLIGHTS
 * Sophisticated scheduling algorithms are available in a new plugin. Jobs
   can be prioritized based upon their age, size and/or fair-share resource
   allocation using hierarchical bank accounts. For more information see:
   https://computing.llnl.gov/linux/slurm/job_priority.html
 * An assortment of resource limits can be imposed upon individual users
   and/or hierarchical bank accounts such as maximum job time limit, maximum
   job size and maximum number of running jobs. For more information see:
   https://computing.llnl.gov/linux/slurm/resource_limits.html
 * Advanced reservations can be made to insure resources will be available when
   needed. For more information see:
   https://computing.llnl.gov/linux/slurm/reservations.html
 * Idle nodes can now be completely powered down when idle and automatically
   restarted when there is work available. For more information see:
   https://computing.llnl.gov/linux/slurm/power_save.html
 * SLURM has been modified to allocate specific cores to jobs and job steps in
   the centralized scheduler rather than the daemons running on the individual
   compute nodes. This permits effective preemption or gang schedule jobs.
 * New configuration parameters, PrologSlurmctld and EpilogSlurmctld, can be
   used to support the booting of different operating systems for each job.
   See "man slurm.conf" for details.
 * Preemption of jobs from lower priority partitions in order to execute jobs
   in higher priority partitions is now supported. The jobs from the lower
   priority partition will resume once preempting job completes. For more
   information see:
   https://computing.llnl.gov/linux/slurm/preempt.html
 * Added support for optimized resource allocation with respect to network
   topology. Requires switch configuration information be added to slurm.conf.
 * Support added for Sun Constellation system with optimized resource allocation
   for a 3-dimensional torus interconnect. For more information see:
   https://computing.llnl.gov/linux/slurm/sun_const.html
 * Support added for IBM BlueGene/P systems, including High Throughput Computing
   (HTC) mode.
 * Support for checkpoint/restart using BLCR added using the checkpoint/blcr
   plugin. For more information see:
   https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html
   https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

 CONFIGURATION FILE CHANGES (see "man slurm.conf" for details)
 * The default AuthType is now "auth/munge" rather than "auth/none".
 * The default CryptoType is now "crypto/munge". OpenSSL is no longer required
   by SLURM in the default configuration.
 * DefaultTime has been added to specify a default job time limit in the
   partition. If not set, uses the partition's MaxTime.
 * PrologSlurmctld has been added and can be used to boot nodes into a
   particular state for each job.
 * DefMemPerTask has been removed. Use DefMemPerCPU or DefMemPerNode instead.
 * KillOnBadExit added to immediately terminate a job step whenever any tasks
   terminates with a non-zero exit code.
 * Added new node state of "FUTURE". These node records are created in SLURM
   tables for future use without a reboot of the SLURM daemons, but are not
   reported by any SLURM commands or APIs.
 * BatchStartTime has been added to control how long to wait for a batch job
   to start (complete Prolog, load environment for Moab, etc.).
 * CompleteTime has been added to control how long to wait for a job's
   completion before allocating already released resources to pending jobs.
 * OverTimeLimit added to permit jobs to exceed their (soft) time limit by a
   configurable amount. Backfill scheduling will be based upon the soft time
   limit.
 * For select/cons_res or sched/gang only: Each nodes processor count must be
   specified in the configuration file. Additional resources found by SLURM
   daemons on the compute nodes will not be used.
 * DebugFlags added to provide detailed logging for specific subsystems.
 * Added job priority plugin.  Default for PriorityType is "priority/basic"
   which is the same logic SLURM has today (job priorities are assigned at
   submit time with decreasing value).  "priority/multifactor" is a new plugin
   which utilizes logic to set a priority on a job based on many different
   configuration parameters as described here:
   https://computing.llnl.gov/linux/slurm/job_priority.html
 * The task/affinity plugin will automatically bind a job step to the CPUs
   it has been allocated. The entity bound to (sockets, cores or threads)
   will be automatically set based upon the allocation size and task count
   SLURM's SPANK cpuset plugin is no longer be needed.
 * Resource allocations can now be optimized according to network topology.
   The following switch topology configuration options have been added:
   TopologyPlugin and in a new topology.conf file: SwitchName, Nodes,
   Switches. More information is available in man pages for slurm.conf,
   topology.conf, and https://computing.llnl.gov/linux/slurm/topology.html
 * SrunIOTimeout has been added to optionally ping srun's tasks for better
   fault tolerance (e.g. killed and restarteed SLURM daemons on compute node).
 * ResumeDelay added to control how much time after a node has been suspended
   before resume it (e.g. powering it back up).
 * BLUEGENE - Added option DenyPassthrough in the bluegene.conf.  Can be set
   to any combination of X,Y,Z to not allow passthroughs when running in
   dynamic layout mode. (see "man bluegene.conf" for details)

 COMMAND CHANGES (see man pages for details)
 * --task-mem and --job-mem options have been removed from salloc, sbatch and
   srun. Use --mem-per-cpu or --mem instead.
 * Added the srun option --preserve-env to pass the current values of
   environment variables SLURM_NNODES and SLURM_NPROCS through to the
   executable, rather than computing them from commandline parameters.
 * --ctrl-comm-ifhn-addr option has been removed from the srun command (it is
   no longer useful).
 * Batch jobs have an environment variable SLURM_RESTART_COUNT set when
   restarted.
 * To create a partition using the scontrol command, use the "create" command
   rather than "update" with a new partition name.
 * Time format of all SLURM command set to ISO 8601 (yyyy-mm-ddThh:mm:ss)
   unless the configure option "--disable-iso8601" is used at build time.
 * sacct -S to status a job will no longer work.  Use sstat from now on.
 * sacct --nodes option can be used to filter jobs by allocated node.
 * sacct default starttime is midnight of the previous day rather than the
   start of the database.
 * sacct and sstat have been rewritten to have a more sacctmgr like feel
 * Added the sprio command to view the factors that comprise a job's scheduling
   priority - works only with the priority/multifactor plugin.

 ACCOUNTING CHANGES
 * Added ability for slurmdbd to archive and purge step and/or job records.
 * Added support for Workload Characterization Key (WCKey) in accounting
   records. This is an optional string that can be used to identify the type of
   work being performed (in addition to user ID, account name, job name, etc.).
 * Added configuration parameter AccountingStorageBackupHost for fault-tolerance
   in communications to SlurmDBD.

 OTHER CHANGES
 * Modify PMI_Get_clique_ranks() to return an array of integers rather
   than a char * to satisfy PMI standard. Correct logic in
   PMI_Get_clique_size() for when srun --overcommit option is used.
 * Set "/proc/self/oom_adj" for slurmd and slurmstepd daemons based upon
   the values of SLURMD_OOM_ADJ and SLURMSTEPD_OOM_ADJ environment
   variables. This can be used to prevent daemons being killed when
   a node's memory is exhausted.
	RELEASE NOTES FOR SLURM VERSION 2.0
	11 February 2009 (after SLURM 1.4.0-pre8 released)


	IMPORTANT NOTE:
	SLURM state files in version 2.0 are different from those of version 1.3.
	After installing SLURM version 2.0, plan to restart without preserving
	jobs or other state information. While SLURM version 1.3 is still running,
	cancel all pending and running jobs (e.g.
	"scancel --state=pending; scancel --state=running"). Then stop and restart
	daemons with the "-c" option or use "/etc/init.d/slurm startclean".

	If using the slurmdbd (SLURM DataBase Daemon) you must update this first.
	The 2.0 slurmdbd will work with SLURM daemons at version 1.3.7 and above.
	You will not need to update all clusters at the same time, but it is very
	important to update slurmdbd first and having it running before updating
	any other clusters making use of it. No real harm will come from updating
	your systems before the slurmdbd, but they will not talk to each other
	until you do.

	There are substantial changes in the slurm.conf configuration file. It
	is recommended that you rebuild your configuration file using the tool
	doc/html/configurator.html that comes with the distribution.

	SLURM can continue to be used as a simple resource manager, but optional
	plugins support sophisticated scheduling algorithms. These plugins do require
	the use of a database containing user and bank account information, so
	more administration work is required. SLURM's modular design lets you
	control the functionality that you want it to provide.

	HIGHLIGHTS
	* Sophisticated scheduling algorithms are available in a new plugin. Jobs
	can be prioritized based upon their age, size and/or fair-share resource
	allocation using hierarchical bank accounts. For more information see:
	https://computing.llnl.gov/linux/slurm/job_priority.html
	* An assortment of resource limits can be imposed upon individual users
	and/or hierarchical bank accounts such as maximum job time limit, maximum
	job size and maximum number of running jobs. For more information see:
	https://computing.llnl.gov/linux/slurm/resource_limits.html
	* Advanced reservations can be made to insure resources will be available when
	needed. For more information see:
	https://computing.llnl.gov/linux/slurm/reservations.html
	* Idle nodes can now be completely powered down when idle and automatically
	restarted when there is work available. For more information see:
	https://computing.llnl.gov/linux/slurm/power_save.html
	* SLURM has been modified to allocate specific cores to jobs and job steps in
	the centralized scheduler rather than the daemons running on the individual
	compute nodes. This permits effective preemption or gang schedule jobs.
	* New configuration parameters, PrologSlurmctld and EpilogSlurmctld, can be
	used to support the booting of different operating systems for each job.
	See "man slurm.conf" for details.
	* Preemption of jobs from lower priority partitions in order to execute jobs
	in higher priority partitions is now supported. The jobs from the lower
	priority partition will resume once preempting job completes. For more
	information see:
	https://computing.llnl.gov/linux/slurm/preempt.html
	* Added support for optimized resource allocation with respect to network
	topology. Requires switch configuration information be added to slurm.conf.
	* Support added for Sun Constellation system with optimized resource allocation
	for a 3-dimensional torus interconnect. For more information see:
	https://computing.llnl.gov/linux/slurm/sun_const.html
	* Support added for IBM BlueGene/P systems, including High Throughput Computing
	(HTC) mode.
	* Support for checkpoint/restart using BLCR added using the checkpoint/blcr
	plugin. For more information see:
	https://computing.llnl.gov/linux/slurm/checkpoint_blcr.html
	https://ftg.lbl.gov/CheckpointRestart/CheckpointRestart.shtml

	CONFIGURATION FILE CHANGES (see "man slurm.conf" for details)
	* The default AuthType is now "auth/munge" rather than "auth/none".
	* The default CryptoType is now "crypto/munge". OpenSSL is no longer required
	by SLURM in the default configuration.
	* DefaultTime has been added to specify a default job time limit in the
	partition. If not set, uses the partition's MaxTime.
	* PrologSlurmctld has been added and can be used to boot nodes into a
	particular state for each job.
	* DefMemPerTask has been removed. Use DefMemPerCPU or DefMemPerNode instead.
	* KillOnBadExit added to immediately terminate a job step whenever any tasks
	terminates with a non-zero exit code.
	* Added new node state of "FUTURE". These node records are created in SLURM
	tables for future use without a reboot of the SLURM daemons, but are not
	reported by any SLURM commands or APIs.
	* BatchStartTime has been added to control how long to wait for a batch job
	to start (complete Prolog, load environment for Moab, etc.).
	* CompleteTime has been added to control how long to wait for a job's
	completion before allocating already released resources to pending jobs.
	* OverTimeLimit added to permit jobs to exceed their (soft) time limit by a
	configurable amount. Backfill scheduling will be based upon the soft time
	limit.
	* For select/cons_res or sched/gang only: Each nodes processor count must be
	specified in the configuration file. Additional resources found by SLURM
	daemons on the compute nodes will not be used.
	* DebugFlags added to provide detailed logging for specific subsystems.
	* Added job priority plugin. Default for PriorityType is "priority/basic"
	which is the same logic SLURM has today (job priorities are assigned at
	submit time with decreasing value). "priority/multifactor" is a new plugin
	which utilizes logic to set a priority on a job based on many different
	configuration parameters as described here:
	https://computing.llnl.gov/linux/slurm/job_priority.html
	* The task/affinity plugin will automatically bind a job step to the CPUs
	it has been allocated. The entity bound to (sockets, cores or threads)
	will be automatically set based upon the allocation size and task count
	SLURM's SPANK cpuset plugin is no longer be needed.
	* Resource allocations can now be optimized according to network topology.
	The following switch topology configuration options have been added:
	TopologyPlugin and in a new topology.conf file: SwitchName, Nodes,
	Switches. More information is available in man pages for slurm.conf,
	topology.conf, and https://computing.llnl.gov/linux/slurm/topology.html
	* SrunIOTimeout has been added to optionally ping srun's tasks for better
	fault tolerance (e.g. killed and restarteed SLURM daemons on compute node).
	* ResumeDelay added to control how much time after a node has been suspended
	before resume it (e.g. powering it back up).
	* BLUEGENE - Added option DenyPassthrough in the bluegene.conf. Can be set
	to any combination of X,Y,Z to not allow passthroughs when running in
	dynamic layout mode. (see "man bluegene.conf" for details)

	COMMAND CHANGES (see man pages for details)
	* --task-mem and --job-mem options have been removed from salloc, sbatch and
	srun. Use --mem-per-cpu or --mem instead.
	* Added the srun option --preserve-env to pass the current values of
	environment variables SLURM_NNODES and SLURM_NPROCS through to the
	executable, rather than computing them from commandline parameters.
	* --ctrl-comm-ifhn-addr option has been removed from the srun command (it is
	no longer useful).
	* Batch jobs have an environment variable SLURM_RESTART_COUNT set when
	restarted.
	* To create a partition using the scontrol command, use the "create" command
	rather than "update" with a new partition name.
	* Time format of all SLURM command set to ISO 8601 (yyyy-mm-ddThh:mm:ss)
	unless the configure option "--disable-iso8601" is used at build time.
	* sacct -S to status a job will no longer work. Use sstat from now on.
	* sacct --nodes option can be used to filter jobs by allocated node.
	* sacct default starttime is midnight of the previous day rather than the
	start of the database.
	* sacct and sstat have been rewritten to have a more sacctmgr like feel
	* Added the sprio command to view the factors that comprise a job's scheduling
	priority - works only with the priority/multifactor plugin.

	ACCOUNTING CHANGES
	* Added ability for slurmdbd to archive and purge step and/or job records.
	* Added support for Workload Characterization Key (WCKey) in accounting
	records. This is an optional string that can be used to identify the type of
	work being performed (in addition to user ID, account name, job name, etc.).
	* Added configuration parameter AccountingStorageBackupHost for fault-tolerance
	in communications to SlurmDBD.

	OTHER CHANGES
	* Modify PMI_Get_clique_ranks() to return an array of integers rather
	than a char * to satisfy PMI standard. Correct logic in
	PMI_Get_clique_size() for when srun --overcommit option is used.
	* Set "/proc/self/oom_adj" for slurmd and slurmstepd daemons based upon
	the values of SLURMD_OOM_ADJ and SLURMSTEPD_OOM_ADJ environment
	variables. This can be used to prevent daemons being killed when
	a node's memory is exhausted.