blob: d7a6c29517507a5c6d61de4f814ba1c82f30df31 [file] [log] [blame] [edit]
RELEASE NOTES FOR SLURM VERSION 2.2
10 January 2011
IMPORTANT NOTE:
If using the slurmdbd (SLURM DataBase Daemon) you must update this first.
The 2.2 slurmdbd will work with SLURM daemons of version 2.1.3 and above.
You will not need to update all clusters at the same time, but it is very
important to update slurmdbd first and having it running before updating
any other clusters making use of it. No real harm will come from updating
your systems before the slurmdbd, but they will not talk to each other
until you do. Also at least the first time running the slurmdbd you need to
make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M.
You can accomplish this by adding the line
innodb_buffer_pool_size=64M
under the [mysqld] reference in the my.cnf file and restarting the mysqld.
This is needed when converting large tables over to the new database schema.
SLURM can be upgraded from version 2.1 to version 2.2 without loss of jobs or
other state information.
HIGHLIGHTS
==========
* Slurmctld restart/reconfiguration operations have been altered.
NOTE: There will be no change in behavior unless partition configuration
or node Features/Weight are altered using the scontrol command to differ
from the contents of the slurm.conf configuration file.
Preserve current partition state information plus node Feature and Weight
state information after slurmctld receives a SIGHUP signal or is restarted
with the -R option. Recreate partition plus node information (except node
State and Reason) from slurm.conf file after executing "scontrol reconfig"
or restarting slurmctld *without* the -R option.
OPERATION ACTION
slurmctld -R Recover all job, node and partition state
slurmctld Recover job state plus state and reason for DOWN
and DRAINED nodes only, recreate all other node state
plus all partition state
slurmctld -c Recover no jobs, recreate node and partition state
SIGHUP to slurmctld Preserve all job, node and partition state
scontrol reconfig Preserve job state, recreate node and partition state
Old logic preserved node Feature plus partition state after "slurmctld" or
"scontrol reconfig" rather than recreating it from slurm.conf. Node Weight
was formerly always recreated from slurm.conf.
* SLURM commands (squeue, sinfo, sview, etc...) can now operate between
clusters. Jobs can also be submitted with sbatch to other cluster(s) with the
job routed to the one cluster expected to initiated the job first.
* Accounting through the SlurmDBD with the MySQL plugin can now support
a default account and wckey per cluster.
CONFIGURATION FILE CHANGES (see "man slurm.conf" for details)
=============================================================
* A hash of the slurm.conf running on each node in the cluster is sent when
registering with the slurmctld so it can verify the slurm.conf is the same
as the one it is running. If not an error message is displayed. To
silence this message add NO_CONF_HASH to DebugFlags in your slurm.conf.
* Added VSizeFactor to enforce virtual memory limits for jobs and job steps as
a percentage of their real memory allocation.
* Added new option for SelectTypeParameters of CR_ONE_TASK_PER_CORE. This
option will allocate one task per core by default. Without this option,
by default one task will be allocated per thread on nodes with more than
one ThreadsPerCore configured (i.e. no change in behavior without this
option).
* Add new configuration parameters GroupUpdateForce and GroupUpdateTime. These
control when slurmctld updates its information of which users are in the
groups allowed to use partitions. NOTE: There is no change in the default
behavior.
* Added new configuration parameters SlurmSchedLogFile and SlurmSchedLogLevel
to support writing scheduling events to a separate log file.
* Added new configuration parameter JobSubmitPlugins which provides a mechanism
to set default job parameters or perform other site-configurable actions at
job submit time. Site-specific job submission plugins may be written either C
or LUA.
* MaxJobCount changed from 16-bit to 32-bit field. The default MaxJobCount was
changed from 5,000 to 10,000.
* Added support for a PropagatePrioProcess configuration parameter value of 2
to restrict spawned task nice values to that of the slurmd daemon plus 1.
This insures that the slurmd daemon always have a higher scheduling priority
than spawned tasks. Also added support in slurmctld, slurmd and slurmdbd for
option of "-n <value>" to reset the daemon's nice value.
* Support has been added for the allocation of generic resources (GRES). A
new configuration parameter, GresPlugins, has been added along with a node-
specific parameter, Gres. There is also a gres.conf file to be configured on
each node. For more information, see the web page
https://computing.llnl.gov/linux/slurm/gang_scheduling.html
Support for enforcement of these allocations using Linux CGroup will be
provided in a later release.
* Added support for new partition states of DRAIN (run queued jobs, but accept
no new jobs) and INACTIVE (do not accept or run any more jobs) and new
partition option of "Alternate" (alternate partition to use for jobs
submitted to partitions that are currently in a state of DRAIN or INACTIVE).
* Added the ability to configure PreemptMode on a per-partition or per-QOS
basis.
* Modified the meaning of InactiveLimit slightly. It will now cancel the job
allocation created using the salloc or srun command if those commands cease
responding for the InactiveLimit regardless of any running job steps. This
parameter will no longer effect jobs spawned using sbatch.
* Added SchedulerParameters option of bf_window to control how far into the
future that the backfill scheduler will look when considering jobs to start.
The default value is one day.
* Added the ability to specify a range of ports in the SlurmctldPort parameter
for better handling of high bursts of RPCs (e.g. "SlurmctldPort=1234-1237").
COMMAND CHANGES (see man pages for details)
===========================================
* sinfo -R now has the user and timestamp in separate fields from the reason.
* Job submission commands (salloc, sbatch and srun) have a new option,
--time-min, that permits the job's time limit to be reduced to the extent
required to start early through backfill scheduling with the minimum value
as specified.
* scontrol now has the ability to change a job step's time limit.
* scontrol now has the ability to shrink a job's size. Use a command of
"scontrol update JobId=# NumNodes=#" or
"scontrol update JobId=# NodeList=<names>". This command generates a script
to be executed in order to reset SLURM environment variables for proper
execution of subsequent job steps.
* We have given Operators, Administrators, and bank account Coordinators (as
defined in the SLURM database) the ability to invoke commands that view/modify
user jobs and reservations. Previously, one had to be root to invoke
"scontrol update JobId" for example. In addition, Administrators have the
ability to view/modify node and partition info without having to become root.
For moredetails, see AUTHORIZATION section of the man pages for the
following commands: scontrol, scancel and sbcast.
* Users can hold and release their own jobs. Submit in held state using srun
or sbatch --hold or -H options. Hold after submission using the command
"scontrol hold <jobid>". Release with "scontrol release <jobid>". Users can
not release jobs held by a system administrator unless the adminstrator uses
the command "scontrol uhold <jobid>" ("uhold" for "user hold").
* Add support for slurmctld and slurmd option of "-n <value>" to reset the
daemon's nice value.
* srun's --core option has been removed. Use the SPANK "Core" plugin from
http://code.google.com/p/slurm-spank-plugins/ for continued support.
* Added salloc and sbatch option --wait-all-nodes. If set non-zero, job
initiation will be delayed until all allocated nodes have booted. Salloc
will log the delay with the messages "Waiting for nodes to boot" and "Nodes
are ready for job".
* Added scontrol "wait_job <job_id>" option to wait for nodes to boot as needed.
Useful for batch jobs (in Prolog, PrologSlurmctld or the script) if powering
down idle nodes.
* Modified sview to display database configuration and add/remove visible tabs.
* Modified sview to save default configuration in .slurm/sviewrc file.
Default setting can be set by using the menus Options->Set Default Settings
or typing Ctrl-S.
* Modified select/cons_res plugin so that if MaxMemPerCPU is configured and a
job specifies it's memory requirement, then more CPUs than requested will
automatically be allocated to a job to honor the MaxMemPerCPU parameter.
* Add new scontrol option of "show aliases" to report every NodeName that is
associated with a given NodeHostName when running multiple slurmd daemons
per compute node (typically used for testing purposes).
BLUEGENE SPECIFIC CHANGES
=========================
OTHER CHANGES
=============
* Added support for a default account and wckey per cluster within accounting.
* Added support for several new trigger types: SlurmDBD failure/restart,
Database failure/restart, Slurmctld failure/restart.
* Support has been added for TotalView to attach to a subset of launched tasks
instead of requiring that all tasks be attached to. This is the default
behavior unless an option of "--enable-partial-attach=no" be passed to the
configure (build) script.
* A web application (chart_stats.cgi) has been added that invokes sreport to
retrieve from the accounting storage db a user's request for job usage or
machine utilization statistics and charts the results to a browser.
* Much functionality has been added to account_storage/pgsql. The plugin
is still in a very beta state.
* SLURM's PMI library (for MPICH2) has been modified to properly execute an
executable program stand-alone (single MPI task launched without srun).
* The PMI was also modified to use more socket connections for better
scalability and to clear state between job step invocations.
* Added support for spank_get_item() to get S_STEP_ALLOC_CORES and
S_STEP_ALLOC_MEM. Support will remain for S_JOB_ALLOC_CORES and
S_JOB_ALLOC_MEM.
* Changed error message from "Requested time limit exceeds partition limit"
to "Requested time limit is invalid (exceeds some limit)". The error can be
triggered by a time limit exceeding the user/bank limit or the time-min
exceeding the job or partition's time limit.
* Added proctrack/cgroup plugin which uses Linux control groups (aka cgroup) to
track processes on Linux systems with this feature (kernel >= 2.6.24).
* Added the derived_ec (exit code) member to job_info_t. exit_code captures
the exit code of the job script (or salloc) while derived_ec contains the
highest exit code of all the job steps.
* Added the derived exit code and derived exit string fields to the database's
job record. Both can be modified by the user after the job completes. See
job_exit_code.html
API CHANGES
===========
Changed members of the following structs
========================================
job_info_t
num_procs -> num_cpus
job_min_cpus -> pn_min_cpus
job_min_memory -> pn_min_memory
job_min_tmp_disk -> pn_min_tmp_disk
min_sockets -> sockets_per_node
min_cores -> cores_per_socket
min_threads -> threads_per_core
job_desc_msg_t
num_procs -> min_cpus
job_min_cpus -> pn_min_cpus
job_min_memory -> pn_min_memory
job_min_tmp_disk -> pn_min_tmp_disk
min_sockets -> sockets_per_node
min_cores -> cores_per_socket
min_threads -> threads_per_core
partition_info_t
state_up (new states added PARTITION_DRAIN and PARTITION_INACTIVE)
default_part -> flags (as PART_FLAG_DEFAULT flag)
disable_root_jobs -> flags (as PART_FLAG_NO_ROOT flag)
hidden -> flags (as PART_FLAG_HIDDEN flag)
root_only -> flags (as PART_FLAG_ROOT_ONLY flag)
slurm_step_ctx_params_t
node_count -> min_nodes
slurm_ctl_conf_t
cache_groups -> group_info (as GROUP_CACHE flag)
Added the following struct definitions
======================================
block_info_t (BlueGene-specific information)
reason
job_info_t
derived_ec
gres
max_cpus
resize_time
show_flags
time_min
job_desc_msg_t
gres
max_cpus
time_min
wait_all_nodes
job_step_info_t
gres
node_info_t
boot_time
gres
reason_time
reason_uid
slurmd_start_time
partition_info_t
alternate
flags
preempt_mode
slurm_ctl_conf_t
gres_plugins
group_info
hash_val
job_submit_plugins
sched_logfile
sched_log_level
slurmctld_port_count
vsize_factor
slurm_step_ctx_params_t
features
gres
max_nodes
update_node_msg_t
gres
preempt_mode
reason_uid
Changed the following enums
===========================
job_state_reason
FAIL_BANK_ACCOUNT -> FAIL_ACCOUNT
FAIL_QOS /* invalid QOS */
WAIT_QOS_THRES /* required QOS threshold has been breached */
select_jobdata_type
SELECT_JOBDATA_PTR /* data-> select_jobinfo_t *jobinfo */
select_nodedata_type
SELECT_NODEDATA_PTR /* data-> select_nodeinfo_t *nodeinfo */
select_type_plugin_info is no longer and it's contents are now mostly #defines
Added the following API's
=========================
slurm_checkpoint_requeue()
slurm_init_update_step_msg()
slurm_job_step_get_pids()
slurm_job_step_pids_free()
slurm_job_step_pids_response_msg_free()
slurm_job_step_stat()
slurm_job_step_stat_free()
slurm_job_step_stat_response_msg_free()
slurm_list_append()
slurm_list_count()
slurm_list_create()
slurm_list_destroy()
slurm_list_find()
slurm_list_is_empty()
slurm_list_iterator_create()
slurm_list_iterator_reset()
slurm_list_iterator_destroy()
slurm_list_next()
slurm_list_sort()
slurm_set_schedlog_level()
slurm_step_launch_fwd_wake()
slurm_update_step()
Changed the following API's
===========================
slurm_load_block_info(): Added show_flag parameter