| RELEASE NOTES FOR SLURM VERSION 2.2 |
| 10 January 2011 |
| |
| |
| IMPORTANT NOTE: |
| If using the slurmdbd (SLURM DataBase Daemon) you must update this first. |
| The 2.2 slurmdbd will work with SLURM daemons of version 2.1.3 and above. |
| You will not need to update all clusters at the same time, but it is very |
| important to update slurmdbd first and having it running before updating |
| any other clusters making use of it. No real harm will come from updating |
| your systems before the slurmdbd, but they will not talk to each other |
| until you do. Also at least the first time running the slurmdbd you need to |
| make sure your my.cnf file has innodb_buffer_pool_size equal to at least 64M. |
| You can accomplish this by adding the line |
| |
| innodb_buffer_pool_size=64M |
| |
| under the [mysqld] reference in the my.cnf file and restarting the mysqld. |
| This is needed when converting large tables over to the new database schema. |
| |
| SLURM can be upgraded from version 2.1 to version 2.2 without loss of jobs or |
| other state information. |
| |
| |
| HIGHLIGHTS |
| ========== |
| * Slurmctld restart/reconfiguration operations have been altered. |
| NOTE: There will be no change in behavior unless partition configuration |
| or node Features/Weight are altered using the scontrol command to differ |
| from the contents of the slurm.conf configuration file. |
| |
| Preserve current partition state information plus node Feature and Weight |
| state information after slurmctld receives a SIGHUP signal or is restarted |
| with the -R option. Recreate partition plus node information (except node |
| State and Reason) from slurm.conf file after executing "scontrol reconfig" |
| or restarting slurmctld *without* the -R option. |
| |
| OPERATION ACTION |
| slurmctld -R Recover all job, node and partition state |
| slurmctld Recover job state plus state and reason for DOWN |
| and DRAINED nodes only, recreate all other node state |
| plus all partition state |
| slurmctld -c Recover no jobs, recreate node and partition state |
| SIGHUP to slurmctld Preserve all job, node and partition state |
| scontrol reconfig Preserve job state, recreate node and partition state |
| |
| Old logic preserved node Feature plus partition state after "slurmctld" or |
| "scontrol reconfig" rather than recreating it from slurm.conf. Node Weight |
| was formerly always recreated from slurm.conf. |
| |
| * SLURM commands (squeue, sinfo, sview, etc...) can now operate between |
| clusters. Jobs can also be submitted with sbatch to other cluster(s) with the |
| job routed to the one cluster expected to initiated the job first. |
| |
| * Accounting through the SlurmDBD with the MySQL plugin can now support |
| a default account and wckey per cluster. |
| |
| CONFIGURATION FILE CHANGES (see "man slurm.conf" for details) |
| ============================================================= |
| * A hash of the slurm.conf running on each node in the cluster is sent when |
| registering with the slurmctld so it can verify the slurm.conf is the same |
| as the one it is running. If not an error message is displayed. To |
| silence this message add NO_CONF_HASH to DebugFlags in your slurm.conf. |
| |
| * Added VSizeFactor to enforce virtual memory limits for jobs and job steps as |
| a percentage of their real memory allocation. |
| |
| * Added new option for SelectTypeParameters of CR_ONE_TASK_PER_CORE. This |
| option will allocate one task per core by default. Without this option, |
| by default one task will be allocated per thread on nodes with more than |
| one ThreadsPerCore configured (i.e. no change in behavior without this |
| option). |
| |
| * Add new configuration parameters GroupUpdateForce and GroupUpdateTime. These |
| control when slurmctld updates its information of which users are in the |
| groups allowed to use partitions. NOTE: There is no change in the default |
| behavior. |
| |
| * Added new configuration parameters SlurmSchedLogFile and SlurmSchedLogLevel |
| to support writing scheduling events to a separate log file. |
| |
| * Added new configuration parameter JobSubmitPlugins which provides a mechanism |
| to set default job parameters or perform other site-configurable actions at |
| job submit time. Site-specific job submission plugins may be written either C |
| or LUA. |
| |
| * MaxJobCount changed from 16-bit to 32-bit field. The default MaxJobCount was |
| changed from 5,000 to 10,000. |
| |
| * Added support for a PropagatePrioProcess configuration parameter value of 2 |
| to restrict spawned task nice values to that of the slurmd daemon plus 1. |
| This insures that the slurmd daemon always have a higher scheduling priority |
| than spawned tasks. Also added support in slurmctld, slurmd and slurmdbd for |
| option of "-n <value>" to reset the daemon's nice value. |
| |
| * Support has been added for the allocation of generic resources (GRES). A |
| new configuration parameter, GresPlugins, has been added along with a node- |
| specific parameter, Gres. There is also a gres.conf file to be configured on |
| each node. For more information, see the web page |
| https://computing.llnl.gov/linux/slurm/gang_scheduling.html |
| Support for enforcement of these allocations using Linux CGroup will be |
| provided in a later release. |
| |
| * Added support for new partition states of DRAIN (run queued jobs, but accept |
| no new jobs) and INACTIVE (do not accept or run any more jobs) and new |
| partition option of "Alternate" (alternate partition to use for jobs |
| submitted to partitions that are currently in a state of DRAIN or INACTIVE). |
| |
| * Added the ability to configure PreemptMode on a per-partition or per-QOS |
| basis. |
| |
| * Modified the meaning of InactiveLimit slightly. It will now cancel the job |
| allocation created using the salloc or srun command if those commands cease |
| responding for the InactiveLimit regardless of any running job steps. This |
| parameter will no longer effect jobs spawned using sbatch. |
| |
| * Added SchedulerParameters option of bf_window to control how far into the |
| future that the backfill scheduler will look when considering jobs to start. |
| The default value is one day. |
| |
| * Added the ability to specify a range of ports in the SlurmctldPort parameter |
| for better handling of high bursts of RPCs (e.g. "SlurmctldPort=1234-1237"). |
| |
| COMMAND CHANGES (see man pages for details) |
| =========================================== |
| * sinfo -R now has the user and timestamp in separate fields from the reason. |
| |
| * Job submission commands (salloc, sbatch and srun) have a new option, |
| --time-min, that permits the job's time limit to be reduced to the extent |
| required to start early through backfill scheduling with the minimum value |
| as specified. |
| |
| * scontrol now has the ability to change a job step's time limit. |
| |
| * scontrol now has the ability to shrink a job's size. Use a command of |
| "scontrol update JobId=# NumNodes=#" or |
| "scontrol update JobId=# NodeList=<names>". This command generates a script |
| to be executed in order to reset SLURM environment variables for proper |
| execution of subsequent job steps. |
| |
| * We have given Operators, Administrators, and bank account Coordinators (as |
| defined in the SLURM database) the ability to invoke commands that view/modify |
| user jobs and reservations. Previously, one had to be root to invoke |
| "scontrol update JobId" for example. In addition, Administrators have the |
| ability to view/modify node and partition info without having to become root. |
| For moredetails, see AUTHORIZATION section of the man pages for the |
| following commands: scontrol, scancel and sbcast. |
| |
| * Users can hold and release their own jobs. Submit in held state using srun |
| or sbatch --hold or -H options. Hold after submission using the command |
| "scontrol hold <jobid>". Release with "scontrol release <jobid>". Users can |
| not release jobs held by a system administrator unless the adminstrator uses |
| the command "scontrol uhold <jobid>" ("uhold" for "user hold"). |
| |
| * Add support for slurmctld and slurmd option of "-n <value>" to reset the |
| daemon's nice value. |
| |
| * srun's --core option has been removed. Use the SPANK "Core" plugin from |
| http://code.google.com/p/slurm-spank-plugins/ for continued support. |
| |
| * Added salloc and sbatch option --wait-all-nodes. If set non-zero, job |
| initiation will be delayed until all allocated nodes have booted. Salloc |
| will log the delay with the messages "Waiting for nodes to boot" and "Nodes |
| are ready for job". |
| |
| * Added scontrol "wait_job <job_id>" option to wait for nodes to boot as needed. |
| Useful for batch jobs (in Prolog, PrologSlurmctld or the script) if powering |
| down idle nodes. |
| |
| * Modified sview to display database configuration and add/remove visible tabs. |
| |
| * Modified sview to save default configuration in .slurm/sviewrc file. |
| Default setting can be set by using the menus Options->Set Default Settings |
| or typing Ctrl-S. |
| |
| * Modified select/cons_res plugin so that if MaxMemPerCPU is configured and a |
| job specifies it's memory requirement, then more CPUs than requested will |
| automatically be allocated to a job to honor the MaxMemPerCPU parameter. |
| |
| * Add new scontrol option of "show aliases" to report every NodeName that is |
| associated with a given NodeHostName when running multiple slurmd daemons |
| per compute node (typically used for testing purposes). |
| |
| BLUEGENE SPECIFIC CHANGES |
| ========================= |
| |
| OTHER CHANGES |
| ============= |
| * Added support for a default account and wckey per cluster within accounting. |
| |
| * Added support for several new trigger types: SlurmDBD failure/restart, |
| Database failure/restart, Slurmctld failure/restart. |
| |
| * Support has been added for TotalView to attach to a subset of launched tasks |
| instead of requiring that all tasks be attached to. This is the default |
| behavior unless an option of "--enable-partial-attach=no" be passed to the |
| configure (build) script. |
| |
| * A web application (chart_stats.cgi) has been added that invokes sreport to |
| retrieve from the accounting storage db a user's request for job usage or |
| machine utilization statistics and charts the results to a browser. |
| |
| * Much functionality has been added to account_storage/pgsql. The plugin |
| is still in a very beta state. |
| |
| * SLURM's PMI library (for MPICH2) has been modified to properly execute an |
| executable program stand-alone (single MPI task launched without srun). |
| |
| * The PMI was also modified to use more socket connections for better |
| scalability and to clear state between job step invocations. |
| |
| * Added support for spank_get_item() to get S_STEP_ALLOC_CORES and |
| S_STEP_ALLOC_MEM. Support will remain for S_JOB_ALLOC_CORES and |
| S_JOB_ALLOC_MEM. |
| |
| * Changed error message from "Requested time limit exceeds partition limit" |
| to "Requested time limit is invalid (exceeds some limit)". The error can be |
| triggered by a time limit exceeding the user/bank limit or the time-min |
| exceeding the job or partition's time limit. |
| |
| * Added proctrack/cgroup plugin which uses Linux control groups (aka cgroup) to |
| track processes on Linux systems with this feature (kernel >= 2.6.24). |
| |
| * Added the derived_ec (exit code) member to job_info_t. exit_code captures |
| the exit code of the job script (or salloc) while derived_ec contains the |
| highest exit code of all the job steps. |
| |
| * Added the derived exit code and derived exit string fields to the database's |
| job record. Both can be modified by the user after the job completes. See |
| job_exit_code.html |
| |
| |
| API CHANGES |
| =========== |
| |
| Changed members of the following structs |
| ======================================== |
| job_info_t |
| num_procs -> num_cpus |
| job_min_cpus -> pn_min_cpus |
| job_min_memory -> pn_min_memory |
| job_min_tmp_disk -> pn_min_tmp_disk |
| min_sockets -> sockets_per_node |
| min_cores -> cores_per_socket |
| min_threads -> threads_per_core |
| |
| job_desc_msg_t |
| num_procs -> min_cpus |
| job_min_cpus -> pn_min_cpus |
| job_min_memory -> pn_min_memory |
| job_min_tmp_disk -> pn_min_tmp_disk |
| min_sockets -> sockets_per_node |
| min_cores -> cores_per_socket |
| min_threads -> threads_per_core |
| |
| partition_info_t |
| state_up (new states added PARTITION_DRAIN and PARTITION_INACTIVE) |
| default_part -> flags (as PART_FLAG_DEFAULT flag) |
| disable_root_jobs -> flags (as PART_FLAG_NO_ROOT flag) |
| hidden -> flags (as PART_FLAG_HIDDEN flag) |
| root_only -> flags (as PART_FLAG_ROOT_ONLY flag) |
| |
| slurm_step_ctx_params_t |
| node_count -> min_nodes |
| |
| slurm_ctl_conf_t |
| cache_groups -> group_info (as GROUP_CACHE flag) |
| |
| |
| Added the following struct definitions |
| ====================================== |
| block_info_t (BlueGene-specific information) |
| reason |
| |
| job_info_t |
| derived_ec |
| gres |
| max_cpus |
| resize_time |
| show_flags |
| time_min |
| |
| job_desc_msg_t |
| gres |
| max_cpus |
| time_min |
| wait_all_nodes |
| |
| job_step_info_t |
| gres |
| |
| node_info_t |
| boot_time |
| gres |
| reason_time |
| reason_uid |
| slurmd_start_time |
| |
| partition_info_t |
| alternate |
| flags |
| preempt_mode |
| |
| slurm_ctl_conf_t |
| gres_plugins |
| group_info |
| hash_val |
| job_submit_plugins |
| sched_logfile |
| sched_log_level |
| slurmctld_port_count |
| vsize_factor |
| |
| slurm_step_ctx_params_t |
| features |
| gres |
| max_nodes |
| |
| update_node_msg_t |
| gres |
| preempt_mode |
| reason_uid |
| |
| |
| Changed the following enums |
| =========================== |
| job_state_reason |
| FAIL_BANK_ACCOUNT -> FAIL_ACCOUNT |
| FAIL_QOS /* invalid QOS */ |
| WAIT_QOS_THRES /* required QOS threshold has been breached */ |
| |
| select_jobdata_type |
| SELECT_JOBDATA_PTR /* data-> select_jobinfo_t *jobinfo */ |
| |
| select_nodedata_type |
| SELECT_NODEDATA_PTR /* data-> select_nodeinfo_t *nodeinfo */ |
| |
| select_type_plugin_info is no longer and it's contents are now mostly #defines |
| |
| Added the following API's |
| ========================= |
| slurm_checkpoint_requeue() |
| slurm_init_update_step_msg() |
| slurm_job_step_get_pids() |
| slurm_job_step_pids_free() |
| slurm_job_step_pids_response_msg_free() |
| slurm_job_step_stat() |
| slurm_job_step_stat_free() |
| slurm_job_step_stat_response_msg_free() |
| slurm_list_append() |
| slurm_list_count() |
| slurm_list_create() |
| slurm_list_destroy() |
| slurm_list_find() |
| slurm_list_is_empty() |
| slurm_list_iterator_create() |
| slurm_list_iterator_reset() |
| slurm_list_iterator_destroy() |
| slurm_list_next() |
| slurm_list_sort() |
| slurm_set_schedlog_level() |
| slurm_step_launch_fwd_wake() |
| slurm_update_step() |
| |
| |
| Changed the following API's |
| =========================== |
| slurm_load_block_info(): Added show_flag parameter |