| This file describes changes in recent versions of SLURM. It primarily |
| documents those changes that are of interest to users and admins. |
| |
| * Changes in SLURM 1.2.36 |
| ========================= |
| -- For spank_get_item(S_JOB_ARGV) for batch job with script input via STDIN, |
| set argc value to 1 (rather than 2, argv[0] still set to path of generated |
| script). |
| -- sacct will now display more properly allocations made with salloc with only |
| one step. |
| |
| * Changes in SLURM 1.2.35 |
| ========================= |
| -- Permit SPANK plugins to dynamically register options at runtime base upon |
| configuration or other runtime checks. |
| -- Add "include" keywork to SPANK plugstack.conf file to optionally include |
| other configuration files or directories of configuration files. |
| -- Srun to wait indefinitely for resource allocation to be made. Used to |
| abort after two minutes. |
| |
| * Changes in SLURM 1.2.34 |
| ========================= |
| -- Permit the cancellation of a job that is in the process of being |
| requeued. |
| -- Ignore the show_flag when getting job, step, node or partition information |
| for user root. |
| -- Convert some functions to thread-safe versions: getpwnam, getpwuid, |
| getgrnam, and getgrgid to similar functions with "_r" suffix. While no |
| failures have been observed, a race condition would in the worst case |
| permit a user access to a partition not normally allowed due to the |
| AllowGroup specification or the wrong user identified in an accounting |
| record. The job would NOT be run as the wrong user. |
| -- For PMI only (MPICH2/MVAPICH2) base address to send messages to (the srun) |
| upon the address from which slurmd gets the task launch request rather then |
| "hostname" where srun executes. |
| -- Make test for StateSaveLocation directory more comprehensive. |
| -- For jobcomp/script plugin, PROCS environment variable is now the actual |
| count of allocated processors rather than the count of processes to |
| be started. |
| |
| * Changes in SLURM 1.2.33 |
| ========================= |
| -- Cancelled or Failed jobs will now report their job and step id on exit |
| -- Add SPANK items available to get: SLURM_VERSION, SLURM_VERSION_MAJOR, |
| SLURM_VERISON_MINOR and SLURM_VERSION_MICRO. |
| -- Fixed handling of SIGPIPE in srun. Abort job. |
| -- Fix bug introduced to MVAPICH plugin preventing use of TotalView debugger. |
| -- Modify slurmctld to get srun/salloc network address based upon the incoming |
| message rather than hostname set by the user command (backport of logic in |
| SLURM v1.3). |
| |
| * Changes in SLURM 1.2.32 |
| ========================= |
| -- LSF only: Enable scancel of job in RootOnly partition by the job's owner. |
| -- Add support for sbatch --distribution and --network options. |
| -- Correct pending job's wait reason to "Priority" rather than "Resources" if |
| required resources are being held in reserve for a higher priority job. |
| -- In sched/wiki2 (Moab) report a node's state as "Drained" rather than |
| "Draining" if it has no allocated work (An undocumented Moab wiki option, |
| see CRI ticket #2394). |
| -- Log to job's output when it is cancelled or reaches it's time limit (ported |
| from existing code in slurm v1.3). |
| -- Add support in salloc and sbatch commands for --network option. |
| -- Add support for user environment variables that include '\n' (e.g. |
| bash functions). |
| -- Partial rewrite of mpi/mvapich plugin for improved scalability. |
| |
| * Changes in SLURM 1.2.31 |
| ========================= |
| -- For Moab only: If GetEnvTimeout=0 in slurm.conf then do not run "su" to get |
| the user's environment, only use the cache file. |
| -- For sched/wiki2 (Moab), treat the lack of a wiki.conf file or the lack |
| of a configured AuthKey as a fatal error (lacks effective security). |
| -- For sched/wiki and sched/wiki2 (Maui or Moab) report a node's state as |
| Busy rather than Running when allocated if SelectType=select/linear. Moab |
| was trying to schedule job's on nodes that were already allocated to jobs |
| that were hidden from it via the HidePartitionJobs in Slurm's wiki.conf. |
| -- In select/cons_res improve the resource selection when a job has specified |
| a processor count along with a maximum node count. |
| -- For an srun command with --ntasks-per-node option and *no* --ntasks count, |
| spawn a task count equal to the number of nodes selected multiplied by the |
| --ntasks-per-node value. |
| -- In jobcomp/script: Set TZ if set in slurmctld's environment. |
| -- In srun with --verbose option properly format CPU allocation information |
| logged for clusters with 1000+ nodes and 10+ CPUs per node. |
| -- Process a job's --mail_type=end option on any type of job termination, not |
| just normal completion (e.g. all failure modes too). |
| |
| * Changes in SLURM 1.2.30 |
| ========================= |
| -- Fix for gold not to print out 720 error messages since they are |
| potentally harmful. |
| -- In sched/wiki2 (Moab), permit changes to a pending job's required features: |
| CMD=CHANGEJOB ARG=<jobid> RFEATURES=<features> |
| -- Fix for not aborting when node selection doesn't load, fatal error instead |
| -- In sched/wiki and sched/wiki2 DO NOT report a job's state as "Hold" if it's |
| dependencies have not been satisfied. This reverses a changed made in SLURM |
| version 1.2.29 (which was requested by Cluster Resources, but places jobs |
| in a HELD state indefinitely). |
| |
| * Changes in SLURM 1.2.29 |
| ========================= |
| -- Modified global configuration option "DisableRootJobs" from number (0 or 1) |
| to boolean (YES or NO) to match partition parameter. |
| -- Set "DisableRootJobs" for a partition to match the global parameters value |
| for newly created partitions. |
| -- In sched/wiki and sched/wiki2 report a node's updated features if changed |
| after startup using "scontrol update ..." command. |
| -- In sched/wiki and sched/wiki2 report a job's state as "Hold" if it's |
| dependencies have not been satisfied. |
| -- In sched/wiki and sched/wiki2 do not process incoming requests until |
| slurm configuration is completely loaded. |
| -- In sched/wiki and sched/wiki2 do not report a job's node count after it |
| has completed (slurm decrements the allocated node count when the nodes |
| transition from completing to idle state). |
| -- If job prolog or epilog fail, log the program's exit code. |
| -- In jobacct/gold map job names containing any non-alphanumeric characters |
| to '_' to avoid MySQL parsing problems. |
| -- In jobacct/linux correct parsing if command name contains spaces. |
| -- In sched/wiki and sched/wiki2 report make job info TASK count reflect the |
| actual task allocation (not requested tasks) even after job terminates. |
| Useful for accounting purposes only. |
| |
| * Changes in SLURM 1.2.28 |
| ========================= |
| -- Added configuration option "DisableRootJobs" for parameter |
| "PartitionName". See "man slurm.conf" for details. |
| -- Fix for faking a large system to correctly handle node_id in the task |
| afffinity plugin for ia64 systems. |
| |
| * Changes in SLURM 1.2.27 |
| ========================= |
| -- Record job eligible time in accounting database (for jobacct/gold only). |
| -- Prevent user root from executing a job step within a job allocation |
| belonging to another user. |
| -- Fixed limiting issue for strings larger than 4096 in xstrfmtcat |
| -- Fix bug in how Slurm reports job state to Maui/Moab when a job is requeued |
| due to a node failure, but we can't terminate the job's spawned processes. |
| Job was being reported as PENDING when it was really still COMPLETING. |
| -- Added patch from Jerry Smith for qstat -a output |
| -- Fixed looking at the correct perl path for Slurm.pm in torque wrappers. |
| -- Enhance job requeue on node failure to be more robust. |
| -- Added configuration parameter "DisableRootJobs". See "man slurm.conf" |
| for details. |
| -- Fixed issue with account = NULL in Gold job accounting plugin |
| |
| * Changes in SLURM 1.2.26 |
| ========================= |
| -- Correct number of sockets/cores/threads reported by slurmd (from |
| Par Andersson, National Supercomputer Centre, Sweden). |
| -- Update libpmi linking so that libslurm is not required for PMI use |
| (from Steven McDougal, SiCortex). |
| -- In srun and sbatch, do not check the PATH env var if an absolute pathname |
| of the program is specified (previously reported an error if no PATH). |
| -- Correct output of "sinfo -o %C" (CPU counts by node state). |
| |
| * Changes in SLURM 1.2.25 |
| ========================= |
| -- Bug fix for setting exit code in accounting for batch script. |
| -- Add salloc option, --no-shell (for LSF). |
| -- Added new options for sacct output |
| -- mvapich: Ensure MPIRUN_ID is unique for all job steps within a job. |
| (Fixes crashes when running multiple job steps within a job on one node) |
| -- Prevent "scontrol show job" from failing with buffer overflow when a job |
| has a very long Comment field. |
| -- Make certain that a job step is purged when a job has been completed. |
| Previous versions could have the job step persist if an allocated node |
| went DOWN and the slurmctld restarted. |
| -- Fix bug in sbcast that can cause communication problems for large files. |
| -- Add sbcast option -t/--timeout and SBCAST_TIMEOUT environment variable |
| to control message timeout. |
| -- Add threaded agent to manage a queue of Gold update requests for |
| performance reasons. |
| -- Add salloc options --chdir and --get-user-env (for Moab). |
| -- Modify scontrol update to support job comment changes. |
| -- Do not clear a DRAINED node's reason field when slurmctld restarts. |
| -- Do not cancel a pending job if Moab or Maui try to start it on unusable nodes. |
| Leave the job queued. |
| -- Add --requeue option to srun and sbatch (these undocumented options have no |
| effect in slurm v1.2, but are legitimate options in slurm v1.3). |
| |
| * Changes in SLURM 1.2.24 |
| ========================= |
| -- In sched/wiki and sched/wiki2, support non-zero UPDATE_TIME specification |
| for GETNODES and GETJOBS commands. |
| -- Bug fix for sending accounting information multiple times for same |
| info. patch from Hongjia Cao (NUDT). |
| -- BLUEGENE - try FILE pointer rotation logic to avoid core dump on |
| bridge log rotate |
| -- Spread out in time the EPILOG_COMPLETE messages from slurmd to slurmctld |
| to avoid message congestions and retransmission. |
| |
| * Changes in SLURM 1.2.23 |
| ========================= |
| -- Fix for libpmi to not export unneeded variables like xstr* |
| -- BLUEGENE - added per partition dynamic block creation |
| -- fix infinite loop bug in sview when there were multiple partitions |
| -- Send message to srun command when a job is requeued due to node failure. |
| Note this will be overwritten in the output file unless JobFileAppend |
| is set in slurm.conf. In slurm version 1.3, srun's --open-mode=append |
| option will offer this control for each job. |
| -- Change a node's default TmpDisk from 1MB to 0MB and change job's default |
| disk space requirement from 1MB to 0MB. |
| -- In sched/wiki (Maui scheduler) specify a QOS (quality of service) by |
| specifying an account of the form "qos-name". |
| -- In select/linear, fix bug in scheduling required nodes that already have |
| a job running on them (req.load.patch from Chris Holmes, HP). |
| -- For use with Moab only: change timeout for srun/sbatch --get-user-env |
| option to 2 secs, don't get DISPLAY environment variables, but explicitly |
| set ENVIRONMENT=BATCH and HOSTNAME to the execution host of the batch script. |
| -- Add configuration parameter GetEnvTimeout for use with Moab. See |
| "man slurm.conf" for details. |
| -- Modify salloc and sbatch to accept both "--tasks" and "--ntasks" as |
| equivalent options for compatibility with srun. |
| -- If a partition's node list contains space separators, replace them with |
| commas for easier parsing. |
| -- BLUEGENE - fixed bug in geometry specs when creating a block. |
| -- Add support for Moab and Maui to start jobs with select/cons_res plugin |
| and jobs requiring more than one CPU per task. |
| |
| * Changes in SLURM 1.2.22 |
| ========================= |
| -- In sched/wiki2, add support for MODIFYJOB option "MINSTARTTIME=<time>" |
| to modify a job's earliest start time. |
| -- In sbcast, fix bug with large files and causing sbcast to die. |
| -- In sched/wiki2, add support for COMMENT= option in STARTJOB and CANCELJOB |
| commands. |
| -- Avoid printing negative job run time in squeue due to clock skew. |
| -- In sched/wiki and sched/wiki2, add support for wiki.conf option |
| HidePartitionJobs (see man pages for details). |
| -- Update to srun/sbatch --get-user-env option logic (needed by Moab). |
| -- In slurmctld (for Moab) added job->details->reserved_resources field |
| to report resources that were kept in reserve for job while it was |
| pending. |
| -- In sched/wiki (for Maui scheduler) report a pending job's node feature |
| requirements (from Miguel Roa, BSC). |
| -- Permit a user to change a pending job's TasksPerNode specification |
| using scontrol (from Miguel Roa, BSC). |
| -- Add support for node UP/DOWN event logging in jobacct/gold plugin |
| WARNING: using the jobacct/gold plugin slows the system startup set the |
| MessageTimeout variable in the slurm.conf to around 20+. |
| -- Added check at start of slurmctld to look for /tmp/slurm_gold_first if |
| there, and using the gold plugin slurm will make record of all nodes in |
| downed or drained state. |
| |
| * Changes in SLURM 1.2.21 |
| ========================= |
| -- Fixed torque wrappers to look in the correct spot for the perl api |
| -- Do not treat user resetting his time limit to the current value as |
| an error. |
| -- Set correct executable names for Totalview when --multi-prog option |
| is used and more than one node is allocated to the job step. |
| -- When a batch job gets requeued, record in accounting logs that |
| the job was cancelled, the requeued job's submit time will be |
| set to the time of its requeue so it looks like a different job. |
| -- Prevent communication problems if the slurmd/slurmstepd have a |
| different JobAcct plugin configured than slurmctld. |
| -- Adding Gold plugin for job accounting |
| -- In sched/wiki2, add support for MODIFYJOB option "JOBNAME=<name>" |
| to modify a job's name. |
| -- Add configuration check for sys/syslog.h and include it as needed. |
| -- Add --propagate option to sbatch for control over limit propagation. |
| -- Added Gold interface to the jobacct plugin. To configure in the config |
| file specify... |
| JobAcctType=jobacct/gold |
| JobAcctLogFile=CLUSTER_NAME:GOLD_AUTH_KEY_FILE:GOLDD_HOST:GOLDD_PORT7112 |
| -- In slurmctld job record, set begin_time to time when all of a job's |
| dependencies are met. |
| |
| * Changes in SLURM 1.2.20 |
| ========================= |
| -- In switch/federation, fix small memory leak effecting slurmd. |
| -- Add PMI_FANOUT_OFF_HOST environment variable to control how message |
| forwarding is done for PMI (MPICH2). See "man srun" for details. |
| -- From sbatch set SLURM_NTASKS_PER_NODE when --ntasks-per-node option is |
| specified. |
| -- BLUEGENE: Documented the prefix should always be lower case and the 3 |
| digit suffix should be uppercase if any letters are used as digits. |
| -- In sched/wiki and sched/wiki2, add support for --cpus-per-task option. |
| From Miguel Ros, BSC. |
| -- In sched/wiki2, prevent invalid memory pointer (and likely seg fault) |
| for job associated with a partition that has since been deleted. |
| -- In sched/wiki2 plus select/cons_res, prevent invalid memory pointer |
| (and likely seg fault) when a job is requeued. |
| -- In sched/wiki, add support for job suspend, resume, and modify. |
| -- In sched/wiki, add suppport for processor allocation (not just node allocation) |
| with layout control. |
| -- Prevent re-sending job termination RPC to a node that has already completed |
| the job. Only send it to specific nodes which have not reported completion. |
| -- Support larger environment variables 64K instead of BUFSIZ (8k on some |
| systems). |
| -- If a job is being requeued, job step create requests will print a |
| warning and repeatedly retry rather than aborting. |
| -- Add optional mode value to srun and sbatch --get-user-env option. |
| -- Print error message and retry job submit commands when MaxJobCount |
| is reached. From Don Albert, Bull. |
| -- Treat invalid begin time specification as a fatal error in sbatch and |
| srun. From Don Albert, Bull. |
| -- Validate begin time specification to avoid hours >24, minutes >59, etc. |
| |
| * Changes in SLURM 1.2.19 |
| ========================= |
| *** NOTE IMPORTANT CHANGE IN RPM BUILD BELOW **** |
| -- slurm.spec file (used to build RPMs) was updated in order to support Mock, a |
| chroot build environment. See https://hosted.fedoraproject.org/projects/mock/ |
| for more information. The following RPMs are no longer build by default: |
| aix-federation, auth_none, authd, bluegene, sgijob, and switch-elan. Change |
| the RPMs built using the following options in ~/rpmmacros: "%_with_authd 1", |
| "%_without_munge 1", etc. See the slurm.spec file for more details. |
| -- Print warning if non-privileged user requests negative "--nice" value on |
| job submission (srun, salloc, and sbatch commands). |
| -- In sched/wiki and sched/wiki2, add support for srun's --ntasks-per-node |
| option. |
| -- In select/bluegene with Groups defined for Images, fix possible memory |
| corruption. Other configurations are not affected. |
| -- BLUEGENE - Fix bug that prevented user specification of linux-image, |
| mloader-image, and ramdisk-image on job submission. |
| -- BLUEGENE - filter Groups specified for image not just by submitting |
| user's current group, but all groups the user has access to. |
| -- BLUEGENE - Add salloc options to specify images to be loaded (--blrts-image, |
| --linux-image, --mloader-image, and --ramdisk-image). |
| -- BLUEGENE - In bluegene.conf, permit Groups to be comma separated in addition |
| to colon separators previously supported. |
| -- sbatch will accept batch script containing "#SLURM" options and advise |
| changed to "#SBATCH". |
| -- If srun --output or --error specification contains a task number rather |
| than a file name, send stdout/err from specified task to srun's stdout/err |
| rather than to a file by the same name as the task's number. |
| -- For srun --multi-prog option, verify configuration file before attempting |
| to launch tasks, report clear explanation of any configuration file errors. |
| -- For sched/wiki2, add optional timeout option to srun's --get-user-env |
| parameter, change default timeout for "su - <user> env" from 3 to 8 seconds. |
| On timeout, attempt to load env from file at StateSaveLocation/env_cache/<user>. |
| The format of this file is the same as output of "env" command. If there |
| is no env cache file, then abort the request. |
| -- squeue modified for completing job to remove nodes that have already |
| completed the job before applying node filter logic. |
| -- squeue formatted output option added for job comment, "%q" (the obvious |
| choices for letters are already in use). |
| -- Added configure option --enable-load-env-no-login for use with Moab. If |
| set then the user job runs with the environment built without a login |
| ("su <user> env" rather than "su - <user> env"). |
| -- Fix output of "srun -o %C" (allocated CPU count) for running jobs. This was |
| broken in 1.2.18 for handling requeue of Moab jobs. |
| -- Added logic to mpiexec wrapper to read in the MPIEXEC_TIMEOUT var |
| -- Updated qstat wrapper to display information for partitions (-Q) option |
| -- NOTE: SLURM should now work directly with Globus using the PBS GRAM. |
| |
| * Changes in SLURM 1.2.18 |
| ========================= |
| -- BLUEGENE - bug fix for smap stating passthroughs are used when they aren't |
| -- Fixed bug in sview to be able to edit partitions correctly |
| -- Fixed bug so in slurm.conf files where SlurmdPort isn't defined things |
| work correctly. |
| -- In sched/wiki2 and sched/wiki add support for batch job being requeued |
| in Slurm either when nodes fail or upon request. |
| -- In sched/wiki2 and sched/wiki with FastSchedule=2 configured and nodes |
| configured with more CPUs than actually exist, return a value of TASKS |
| equal to the number of configured CPUs that are allocated to a job rather |
| than the number of physical CPUs allocated. |
| -- For sched/wiki2, timeout "srun --get-user-env ..." command after 3 seconds |
| if unable to perform pseudo-login and get user environment variables. |
| -- Add contribs/time_login.c program to test how long pseudo-login takes |
| for specific users or all users. This can identify users for which Moab |
| job submissions are unable to set the proper environment variables. |
| -- Fix problem in parallel make of Slurm. |
| -- Fixed bug in consumable resources when CR_Core_Memory is enabled |
| -- Add delay in slurmctld for "scontrol shutdown" RPC to get propagated |
| to slurmd daemons. |
| |
| * Changes in SLURM 1.2.17 |
| ========================= |
| -- In select/cons_res properly release resources allocated to job being |
| suspended (rmbreak.patch, from Chris Holmes, HP). |
| -- Fix AIX linking problem for PMI (mpich2) support. |
| -- Improve PMI logic for greater scalability (up to 16k tasks run). |
| -- Add srun support for SLURM_THREADS and PMI_FANOUT environment variables. |
| -- Fix support in squeue for output format with left justification of |
| reason (%r) and reason/node_list (%R) output. |
| -- Automatically requeue a batch job when a node allocated to it fails |
| or the prolog fails (unless --no-requeue or --no-kill option used). |
| -- In sched/wiki, enable use of wiki.conf parameter ExcludePartitions to |
| directly schedule selected partitions without Maui control. |
| -- In sched/backfill, if a job requires specific nodes, schedule other jobs |
| ahead of it rather than completely stopping backfill scheduling for that |
| partition. |
| -- BLUEGENE - corrected logic making block allocation work in a circular |
| fashion instead of linear. |
| |
| * Changes in SLURM 1.2.16 |
| ========================= |
| -- Add --overcommit option to the salloc command. |
| -- Run task epilog from job's working directory rather than directory |
| where slurmd daemon started from. |
| -- Log errors running task prolog or task epilog to srun's output. |
| -- In sched/wiki2, fix bug processing condensed hostlist expressions. |
| -- Release contribs/mpich1.slurm.patch without GPL license. |
| -- Fix bug in mvapich plugin for read/write calls that return EAGAIN. |
| -- Don't start MVAPICH timeout logic until we know that srun is starting |
| an MVAPICH program. |
| -- Fix to srun only allocating number of nodes needed for requested task |
| count when combining allocation and step creation in srun. |
| -- Execute task-prolog within proctrack container to insure that all |
| child processes get terminated. |
| -- Fixed job accounting to work with sgi_job proctrack plugin. |
| |
| * Changes in SLURM 1.2.15 |
| ========================= |
| -- In sched/wiki2, fix bug processing hostlist expressions where hosts |
| lack a numeric suffix. |
| -- Fix bug in srun. When user did not specify time limit, it defaulted to |
| INFINITE rather than partition's limit. |
| -- In select/cons_res with SelectTypeParameters=CR_Socket_Memory, fix bug in |
| memory allocation tracking, mem.patch from Chris Holmes, HP. |
| -- Add --overcommit option to the sbatch command. |
| |
| * Changes in SLURM 1.2.14 |
| ========================= |
| -- Fix a couple of bugs in MPICH/MX support (from Asier Roa, BSC). |
| -- Fix perl api for AIX |
| -- Add wiki.conf parameter ExcludePartitions for selected partitions to |
| be directly schedule by Slurm without Moab control |
| -- Optimize load leveling for shared nodes (alloc.patch, contributed |
| by Chris Holmes, HP). |
| -- Added PMI_TIME environment variable for user to control how PMI |
| communications are spread out in time. See "man srun" for details. |
| -- Added PMI timing information to srun debug mode to aid in tuning. |
| Use "srun -vv ..." to see the information. |
| -- Added checkpoint/ompi (OpenMPI) plugin (still under development). |
| -- Fix bug in load leveling logic added to v1.2.13 which can cause an |
| infinite loop and hang slurmctld when sharing nodes between jobs. |
| -- Added support for sbatch to read in #PBS options from a script |
| |
| * Changes in SLURM 1.2.13 |
| ========================= |
| -- Add slurm.conf parameter JobFileAppend. |
| -- Fix for segv in "scontrol listpids" on nodes not in SLURM config. |
| -- Add support for SCANCEL_CTLD env var. |
| -- In mpi/mvapich plugin, add startup timeout logic. Time based upon |
| SLURM_MVAPICH_TIMEOUT (value in seconds). |
| -- Fixed pick_step_node logic to only pick the number of nodes requested |
| from the user when excluding nodes, to avoid an error message. |
| -- Disable salloc, sbatch and srun -I/--immediate options with |
| Moab scheduler. |
| -- Added "contribs" directory with a Perl API and Torque wrappers for Torque |
| to SLURM migration. This directory should be used to put anything that |
| is outside of SLURM proper such as a different API. Perl APIs contributed |
| by Hongjia Cao (NUDT). |
| -- In sched/wiki2: add support for tasklist with node name expressions |
| and task counts (e.g. TASKLIST=tux[1-4]*2:tux[12-14]*4"). |
| -- In select/cons_res with sched/wiki2: fix bug in task layout logic. |
| -- Removed all curses info from the bluegene plugin putting it into smap |
| where it belongs. |
| -- Add support for job time limit specification formats: min, min:sec, |
| hour:min:sec, and days-hour:min:sec (formerly only supported minutes). |
| Applies to salloc, sbatch, and srun commands. |
| -- Improve scheduling support for exclusive constraint list, nodes can |
| now be in more than one constraint specific exclusively for a job |
| (e.g. "srun -C [rack1|rack2|rack3|rowB] srun") |
| -- Create separate MPICH/MX plugin (split out from MPICH/GM plugin) |
| -- Increase default MessageTimeout (in slurm.conf) from 5 to 10 secs. |
| -- Fix bug in batch job requeue if node zero of allocation fails to respond |
| to task launch request. |
| -- Improve load leveling logic to more evenly distribute the workload |
| (best_load.patch, contributed by Chris Holmes, HP). |
| |
| * Changes in SLURM 1.2.12 |
| ========================= |
| -- Increase maximum message size from 1MB to 16MB (from Ernest Artiaga, BSC). |
| -- In PMI_Abort(), log the event and abort the entire job step. |
| -- Add support for additional PMI functions: PMI_Get_clique_ranks and |
| PMI_Get_clique_size (from Chuck Clouston, Bull). |
| -- Report an error when a hostlist comes in appearing to be a box but not |
| formatted in XYZxXYZ format. |
| -- Add support for partition configuration "Shared=exclusive". This is |
| equivalent to "srun --exclusive" when select/cons_res is configured. |
| -- In sched/wiki2, report the reason for a node being unavailable for the |
| GETNODES command using the CAT="<reason>" field. |
| -- In sched/wiki2 with select/linear, duplicate hostnames in HOSTLIST, one |
| per allocated processor. |
| -- Fix bug in scancel with specific signal and job lacks active steps. |
| -- In sched/wiki2, add support for NOTIFYJOB ARG=<jobid> MSG=<message>. |
| This sends a message to an active srun command. |
| -- salloc will now set SLURM_NPROCS to improve srun's behavior under salloc. |
| -- In sched/wiki2 and select/cons_res: insure that Slurm's CPU allocation |
| is identical to Moab's (from Ernest Artiaga and Asier Roa, BSC). |
| -- Added "scontrol show slurmd" command to status local slurmd daemon. |
| -- Set node DOWN if prolog fails on node zero of batch job launch. |
| -- Properly handle "srun --cpus-per-task" within a job allocation when |
| SLURM_TASKS_PER_NODE environment varable is not set. |
| -- Fixed return of slurm_send_rc_msg if msg->conn_fd is < 0 set errno ENOTCONN |
| and return SLURM_ERROR instead of return ENOTCONN |
| -- Added read before we send anything down a socket to make sure the socket |
| is still there. |
| -- Add slurm.conf variables UnkillableStepProgram and UnkillableStepTimeout. |
| -- Enable nice file propagation from sbatch command. |
| |
| * Changes in SLURM 1.2.11 |
| ========================= |
| -- Updated "etc/mpich1.slurm.patch" for direct srun launch of MPICH1_P4 |
| tasks. See the "README" portion of the patch for details. |
| -- Added new scontrol command "show hostlist <hostnames>" to translate a list |
| of hostnames into a hostlist expression (e.g. "tux1,tux2" -> "tux[1-2]") |
| and "show hostnames <list>", returns a list of of nodes (one node per line) |
| from SLURM hostlist expression or from SLURM_NODELIST environment variable |
| if no hostlist specified. |
| -- Add the sbatch option "--wrap". |
| -- Add the sbatch option "--get-user-env". |
| -- Added support for mpich-mx (use the mpichgm plugin). |
| -- Make job's stdout and stderr file access rights be based upon user's umask |
| at job submit time. |
| -- Add support for additional PMI functions: PMI_Parse_option, |
| PMI_Args_to_keyval, PMI_Free_keyvals and PMI_Get_options (from Puenlap Lee |
| and Nancy Kritkausky, Bull). |
| -- Make default value of SchedulerPort (configuration parameter) be 7321. |
| -- Use SLURM_UMASK environment variable (if set) at job submit time as umask |
| for spawned job. |
| -- Correct some format issues in the man pages (from Gennero Oliva, ICAR). |
| -- Added support for parallel make across an existing SLURM allocation |
| based upon GNU make-3.81. Patch is in "etc/make.slurm.patch". |
| -- Added '-b' option to sbatch for easy MOAB trasition to sbatch instead of |
| srun. Option does nothing in sbatch. |
| -- Changed wiki2's handling of a node state in Completing to return 'busy' |
| instead of 'running' which matches slurm version 1.1 |
| |
| * Changes in SLURM 1.2.10 |
| ========================= |
| -- Fix race condititon in jobacct/linux with use of proctrack/pgid and a |
| realloc issue inside proctrack/linux |
| -- Added MPICH1_P4 plugin for direct launch of mpich1/p4 tasks using srun |
| and a patched version of the mpi library. See "etc/mpich1.slurm.patch". |
| NOTE: This is still under development and not ready for production use. |
| |
| * Changes in SLURM 1.2.9 |
| ======================== |
| -- Add new sinfo field to sort by "%E" sorts by the time associated with a |
| node's state (from Prashanth Tamraparni, HP). |
| -- In sched/wiki: fix logic for restarting backup slurmctld. |
| -- Preload SLURM plugins early in the slurmstepd operation to avoid |
| multiple dlopens after forking (and to avoid a glibc bug |
| that leaves dlopen locks in a bad state after a fork). |
| -- Added MPICH1_P4 patch to launch tasks using srun rather than rsh and |
| automatically generate mpirun's machinefile based upon the job's |
| allocation. See "etc/mpich1.slurm.patch". |
| -- BLUEGENE - fix for overlap mode to mark all other base partitions as used |
| when creating a new block from the file to insure we only use the base |
| partitions we are asking for. |
| |
| * Changes in SLURM 1.2.8 |
| ======================== |
| -- Added mpi/mpich1_shmem plugin. |
| -- Fix in proctrack/sgi_job plugin that could cause slurmstepd to seg_fault |
| preventing timely clean-up of batch jobs in some cases. |
| |
| * Changes in SLURM 1.2.7 |
| ======================== |
| -- BLUEGENE - code to make it so you can make a 36x36x36 system. |
| The wiring should be correct for a system with x-dim of 1,2,4,5,8,13 |
| in emulation mode. It will work with any real system no matter the size. |
| -- Major re-write of jobcomp/script plugin: fix memory leak and |
| general code clean-up. |
| -- Add ability to change MaxNodes and ExcNodeList for pending job |
| using scontrol. |
| -- Purge zombie processes spawned via event triggers. |
| -- Add support for power saving mode (experimental code to reduce voltage |
| and frequency on nodes that stay in the IDLE state, for more information |
| see http://www.llnl.gov/linux/slurm/power_save.html). None of this |
| code is enabled by default. |
| |
| * Changes in SLURM 1.2.6 |
| ======================== |
| -- Fix MPIRUN_PORT env variable in mvapich plugin |
| -- Disable setting triggers by other than user SlurmUser unless SlurmUser |
| is root for improved security. |
| -- Add event trigger for IDLE nodes. |
| |
| * Changes in SLURM 1.2.5 |
| ======================== |
| -- Fix nodelist truncation in "scontrol show jobs" output |
| -- In mpi/mpichgm, fix potential problem formatting GMPI_PORT, from |
| Ernest Artiaga, BSC. |
| -- In sched/wiki2 - Report job's account, from Ernest Artiaga, BSC. |
| -- Add sbatch option "--ntasks-per-node". |
| |
| * Changes in SLURM 1.2.4 |
| ======================== |
| -- In select/cons_res - fix for function argument type mis-match in getting |
| CPU count for a job, from Ernest Artiaga, BSC. |
| -- In sched/wiki2 - Report job's tasks_per_node requirement. |
| -- In forward logic fix to check if the forwarding node recieves a connection |
| but doesn't ever get the message from the sender (network issue or |
| something) also check to make sure if we get something back we make sure |
| we account for everything we sent out before we call it good. |
| -- Another fix to make sure steps with requested nodes have correct cpus |
| accounted for and a fix to make sure the user can't allocate more |
| cpus than the have requested. |
| |
| * Changes in SLURM 1.2.3 |
| ======================== |
| -- Cpuset logic added to task/affinity, from Don Albert (Bull) and |
| Moe Jette (LLNL). The /dev/cpuset file system must be mounted and |
| set "TaskPluginParam=cpusets" in slurm.conf to enable. |
| -- In sched/wiki2, fix possible overflow in job's nodelist, from |
| Ernest Artiaga, BSC. |
| -- Defer creation of new job steps until a suspended job is resumed. |
| -- In select/linear - fix for potential stack corruption bug. |
| |
| * Changes in SLURM 1.2.2 |
| ======================== |
| -- Added new command "strigger" for event trigger management, a new |
| capability. See "man strigger" for details. |
| -- srun --get-user-env now sends su's stderr to /dev/null |
| -- Fix in node_scheduling logic with multiple node_sets, from |
| Ernest Artiaga, BSC. |
| -- In select/cons_res, fix for function argument type mis-match in getting |
| CPU count for a job. |
| |
| * Changes in SLURM 1.2.1 |
| ======================== |
| -- MPICHGM support bug fixes from Ernest Artiaga, BSC. |
| -- Support longer hostlist strings, from Ernest Artiaga, BSC. |
| |
| * Changes in SLURM 1.2.0 |
| ======================== |
| -- Srun to use env vars for SLURM_PROLOG, SLURM_EPILOG, SLURM_TASK_PROLOG, |
| and SLURM_TASK_EPILOG. patch.1.2.0-pre11.070201.envproepilog from |
| Dan Palermo, HP. |
| -- Documenation update. patch.1.2.0-pre11.070201.mchtml from Dan Palermo, HP. |
| -- Set SLURM_DIST_CYCLIC = 1 (needed for HP MPI, slurm.hp.env.patch). |
| |
| * Changes in SLURM 1.2.0-pre15 |
| ============================== |
| -- Fix for another spot where the backup controller calls switch/federation |
| code before switch/federation is initialized. |
| |
| * Changes in SLURM 1.2.0-pre14 |
| ============================== |
| -- In sched/wiki2, clear required nodes list when a job is requeued. |
| Note that the required node list is set to every node used when |
| a job is started via sched/wiki2. |
| -- BLUEGENE - Added display of deallocating blocks to smap and other tools. |
| -- Make slurmctld's working directory be same as SlurmctldLogFile (if any), |
| otherwise StateSaveDir (which is likely a shared directory, possibly |
| making core file identification more difficult). |
| -- Fix bug in switch/federation that results in the backup controller |
| aborting if it receives an epilog-complete message. |
| |
| * Changes in SLURM 1.2.0-pre13 |
| ============================== |
| -- Fix for --get-user-env. |
| |
| * Changes in SLURM 1.2.0-pre12 |
| ============================== |
| -- BLUEGENE - Added correct node info for sinfo and sview for viewing |
| allocated nodes in a partition. |
| -- BLUEGENE - Added state save on slurmctld shutdown of blocks in an error |
| state on real systems and total block config on emulation systems. |
| -- Major update to Slurm's PMI internal logic for better scalability. |
| Communications now supported directly between application tasks via |
| Slurm's PMI library. Srun sends single message to one task on each node |
| and that tasks forwards key-pairs to other tasks on that nodes. The old |
| code sent key-pairs directly to each task. |
| NOTE: PMI applications must re-link with this new library. |
| -- For multi-core support: Fix task distribution bug and add automated |
| tests, patch.1.2.0-pre11.070111.plane from Dan Palermo (HP). |
| |
| * Changes in SLURM 1.2.0-pre11 |
| ============================== |
| -- Add multi-core options to slurm_step_launch API. |
| -- Add man pages for slurm_step_launch() and related functions. |
| -- Jobacct plugin only looks at the proctrack list instead of the entire |
| list of processes running on the node. Cutting down a lot of unnecessary |
| file opens in linux and cutting down the time to query the procs by |
| more than half. |
| -- Multi-core bug fix, mask re-use with multiple job steps, |
| patch.1.2.0-pre10.061214.affinity_stepid from Dan Palermo (HP). |
| -- Modify jobacct/linux plugin to completely eliminate open /proc files. |
| -- Added slurm_sched_plugin_reconfig() function to re-read config files. |
| -- BLUEGENE - --reboot option to srun, salloc, and sbatch actually works. |
| -- Modified step context and step launch APIs. |
| |
| * Changes in SLURM 1.2.0-pre10 |
| ============================== |
| -- Fix for sinfo node state counts by state (%A and %F output options). |
| -- Add ability to change a node's features via "scontrol update". NOTE: |
| Update slurm.conf also to preserve changes over slurmctld restart or |
| reconfig. |
| NOTE: Job and node state information can not be preserved from earlier |
| versions. |
| -- Added new slurm.conf parameter TaskPluginParam. |
| -- Fix for job requeue and credential revoke logic from Hongjia Cao (NUDT). |
| -- Fix for incorrectly generated masks for task/affinity plugin, |
| patch.1.2.0-pre9.061207.bitfmthex from Dan Palermo (HP). |
| -- Make mask_cpu options of srun and slaunch commands not requeue prefix |
| of "0x". patch.1.2.0-pre9.061208.srun_maskparse from Dan Palermo (HP). |
| -- Add -c support to the -B automatic mask generation for multi-core |
| support, patch.1.2.0-pre9.061208.mcore_cpuspertask from Dan Palermo (HP). |
| -- Fix bug in MASK_CPU calculation, |
| patch.1.2.0-pre9.061211.avail_cpuspertask from Dan Palermo (HP). |
| -- BLUEGENE - Added --reboot option to srun, salloc, and sbatch commands. |
| -- Add "scontrol listpids [JOBID[.STEPID]]" support. |
| -- Multi-core support patches, fixed SEGV and clean up output for large |
| task counts, patch.1.2.0-pre9.061212.cpubind_verbose from Dan Palermo (HP). |
| -- Make sure jobacct plugin files are closed before exec of user tasks to |
| prevent problems with job checkpoint/restart (based on work by |
| Hongjia Cao, NUDT). |
| |
| * Changes in SLURM 1.2.0-pre9 |
| ============================= |
| -- Fix for select/cons_res state preservation over slurmctld restart, |
| patch.1.2.0-pre7.061130.cr_state from Dan Palermo. |
| -- Validate product of socket*core*thread count on node registration rather |
| than individual values. Correct values will need to be specified in slurm.conf |
| with FastSchedule=1 for correct multi-core scheduling behavior. |
| |
| * Changes in SLURM 1.2.0-pre8 |
| ============================= |
| -- Modity job state "reason" field to report why a job failed (previously |
| previously reported only reason waiting to run). Requires cold-start of |
| slurmctld (-c option). |
| -- For sched/wiki2 job state request, return REJMESSAGE= with reason for |
| a job's failure. |
| -- New FastSchedule configuration parameter option "2" means to base |
| scheduling decisions upon the node's configuration as specified in |
| slurm.conf and ignore the node's actual hardware configuration. This |
| can be useful for testing. |
| -- Add sinfo output format option "%C" for CPUs (active/idle/other/total). |
| Based upon work by Anne-Marie Wunderlin (BULL). |
| -- Assorted multi-core bug fixes (patch1.2.0-pre7.061128.mcorefixes). |
| -- Report SelectTypeParameters from "scontrol show config". |
| -- Build sched/wiki plugin for Maui Scheduler (based upon new sched/wiki2 |
| code for Moab Scheduler). |
| -- BLUEGENE - changed way of keeping track of smaller partitions using |
| ionode range instead of quarter nodecard notation. |
| (i.e. bgl000[0-3] instead of bgl000.0.0) |
| -- Patch from Hongjia Cao (EINPROGRESS error message change) |
| -- Fix for correct requid for jobacct plugin |
| -- Added subsec timing display for sacct |
| |
| * Changes in SLURM 1.2.0-pre7 |
| ============================= |
| -- BLUEGENE - added configurable images for bluegene block creation. |
| -- Plug a bunch of memory leaks. |
| -- Support processors, core, and physical IDs that are not in numeric |
| order (in slurmd to gathering node state information, based on patch |
| by Don Albert, Bull). |
| -- Fixed bug with aix not looking in the correct dir for the proctrack |
| include files |
| -- Removed global_srun.* from common merged it into srun proper |
| -- Added bluegene section to troubleshooting guide (web page). |
| -- NOTE: Requires cold-start when moving from 1.2.0-pre6, save state |
| info for jobs changed. |
| -- BLUEGENE - Changed logic for wiring bgl blocks to be more maintainable. |
| (Haven't tested on large system yet, works on 2 base partition system) |
| -- Do not read the select/cons_res state save file if slurmctld is |
| cold-started (with the "-c" option). |
| |
| * Changes in SLURM 1.2.0-pre6 |
| ============================= |
| -- Maintain actually job step run time with suspend/resume use. |
| -- Allow slurm.conf options to appear multiple times. SLURM will use the |
| last instance of any particular option. |
| -- Add version number to node state save file. Will not recover node |
| state information on restart from older version. |
| -- Add logic to save/restore multi-core state information. |
| -- Updated multi-core logic to use types uint16_t and uint32_t instead |
| of just type int. |
| -- Race condition for forwarding logic fix from Hongjia Cao |
| -- Add support for Portable Linux Processor Affinity (PLPA, see |
| http://www.open-mpi.org/software/plpa). |
| -- When a job epilog completes on all non-DOWN nodes, immediately purge |
| it's job steps that lack switch windows. Needed for LSF operation. |
| Based upon slurm.hp.node_fail.patch. |
| -- Modify srun to ignore entries on --nodelist for job step creation |
| if their count exceeds the task count. Based on slurm.hp.srun.patch. |
| |
| * Changes in SLURM 1.2.0-pre5 |
| ============================= |
| -- Patch from HP patch.1.2.0.pre4.061017.crcore_hints, supports cores as |
| consumable resource. |
| |
| * Changes in SLURM 1.2.0-pre4 |
| ============================= |
| -- Added node_inx to job_step_info_t to get the node indecies for mapping out |
| steps in a job by nodes. |
| -- sview grid added |
| -- BLUEGENE node_inx added to blocks for reference. |
| -- Automatic CPU_MASK generation for task launch, new srun option -B. |
| -- Automatic logical to physical processor identification and mapping. |
| -- Added new srun options to --cpu_bind: sockets, cores, and threads |
| -- Updated select/cons_res to operate as socket granularity. |
| -- New srun task distribution options to -m: plane |
| -- Multi-core support in sinfo, squeue, and scontrol. |
| -- Memory can be treated as a consumable resource. |
| -- New srun options --ntasks-per-[node|socket|core]. |
| |
| * Changes in SLURM 1.2.0-pre3 |
| ============================= |
| -- Remove configuration parameter ShedulerAuth (defunct). |
| -- Add NextJobId to "scontrol show config" output. |
| -- Add new slurm.conf parameter MailProg. |
| -- New forwarding logic. New recieve_msg functions depending on what you |
| are expecting to get back. No srun_node_id anymore passed around in |
| a slurm_msg_t |
| -- Remove sched/wiki plugin (use sched/wiki2 for now) |
| -- Disable pthread_create() for PMI_send when TotalView is running for |
| better performance. |
| -- Fixed certain tests in test suite to not run with bluegene or front-end |
| systems |
| -- Removed addresses from slurm_step_layout_t |
| -- Added new job field, "comment". Set by srun, salloc and sbatch. See |
| with "scontrol show job". Used in sched/wiki2. |
| -- Report a job's exit status in "scontrol show job". |
| -- In sched/wiki2: add support for JOBREQUEUE command. |
| |
| * Changes in SLURM 1.2.0-pre2 |
| ============================= |
| -- Added function slurm_init_slurm_msg to be used to init any slurm_msg_t |
| you no longer need do any other type of initialization to the type. |
| |
| * Changes in SLURM 1.2.0-pre2 |
| ============================= |
| -- Fixed task dist to work with hostfile and warn about asking for more tasks |
| than you have nodes for in arbitray mode. |
| -- Added "account" field to job and step accounting information and sacct output. |
| -- Moved task layout to slurmctld instead of srun. Job step create returns |
| step_layout structure with hostnames and addresses that corrisponds |
| to those nodes. |
| -- Changed api slurm_lookup_allocation params, |
| resource_allocation_response_msg_t changed to job_alloc_info_response_msg_t |
| this structure is being renamed so contents are the same. |
| -- alter resource_allocation_response_msg_t see slurm.h.in |
| -- remove old_job_alloc_msg_t and function slurm_confirm_alloc |
| -- Slurm configuration files now support an "Include" directive to |
| include other files inline. |
| -- BLUEGENE New --enable-bluegene-emulation configure parameter to allow |
| running system in bluegene emulation mode. Only |
| really useful for developers. |
| -- New added new tool sview GUI for displaying slurm info. |
| -- fixed bug in step layout to lay out tasks correctly |
| |
| * Changes in SLURM 1.2.0-pre1 |
| ============================= |
| -- Fix bug that could run a job's prolog more than once |
| -- Permit batch jobs to be requeued, scontrol requeue <jobid> |
| -- Send overcommit flag from srun in RPCs and have slurmd set SLURM_OVERCOMMIT |
| flag at batch job launch time. |
| -- Added new configuration parameter MessageTimeout (replaces #define in |
| the code) |
| -- Added support for OSX build. |
| |
| * Changes in SLURM 1.1.37 |
| ========================= |
| - In sched/wiki2: Add NAME to job record. |
| - Changed -w (--nodelist) option to only read in number of nodes specified |
| by -N option unless nprocs was set and in Arbitrary layout mode. |
| - Added some loops around pthread creates incase they fail and also fixed an |
| issue in srun to fail job has failed instead of waiting around for threads |
| that will never end. |
| - Added fork handlers in the slurmstepd |
| - In sched/wiki2: fix logic for restarting backup slurmctld. |
| - In sched/wiki2: if job has no time limit specified, return the partition's |
| time limit (which is the default for the job) rather than 365 days. |
| |
| * Changes in SLURM 1.1.36 |
| ========================= |
| - Permit node state specification of DRAIN in slurm.conf. |
| - In jobcomp/script - fix bug that prevented UID and JOBID environment |
| variables from being set. |
| |
| * Changes in SLURM 1.1.35 |
| ========================= |
| - In sched/wiki2: Add support for CMD=SIGNALJOB to accept option |
| of VALUE=SIGXXX in addition to VALUE=# and VALUE=XXX options. |
| - In sched/wiki2: Add support for CMD=MODIFYJOB to accept option of |
| DEPEND=afterany:<jobid>, specify jobid=0 to clear. |
| - Correct logic for job allocation with task count (srun -n ...) AND |
| FastSchedule=0 AND low CPUs count in Slurm's node configuration. |
| - Add new and undocumented scancel option, --ctld, to route signal |
| requests through slurmctld rather than directly to slurmd daemons. |
| Useful for testing purposes. |
| - Fixed issue with hostfile support not working in a job step. |
| - Set supplemental groups for SlurmUser in slurmctld daemon, from |
| Anne Marie Wunderlin, Bull. |
| - In jobcomp/script: Add ACCOUNT and PROCS (count) to environment |
| variables set. Fix bug that prevented UID and JOBID from being |
| overwritten. |
| |
| * Changes in SLURM 1.1.34 |
| ========================= |
| - Insure that slurm_signal_job_step() is defined in srun for mvapich |
| and mpichgm error conditions. |
| - Modify /etc/init.d/slurm restart command to wait for daemon to terminate |
| before starting a new one |
| - Permit job steps to be started on draining nodes that have already |
| been allocated to that job. |
| - Prevent backup slurmctld from purging pending batch job scripts when a |
| SIGHUP is received. |
| - BLUEGENE - check to make sure set_block_user works when the block |
| is in a ready state. |
| - Fix to slurmstepd to not use local variables in a pthread create. |
| - In sched/wiki2 - add wiki.conf paramter HostFormat specifying |
| format of hostlists exchanged between Slurm and Moab (experimental). |
| - mpi/mvapich: Support Adam Moody's fast MPI initialization protocol |
| (MVAPICH protocol version 8). |
| |
| * Changes in SLURM 1.1.33 |
| ========================= |
| - sched/wiki2 - Do not wait for job completion before permitting |
| additional jobs to be scheduled. |
| - Add srun SLURM_EXCLUSIVE environment variable support, from |
| Gilles Civario (Bull). |
| - sched/wiki2 - Report job's node sharing options. |
| - sched/wiki2 - If SchedulerPort is in use, retry opening it indefinitely. |
| - sched/wiki2 - Add support for changing the size of a pending job. |
| - BLUEGENE - Fix to correctly look at downed/drained nodes with picking |
| a block to run a job and not confuse it with another running job. |
| |
| * Changes in SLURM 1.1.32 |
| ========================= |
| - If a job's stdout/err file names are unusable (bad path), use the |
| default names. |
| - sched/wiki2 - Fix logic to be compatible with select/cons_res plugin |
| for allocating individual processors within nodes. |
| - Fix job end time calculation when changed from an initial value of |
| INFINITE. |
| |
| * Changes in SLURM 1.1.31 |
| ========================= |
| - Correctly identify a user's login shell when running "srun -b --uid" |
| as root. Use the --uid field for the /etc/passwd lookup instead of |
| getuid(). |
| |
| * Changes in SLURM 1.1.30 |
| ========================= |
| - Fix to make sure users don't include and exclude the same node in |
| their srun line. |
| - mpi/mvapich: Forcibly terminate job 60s after first MPI_Abort() |
| to avoid waiting indefinitely for hung processes. |
| - proctrack/sgi_job: Fix segv when destroying an active job container |
| with processes still running. |
| - Abort a job's stdout/err to srun if not processed within 5 minutes |
| (prevents node hanging in completing state if the srun is stopped). |
| |
| * Changes in SLURM 1.1.29 |
| ========================= |
| - Fix bug which could leave orphan process put into background from |
| batch script. |
| |
| * Changes in SLURM 1.1.28 |
| ========================= |
| - BLUEGENE - Fixed issue with nodes that return to service outside of an |
| admin state is now updated in the bluegene plugin. |
| - Fix for --get-user-env parsing of non-printing characters in users' logins. |
| - Restore "squeue -n localhost" support. |
| - Report lack of PATH env var as verbose message, not error in srun. |
| |
| * Changes in SLURM 1.1.27 |
| ========================= |
| - Fix possible race condition for two simultaneous "scontrol show config" |
| calls resulting in slurm_xfree() Error: from read_config.c:642 |
| - BLUEGENE - Put back logic to make a block fail a boot 3 times before |
| cancelling a users job. |
| - Fix problem using srun --exclude option for a job step. |
| - Fix problem generating slurmd error "Unrecognized request: 0" with |
| some compilers. |
| |
| * Changes in SLURM 1.1.26 |
| ========================= |
| - In sched/wiki2, fixes for support of job features. |
| - In sched/wiki2, add "FLAGS=INTERACTIVE;" to GETJOBS response for |
| non-batch (not srun --batch) jobs. |
| |
| * Changes in SLURM 1.1.25 |
| ========================= |
| - switch/elan: Fix for "Failed to initialise stats structure" from |
| libelan when ELAN_STATKEY > MAX_INT. |
| - Tune PMI support logic for better scalability and performance. |
| - Fix for running a task on each node of an allocation if not specified. |
| - In sched/wiki2, set TASKLIST for running jobs. |
| - In sched/wiki2, set STARTDATE for pending jobs with deferred start. |
| - Added srun --get-user-env option (for Moab scheduler). |
| |
| * Changes in SLURM 1.1.24 |
| ========================= |
| - In sched/wiki2, add support for direct "srun --dependency=" use. |
| - mpi/mvapich: Add support for MVAPICH protocol version 6. |
| - In sched/wiki2, change "JOBMODIFY" command to "MODIFYJOB". |
| - In sched/wiki2, change "JOBREQUEUE" command to "REQUEUEJOB". |
| - For sched/wiki2, permit normal user to specify arbitrary job id. |
| - In sched/wiki2, set buffer pointer to NULL after free() to avoid |
| possible memory corruption. |
| - In sched/wiki2, report a job's exit code on completion. |
| - For AIX, fix mail for job event notification. |
| - Add documentation for propagation options in man srun and slurm.conf. |
| |
| * Changes in SLURM 1.1.23 |
| ========================= |
| - Fix bug in non-blocking connect() code affecting AIX. |
| |
| * Changes in SLURM 1.1.22 |
| ========================= |
| - Add squeue option to print a job step's task count (-o %A). |
| - Initialize forward_struct to avoid trying to free a bad pointer, |
| patch from Anton Blanchard (SAMBA). |
| - In sched/wiki2, fix fatal race condition on slurmctld startup. |
| - Fix for displaying launching verbose messages for each node under the |
| tree instead of just the head one. |
| - Fix job suspend bug, job accounting plugin would SEGV when given a |
| bad job ID. |
| |
| * Changes in SLURM 1.1.21 |
| ========================= |
| - BLUEGENE - Wait on a fini to make sure all threads are finished before |
| cleaning up. |
| - BLUEGENE - replacements to not destroy lists but just empty it to avoid |
| losing the pointer to the list in the block allocator. |
| - BLUEGENE - added --enable-bluegene-emulation configure option to 1.1 |
| - In sched/wiki2, enclose a job's COMMENT value in double quotes. |
| - In sched/wiki2, support newly defined SIGNALJOB command. |
| - In sched/wiki2, maintain open event socket, don't open and close |
| for each event. |
| - In sched/wiki2, fix for scalability problem starting large jobs. |
| - Fix logic to execute a batch job step (under an existing resource |
| allocation) as needed by LSF. |
| - Patches from Hongjia Cao (pmi finialize issues and type declaration) |
| - Delete pending job if it's associated partition is deleted. |
| - fix for handling batch steps completing correctly and setting the |
| return code. |
| - Altered ncurses check to make sure programs can link before saying we |
| have a working curses lib and header. |
| - Fixed an init issue with forward_struct_init not being set correctly in |
| a few locations in the slurmd. |
| - Fix for user to use the NodeHostname (when specified in the slurm.conf file) |
| to start jobs on. |
| |
| * Changes in SLURM 1.1.20 |
| ========================= |
| - Added new SPANK plugin hook slurm_spank_local_user_init() called |
| from srun after node allocation. |
| - Fixed bug with hostfile support not working on a direct srun |
| |
| * Changes in SLURM 1.1.19 |
| ========================= |
| - BLUEGENE - make sure the order of blocks read in from the bluegene.conf |
| are created in that order (static mode). |
| - Fix logic in connect(), slurmctld fail-over was broken in v1.1.18. |
| - Fix logic to calculate the correct timeout for fan out. |
| |
| * Changes in SLURM 1.1.18 |
| ========================= |
| - In sched/wiki2, add support for EHost and EHostBackup configuration |
| parameters in wiki.conf file |
| - In sched/wiki2, fix memory management bug for JOBWILLRUN command. |
| - In sched/wiki2, consider job Busy while in Completing state for |
| KillWait+10 seconds (used to be 30 seconds). |
| - BLUEGENE - Fixes to allow full block creation on the system and not to add |
| passthrough nodes to the allocation when creating a block. |
| - BLUEGENE - Fix deadlock issue with starting and failing jobs at the same |
| time |
| - Make connect() non-blocking and poll() with timeout to avoid huge |
| waits under some conditions. |
| - Set "ENVIRONMENT=BATCH" environment variable for "srun --batch" jobs only. |
| - Add logic to save/restore select/cons_res state information. |
| - BLUEGENE - make all sprintf's into snprintf's |
| - Fix for "srun -A" segfault on a node failure. |
| |
| * Changes in SLURM 1.1.17 |
| ========================= |
| - BLUEGENE - fix to make dynamic partitioning not go create block where |
| there are nodes that are down or draining. |
| - Fix srun's default node count with an existing allocation when neither |
| SLURM_NNODES nor -N are set. |
| - Stop srun from setting SLURM_DISTRIBUTION under job steps when a |
| specific was not explicitly requested by the user. |
| |
| * Changes in SLURM 1.1.16 |
| ========================= |
| - BLUEGENE - fix to make prolog run 5 minutes longer to make sure we have |
| enough time to free the overlapping blocks when starting a new job on a |
| block. |
| - BLUEGENE - edit to the libsched_if.so to read env and look at |
| MPIRUN_PARTITION to see if we are in slurm or running mpirun natively. |
| - Plugins are now dlopened RTLD_LAZY instead of RTLD_NOW. |
| |
| * Changes in SLURM 1.1.15 |
| ========================= |
| - BLUEGENE - fix to be able to create static partitions |
| - Fixed fanout timeout logic. |
| - Fix for slurmctld timeout on outgoing message (Hongjia Cao, NUDT.edu.cn). |
| |
| * Changes in SLURM 1.1.14 |
| ========================= |
| - In sched/wiki2: report job/node id and state only if no changes since |
| time specified in request. |
| - In sched/wiki2: include a job's exit code in job state information. |
| - In sched/wiki2: add event notification logic on job submit and completion. |
| - In sched/wiki2: add support for JOBWILLRUN command type. |
| - In sched/wiki2: for job info, include required HOSTLIST if applicable. |
| - In sched/wiki2: for job info, replace PARTITIONMASK with RCLASS (report |
| partition name associated with a job, but no task count) |
| - In sched/wiki2: for job and node info, report all data if TS==0, |
| volitile data if TS<=update_time, state only if TS>update_time |
| - In sched/wiki2: add support for CMD=JOBSIGNAL ARG=jobid SIGNAL=name or # |
| - In sched/wiki2: add support for CMD=JOBMODIFY ARG=jobid [BANK=name] |
| [TIMELIMIT=minutes] [PARTITION=name] |
| - In sched/wiki2: add support for CMD=INITIALIZE ARG=[USEHOSTEXP=T|F] |
| [EPORT=#]; RESPONSE=EPORT=# USEHOSTEXP=T |
| - In sched/wiki2: fix memory leak. |
| - Fix sinfo node state filtering when asking for idle nodes that are also |
| draining. |
| - Add Fortran extension to slurm_get_rem_time() API. |
| - Fix bug when changing the time limit of a running job that has previously |
| been suspended (formerly failed to account for suspend time in setting |
| termination time). |
| - fix for step allocation to be able to specify only a few nodes in a |
| step and ask for more that specified. |
| - patch from Hongjia Cao for forwarding logic |
| - BLUEGENE - able to allocate specific nodes without locking up. |
| - BLUEGENE - better tracking of blocks that are created dynamically, |
| less hitting the db2. |
| |
| * Changes in SLURM 1.1.13 |
| ========================= |
| - Fix hang in sched/wiki2 if Moab stops responding responding when |
| response is outgoing. |
| - BLUEGENE - fix to make sure the block is good to go when picking it |
| - BLUEGENE - add libsched_if.so so mpirun doesn't try to create a block |
| by itself. |
| - Enable specification of srun --jobid=# option with --batch (for user root). |
| - Verify that job actually starts when requested by sched/wiki2. |
| - Add new wiki.conf parameters: EPort and JobAggregationTime for event |
| notification logic (see wiki.conf man page for details) |
| |
| * Changes in SLURM 1.1.12 |
| ========================= |
| - Sched/wiki2 to report a job's account as COMMENT response to GETJOBS |
| request. |
| - Add srun option "--comment" (maps to job account until slurm v1.2, |
| needed for Moab scheduler functionality). |
| - fixed some timeout issues in the controller hopefully stopping all the |
| issues with excessive timeouts. |
| - unit conversion (i.e. 1024 => 1k) only happens on bgl systems for node |
| count. |
| - Sched/wiki2 to report a job's COMPETETIME and SUSPENDTIME in GETJOBS |
| response. |
| - Added support for Mellanox's version of mvapich-0.9.7. |
| |
| * Changes in SLURM 1.1.11 |
| ========================= |
| - Update file headers adding permission to link with OpenSSL. |
| - Enable sched/wiki2 message authentication. |
| - Fix libpmi compilation issue. |
| - Remove "gcc-c++ python" from slurm.spec BuildRequires. It breaks |
| the AIX build, so we'll have to find another way to deal with that. |
| |
| * Changes in SLURM 1.1.10 |
| ========================= |
| -- task distribution fix for steps that are smaller than job allocation. |
| -- BLUEGENE - fix to only send a success when block was created when trying |
| to allocate the block. |
| -- fix so if slurm_send_recv_node_msg fails on the send the auth_cred returned |
| by the resp is NULL. |
| -- Fix switch/federation plugin so backup controller can assume control |
| repeatedly without leaking or corrupting memory. |
| -- Add new error code (for Maui/Moab scheduler): ESLURM_JOB_HELD |
| -- Tweak slurmctld's node ping logic to better handle failed nodes with |
| hierarchical communications fail-over logic. |
| -- Add support for sched/wiki specific configuration file "wiki.conf". |
| -- Added sched/wiki2 plugin (new experimental wiki plugin). |
| |
| * Changes in SLURM 1.1.9 |
| ======================== |
| -- BLUEGENE - fix to handle a NO_VAL sent in as num procs in the job |
| description. |
| -- Fix bug in slurmstepd code for parsing --multi-prog command script. |
| Parser was failing for commands with no arguments. |
| -- Fix bug to check unsigned ints correctly in bitstring.c |
| -- Alter node count covert to kilo to only convert number divisible by |
| 1024 or 512 |
| |
| * Changes in SLURM 1.1.8 |
| ======================== |
| -- Added bug fixes (fault-tolerance and memory leaks) from Hongjia Cao |
| <hjcao@nudt.edu.cn> |
| -- Gixed some potential BLUEGENE issues with the bridge log file not having |
| a mutex around the fclose and fopen. |
| -- BLUEGENE - srun -n procs now regristers correctly |
| -- Fixed problem with reattach double allocating step_layout->tids |
| -- BLUEGENE - fix race condition where job is finished before it starts. |
| |
| * Changes in SLURM 1.1.7 |
| ======================== |
| -- BLUEGENE - fixed issue with doing an allocation for nodes since asking |
| for 32,128, or 512 all mean 1 to the controller. |
| -- Add "Include" directive to slurm.conf files. If "Include" is found |
| at the beginning of a line followed by whitespace and then |
| the full path to a file, that file is included inline with the current |
| slurm.conf file. |
| |
| * Changes in SLURM 1.1.6 |
| ======================== |
| -- Improved task layout for relative positions |
| -- Fixed heterogeous cpu overcommit issue |
| -- Fix bug where srun would hang if it ran on one node and that |
| node's slurmd died |
| -- Fix bug where srun task layout would be bad when min-max node range is |
| specified (e.g. "srun -N1-4 ...") |
| -- Made slurmctld_conf.node_prefix only be set on Bluegene systems. |
| -- Fixed a race condition in the controller to make it so a plugin thread |
| wouldn't be able to access the slurmctld_conf structure before it was |
| filled. |
| |
| * Changes in SLURM 1.1.5 |
| ======================== |
| -- Ignore partition's MaxNodes for SlurmUser and root. |
| -- Fix possible memory corruption with use of PMI_KVS_Create call. |
| -- Fix race condition when multiple PMI_KVS_Barrier calls. |
| -- Fix logic in which slurmctld outgoing RPC requests could get delayed. |
| -- Fix logic for laying out steps without a hostlist. |
| |
| * Changes in SLURM 1.1.4 |
| ======================== |
| -- Improve error handling in hierarchical communications logic. |
| |
| * Changes in SLURM 1.1.3 |
| ======================== |
| -- Fix big-endian bug in the bitstring code which plagued AIX. |
| -- Fix bug in handling srun's --multi-prog option, could go off end of buffer. |
| -- Added support for job step completion (and switch window release) on |
| subset of allocated nodes. |
| -- BLUEGENE - removed configure option --with-bg-link bridge is linked with |
| dlopen now no longer needing fake database so files on frontend node. |
| -- BLUEGENE - implemented use of rm_get_partition_info instead of |
| ...partitions_info which has made a much better design improving stability. |
| -- Streamline PMI communications and increase timeouts for highly parallel |
| jobs. Improves scalability of PMI. |
| |
| * Changes in SLURM 1.1.2 |
| ======================== |
| -- Fix bug in jobcomp/filetxt plugin to report proper NodeCnt when a job |
| fails due to a node failure. |
| -- Fix Bluegene configure to work with the new 64bit libs. |
| -- Fix bug in controller that causes it to segfault when hit with a malformed |
| message. |
| -- For "srun --attach=X" to other users job, report an error and exit (it |
| previously just hung). |
| -- BLUEGENE - fix for doing correct small block logic on user error. |
| -- BLUEGENE - Added support in slurmd to create a fake libdb2.so if it |
| doesn't exist so smap won't seg fault |
| -- BLUEGENE - "scontrol show job" reports "MaxProcs=None" and "Start=None" |
| if values are not specified at job submit time |
| -- Add retry logic for PMI communications, may be needed for highly parallel |
| jobs. |
| -- Fix bug in slurmd where variable is used in logging message after freed |
| (slurmstepd rank info). |
| -- Fix bug in scontrol show daemons if NodeName=localhost will work now to |
| display slurmd as place where it is running. |
| -- Patch from HP for init nodes before init_bitmaps |
| -- ctrl-c killed sruns will result in job state as cancelled instead of |
| completed. |
| -- BLUEGENE - added configure option --with-bg-link to choose dynamic linking |
| or static linking with the bridgeapi. |
| |
| * Changes in SLURM 1.1.1 |
| ======================== |
| -- Fix bug in packing job suspend/resume RPC. |
| -- If a user breaks out of srun before the allocation takes place, mark the |
| job as CANCELLED rather than COMPLETED and change its start and end time |
| to that time. |
| -- Fix bug in PMI support that prevented use of second PMI_Barrier call. |
| This fix is needed for MVAPICH2 use. |
| -- Add "-V" options to slurmctld and slurmd to print version number and exit. |
| -- Fix scalability bug in sbcast. |
| -- Fix bug in cons_res allocation strategy. |
| -- Fix bug in forwarding with mpi |
| -- Fix bug sacct forwarding with stat option |
| -- Added nodeid to sacct stat information |
| -- cleaned up way slurm_send_recv_node_msg works no more clearing errno |
| -- Fix error handling bug in the networking code that causes the slurmd to |
| xassert if the server is not running when the slurmd tries to register. |
| |
| * Changes in SLURM 1.1.0 |
| ======================== |
| -- Fix bug that could temporarily make nodes DOWN when they are really |
| responding. |
| -- Fix bug preventing backup slurmctld from responding to PING RPCs. |
| -- Set "CFLAGS=-DISO8601" before configuration to get ISO8601 format |
| times for all SLURM commands. NOTE: This may break Moab, Maui, and/or |
| LSF schedulers. |
| -- Fix for srun -n and -O options when paired with -b. |
| -- Added logic for fanout to failover to forward list if main node is |
| unreachable |
| -- sacct also now keeps track of submitted, started and ending times of jobs |
| -- reinit config file mutex at beginning of slurmstepd to avoid fork issues |
| |
| * Changes in SLURM 1.1.0-pre8 |
| ============================= |
| -- Fix bug in enforcement of partition's MaxNodes limit. |
| -- BLUEGENE - added support for srun -w option also fixed the geometry option |
| for srun. |
| |
| * Changes in SLURM 1.1.0-pre7 |
| ============================= |
| -- Accounting works for aix systems, use jobacct/aix |
| -- Support large (over 2GB) files on 32-bit linux systems |
| -- changed all writes to safe_write in srun |
| -- added $float to globals.example in the testsuite |
| -- Set job's num_proc correctly for jobs that do not have exclusive use |
| of it's allocated nodes. |
| -- Change in support for test suite: 'testsuite/expect/globals.example' |
| is now 'testsuite/expect/globals' and you can override variable |
| settings with a new file 'testsuite/expect/globals.local'. |
| -- Job suspend now sends SIGTSTP, sleep(1), sends SIGSTOP for better |
| MPI support. |
| -- Plug a bunch of memory leaks in various places. |
| -- Bluegene - before assigning a job to a block the plugin will check the bps |
| to make sure they aren't in error state. |
| -- Change time format in job completion logging (JobCompType=jobcomp/filetxt) |
| from "MM/DD HH:MM:SS" to "YYYY-MM-DDTHH:MM:SS", conforming with the ISO8601 |
| standard format. |
| |
| * Changes in SLURM 1.1.0-pre6 |
| ============================= |
| -- Added logic to "stat" a running job with sacct option -S use -j to specify |
| job.step |
| -- removed jobacct/bluegene (no real need for this) meaning, I don't think |
| there is a way to gather the data yet. |
| -- Added support for mapping "%h" in configured SlurmdLog to the hostname. |
| -- Add PropagatePrioProcess to control propagation of a user's nice value |
| to spawned tasks (based upon work by Daniel Christians, HP). |
| |
| * Changes in SLURM 1.1.0-pre5 |
| ============================= |
| -- Added step completion RPC logic |
| -- Vastly changed sacct and the jobacct plugin. Read documentation for full |
| details. |
| -- Added jobacct plugin for AIX and BlueGene, they currently don't work, |
| but infrastructure is in place. |
| -- Add support for srun option --ctrl-comm-ifhn to set PMI communications |
| address (Hongjia Cao, National University of Defense Technology). |
| -- Moved safe_read/write to slurm_protocol_defs.h removing multiple copies. |
| -- Remove vestigial functions slurm_allocate_resources_and_run() and |
| slurm_free_resource_allocation_and_run_response_msg(). |
| -- Added support for different executable files and arguments by task based |
| upon a configuration file. See srun's --multi-prog option (based upon |
| work by Hongjia Cao, National University of Defense Technology). |
| -- moved the way forward logic waited for fanout logic mostly eliminating |
| problems with scalability issues. |
| -- changed -l option in sacct to display different params see sacct/sacct.h |
| for details. |
| |
| * Changes in SLURM 1.1.0-pre4 |
| ============================= |
| -- Bluegene specific - Added support to set bluegene block state to |
| free/error via scontrol update BlockName |
| -- Add needed symbol to select/bluegene in order to load plugin. |
| |
| * Changes in SLURM 1.1.0-pre3 |
| ============================= |
| -- Added framework for XCPU job launch support. |
| -- New general configuration file parser and slurm.conf handling code. |
| Allows long lines to be continued on the next line by ending with a "\". |
| Whitespace is allowed between the key and "=", and between the "=" and |
| value. |
| WARNING: A NodeName may now occur only once in a slurm.conf file. |
| If you want to temporarily make nodes DOWN in the slurm.conf, |
| use the new DownNodes keyword (see "man slurm.conf"). |
| -- Gracefully handle request to submit batch job from within an existing |
| batch job. |
| -- Warn user attempting to create a job allocation from within an existing job |
| allocation. |
| -- Add web page description for proctrack plugin. |
| -- Add new function slurm_get_rem_time() for job's time limit. |
| -- JobAcct plugin renamed from "log" to "linux" in preparation for support of |
| new system types. |
| WARNING: "JobAcctType=jobacct/log" is no longer supported. |
| -- Removed vestigal 'bg' names from bluegene plugin and smap |
| -- InactiveLimit parameter is not enforced for RootOnly partitions. |
| -- Update select/cons_res web page (Susanne Balle, HP, |
| cons_res_doc_patch_3_29_06). |
| -- Build a "slurmd.test" along with slurmd. slurmd.test has the path to |
| slurmstepd set allowing it to run unmodified out of the builddir for |
| testing (Mark Grondona). |
| |
| * Changes in SLURM 1.1.0-pre2 |
| ============================= |
| -- Added "bcast" command to transmit copies of a file to compute nodes |
| with message fanout. |
| -- Bluegene specific - Added support for overlapping partitions and |
| dynamic partitioning. |
| -- Bluegene specific - Added support for nodecard sized blocks. |
| -- Added logic to accept 1k for 1024 and so on for --nodes option of srun. |
| This logic is through display tools such as smap, sinfo, scontrol, and |
| squeue. |
| -- Added bluegene.conf man page. |
| -- Added support for memory affinity, see srun --mem_bind option. |
| |
| * Changes in SLURM 1.1.0-pre1 |
| ============================= |
| -- New --enable-multiple-slurmd configure parameter to allow running |
| more than one copy of slurmd on a node at the same time. Only |
| really useful for developers. |
| -- New communication is now branched on all processes to slurmd's from |
| slurmctld and srun launch command. This is done with a tree type |
| algorithm. Spawn and batch mode work the same as before. New slurm.conf |
| variable TreeWidth=50 is default. This is the number of threads per |
| stop on the tree. |
| -- Configuration parameter HeartBeatInterval is depracated. Now used half |
| of SlurmdTimeout and SlurmctldTimeout for communications to slurmd and |
| slurmctld daemons repsectively. |
| -- Add hash tables for select/cons_res plugin (Susanne Balle, HP, |
| patch_02222006). |
| -- Remove some use of cr_enabled flag in slurmctld job record, use |
| new flag "test_only" in select_g_job_test() instead. |
| |
| * Changes in SLURM 1.0.17 |
| ========================= |
| -- Set correct user groups for task epilogs. |
| -- Add more debugging for tracking slow slurmd job initiations |
| (slurm.hp.replaydebug.patch). |
| |
| * Changes in SLURM 1.0.16 |
| ========================= |
| -- For "srun --attach=X" to other users job, report an error and exit (it |
| previously just hung). |
| -- Make sure that "scancel -s KILL" terminates the job just like "scancel" |
| including deletion of all job steps (Chris Holmes, HP, slurm,patch). |
| -- Recognize ISO-8859 input to srun as a script (for non-English scripts). |
| -- switch/elan: Fix bug in propagation of ELAN_STATKEY environment variable. |
| -- Fix bug in slurmstepd IO code that can result in it spinning if a |
| certain error occurs. |
| -- Remove nodes from srun's required node list if their count exceeds |
| the number of requested tasks. |
| -- sched/backfill to schedule around jobs that are hung in a completing |
| state. |
| -- Avoid possibly re-running the epilog for a job on slurmctld restart or |
| reconfig by saving and restoring a hostlist of nodes still completing |
| the job. |
| |
| * Changes in SLURM 1.0.15 |
| ========================= |
| -- In srun, reset stdin to blocking mode (if it was originally blocking before |
| we set it to O_NONBLOCK) on exit to avoid trouble with things like running |
| srun under a bash shell in an emacs *shell* buffer. |
| -- Fix srun race condition that occasionally causes segfaults at shutdown |
| -- Fix obscure locking issues in log.c code. |
| -- Explicitly close IO related sockets. If an srun gets "stuck", possibly |
| because of unkillable tasks in its job step, it will not hold many TCP |
| sockets in the CLOSE_WAIT state. |
| -- Increase the SLURM protocol timeout from 5 seconds to 10 seconds. |
| (In 1.2 there will be a slurm.conf parameter for this, rather than having |
| it hardcoded.) |
| |
| * Changes in SLURM 1.0.14 |
| ========================= |
| -- Fix for bad xfree() call in auth/munge which can raise an assert(). |
| -- Fix installed fork handlers for the conf mutex for slurmd and slurmstepd. |
| |
| * Changes in SLURM 1.0.13 |
| ========================= |
| -- Fix for AllowGroups option to work when the /etc/group file doesn't |
| contain all users in group by adding the uids of the names in /etc/passwd |
| that have a gid of that which we are looking for. |
| -- Fix bug in InactiveLimit support that can potentially purge active jobs. |
| NOTE: This is highly unlikely except on very large AIX clusters. |
| -- Fix bug for reiniting the config_lock around the control_file in |
| slurm_protocol_api.c logic has changed in 1.1 so no need to merge |
| |
| * Changes in SLURM 1.0.12 |
| ========================= |
| -- Report node state of DRAIN rather than DOWN if DOWN with DRAIN flag set. |
| -- Initialize job->mail_type to 0 (NONE) for job submission. |
| -- Fix for stalled task stdout/stderr when buffered I/O is used, and |
| a single line exceeds 4096 bytes. |
| -- Memory leak fixes for maui plugin (hjcao@nudt.edu.cn) |
| -- Fix for spinning srun when the terminal to which srun is talking |
| goes away. |
| -- Don't set avail_node_bitmap for DRAINED nodes on slurmctld reconfig |
| (can schedule a job on drained node after reconfig). |
| |
| |
| * Changes in SLURM 1.0.11 |
| ========================= |
| -- Fix for slurmstepd hang when launching a task. (Needed to install |
| list library's atfork handlers). |
| -- Fix memory leak on AIX (and possibly other architectures) due to |
| missing pthread_attr_destroy() calls. |
| -- Fix rare task standard I/O setup bug. When the bug hit, stdin, stdout, |
| or stderr could be an invalid file descriptor. |
| -- General slurmstepd file descriptor cleanup. |
| -- Fix memory leak in job accounting logic (Andy Riebs, HP, memory_leak.patch). |
| |
| * Changes in SLURM 1.0.10 |
| ========================= |
| -- Fix for job accounting logic submitted from Andy Riebs to handle issues |
| with suspending jobs and such. patch file named requeue.patch |
| -- Make select/cons_res interoperate with mpi/lam plugin for task counts. |
| -- Fix race condition where srun could seg-fault due to use of logging functions |
| within pthread after calling log_fini. |
| -- Code changes for clean build with gcc 2.96 (gcc_2_96.patch, Takao Hatazaki, HP). |
| -- Add CacheGroups configuration support in configurator.html (configurator.patch, |
| Takao Hatazaki, HP). |
| -- Fix bug preventing use of mpich-gm plugin (mpichgm.patch, Takao Hatazaki, HP). |
| |
| * Changes in SLURM 1.0.9 |
| ======================== |
| -- Fix job accounting logic to open new log file on slurmctld reconfig. |
| (Andy Riebs, slurm.hp.logfile.patch). |
| -- Fix bug which allows a user to run a batch script on a node not allocated |
| by the slurmctld. |
| -- Fix poe MP_HOSTFILE handling bug on AIX. |
| |
| * Changes in SLURM 1.0.8 |
| ======================== |
| -- Fix to communication between slurmd and slurmstepd to allow for partial |
| reads and writes on their communication pipes. |
| |
| * Changes in SLURM 1.0.7 |
| ======================== |
| -- Change in how AuthType=auth/dummy is handled for security testing. |
| -- Fix for bluegene systems to allow full system partitions to stay booted |
| when other jobs are submitted to the queue. |
| |
| * Changes in SLURM 1.0.6 |
| ======================== |
| -- Prevent slurmstepd from crashing when srun attaches to batch job. |
| |
| * Changes in SLURM 1.0.5 |
| ======================== |
| -- Restructure logic for scheduling BlueGene small block jobs. Added |
| "test_only" flag to select_p_job_test() in select plugin. |
| -- Correct squeue "NODELIST" output for BlueGene small block jobs. |
| -- Fix possible deadlock situations on BlueGene plugin on errors. |
| |
| * Changes in SLURM 1.0.4 |
| ======================== |
| -- Release job allocation if step creation fails (especially for BlueGene). |
| -- Fix bug select/bluegene warm start with changed bglblock layout. |
| -- Fix bug for queuing full-system BlueGene jobs. |
| |
| * Changes in SLURM 1.0.3 |
| ======================== |
| -- Fix bug that could refuse to queue batch jobs for BlueGene system. |
| -- Add BlueGene plugin mutex lock for reconfig. |
| -- Ignore BlueGene bgljobs in ERROR state (don't try to kill). |
| -- Fix job accounting for batch jobs (Andy Riebs, HP, |
| slurm.hp.jobacct_divby0a.patch). |
| -- Added proctrack/linuxproc.so to the main RPM. |
| -- Added mutex around bridge api file to avoid locking up the api. |
| -- BlueGene mod: Terminate slurm_prolog and slurm_epilog immediately if |
| SLURM_JOBID environment variable is invalid. |
| -- Federation driver: allow selection of a sepecific switch interface |
| (sni0, sni1, etc.) with -euidevice/MP_EUIDEVICE. |
| -- Return an error for "scontrol reconfig" if there is already one in |
| progress |
| |
| * Changes in SLURM 1.0.2 |
| ======================== |
| -- Correctly report DRAINED node state as type OTHER for "sinfo --summarize". |
| -- Fixes in sacct use of malloc (Andy Riebs, HP, sacct_malloc.patch). |
| -- Smap mods: eliminate screen flicker, fix window resize, report more clear |
| message if window too small (Dan Palermo, HP, patch.1.0.0.1.060126.smap). |
| -- Sacct mods for inconsistent records (race condition) and replace --debug |
| option with --verbose (Andy Riebs, HP, slurm.hp.sacct_exp_vvv.patch). |
| -- scancel of a job step will now send a job-step-completed message |
| to the controller after verifying that the step has completed on all nodes. |
| -- Fix task layout bug in srun. |
| -- Added times to node "Reason" field when set down for insufficient |
| resources or if not responding. |
| -- Validate operation with Elan switch and heterogeneous nodes. |
| |
| * Changes in SLURM 1.0.1 |
| ======================== |
| -- Assorted updates and clarifications in documentation. |
| -- Detect which munge installation to use 32/64 bit. |
| |
| * Changes in SLURM 1.0.0 |
| ======================== |
| -- Fix sinfo filtering bug, especially "sinfo -R" output. |
| -- Fix node state change bug, resuming down or drained nodes. |
| -- Fix "scontrol show config" to display JobCredentialPrivateKey instead |
| of JobCredPrivateKey and JobCredentialPublicCertificate instead of |
| JobCredPublicKey. They now match the options in the slurm.conf. |
| -- Fix bug in job accounting for very long node list records (Andy Riebs, |
| HP, sacct_buf.patch). |
| -- BLUEGENE SPECIFIC - added load function to smap to load an already |
| exsistant bluegene.conf file. |
| -- Fix bug in sacct: If user requests specific job or job step ID, |
| only the last one with that ID will be reported. If multiple |
| nodes fail, the job has its state recorded as "JOB_TERMINATED...nf" |
| (Andy Riebs, HP, slurm.hp.sacct_dup.patch). |
| -- Fix some inconsistencies in sacct's help message (Andy Riebs, HP, |
| slurm.hp.sacct_help.patch). |
| -- Validate input to sacct command and allows embedded spaces in |
| arguments (Andy Riebs, HP, slurm.hp.sacct_validate.patch). |
| |
| * Changes in SLURM 0.7.0-pre8 |
| ============================= |
| -- BGL specific -- bug fix for smap configure function down configuration |
| -- Add support for job suspend/resume. |
| -- Add slurmd cache for group IDs (Takao Hatazaki, HP). |
| -- Fix bug in processing of "#SLURM" batch script option parsing. |
| |
| * Changes in SLURM 0.7.0-pre7 |
| ============================= |
| -- Fix issue with NODE_STATE_COMPLETING, could start job on node before |
| epilog completed. |
| -- Added some infrastructure for job suspend/resume (scontrol, api, and |
| slurmctld stub). |
| -- Set job's num_procs to the actual processor count allocated to the job. |
| -- Fix bug in HAVE_FRONT_END support for cluster emulation. |
| |
| * Changes in SLURM 0.7.0-pre6 |
| ============================= |
| -- Added support for task affinity for binding tasks to CPUs (Daniel |
| Palermo, HP). |
| -- Integrate task affinity support with configuration, add validation |
| test. |
| |
| * Changes in SLURM 0.7.0-pre5 |
| ============================= |
| -- Enhanced performance and debugging for slurmctld reconfiguration. |
| -- Add "scontrol update Jobid=# Nice=#" support. |
| -- Basic slurmctld and tool functionality validated to 16k nodes. |
| -- squeue and smap now display correct info for jobs in bluegene enviornment. |
| -- Fix setting of SLURM_NODELIST for batch jobs. |
| -- Add SubmitTime to job information available for display. |
| -- API function slurm_confirm_allocation() has been marked OBSOLETE |
| and will go away in some future version of SLURM. Use |
| slurm_allocation_lookup() instead. |
| -- New API calls slurm_signal_job and slurm_signal_job_step to send |
| signals directly to the slurmds without triggering the shutdown sequence. |
| -- remove "uid" from old_job_alloc_msg_t, no longer needed. |
| -- Several bug fixes in maui scheduler plugin from Dave Jackon |
| (Cluster Resources). |
| |
| * Changes in SLURM 0.7.0-pre4 |
| ============================= |
| -- Remove BNR libary functions and add those for PMI (KVS and basic |
| MPI-1 functions only for now) |
| -- Added Hostfile support for POE and srun. MP_HOSTFILE env var to set |
| location of hostfile. Tasks will run from list order in the file. |
| -- Removes the slurmd's use of SysV shared memory. Instead the slurmd |
| communicates with the slurmstepd processes through the slurmstepd's |
| new named unix domain socket. The "stepd_api" is used to talk to the |
| slurmstepd (src/slurmd/common/stepd_api.[ch]). |
| -- Bluegene specific - bluegene block allocator will find most any |
| partition size now. Added support to start at any point in smap |
| to request a partition instead of always starting at 000. |
| -- Bluegene specific - Support to smap to down or bring up nodes in |
| configure mode. Added commands include allup, alldown, |
| up [range], down [range] |
| -- Time format in sinfo/squeue/smap/sacct changed from D:HH:MM:SS to |
| D-HH:MM:SS per POSIX standards document. |
| -- Treat scontrol update request without any requested changes as an |
| error condition. |
| -- Bluegene plugin renamed with BG instead of BGL. partition_allocator moved |
| into bluegene plugin and renamed block_allocator. Format for bluegene.conf |
| file changed also. Read bluegene html page. Code is backwards compatable |
| smap will generate in new form |
| -- Add srun option --nice to give user some control over job priority. |
| |
| * Changes in SLURM 0.7.0-pre3 |
| ============================= |
| -- Restructure node states: DRAINING and DRAINED states are replaced |
| with a DRAIN flag. COMPLETING state is changed to a COMPLETING flag. |
| -- Test suite moved into testsuite/expect from separate repository. |
| -- Added new document describing slurm APIs (doc/html/api.html). |
| -- Permit nodes to be in multiple partitions simultaneously. |
| |
| * Changes in SLURM 0.7.0-pre2 |
| ============================= |
| -- New stdio protocol. Now srun has just a single TCP stream to each node |
| of a job-step. srun and slurmd comminicate over the TCP stream using a |
| simple messaging protocol. |
| -- Added task plugin and use task prolog/epilog(s). |
| -- New slurmd_step functionality added. Fork exec instead of using shared |
| memory. Not completely tested. |
| -- BGL small partition logic in place in plugin and smap. Scheduler needs |
| to be rewritten to handle multiple partitions on a single node. No |
| documentation written on process yet. |
| -- If running select/bluegene plugin without access to BGL DB2, then |
| full-system bglblock is of system size defined in bluegene.conf. |
| |
| * Changes in SLURM 0.7.0-pre1 |
| ============================= |
| -- Support defered initiation of job (e.g. srun --begin=11:30 ...). |
| -- Add support for srun --cpus-per-task through task allocation in |
| slurmctld. |
| -- fixed partition_allocator to work without curses |
| -- made change to srun to start message thread before other threads |
| to make sure localtime doesn't interfere. |
| -- Added new RPCs for slurmctld REQUEST_TERMINATE_JOB or TASKS, |
| REQUEST_KILL_JOB/TASKS changed to REQUEST_SIGNAL_JOB/TASKS. |
| -- Add support for e-mail notification on job state changes. |
| -- Some infrastructure added for task launch controls (slurm.conf: |
| TaskProlog, TaskEpilog, TaskPlugin; srun --task-prolog, --task-epilog). |
| |
| * Changes in SLURM 0.6.11 |
| ========================= |
| -- Fix bug in sinfo partition sorting order. |
| -- Fix bugs in srun use of #SLURM options in batch script. |
| -- Use full Elan credential space rather than re-using credentials as soon |
| as job step completes (helps with fault-tolerance). |
| |
| * Changes in SLURM 0.6.10 |
| ========================= |
| -- Fix for slurmd job termination logic (could hang in COMPLETING state). |
| -- Sacct bug fixes: Report correct user name for job step, show "uid.gid" |
| as fifth field of job step record (Andy Riebs, slurm.hp.sacct_uid.patch). |
| -- Add job_id to maui scheduler plugin start job status message. |
| -- Fix for srun's handling of null characters in stdout or stderr. |
| -- Update job accounting for larger systems (Andy Riebs, uptodate.patch). |
| -- Fixes for proctrack/linuxproc and mpich-gm support (Takao Hatazaki, HP). |
| -- Fix bug in switch/elan for large task count job having irregular task |
| distribution across nodes. |
| |
| * Changes in SLURM 0.6.9 |
| ======================== |
| -- Fix bug in mpi plugin to set the ID correctly |
| -- Accounting bug causing segv fixed (Andy Riebs, 14oct.jobacct.patch) |
| -- Fix for failed launch of a debugged job (e.g. bad executable name). |
| -- Wiki plugin fix for tracking allocated nodes (Ernest Artiaga, BSC). |
| -- Fix memory leaks in slurmctld and federation plugin. |
| -- Fix sefault in federation plugin function fed_libstate_clear(). |
| -- Align job accounting data (Andy Riebs, slurm.hp.unal_jobacct.patch) |
| -- Restore switch state in backup controller restarts |
| |
| * Changes in SLURM 0.6.8 |
| ======================== |
| -- Invalid AllowGroup value in slurm.conf to not cause seg fault. |
| -- Fix bug that would cause slurmctld to seg-fault with select/cons_res |
| and batch job containing more than one step. |
| |
| * Changes in SLURM 0.6.7 |
| ======================== |
| -- Make proctrack/linuxproc thread safe, could cause slurmd seg fault. |
| -- Propagate umask from srun to spawned tasks. |
| -- Fix problem in switch/elan error handling that could hang a slurmd |
| step manager process. |
| -- Build on AIX with -bmaxdata:0x70000000 for memory limit more than 256MB. |
| -- Restore srun's return code support. |
| |
| * Changes in SLURM 0.6.6 |
| ======================== |
| -- Fix for bad socket close() in the spawn-io code. |
| |
| * Changes in SLURM 0.6.5 |
| ======================== |
| -- Sacct to report on job steps that never actually started. |
| -- Added proctrack/rms to elan rpm. |
| -- Restructure slurmctld/agent.c logic to insure timely reaping of |
| terminating pthreads. |
| -- Srun not to hang if job fails before task launches not all completed. |
| -- Fix for consumable resources properly scheduling nodes that have more |
| nodes than configured (Susanne Balle, HP, cons_res_patch.10.14.2005) |
| |
| * Changes in SLURM 0.6.4 |
| ======================== |
| -- Bluegene plugin drains an entire bglblock on repeated boot failures |
| only if it has not identified a specific node as being bad. |
| |
| * Changes in SLURM 0.6.3 |
| ======================== |
| -- Fix slurmctld mem leaks (step name and hostlist struct). |
| -- Bluegene plugin sets end time for job terminated due to removed |
| bglblock. |
| |
| * Changes in SLURM 0.6.2 |
| ======================== |
| -- Fix sinfo and squeue formatting to properly handle slurm nodes, |
| jobs, and other names containing "%". |
| |
| * Changes in SLURM 0.6.1 |
| ======================== |
| -- Fixed smap -Db to display slurm partitions correctly (take 2). |
| -- Add srun fork() retry logic for very heavily loaded system. |
| -- Fix possible srun hang on task launch failure. |
| -- Add support for mvapich v0.9.4, 0.9.5 and gen2. |
| |
| * Changes in SLURM 0.6.0 |
| ======================== |
| -- Add documentation for ProctrackType=proctrack/rms. |
| -- Make proctrack/rms be the default for switch/elan. |
| -- Do not preceed SIGKILL or SIGTERM to job step with (non-requested) SIGCONT. |
| -- Fixed smap -Db to display slurm partitions correctly. |
| -- Explicitly disallow ProctrackType=proctrack/linuxproc with |
| SwitchType=switch/elan. They will not work properly together. |
| |
| * Changes in SLURM 0.6.0-pre8 |
| ============================= |
| -- Remove debugging xassert in switch/federation that were accidentally |
| committed |
| -- Make slurmd step manager retry slurm_container_destroy() indefinitely |
| instead of giving up after 30 seconds. If something prevents a job |
| step's processes from being killed, the job will be stuck in the |
| completing until the container destroy succeeds. |
| |
| * Changes in SLURM 0.6.0-pre7 |
| ============================= |
| -- Disable localtime_r() calls from forked processes (semaphore set |
| in another pthread can deadlock calls to localtime_r made from |
| the forked process, this will be properly fixed in the next |
| major release of SLURM). |
| -- Added SLURM_LOCALID environment variable for spawned tasks |
| (Dan Palermo, HP). |
| -- Modify switch logic to restore state based exclusively upon |
| recovered job steps (not state save file). |
| -- Gracefully refuse job if there are too many job steps in slurmd. |
| -- Fix race condition in job completion that can leave nodes in |
| COMPLETING state after job is COMPLETED. |
| -- Added frees for BGL BrigeAPI strdups that were to this point unknown. |
| -- smap scrolls correctly for BGL systems. |
| -- slurm_pid2jobid() API call will now return the jobid for a step |
| manager slurmd process. |
| |
| * Changes in SLURM 0.6.0-pre6 |
| ============================= |
| -- Added logic to return scheduled nodes to Maui scheduler (David |
| Jackson, Cluster Resources) |
| -- Fix bug in handling job request with maximum node count. |
| -- Fix node selection scheduling bug with heterogeneous nodes and |
| srun --cpus-per-task option |
| -- Generate error file to note prolog failures. |
| |
| * Changes in SLURM 0.6.0-pre5 |
| ============================= |
| -- Modify sfree (BGL command) so that --all option no longer requires |
| an argument. |
| -- Modify smap so it shows all nodes and partitions by default (even |
| nodes that the user can't access, otherwise there are holes in |
| its maps). |
| -- Added module to parse time string (src/common/parse_time.c) for |
| future use. |
| -- Fix BlueGene hostlist processing for non-rectangular prisms and |
| add string length checking. |
| -- Modify orphan batch job time calculation for BGL to account for |
| slowness when booting many bglblocks at the same time. |
| |
| * Changes in SLURM 0.6.0-pre4 |
| ============================= |
| -- Added etc/slurm.epilog.clean to kill processes initiated outside of |
| slurm when a user's last job on a node terminates. |
| -- Added config.xml and configurator.html files for use by OSCAR. |
| -- Increased maximum job step count from 64 to 130 for BGL systems only. |
| |
| * Changes in SLURM 0.6.0-pre3 |
| ============================= |
| -- Add code so job request for shared nodes gets explicitly requested |
| nodes, but lightly loaded nodes otherwise. |
| -- Add job step name field. |
| -- Add job step network specification field. |
| -- Add proctrack/rms plugin |
| -- Change the proctrack API to send a slurmd_job_t pointer to both |
| slurm_container_create() and slurm_container_add(). One of those |
| functions MUST set job->cont_id. |
| -- Remove vestigial node_use (virtual or coprocessor) field from job |
| request RPC. |
| -- Fix mpich-gm bugs, thanks to Takao Hatazaki (HP). |
| -- Fix code for clean build with gcc 2.96, Takao Hatazaki (HP). |
| -- Add node update state of "RESUME" to return DRAINED, DRAINING, or |
| DOWN node to service (IDLE or ALLOCATED state). |
| -- smap keeps trying to connect to slurmctld in iterative mode rather |
| than just aborting on failure. |
| -- Add squeue option --node to filter by node name. |
| -- Modify squeue --user option to accept not only user names, but also |
| user IDs. |
| |
| * Changes in SLURM 0.6.0-pre2 |
| ============================= |
| -- Removed "make rpm" target. |
| |
| * Changes in SLURM 0.6.0-pre1 |
| ============================= |
| -- Added bgl/partition_allocator/smap changes from 0.5.7. |
| -- Added configurable resource limit propagation (Daniel Christians, HP). |
| -- Added mpi plugin specify at start of srun. |
| -- Changed SlurmUser ID from 16-bit to 32-bit. |
| -- Added MpiDefault slurm.conf parameter. |
| -- Remove KillTree configuration parameter (replace with |
| "ProctrackType=proctrack/linuxproc") |
| -- Remove MpichGmDirectSupport configuration parameter (replace with |
| "MpiDefault=mpich-gm") |
| -- Make default plugin be "none" for mpi. |
| -- Added mpi/none plugin and made it the default. |
| -- Replace extern program_invocation_short_name with program_invocation_name |
| due to short name being truncated to 16 bytes on some systems. |
| -- Added support for Elan clusters with different CPU counts on nodes |
| (Chris Holmes, HP). |
| -- Added Consumable Resources web page (Susanne Balle, HP). |
| -- "Session manager" slurmd process has been eliminated. |
| -- switch/federation fixes migrated from 0.5.* |
| -- srun pthreads really set detached, fixes scaling problem |
| -- srun spawns message handler process so it can now be stopped (via |
| Ctrl-Z or TotalView) without inducing failures. |
| |
| * Changes in SLURM 0.5.7 |
| ======================== |
| -- added infrastructure for (eventual) support of AIX checkpointing |
| of slurm batch and interactive poe jobs |
| -- added wiring for BGL to do wiring for physical location first and then |
| logical. |
| -- only one thread used to query database before polling thread is there. |
| |
| * Changes in SLURM 0.5.6 |
| ======================== |
| -- fix for BGL hostnames and full system partition finding |
| |
| * Changes in SLURM 0.5.5 |
| ======================== |
| -- Increase SLURM_MESSAGE_TIMEOUT_MSEC_STATIC to 15000 |
| -- Fix for premature timeout in _slurm_send_timeout |
| -- Fix for federation overlapping calls to non-thread-safe _get_adapters |
| |
| * Changes in SLURM 0.5.4 |
| ======================== |
| -- Added support for no reboot for VN to CO on BGL |
| -- Fix for if a job starts after it finishes on BGL |
| |
| * Changes in SLURM 0.5.3 |
| ======================== |
| -- federation patch so the slurm controller has sane window status at |
| start-up regardless of the window status reported in the slurmd |
| registration. |
| -- federation driver exits with fatal() if the federation driver can not |
| find all of the adapters listed in the federation.conf |
| |
| * Changes in SLURM 0.5.2 |
| ======================== |
| -- Extra federation driver sanity checks |
| |
| * Changes in SLURM 0.5.1 |
| ======================== |
| -- Fix federation driver bad free(), other minor fed fixes |
| -- Allow slurm to parse very long lines in the slurm.conf |
| |
| * Changes in SLURM 0.5.0 |
| ======================== |
| -- Fix race condition in job accouting plugin, could hang slurmd |
| -- Report SlurmUser id over 16 bits as an error (fix on v0.6) |
| |
| * Changes in SLURM 0.5.0-pre19 |
| ============================== |
| -- Fix memory management bug in federation driver |
| |
| * Changes in SLURM 0.5.0-pre18 |
| ============================== |
| -- elan switch plugin memory leak plugged |
| -- added g_slurmctld_jobacct_fini() to release all memory (useful |
| to confirm no memory leaks) |
| -- Fix slurmd bug introduced in pre17 |
| |
| * Changes in SLURM 0.5.0-pre17 |
| ============================== |
| -- slurmd calls the proctrack destroy function at job step completion |
| -- federation driver tries harder to clean up switch windows |
| -- BGL wiring changes |
| |
| * Changes in SLURM 0.5.0-pre16 |
| ============================== |
| -- Check slurm.conf values for under/overflows (some are 16 bit values). |
| -- Federation driver clears windows at job step completion |
| -- Modify code for clean build with gcc v4.0 |
| -- New SLURM_NETWORK environmant variable used by slurm_ll_api |
| |
| * Changes in SLURM 0.5.0-pre15 |
| ============================== |
| -- Added "network" field to "scontrol show job" output. |
| -- Federation fix for unfreed windows when multiple adapters on |
| one node use the same LID |
| |
| * Changes in SLURM 0.5.0-pre14 |
| ============================== |
| -- RDMA works on fed plugin. |
| |
| * Changes in SLURM 0.5.0-pre13 |
| ============================== |
| -- Major mods to support checkpoint on AIX. |
| -- Job accounting documenation expanded, added tuning options, minor bug fixes |
| -- BGL wiring will now work on <= 4 node X-dim partitions and also 8 node |
| X-dim partitions. |
| -- ENV variables set for spawning jobs. |
| -- jobacct patch from HP to not erroneously lock a mutex in the |
| jobacct_log plugin. |
| -- switch/federation supports multiple adapters per task. sn_all behaviour |
| is now correct, and it also supports sn_single. |
| |
| * Changes in SLURM 0.5.0-pre12 |
| ============================== |
| -- Minor build changes to support RPM creation on AIX |
| |
| * Changes in SLURM 0.5.0-pre11 |
| ============================== |
| -- Slurmd tests for initialized session manager (user's) slurmd pid before |
| killing it to avoid killing system daemon (race condition). |
| -- srun --output or --error file names of "none" mapped to /dev/null for |
| batch jobs rather than a file actually named "none". |
| -- BGL: don't try to read bglblock state until they are all created to |
| avoid having BGL Bridge API seg fault. |
| |
| * Changes in SLURM 0.5.0-pre10 |
| ============================== |
| -- Fix bug that was resetting BGL job geometry on unrelated field update. |
| -- squeue and sinfo print timestamp in interate mode by default. |
| -- added scrolling windows in smap |
| -- introduced new variable to start polling thread in the bluegene plugin. |
| -- Latest accounting patches from Riebs/HP, retry communications. |
| -- Added srun option --kill-on-bad-exit from Holmes/HP. |
| -- Support large (64-bit address) log files where possible. |
| -- Fix problem of signals being delivered twice to tasks. Note that as |
| part of the fix the slurmd session manger no longer calls setsid to |
| create a new session. |
| |
| * Changes in SLURM 0.5.0-pre9 |
| ============================= |
| -- If a job and node are in COMPLETING state and slurmd stops responding for |
| SlurmdTimeout, then set the node DOWN and the job COMPLETED. |
| -- Add logic to switch/elan to track contexts allocated to active job steps |
| rather than just using a cyclic counter and hoping to avoid collisions. |
| -- Plug memory leak in freeing job info retrieved using API. |
| -- Bluegene Plugin handles long deallocating states from driver 202. |
| -- Fix bug in bitfmt2int() which can go off allocated memory. |
| |
| * Changes in SLURM 0.5.0-pre8 |
| ============================= |
| -- BlueGene srun --geometry was not getting propogated properly. |
| -- Fix race condition with multiple simultaneous epilogs. |
| -- Modify slurmd to resend job completion RPC to slurmctld in the |
| case where slurmctld is not responding. |
| -- Updated sacct: handle cancelled jobs correctly, add user/group |
| output, add ntasks ans synonym for nprocs, display error field |
| by default, display ncpus instead of nprocs |
| -- Parallelization of queing jobs up to 32 at once. Variable |
| MAX_AGENT_COUNT used in bgl_job_run.c to specify. |
| -- bgl_job_run.c fixed threading issue with uid_to_string use. |
| |
| * Changes in SLURM 0.5.0-pre7 |
| ============================= |
| -- Preserve next_job_id across restarts. |
| -- Add support for really long job names (256 bytes). |
| -- Add configuration parameter SchedulerRootFilter to control what |
| entity manages prioritization of jobs in RootOnly partition |
| (internal scheduler plugin or external entity). |
| -- Added support for job accounting. |
| -- Added support for consumable resource based node scheduling. |
| -- Permit batch job to be launched to re-existing allocation. |
| |
| * Changes in SLURM 0.5.0-pre6 |
| ============================= |
| -- Load bluegene.conf and federation.conf based upon SLURM_CONF env |
| var (if set). |
| -- Fix slurmd shutdown signal synchronization bug (not consistently |
| terminating). |
| -- Add doc/html/ibm.html document. Update bluegene.html. |
| -- Add sfree to bluegene plugin. |
| -- Remove geometry[SYSTEM_DIMENSIONS] from opaque node_select data |
| type if SYSTEM_DIMENSIONS==0 (not ASCI-C compliant). |
| -- Modify smap to test for valid libdb2.so before issuing any BGL |
| Bridge API calls. |
| -- Modify spec file for optional inclusion of select_bluegene and |
| sched_wiki plugin libraries. |
| -- Initialize job->network in data structure, could cause job |
| submit/update to fail depending upon what is left on stack. |
| |
| * Changes in SLURM 0.5.0-pre5 |
| ============================= |
| -- Expand buffer to hold node_select info in job termination log. |
| -- Modify slurmctld node hashing function to reduce collisions. |
| -- Treat bglblock vanishing as fatal error for job, prolog and epilog |
| exit immediately. |
| -- bug fix for following multiple X-dim partitions |
| |
| * Changes in SLURM 0.5.0-pre4 |
| ============================= |
| -- Fix bug in slurmd that could double KillWait time on job timeout. |
| -- Fix bug in srun's error code reporting to slurmctld, could DOWN |
| a node if job run as root has non-zero error code. |
| -- Remove a node's partition info when removed from existing partition. |
| -- Use proctrack plugin to call all processes in a job step before |
| calling interconnect_postfini() to insure no processes escape from |
| job and prevent switch windows from being released. |
| -- Added mail.html web page telling how to get on slurm mailing lists. |
| -- Added another directory to search for DB2 files on BGL system. |
| -- Added overview man page slurm.1. |
| -- Added new configure option "--with-db2-dir=PATH" for BGL. |
| |
| * Changes in SLURM 0.5.0-pre3 |
| ============================= |
| -- Merge of SLURM v0.4-branch into v0.5/HEAD. |
| |
| * Changes in SLURM 0.5.0-pre2 |
| ============================= |
| -- Fix bug in srun to clean-up upon failure of an allocated node |
| (srun -A would generate a segmentation fault, Chris Holmes, HP). |
| -- If slurmd's node name is mapped to NULL (due to bad configuration) |
| terminate slurmd with a fatal error and don't crash slurmctld. |
| -- Add SLURMD_DEBUG env var for use with AIX/POE in spawn_task RPC. |
| -- Always pack job's "features" for access by prolog/epilog |
| |
| * Changes in SLURM 0.5.0-pre1 |
| ============================= |
| -- Add network option to srun and job creation API for specification |
| of communication protocol over IBM Federation switch. |
| -- Add new slurm.conf parameter ProctrackType (process tracking) and |
| associated plugin in the slurmd module. |
| -- Send node's switch state with job epilog completion RPC and |
| node registration (only when slurmd starts, not on periodic |
| registrtions). |
| -- Add federation switch plugin. |
| -- Add new configuration keyword, SchedulerRootFilter, to control |
| external scheduler control of RoolOnly partition (Chris Holmes, HP). |
| -- Modify logic to set process group ID for spawned processes (last |
| patch from slurm v0.3.11). |
| -- "srun -A" modified to return exit code of last command executed |
| (Chris Holmes, HP). |
| -- Add support for different slurm.conf files controlled via SLURM_CONF |
| env var (Brian O'Sullivan, pathscale) |
| -- Fix bug if srun given --uid without --gid option (Chris Holmes, HP). |
| |
| * Changes in SLURM 0.4.24 |
| ========================= |
| -- DRAIN nodes with switches on base partitions are in ERROR, MISSING, |
| or DOWN states. |
| |
| * Changes in SLURM 0.4.23 |
| ========================= |
| -- Modified bluegene plugin to only sync bglblocks to jobs on initial |
| startup, not on reconfig. Fixes race condition. |
| -- Modified bluegene plugin to work with 141 driver. Enabling it to |
| only have to reboot when switching from coproc -> virtual and back. |
| -- added support for a full system partition to make sure every other |
| partition is free and vice-verse. |
| -- smap resizing issue fixed. |
| -- change prolog not to add time when a partition is in deallocating |
| state. |
| -- NOTE: This version of SLURM requires BGL driver 141/2005. |
| |
| * Changes in SLURM 0.4.22 |
| ========================= |
| -- Modified bluegene plugin to not do anything if the bluegene.conf file |
| is altered. |
| -- added checking for lists before trying to create iterator on the list. |
| |
| * Changes in SLURM 0.4.21 |
| ========================= |
| -- Fix in race condition with time in Status Thread of BGL |
| -- Fix no leading zeros in smap output. |
| |
| * Changes in SLURM 0.4.20 |
| ========================= |
| -- Smap output is more user friendly with -c option |
| |
| * Changes in SLURM 0.4.19 |
| ========================= |
| -- Added new RPCs for getting bglblock state info remotely and cache data |
| within the plugin (permits removal of DB2 access from BGL FEN and |
| dramatically increases smap responsivenss, also changed prolog/epilog |
| operation) |
| -- Move smap executable to main slurm RPM (from separate RPM). |
| -- smap uses RPC instead of DB2 to get info about bgl partitions. |
| -- Status function added to bluegene_agent thread. Keeps current state |
| of BGL partitions updating every second. will handle multiple attempts |
| at booting if booting a partition fails. |
| |
| * Changes in SLURM 0.4.18 |
| ========================= |
| -- Added error checking of rm_remove_partition calls. |
| -- job_term() was terminating a job in real time rather than |
| queueing the request. This would result in slurmctld hanging |
| for many seconds when a job termination was required. |
| |
| * Changes in SLURM 0.4.17 |
| ======================== |
| -- Bug fixes from testing .16. |
| |
| * Changes in SLURM 0.4.16 |
| ======================== |
| -- Added error checking to a bunch of Bridge API calls and more |
| gracefully handle failure modes. |
| -- Made smap more robust for more jobs. |
| |
| * Changes in SLURM 0.4.15 |
| ======================== |
| -- Added error checking to a bunch of Bridge API calls and more |
| gracefully handle failure modes. |
| |
| * Changes in SLURM 0.4.14 |
| ======================== |
| -- job state is kept on warm start of slurm |
| |
| * Changes in SLURM 0.4.13 |
| ======================== |
| -- epilog fix for bgl plugin |
| |
| * Changes in SLURM 0.4.12 |
| ======================== |
| -- bug shot for new api calls. |
| -- added BridgeAPILogFile as an option for bluegene.conf file |
| |
| * Changes in SLURM 0.4.11 |
| ======================== |
| -- changed as many rm_get_partition() to rm_get_partitions_info as we could |
| for time saving. |
| |
| * Changes in SLURM 0.4.10 |
| ======================== |
| -- redesign for BGL external wiring. |
| -- smap display bug fix for smaller systems. |
| |
| * Changes in SLURM 0.4.9 |
| ======================== |
| -- setpnum works now, have to include this in bluegene.conf |
| |
| * Changes in SLURM 0.4.8 |
| ======================== |
| -- Changed the prolog and the epilog to use the env var MPIRUN_PARTITION |
| instead of BGL_PARTITION_ID |
| |
| * Changes in SLURM 0.4.7 |
| ======================== |
| -- Remove some BGL specific headers that IBM now distributes, NOTE |
| BGL driver 080 or greater required. |
| -- Change autogen.sh to deal with problems running autoconf on one |
| system and configure on another with different software versions. |
| |
| * Changes in SLURM 0.4.6 |
| ======================== |
| -- smap now works on non-BGL systems. |
| -- took tv.h out of partition_allocator so it would work withn driver 080 |
| from IBM. |
| -- updated slurmd signal handling to prevent possible user killing of daemon. |
| |
| * Changes in SLURM 0.4.5 |
| ======================== |
| -- Change sinfo default time limit field to have 10 bytes (up from 9). |
| -- Fix bug in bluegene partition selection (sorting bug). |
| -- Don't display any completed jobs in smap. |
| -- Add NodeCnt to filetxt job completion plugin. |
| -- Minor restructuring of how MMCS is polled for DOWN nodes and switches. |
| -- Fix squeue output format for "%s" (node select data). |
| -- Queue job requesting more resources than exist in a partition if |
| that partition's state is DOWN (rather than just abort it). |
| -- Add prolog/epilog for bluegene to code base (moved from mpirun in CVS) |
| -- Add prolog, epilog and bluegene.conf.example to bluegene RPM |
| -- In smap, Admin can get the Rack/midplane id from an XYZ input and vice versa. |
| -- Add smap line-oriented output capability. |
| |
| * Changes in SLURM 0.4.4 |
| ======================== |
| -- Fix race condition in slurmd seting pgid of spawned tasks for |
| process tracking. |
| -- Fix scontrol reconfig does nothing to running jobs nor crash the system |
| -- Fix sort of bgl_list only happens once in select_bluegene.c instead of every |
| time a new job is inserted. |
| |
| * Changes in SLURM 0.4.3 |
| ======================== |
| -- Turn off some RPM build checks (bug in RPM, see slurm.spec.in) |
| -- starting slurmctrld will destroy all RMP*** partitions everytime. |
| |
| * Changes in SLURM 0.4.2 |
| ======================== |
| -- Fix memory leak in BlueGene plugin. |
| -- Srun's --test-only option takes precedence over --batch option. |
| -- Add sleep(1) after setting bglblock owner due to apparent race condition |
| in the BGL API. |
| -- Slurm was timing out and killing batch jobs if the node registered when |
| a job prolog was still running. |
| |
| * Changes in SLURM 0.4.1 |
| ======================== |
| -- BlueGene plugin kills jobs running in defunct bglblock on restart. |
| -- Smap displays pending jobs now, in addition to running and completing jobs. |
| -- Remove node "use=" from bluegene.conf file, create both coprocessor and |
| virtual bglblocks for now (later create just one and use API to change |
| it when such an API is available). |
| -- Add "ChangeNumpsets" parameter to bluegene.conf to use script to |
| update the numpsets parameter for newly created bglblocks (to be |
| removed once the API functions). |
| -- Add all patches from slurm v0.3.11 (through 2/7/2005) |
| - Added srun option --disable-status,-X to disable srun status feature |
| and instead forward SIGINT immediately to job upon receipt of Ctrl-C. |
| - Fix for bogus slurmd error message "Unable to put task N into pgrp..." |
| - Fix case where slurmd may erroneously detect shared memory entry |
| as "stale" and delete entry for unkillable or slow-to-exit job. |
| - (qsnet) Fix for running slurmd on node without and elan3 adapter. |
| - Fix for reported problem: slurm/538: user tasks block writing to stdio |
| |
| * Changes in SLURM 0.4.0 |
| ======================== |
| -- Minor tweak to init.d/slurm for BlueGene systems. |
| -- Added smap RPM package (to install binary built on BlueGene |
| service node on front-end nodes). |
| -- Added wait between bglblock destroy and creation of new blocks |
| so that MMCS can complete the operation. |
| -- Fix bug in synchronizing bglblock owners on slurmctld restart. |
| |
| * Changes in SLURM 0.4.0-pre11 |
| ============================== |
| -- Add new srun option "--test-only" for testing slurm_job_will_run API. |
| -- Fix bugs in slurm_job_will_run() processing. |
| -- Change slurm_job_will_run() to not return a message, just an error code. |
| -- Sync partition owners with running jobs on slurmctld restart. |
| |
| * Changes in SLURM 0.4.0-pre10 |
| ============================== |
| -- Specify number of I/O nodes associated with BlueGene partition. |
| -- Do not launch a job's tasks if the job is cancelled while its |
| prolog is running (which can be slow on BlueGene). |
| -- Add new error code, ESLURM_BATCH_ONLY for attepts to launch |
| job steps on front-end system (e.g. Blue Gene). |
| -- Updates to html documents. |
| -- Assorted fixes in smap, partition creation mode. |
| -- Add proper support for "srun -n" option on BGL recognizing |
| processor count in both virual and coprocessor modes. |
| -- Make default node_use on Blue Gene be coprocessor, as documented. |
| -- Add SIGKILL to BlueGene jobs as part of cleanup. |
| |
| * Changes in SLURM 0.4.0-pre9 |
| ============================= |
| -- Change in /etc/init.d/slurm for RedHat and Suze compatability |
| |
| * Changes in SLURM 0.4.0-pre8 |
| ============================= |
| -- Add logic to create and destroy Bluegene Blocks automatically as needed. |
| -- Update smap man page to include Bluegene configuration commands. |
| |
| * Changes in SLURM 0.4.0-pre7 |
| ============================= |
| -- Port all patches from slurm v0.3 up through v0.3.10: |
| - Remove calls in auth/munge plugin deprecated by munge-0.4. |
| - Allow single task id to be selected with --input, --output, and --error. |
| - Create shared memory segment for Elan statistics when using the |
| switch/elan plugin. |
| - More fixes necessary for TotalView. |
| |
| * Changes in SLURM 0.4.0-pre6 |
| ============================= |
| -- Add new job reason value "JobHeld" for jobs with priority==0 |
| -- Move startup script from "/etc/rc.d/init.d/slurm" to "/etc/init.d/slurm" |
| -- Modify prolog/epilog logic in slurmd to accomodate very long run times, |
| on BGL these scripts wait for events that can take a very long time |
| (tens of seconds). |
| -- This code base was used for BGLb acceptance test with pre-defined |
| BGL blocks. |
| |
| * Changes in SLURM 0.4.0-pre5 |
| ============================= |
| -- select/bluegene plugin confirms db.properties file in $sysconfdir |
| and copies it to StateSaveLocation (slurmctld's working directory) |
| -- select/bluegene plugin confirms environment variable required for |
| DB2 interaction are set (execute "db2profile" script before slurmctld) |
| -- slurmd to always give jobs KillWait time between SIGTERM and SIGKILL |
| at termination |
| -- set job's start_time and end_time = now rather than leaving zero if |
| they fail to execute |
| -- modify srun to forward SIGTERM |
| -- enable select/bluegene testing for DOWN nodes and switches |
| -- select/bluegene plugin to delete orphan jobs, free BGLblocks and |
| set owner as jobs terminate/start |
| |
| * Changes in SLURM 0.4.0-pre4 |
| ============================= |
| -- Fixes for reported problems: |
| - slurm/512: Let job steps run on DRAINING nodes |
| - slurm/513: Gracefully deal with UIDs missing from passwd file |
| -- Add support for MPICH-GM (from takao.hatazaki@hp.com) |
| -- Add support for NodeHostname in node configuration |
| -- Make "scontrol show daemons" function properly on front-end system |
| (e.g. Blue Gene) |
| -- Fix srun bug when --input, --output and --error are all "none" |
| -- Don't schedule jobs for user root if partition is DOWN |
| -- Modify select/bluegene to honor job's required node list |
| -- Modify user name logic to explicitly set UID=0 to "root", |
| Suse Linux was not handling multiple users with UID=0 well. |
| |
| * Changes in SLURM 0.4.0-pre3 |
| ============================= |
| -- Send SIGTERM to batch script before SIGKILL for mpirun cleanup on |
| Blue Gene/L |
| -- Create new allocation as needed for debugger in case old allocation |
| has been purged |
| -- Add Blue Gene User Guide to html documents |
| -- Fix srun bug that could cause seg fault with --no-shell option if not |
| running under a debugger |
| -- Propogate job's task count (if set) for batch job via SLURM_NPROCS. |
| -- Add new job parameters for Blue Gene: geometry, rotate, mode (virtual |
| or co-processor), communications type (mesh or torus), and partition ID. |
| -- Exercise a bunch of new switch plugin functions for Federation |
| switch support. |
| -- Fix bug in scheduling jobs when a processor count is specified |
| and FastSchedule=0 and the cluster is heterogeneous. |
| |
| * Changes in SLURM 0.4.0-pre2 |
| ============================= |
| -- NOTE: "startclean" when transitioning from version 0.4.0-pre1, JOBS ARE LOST |
| -- Fixes for reported problems: |
| - slurm/477: Signal of batch job script (scancel -b) fixed |
| - slurm/481: Permit clearing of AllowGroups field for a partition |
| - slurm/482: Adjust Elan base context number to match RMS range |
| - slurm/489: Job completion logger was writing NULL to text file |
| -- Preserve job's requested processor count info after job is initiated |
| (for viewing by squeue and scontrol) |
| -- srun cancels created job if job step creation fails |
| -- Added a lots of Blue Gene/L support logic: slurmd executes on a single |
| node to front-end the 512-CPU base-partitions (Blue Gene/L's nodes) |
| -- Add node selection plugin infrastructure, relocate existing logic |
| to select/linear, add configuration parameter SelectType |
| -- Modify node hashing algorithm for better performance on Blue Gene/L |
| -- Add ability to specify node ranges for 3-D rectangular prism |
| |
| * Changes in SLURM 0.4.0-pre1 |
| ============================= |
| -- NOTE: "startclean" when transitioning from version 0.3, JOBS ARE LOST |
| -- Added support for job account information (arbitrary string) |
| -- Added support for job dependencies (start job X after job Y completes) |
| -- Added support for configuration parameter CheckpointType |
| -- Added new job state "CANCELLED" |
| -- Don't strip binaries, breaks parallel debuggers |
| -- Fix bug in Munge authentication retry logic |
| -- Change srun handling of interupts to work properly with TotalView |
| -- Added "reason" field to job info showing why a job is waiting to run |
| |
| * Changes in SLURM 0.3.7 |
| ======================== |
| -- Fixes required for TotalView operability under RHEL3.0 |
| (Reported by Dong Ahn <dahn@llnl.gov>) |
| - Do not create detached threads when running under parallel debugger. |
| - Handle EINTR from sigwait(). |
| |
| * Changes in SLURM 0.3.6 |
| ======================== |
| -- Fixes for reported problems: |
| - slurm/459: Properly support partition's "Shared=force" configuration. |
| -- Resync node state to DRAINED or DRAINING on restart in case job |
| and node state recovered are out of sync. |
| -- Added jobcomp/script plugin (execute script on job completion, |
| from Nathan Huff, North Dakota State University). |
| -- Added new error code ESLURM_FRAGMENTED for immediate resource |
| allocation requests which are refused due to completing job (formerly |
| returned ESLURM_NOT_TOP_PRIORITY) |
| -- Modified job completion logging plugin calling sequence. |
| -- Added much of the infrastructure required for system checkpoint |
| (APIs, RPCs, and NULL plugin) |
| |
| * Changes in SLURM 0.3.5 |
| ======================== |
| -- Fix "SLURM_RLIMIT_* not found in environment" error message when |
| distributing large rlimit to jobs. |
| -- Add support for slurm_spawn() and associated APIs (needed for IBM |
| SP systems). |
| -- Fix bug in update of node state to DRAINING/DRAINED when update |
| request occurs prior to initial node registration. |
| -- Fix bug in purging of batch jobs (active batch jobs were being |
| improperly purged starting in version 0.3.0). |
| -- When updating a node state to DRAINING/DRAINED a Reason must be |
| provided. The user name and a timestamp will automatically be |
| appended to that Reason. |
| |
| * Changes in SLURM 0.3.4 |
| ======================== |
| -- Fixes for reported problems: |
| - slurm/404: Explicitly set pthread stack size to 1MB for srun |
| -- Allow srun to respond to ctrl-c and kill queued job while waiting |
| for allocation from controller. |
| |
| * Changes in SLURM 0.3.3 |
| ======================== |
| -- Fix slurmctld handling of heterogeneous processor count on elan |
| switch (was setting DRAINED nodes in state DRAINING). |
| -- Fix sinfo -R, --list-reasons to list all relevant node states. |
| -- Fix slurmctld to honor srun's node configuration specifications |
| with FastSchedule==0 configuration. |
| -- Added srun option --debugger-test to confirm that slurm's debugger |
| infrastructure is operational. |
| -- Removed debugging hacks for srun.wrapper.c. Temporarily use |
| RPM's debugedit utility if available for similar effect. |
| |
| * Changes in SLURM 0.3.2 |
| ======================== |
| -- The srun command wakes immeditely upon resource allocation (via new RPC) |
| rather than polling. |
| -- SLURM daemons log current version number at startup. |
| -- If slurmd can't respond to ping (e.g. paging is keeping it from |
| responding in a timely fashion) then send a registration RPC |
| to slurmctld. |
| -- Fix slurmd -M option to call mlockall() after daemonizing. |
| -- Add "slurm_" prefix to slurm's hostlist_ function man pages. |
| -- More AIX support added. |
| -- Change get info calls from using show_all to more general show_flags |
| with #define for SHOW_ALL flag. |
| |
| * Changes in SLURM 0.3.1 |
| ======================== |
| -- Set SLURM_TASKS_PER_NODE env var for batch jobs (and LAM/MPI). |
| -- Fix for slurmd spinning when stdin buffers full (gnats:434) |
| -- Change some slurmctld malloc sizes to reduce demand for realloc calls, |
| improves performance and eliminates realloc failure on RH EL3 under |
| extremely heavy workload apparently due to memory fragmentation. |
| -- Fix scheduling logic for heterogeneous processor count. |
| -- Modify security_2_2 test to function with release 0.3 |
| -- Fix broken rpm build when libslurm not already installed. |
| -- New slurmd option -M to mlock() slurmd process into memory. |
| -- New srun option --no-shell causes srun to exit instead of spawning |
| shell when using --allocate, -A. |
| -- Modify srun --uid=user and --gid=group options to maintain invoking |
| user's credentials until after nodes have been allocated to requested |
| user/group (allows root to run jobs and allocate nodes for other users |
| in a RootOnly partition). |
| -- Fix node processing if state change requested via scontrol prior to |
| initial node registration. |
| |
| * Changes in SLURM 0.3.0 |
| ======================== |
| -- Support for AIX added (a few bugs do remain). |
| -- Fix memory leak in slurmctld, slurm_cred_create(). |
| -- On ELF systems, export BNR_* functions from SLURM API. |
| -- Add support for "hidden" partitions (applies to their |
| nodes, jobs, and job steps as well). APIs and commands |
| modified to optionally display hidden partitions. |
| -- Modify partition's group_allow test to be based upon the user |
| of the allocation rather than the user making the allocation |
| request (user root for LCRM batch jobs). |
| -- Restructure plugin directory structure. |
| -- New --core=type option in srun for lightweight corefile support. |
| (requires liblwcf). |
| -- Let user root and SlurmUser exceed any partition limits. |
| -- Srun treats "--time=0" as a request for an infinite time limit. |
| |
| * Changes in SLURM 0.3.0.0-pre10 |
| ================================ |
| -- Fix bugs in support of slurmctld "-f" option (specify different |
| slurm.conf pathname). |
| -- Remove slurmd "-f" option. |
| -- Several documenation changes for slurm administrators. |
| -- On ELF systems, export only slurm_* functions from slurm API and |
| ensure plugins use only slurm_ prefixed functions (created aliases |
| where necessary). |
| -- New srun option -Q, --quiet to suppress informational messages. |
| -- Fix bug in slurmctld's building of nodelist for job (failed if |
| more than one numeric field in node name). |
| -- Change "scontrol completing" and "sinfo" to use job's node bitmap |
| to identify nodes associated with that particular job that are |
| still processing job completion. This will work properly for |
| shared nodes. |
| -- Set SLURM_DISTRIBUTION environment varible for user tasks. |
| -- Fix for file descriptor leak in slurmd. |
| -- Propagate stacksize limit to jobs along with other resource limits |
| that were previously ignored. |
| |
| * Changes in SLURM 0.3.0.0-pre9 |
| =============================== |
| -- Restructure how slurmctld state saves are performed for better |
| scalability. |
| -- New sinfo option "--list-reason" or "-R". Displays down or drained |
| nodes along with their REASON field. |
| |
| * Changes in SLURM 0.3.0.0-pre8 |
| =============================== |
| -- Queue outgoing message traffic rather than immediately spawning |
| pthreads (under heavy load this resulted in hundreds of pthreads |
| using more memory than was available). |
| -- Restructure slurmctld message agent for higher throughput. |
| -- Add new sinfo options --responding and --dead (i.e. non-responding) |
| for filtering node states. |
| -- Fix bug in sinfo to properly process specified state filter including |
| "*" suffix for non-responding nodes. |
| -- Create StateSaveLocation directory if changes via slurmctld reconfig |
| |
| * Changes in SLURM 0.3.0.0-pre7 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/381: Hold jobs requesting more resources than partition limit. |
| - slurm/387: Jobs lost and nodes DOWN on slurmctld restart. |
| -- Add support for getting node's real memory size on AIX. |
| -- Sinfo sort partitions in slurm.conf order, new sort option ("#P"). |
| -- Document how to gracefully change plugin values. |
| -- Slurmctld does not attempt to recover jobs when the switch plugin |
| value changes (decision reached when any job's switch state recovery |
| fails). |
| -- Node does not transition from COMPLETING to DOWN state due to |
| not responding. Wait for tasks to complete or admin to set DOWN. |
| -- Always chmod SlurmdSpoolDir to 755 (a umask of 007 was resulting |
| in batch jobs failing). |
| -- Return errors when trying to change configuration parameters |
| AuthType, SchedulerType, and SwitchType via "scontrol reconfig" |
| or SIGHUP. Document how to safely change these parameters. |
| -- Plugin-specific error number definitions and descriptive strings |
| moved from common into plugin modules. |
| -- Documentation for writing scheduler, switch, and job completion |
| logging plugins added. |
| -- Added job and node state descriptions to the squeue and sinfo man pages. |
| -- Backup slurmctld to generate core file on SIGABRT. |
| -- Backup slurmctld to re-read slurm.conf on SIGHUP. |
| -- Added -q,--quit-on-interrupt option to srun. |
| -- Elan switch plugin now starts neterr resolver thread on all Elan3 |
| systems (QsNet and QsNetII). |
| -- Added some missing read locks for references for slurmctld's |
| configuration data structure |
| -- Modify processing of queued slurmctld message traffic to get better |
| throughput (resulted in job inactivity limit being reached improperly |
| when hundreds of jobs running simultaneously) |
| |
| * Changes in SLURM 0.3.0.0-pre6 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/372: job state descriptions added to squeue man page |
| -- Switch plugin added. Add "SwitchType=switch/elan" to slurm.conf for |
| systems with Quadrics Elan3 or Elan4 switches. |
| -- Don't treat DOWN nodes with too few CPUs as a fatal error on Elan |
| -- Major re-write of html documents |
| -- Updates to node pinging for large numbers of unresponsive nodes |
| -- Explicitly set default action for SIGTERM (action on Thunder was |
| to ignore SIGTERM) |
| -- Sinfo "--exact" option only applies to fields actually displayed |
| -- Partition processor count not correctly computed for heterogeneous |
| clusters with FastSchedule=0 configuration |
| -- Only return DOWN nodes to service if the reason for them being in |
| that state is non-responsiveness and "ReturnToService=1" configuration |
| -- Partition processor count now correctly computed for heterogeneous |
| clusters with FastSchedule configured off |
| -- New macros and function to export SLURM version number |
| |
| * Changes in SLURM 0.3.0.0-pre5 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/346: Support multiple colon-separated PluginDir values |
| -- Fix node state transition: DOWN to DRAINED (instead of DRAINING) |
| -- Fix a couple of minor slurmctld memory leaks |
| |
| * Changes in SLURM 0.3.0.0-pre4 |
| =============================== |
| -- Fix bug where early launch failures (such as invalid UID/GID) resulted |
| in jobs not terminating properly. |
| -- Initial support for BNR committed (not yet functional). |
| -- QsNet: SLURM now uses /etc/elanhosts exclusively for converting |
| hostnames to ElanIDs. |
| |
| * Changes in SLURM 0.3.0.0-pre3 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/328: Slurmd was restarting with a new shared memory segment and |
| losing track of jobs |
| - slurm/329: Job processing may be left running when one task dies |
| - slurm/333: Slurmd fails to launch a job and deletes a step, due to |
| a race condition in shared memory management |
| - slurm/334: Slurmd was getting a segv due to a race condition in shared |
| memory management |
| - slurm/342: Properly handle nodes being removed from configuration |
| even when there are partitions, nodes, or job steps still associated |
| with them |
| -- Srun properly terminates jobs/steps upon node failure (used to hang |
| waiting for I/O completion) |
| -- Job time limits enforced even if InactiveLimit configured as zero |
| -- Support the sending of an arbitrary signal to a batch script (but not |
| the processses in its job steps) |
| -- Re-read slurm configuration file whenever changed, needed by users |
| of SLURM APIs |
| -- Scancel was generating a assert failure |
| -- Slurmctld sends a launch response message upon scheduling of a queued |
| job (for immediate srun response) |
| -- Maui scheduler plugin added |
| -- Backfill scheduler plugin added |
| -- Batch scripts can now have arguments that are propogated |
| -- MPICH support added (via patch, not in SLURM CVS) |
| -- New SLURM environment variables added SLMR_CPUS_ON_NODE and |
| SLURM_LAUNCH_NODE_IPADDR, these provide support needed for LAM/MPI |
| (version 7.0.4+) |
| -- The TMPDIR directory is created as needed before job launch |
| -- Do not create duplicate SLURM environment variables with the same name |
| -- Insure proper enforcement of node sharing by job |
| -- Treat lack of SpoolDir or StateSaveDir as a fatal error |
| -- Quickstart.html guide expanded |
| -- Increase maximum jobs steps per node from 16 to 64 |
| -- Delete correct shared memory segment on slurmd -c (clean start) |
| |
| * Changes in SLURM 0.3.0.0-pre2 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/326: Properly clean-up jobs terminating on non-responding nodes |
| -- Move all configuration data structure into common/read_config, scontrol |
| now always shows default values if not specified in slurm.conf file |
| -- Remove the unused "Prioritize" configuration parameter |
| |
| * Changes in SLURM 0.3.0.0-pre1 |
| =============================== |
| -- Fixes for reported problems: |
| - slurm/252: "jobs left orphaned when using TotalView:" SLURM controller |
| now pings srun and kills defunct jobs. |
| - slurm/253: "srun fails to accept new IO connection." |
| - slurm/317: "Lack of default partition in config file causes errors." |
| - slurm/319: Socket errors on multiple simultaneous job launches fixed |
| - slurm/321: slurmd shared memory synchronization error. |
| -- Removed slurm_tv_clean daemon which has been obsoleted by slurm/252 fix. |
| -- New scontrol command ``delete'' and RPC added to delete a partition |
| -- Squeue can now print and sort by group id/name |
| -- Scancel has new option -q,--quiet to not report an error if a job |
| is already complete |
| -- Add the excluded node list to job information reported. |
| -- RPC version mis-match now properly handled |
| -- New job completion plugin interface added for logging completed jobs. |
| -- Fixed lost digit in scontrol job priority specification. |
| -- Remove restriction in the number of consecutive node sets (no longer |
| needed after DPCS upgrade) |
| -- Incomplete state save write now properly handled. |
| -- Modified slurmd setrlimit error for greater clarity. |
| -- Slurmctld performs load-leveling across shared nodes. |
| -- New user function added slurm_get_end_time for user jobs. |
| -- Always compile srun with stabs debug section when TotalView support |
| is requested. |
| |
| * Changes in SLURM 0.2.21 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/253: Try using different port if connect() fails (was rarely |
| failing when an existing defunct connection was in TIME_WAIT state) |
| - slurm/300: Possibly killing wrong job on slurmd restart |
| - slurm/312: Freeing non-allocated memory and killing slurmd |
| -- Assorted changes to support RedHat Enterprise Linux 3.0 and IA64 |
| -- Initial Elan4 and libelanctrl support (--with-elan). |
| -- Slurmctld was sometimes inappropriately setting a job's priority |
| to 1 when a node was down (even if up nodes could be used for the |
| job when a running job completes) |
| -- Convert all user commands from use of popt library to getopt_long() |
| -- If TotalView support is requested, srun exports "totalview_jobid" |
| variable for `%J' expansion in TV bulk launch string. |
| -- Fix several locking bugs in slurmd IO layer. |
| -- Throttle back repetitious error messages in slurmd to avoid filling |
| log files. |
| |
| |
| * Changes in SLURM 0.2.20 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/298: Elan initialization error (Invalid vp 2147483674). |
| - slurm/299: srun fails to exit with multiple ^C's. |
| -- Temporarily prevent DPCS from allocating jobs with more than eight |
| sets of consecutive nodes. This was likely causing user applications |
| to fail with libelan errors. This will be removed after DPCS is updated. |
| -- Fix bug in popt use, was failing in some versions of Linux. |
| -- Resend KILL_JOB messages as needed to clear COMPLETING jobs. |
| -- Install dummy SIGCHLD handler in slurmd to fix problem on NPTL systems |
| where slurmd was not notified of terminated tasks. |
| |
| * Changes in SLURM 0.2.19 |
| ========================= |
| -- Memory corruption bug fixed, it was causing slurmctld to seg-fault |
| |
| * Changes in SLURM 0.2.18 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/287: slurm protocol timeouts when using TotalView. |
| - slurm/291: srun fails using ``-n 1'' under multi-node allocation. |
| - slurm/294: srun IO buffer reports ENOSPC. |
| -- Memory corruption bug fixed, it was causing slurmctld to seg-fault |
| -- Non-responding nodes now go from DRAINING to DRAINED state when |
| jobs complete |
| -- Do not schedule pending jobs while any job is actively COMPLETING |
| unless the submitted job specifically identifies its nodes (like DPCS) |
| -- Reset priority of jobs with priority==1 when a non-responding node |
| starts to respond again |
| -- Ignore jobs with priority==1 when establishing new baseline upon |
| slurmctld restart |
| -- Make slurmctld/message retry be timer based rather than queue based |
| for better scalability |
| -- Slurmctld logging is more concise, using hostlists more |
| -- srun --no-allocate used special job_id range to avoid conflicts |
| or premature job termination (purging by slurmctld) |
| -- New --jobid=id option in srun to initiate job step under an existing |
| allocation. |
| -- Support in srun for TotalView bulk launch. |
| |
| * Changes in SLURM 0.2.17 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/279: Hold jobs that can't execute due to DOWN or DRAINED |
| nodes and release when nodes are returned to service. |
| - slurm/285: "srun killed due to SIGPIPE" |
| -- Support for running job steps on nodes relative to current |
| allocation via srun -r, --relative=n option. |
| -- SIGKILL no longer broadcasted to job via srun on task failure unless |
| --no-allocate option is used. |
| -- Re-enabled "chkconfig --add" in default RPMs. |
| -- Backup controller setting proper PID into slurmctld.pid file. |
| -- Backup controller restores QSW state each time it assumes control |
| -- Backup controller purges old job records before assuming control |
| to avoid resurrecting defunct jobs. |
| -- Kill jobs on non-responding DRAINING nodes and make their state |
| DRAINED. |
| -- Save state upon completion of a job's last EPILOG_COMPLETION to |
| reduce possibility of inconsistent job and node records when the |
| controller is transitioning between primary and backup. |
| -- Change logging level of detailed communication errors to not print |
| them unless detailed debugging is requested. |
| -- Increase number of concurrent controller server threads from 20 |
| to 50 and restructure code to handle backlogs more efficiently. |
| -- Partition state at controller startup is based upon slurm.conf |
| rather than previously saved state. Additional improvements to |
| avoid inconsistent job/node/partition states at restart. Job state |
| information is used to arbitrate conflicts. |
| -- Orphaned file descriptors eliminated. |
| |
| * Changes in SLURM 0.2.16 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/265: Early termination of srun could cause job to remain in queue. |
| - slurm/268: Slurmctld could deadlock if there was a delay in the |
| termination of a large node-count job. An EPILOG_COMPLETE RPC was |
| added so that slurmd could notify slurmctld whenever the job |
| termination was completed. |
| - slurm/270: Segfault in sinfo if a configured node lacked a partition. |
| - slurm/278: Exit code in scontrol did not indicate failure. |
| -- Fixed bug in slurmd that caused the daemon to occaisionally kill itself. |
| -- Fixed bug in srun when running with --no-allocate and >1 process per node. |
| -- Small fixes and updates for srun manual. |
| |
| * Changes in SLURM 0.2.15 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/265: Job was orphaned when allocation response message could |
| not be sent. Job is now killed on allocation response message transmit |
| failure and socket error details are logged. |
| - Fix for slurm/267: "Job epilog may run multiple times." |
| -- Squeue job TIMELIMIT format changed from "h:mm" to "d:h:mm:ss". |
| -- DPCS initiated jobs have steps execute properly without explicit |
| specification of node count. |
| |
| * Changes in SLURM 0.2.14 |
| ========================= |
| -- Fixes for reported problems: |
| - slurm/194: "srun doesn't handle most options when run under an allocation." |
| - slurm/244: "REQ: squeue shows requested size of pending jobs." |
| -- SLURM_NODELIST environment variable now exported to all jobs, not |
| only batch jobs. |
| -- Nodelist displayed in squeue for completing jobs is now restricted to |
| completing nodes. |
| -- Node "reason" field properly displayed in sinfo even with filtering. |
| -- ``slurm_tv_clean'' daemon now supports a log file. |
| -- Batch jobs are now re-queued on launch failure. |
| -- Controller confirms job scripts for batch jobs are still running on |
| node zero at node registration. |
| -- Default RPMs no longer stop/start SLURM daemons on upgrade or install. |
| |
| * Changes in SLURM 0.2.13 |
| ========================= |
| -- Fixes for reported problems: |
| - Fixed bug in slurmctld where "drained" nodes would go back into |
| the "idle" state under some conditions (slurm/228). |
| - Added possible fix for slurm/229: "slurmd occasionally fails |
| to reap all children." |
| -- Fixed memory leak in auth_munge plugin. |
| -- Added fix to slurmctld to allow arbitrarily large job specifications |
| to be saved and recovered in the state file. |
| -- Allow "updates" in the configuration file of previously defined |
| node state and reason. |
| -- On "forceful termination" of a running job step, srun now exits |
| unconditionally, instead of waiting for all I/O. |
| -- Slurmctld now uses pidfile to kill old daemon when a new one is started. |
| -- Addition of new daemon "slurm_tv_clean" used to clean up jobs orphaned |
| due to use of the TotalView parallel debugger. |
| |
| * Changes in SLURM 0.2.12 |
| ========================= |
| -- Fixes for reported problems: |
| - Fix for "waitpid: No child processes" when using TotalView (slurm/217). |
| - Implemented temporary workaround for slurm/223: "Munge decode failed: |
| Munged communication error." |
| - Temporary fix for slurm/222: "elan3_create(0): Invalid argument." |
| -- Fixed memory leaks in slurmctld (mostly due to reconfigure). |
| -- More squeue/sinfo interface changes (see squeue(1), sinfo(1)). |
| -- Sinfo now accepts list of node states to -t,--state option. |
| -- Node "reason" field now available via sinfo command (see sinfo(1)). |
| -- Wrapper source for srun (srun.wrapper.c) now installed and available |
| for TotalView support. |
| -- Improved retry login in user commands for periods when slurmctld |
| primary is down and backup has not yet taken over. |
| |
| * Changes in SLURM 0.2.11 |
| ========================= |
| -- Changes in srun: |
| - Fixed bug in signal handling that occaisonally resulted in orphaned |
| jobs when using Ctrl-C. |
| - Return non-zero exit code when remote tasks are killed by a signal. |
| - SIGALRM is now blocked by default. |
| -- Added ``reason'' string for down, drained, or draining nodes. |
| -- Added -V,--version option to squeue and sinfo. |
| -- Improved some error messages from user utilities. |
| |
| * Changes in SLURM 0.2.10 |
| ========================= |
| -- New slurm.conf configuration parameters: |
| - WaitTime: Default for srun -w,--wait parameter. |
| - MaxJobCount: Maximum number of jobs SLURM can handle at one time. |
| - MinJobAge: Minimum time since completing before job is purged from |
| slurmctld memory. |
| -- Block user defined signals USR1 and USR2 in slurmd session manager. |
| -- More squeue cleanup. |
| -- Support for passing options to sinfo via environment variables. |
| -- Added option to scontrol to find intersection of completing jobs and nodes. |
| -- Added fix in auth_munge to prevent "Munged communication error" message. |
| |
| * Changes in SLURM 0.2.9 |
| ======================== |
| -- Fixes for reported problems: |
| - Argument to srun `-n' option was taken as octal if preceeded with a `0'. |
| -- New format for Elan hosts config file (/etc/elanhosts. See README) |
| -- Various fixes for managing COMPLETING jobs. |
| -- Support for passing options to squeue via environment variables |
| (see squeue(1)) |
| |
| * Changes in SLURM 0.2.8 |
| ========================= |
| -- Fix for bug in slurmd that could make debug messages appear in job output. |
| -- Fix for bug in slurmctld retry count computation. |
| -- Srun now times out slow launch threads. |
| -- "Time Used" output in squeue now includes seconds. |
| |
| * Changes in SLURM 0.2.7 |
| ========================= |
| -- Fix for bug in Elan module that results in slurmd hang. |
| -- Added completing job state to default list of states to print with squeue. |
| |
| * Changes in SLURM 0.2.6 |
| ========================= |
| -- More fixes for handling cleanup of slow terminating jobs. |
| -- Fixed bug in srun that might leave nodes allocated after a Ctrl-C. |
| |
| * Changes in SLURM 0.2.5 |
| ========================= |
| -- Various fixes for cleanup of slow terminating or unkillable jobs. |
| -- Fixed some small memory leaks in communications code. |
| -- Added hack for synchronized exit of jobs on large node count. |
| -- Long lists of nodes are no longer truncated in sinfo. |
| -- Print more descriptive error message when tasks exit with nonzero status. |
| -- Fixed bug in srun where unsuccessful launch attempts weren't detected. |
| -- Elan network error resolver thread now runs from elan module in slurmd. |
| -- Slurmctld uses consecutive Elan context and program description numbers |
| instead of choosing them randomly. |
| |
| * Changes in SLURM 0.2.4 |
| ========================== |
| -- Fix for file descriptor leak in slurmctld. |
| -- auth_munge plugin now prints credential info on decode failure. |
| -- Minor changes to scancel interface. |
| -- Filename format option "%J" now works again for srun --output and --error. |
| |
| * Changes in SLURM 0.2.3 |
| ========================== |
| -- Fix bug in srun when using per-task files for stderr. |
| -- Better error reporting on failure to open per-task input/output files. |
| -- Update auth_munge plugin for munge 0.1. |
| -- Minor changes to squeue interface. |
| -- New srun option `--hold' to submit job in "held" state. |
| |
| * Changes in SLURM 0.2.2 |
| ========================== |
| -- Fixes for reported problems: |
| - Execution of script allocate mode fails in some cases. (gnats:161) |
| - Errors using per-task input files with Elan support. (gnats:162) |
| - srun doesn't handle all environment variables properly. (gnats:164) |
| -- Parallel job is now terminated if a task is killed by a signal. |
| -- Exit status of srun is set based on exit codes of tasks. |
| -- Redesign of sinfo interface and options. |
| -- Shutdown of slurmctld no longer propagates shutdown to all nodes. |
| |
| * Changes in SLURM 0.2.1 |
| =========================== |
| -- Fix bug where reconfigure request to slurmctld killed the daemon. |
| |
| * Changes in SLURM 0.2.0 |
| ============================ |
| |
| -- SlurmdTimeout of 0 means never set a non-responding node to DOWN. |
| -- New srun option, -u,--unbuffered, for unbuffered stdout. |
| -- Enhancements for sinfo |
| - Non-responding nodes show "*" character appended instead of "NoResp+". |
| - Node states show abbreviated variant by default |
| -- Enhancements for scontrol. |
| - Added "ping" command to show current state of SLURM controllers. |
| - Job dump in scontrol shows user name as well as UID. |
| - Node state of DRAIN is appropriately mapped to DRAINING or DRAINED. |
| -- Fix for bug where request for task count greater than partition limit |
| was queued anyway. |
| -- Fix for bugs in job end time handling. |
| -- Modifications for error free builds on 64 bit architectures. |
| -- Job cancel immediately deallocates nodes instead of waiting on srun. |
| -- Attempt to create slurmd spool if it does not exist. |
| -- Fixed signal handling bug in srun allocate mode. |
| -- Earlier error detection in slurmd startup. |
| -- "fatal: _shm_unlock: Numerical result out of range" bug fixed in slurmd. |
| -- Config file parsing is now case insensitive. |
| -- SLURM_NODELIST environment variable now set in allocate mode. |
| |
| * Changes in SLURM 0.2.0-pre2 |
| ============================= |
| |
| -- Fix for reconfigure when public/private key path is changed. |
| -- Shared memory fixes in slurmd. |
| - fix for infinite semaphore incrementation bug. |
| -- Semaphore fixes in slurmctld. |
| -- Slurmctld now remembers which nodes have registered after recover. |
| -- Fixed reattach bug when tasks have exited. |
| -- Change directory to /tmp in slurmd if daemonizing. |
| -- Logfiles are reopened on reconfigure. |
| |
| $Id$ |