blob: 3615d053e6ead7f0e0b6dc32e1a69923c5cff2f3 [file] [log] [blame] [edit]
-*- text -*- $Id: README.arrayrun,v 1.2 2011/06/28 11:21:27 bhm Exp $
Overview
========
Arrayrun is an attempt to simulate arrayjobs as found in SGE and PBS. It
works very similarly to mpirun:
arrayrun [-r] taskids [sbatch arguments] YourCommand [arguments]
In principle, arrayrun does
TASK_ID=id sbatch [sbatch arguments] YourCommand [arguments]
for each id in the 'taskids' specification. 'taskids' is a comma separated
list of integers, ranges of integers (first-last) or ranges with step size
(first-last:step). If -r is specified, arrayrun will restart a job that has
failed. To avoid endless loops, a job is only restarted once, and a maximum
of 10 (configurable) jobs will be restarted.
The idea is to submit a master job that calls arrayrun to start the jobs,
for instance
$ cat workerScript
#!/bin/sh
#SBATCH --account=YourProject
#SBATCH --time=1:0:0
#SBATCH --mem-per-cpu=1G
DATASET=dataset.$TASK_ID
OUTFILE=result.$TASK_ID
cd $SCRATCH
YourProgram $DATASET > $OUTFILE
# end of workerScript
$ cat submitScript
#!/bin/sh
#SBATCH --account=YourProject
#SBATCH --time=50:0:0
#SBATCH --mem-per-cpu=100M
arrayrun 1-200 workerScript
# end of submitScript
$ sbatch submitScript
The --time specification in the master script must be long enough for all
jobs to finish.
Alternatively, arrayrun can be run on the command line of a login or master
node.
If the master job is cancelled, or the arrayrun process is killed, it tries
to scancel all running or pending jobs before it exits.
Arrayrun tries not to flood the queue with jobs. It works by submitting a
limited number of jobs, sleeping a while, checking the status of its jobs,
and iterating, until all jobs have finished. All limits and times are
configurable (see below). It also tries to handle all errors in a graceful
manner.
Installation and configuration
==============================
There are two files, arrayrun (to be called by users) and arrayrun_worker
(exec'ed or srun'ed by arrayrun, to make scancel work).
arrayrun should be placed somewhere on the $PATH. arrayrun_worker can be
place anywhere. Both files should be accessible from all nodes.
There are quite a few configuration variables, so arrayrun can be tuned to
work under different policies and work loads.
Configuration variables in arrayrun:
- WORKER: the location of arrayrun_worker
Configuration variables in arrayrun_worker:
- $maxJobs: The maximal number of jobs arrayrun will allow in the
queue at any time
- $maxIdleJobs: The maximal number of _pending_ jobs arrayrun will allow
in the queue at any time
- $maxBurst: The maximal number of jobs submitted at a time
- $pollSeconds: How many seconds to sleep between each iteration
- $maxFails: The maximal number of errors to accept when submitting a
job
- $retrySleep: The number of seconds to sleep between each retry when
submitting a job
- $doubleCheckSleep: The number of seconds to sleep after a failed sbatch
before runnung squeue to double check whether the job
was submitted or not.
- $maxRestarts: The maximal number of restarts all in all
- $sbatch: The full path of the sbatch command to use
Notes and caveats
=================
Arrayrun is an attempt to simulate array jobs. As such, it is not
perfect or foolproof. Here are a couple of caveats.
- Sometimes, arrayrun fails to scancel all jobs when it is itself cancelled
- When arrayrun is run as a master job, it consumes one CPU for the whole
duration of the job. Also, the --time limit must be long enough. This can
be avoided by running arrayrun interactively on a master/login node (in
which case running it under screen is probably a good idea).
- Arrayrun does (currently) not checkpoint, so if an arrayrun is restarted,
it starts from scratch with the first taskid.
We welcome any suggestions for improvements or additional functionality!
Copyright
=========
Copyright 2009,2010,2011 BjΓΈrn-Helge Mevik <b.h.mevik@usit.uio.no>
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License version 2 as
published by the Free Software Foundation.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License version 2 for more details.
A copy of the GPL v. 2 text is available here:
http://www.gnu.org/licenses/old-licenses/gpl-2.0.txt