| This information is meant primarily for the Slurm developers. |
| System administrators should read the instructions at |
| http://www.llnl.gov/linux/slurm/quickstart_admin.html |
| (also found in the file doc/html/quickstart_admin.shtml). |
| The "INSTALL" file contains generic Linux build instructions. |
| |
| Simple build/install on Linux: |
| ./configure --enable-debug \ |
| --prefix=<install-dir> --sysconfdir=<config-dir> |
| make |
| make install |
| |
| To build the files in the contribs directory: |
| make contrib |
| make install-contrib |
| (The RPMs are built by default) |
| |
| If you make changes to any auxdir/* or Makefile.am file, then run |
| _snowflake_ (where there are recent versions of autoconf, automake |
| and libtool installed): |
| ./autogen.sh |
| then check-in the new Makefile.am and Makefile.in files |
| |
| Here is a step-by-step HOWTO for creating a new release of SLURM on a |
| Linux cluster (See BlueGene and AIX specific notes below for some differences). |
| 0. Get current copies of SLURM and buildfarm |
| > git clone https://<user_name>@github.com/chaos/slurm.git |
| > svn co https://eris.llnl.gov/svn/chaos/private/buildfarm/trunk buildfarm |
| place the buildfarm directory in your search path |
| > export PATH=~/buildfarm:$PATH |
| 1. Update NEWS and META files for the new release. In the META file, |
| the API, Major, Minor, Micro, Version, and Release fields must all |
| by up-to-date. **** DON'T UPDATE META UNTIL RIGHT BEFORE THE TAG **** |
| The Release field should always be 1 unless one of |
| the following is true |
| - Changes were made to the spec file, documentation, or example |
| files, but not to code. |
| - this is a prerelease (Release = 0.preX) |
| 2. Tag the repository with the appropriate name for the new version. |
| Note the first three digits are the version number. For a proper release, |
| the last digit is "1" (except for a rebuild without code changes which |
| could be "2"). For pre-releases, the last digit should be "0" followed by |
| "pre#" or "rc#". |
| > git tag -a slurm-2-6-7-1 -m "create tag v2.6.7" OR |
| > git tag -a slurm-2-7-0-0pre5 -m "create tag v2.7.0-pre5" |
| > git push --tags |
| 3. Use the rpm make target to create the new RPMs. This requires a .rpmmacros |
| (.rpmrc for newer versions of rpmbuild) file containing: |
| %_slurm_sysconfdir /etc/slurm |
| %_with_debug 1 |
| %_with_sgijob 1 |
| %_with_elan 1 (ONLY ON SYSTEMS WITH ELAN SWITCH) |
| NOTE: build will make a tar-ball based upon ALL of the files in your current |
| local directory. If that includes scratch files, everyone will get those |
| files in the tar-ball. For that reason, it is a good idea to clone a clean |
| copy of the repository and build from that |
| > git clone https://<user_name>@github.com/chaos/slurm.git <local_dir> |
| Build using the following syntax: |
| > build --snapshot -s <local_dir> OR |
| > build --nosnapshot -s <local_dir> |
| --nosnapshot will name the tar-ball and RPMs based upon the META file |
| --snapshot will name the tar-ball and RPMs based upon the META file plus a |
| timestamp. Do this to make a tar-ball for a non-tagged release. |
| NOTE: <local_dir> should be a fully-qualified pathname |
| 4. scp the files to schedmd.com in to ~/www/download/latest or |
| ~/www/download/development. Move the older files to ~/www/download/archive, |
| login to schedmd.com, cd to ~/download, and execute "php process.php" to |
| update the web pages. |
| |
| BlueGene build notes: |
| 0. If on a bgp system and you want sview export these variables |
| > export CFLAGS="-I/opt/gnome/lib/gtk-2.0/include -I/opt/gnome/lib/glib-2.0/include $CFLAGS" |
| > export LIBS="-L/usr/X11R6/lib64 $LIBS" |
| > export CMD_LDFLAGS='-L/usr/X11R6/lib64' |
| > export PKG_CONFIG_PATH="/opt/gnome/lib64/pkgconfig/:$PKG_CONFIG_PATH" |
| 1. Use the rpm make target to create the new RPMs. This requires a .rpmmacros |
| (.rpmrc for newer versions of rpmbuild) file containing: |
| %_prefix /usr |
| %_slurm_sysconfdir /etc/slurm |
| %_with_bluegene 1 |
| %_without_pam 1 |
| %_with_debug 1 |
| Build on Service Node with using the following syntax |
| > rpmbuild -ta slurm-...bz2 |
| The RPM files get written to the directory |
| /usr/src/packages/RPMS/ppc64 |
| |
| To build and run on AIX: |
| 0. Get current copies of SLURM and buildfarm |
| > git clone https://<user_name>@github.com/chaos/slurm.git |
| > svn co https://eris.llnl.gov/svn/chaos/private/buildfarm/trunk buildfarm |
| put the buildfarm directory in your search path |
| > export PATH=~/buildfarm:$PATH |
| |
| Put the buildfarm directory in your search path |
| Also, you will need several commands to appear FIRST in your PATH: |
| |
| /usr/local/tools/gnu/aix_5_64_fed/bin/install |
| /usr/local/gnu/bin/tar |
| /usr/bin/gcc |
| |
| I do this by making symlinks to those commands in the buildfarm directory, |
| then making the buildfarm directory the first one in my PATH. |
| Also, make certain that the "proctrack" rpm is installed. |
| 1. Export some environment variables |
| > export OBJECT_MODE=32 |
| > export PKG_CONFIG="/usr/bin/pkg-config" |
| 2. Build with: |
| > ./configure --enable-debug --prefix=/opt/freeware \ |
| --sysconfdir=/opt/freeware/etc/slurm \ |
| --with-ssl=/opt/freeware --with-munge=/opt/freeware \ |
| --with-proctrack=/opt/freeware |
| make |
| make uninstall # remove old shared libraries, aix caches them |
| make install |
| 3. To build RPMs (NOTE: GNU tools early in PATH as described above in #0): |
| Create a .rpmmacros file specifying system specific files: |
| # |
| # RPM Macros for use with SLURM on AIX |
| # The system-wide macros for RPM are in /usr/lib/rpm/macros |
| # and this overrides a few of them |
| # |
| %_prefix /opt/freeware |
| %_slurm_sysconfdir %{_prefix}/etc/slurm |
| %_defaultdocdir %{_prefix}/doc |
| %_with_debug 1 |
| %_with_aix 1 |
| %with_ssl "--with-ssl=/opt/freeware" |
| %with_munge "--with-munge=/opt/freeware" |
| %with_proctrack "--with-proctrack=/opt/freeware" |
| Log in to the machine "uP". uP is currently the lowest-common-denominator |
| AIX machine. |
| NOTE: build will make a tar-ball based upon ALL of the files in your current |
| local directory. If that includes scratch files, everyone will get those |
| files in the tar-ball. For that reason, it is a good idea to clone a clean |
| copy of the repository and build from that |
| > git clone https://<user_name>@github.com/chaos/slurm.git <local_dir> |
| Build using the following syntax: |
| > export CC=/usr/bin/gcc |
| > build --snapshot -s <local_dir> OR |
| > build --nosnapshot -s <local_dir> |
| --nosnapshot will name the tar-ball and RPMs based upon the META file |
| --snapshot will name the tar-ball and RPMs based upon the META file plus a |
| timestamp. Do this to make a tar-ball for a non-tagged release. |
| 4. Test POE after telling POE where to find SLURM's LoadLeveler wrapper. |
| > export MP_RMLIB=./slurm_ll_api.so |
| > export CHECKPOINT=yes |
| 5. > poe hostname -rmpool debug |
| 6. To debug, set SLURM_LL_API_DEBUG=3 before running poe - will create a file |
| /tmp/slurm.* |
| It can also be helpful to use poe options "-ilevel 6 -pmdlog yes" |
| There will be a log file create named /tmp/mplog.<jobid>.<taskid> |
| 7. If you update proctrack, be sure to run "slibclean" to clear cached |
| version. |
| 8. Remove the RPMs that we don't want: |
| rm -f slurm-perlapi*rpm slurm-torque*rpm |
| and install the other RPMs into /usr/admin/inst.images/slurm/aix5.3 on an |
| OCF AIX machine (pdev is a good choice). |
| |
| Debian build notes: |
| Since Debian doesn't have PRMs, the rpmbuild program can not locate |
| dependencies, so build without them by patching the build program: |
| Index: build |
| =================================================================== |
| --- build (revision 173) |
| +++ build (working copy) |
| @@ -798,6 +798,7 @@ |
| $cmd .= " --define \"_tmppath $rpmdir/TMP\""; |
| $cmd .= " --define \"_topdir $rpmdir\""; |
| $cmd .= " --define \"build_bin_rpm 1\""; |
| + $cmd .= " --nodeps"; |
| if (defined $conf{rpm_dist}) { |
| my $dist = length $conf{rpm_dist} ? $conf{rpm_dist} : "%{nil}"; |
| $cmd .= " --define \"dist $dist\""; |
| |
| AIX/Federation switch window problems |
| To clean switch windows: ntblclean =w 8 -a sni0 |
| To get switch window status: ntblstatus |
| |
| BlueGene bglblock boot problem diagnosis |
| - Logon to the Service Node (bglsn, ubglsn) |
| - Execute /admin/bglscripts/fatalras |
| This will produce a list of failures including Rack and Midplane number |
| <date> R<rack> M<midplane> <failure details> |
| - Translate the Rack and Midplane to SLURM node id: smap -R r<rack><midplane> |
| - Drain only the bad SLURM node, return others to service using scontrol |
| |
| Configuration file update procedures: |
| - cd /usr/bgl/dist/slurm (on bgli) |
| - co -l <filename> |
| - vi <filename> |
| - ci -u <filename> |
| - make install |
| - then run "dist_local slurm" on SN and FENs to update /etc/slurm |
| |
| Some RPM commands: |
| rpm -qa | grep slurm (determine what is installed) |
| rpm -qpl slurm-1.1.9-1.rpm (check contents of an rpm) |
| rpm -e slurm-1.1.8-1 (erase an rpm) |
| rpm --upgrade slurm-1.1.9-1.rpm (replace existing rpm with new version) |
| rpm -i --ignoresize slurm-1.1.9-1.rpm (install a new rpm) |
| For main SLURM plugin installation on BGL service node: |
| rpm -i --force --nodeps --ignoresize slurm-1.1.9-1.rpm |
| rpm -U --force --nodeps --ignoresize slurm-1.1.9-1.rpm (upgrade option) |
| |
| To clear a wedged job: |
| /bgl/startMMCSconsole |
| > delete bgljob #### |
| > free RMP### |
| |
| Starting and stopping daemons on Linux: |
| /etc/init.d/slurm stop |
| /etc/init.d/slurm start |
| |
| Patches: |
| - cd to the top level src directory |
| - Run the patch command with epilog_complete.patch as stdin: |
| patch -p[path_level_to_filter] [--dry-run] < epilog_complete.patch |
| |
| To get the process and job IDs with proctrack/sgi_job: |
| - jstat -p |
| |
| CVS and gnats: |
| Include "gnats:<id> e.g. "(gnats:123)" as part of cvs commit to |
| automatically record that update in gnats database. NOTE: Does |
| not change gnats bug state, but records source files associated |
| with the bug. |
| |
| For memory leaks (for AIX use zerofault, zf; for linux use valgrind) |
| - Run configure with the option "--enable-memory-leak-debug" to completely |
| release allocated memory when the daemons exit |
| - valgrind --tool=memcheck --leak-check=yes --num-callers=8 --leak-resolution=med \ |
| ./slurmctld -Dc >valg.ctld.out 2>&1 |
| - valgrind --tool=memcheck --leak-check=yes --num-callers=8 --leak-resolution=med \ |
| ./slurmd -Dc >valg.slurmd.out 2>&1 (Probably only one one node of cluster) |
| - Run the regression test. In the globals.local file include: |
| "set enable_memory_leak_debug 1" |
| - Shutdown the daemons using "scontrol shutdown" |
| - Examine the end of the log files for leaks. pthread_create() and dlopen() |
| have small memory leaks on some systems, which do not grow over time |
| - Functions in the plugins will not have their symbols preserved when the |
| plugin is unloaded and the function names will appear as "???" in the |
| valgrind backtrace after shutdown. Rebuilding the daemons without the |
| configure option of "--enable-memory-leak-debug" typically prevents the |
| plugin from being unloaded so the symbols will be properly reported. However |
| many memory leaks will be reported due to not unloading plugins. |
| |
| Job profiling: |
| - "export CFLAGS=-pg", then run "configure" and "make install" as usual. |
| - Run the slurm daemons through a stress test and exit normally |
| - Run "gprof [executable-file] >outfile" |
| |
| Before new major release: |
| - Test on ia64, i386, x86_64, BGL, AIX, OSX, XCPU |
| - Test on Elan and IB switches |
| - Test fail-over of slurmctld |
| - Test for memory leaks in slurmctld, slurmd and slurmdbd with various plugins |
| - Change API version number |
| - Review and release web pages |
| - Review and release code |
| - Run "make check" |
| - Test that the prolog and epilog run |
| - Run the test suite with SlurmUser NOT being self |
| - Test for errors reported by CLANG tool: |
| scan-build -k -v make >m.sb.out 2>&1 |
| # and look for output in /tmp/scan-build-<DATE> |