Internal change PiperOrigin-RevId: 377331747 Change-Id: I010b4f96bf180449f78a03663708cb80ecff11d4
diff --git a/CHANGES b/CHANGES new file mode 100644 index 0000000..b7258f8 --- /dev/null +++ b/CHANGES
@@ -0,0 +1,407 @@ +[Note: newer entries are at the bottom] + +The actual patches and their series are in the tarball: patches.tar.gz + +Early prehistory: + +- Fix some warnings +- Add const to global variables and prototypes +- Document numa_warn +- Add numa_distance() support to read topology +- remove internal alias from numa_{warn,error} to allow overwriting + again +- Various bug fixes +- add -f option to memhog to map file +- more bugfixes +- replace forward backward with STREAM test in numademo + +0.6.1 + +- make headers C++ clean + +0.6.2 + +- use more accurate buffer length for cpumask +- add --cpubind to test suite + +0.6.3 + +- fix cpumask parser for large number of cpus. +Note that you need a kernel patch (as of 2.6.7-pre) for that too +if the cpumask is longer than 99 characters. + +0.6.4 + +- Add Copyright headers + +0.6.5 + +- Reduce unneeded DSO relocations (Arjan van de Ven) +- Add -r option to memhog (repeat walk) +- some manpage fixes +- Use syscall numbers from asm/unistd.h if possible +- Add numa_node_size64 to handle large nodes on 32bit architectures +- Fix numactl to use it (report from Rajan Ravindran) +- Use ln -sf in make install (Rajan Ravindran) +- Add syscall numbers for ppc/ppc64 +- Add private syscall6 for i386 since the glibc version is broken +- Remove STUB +- Change numactl --show to use cpubind instead of nodebind for CPU affinity. +- Fix make install into examples directory +- Work around broken sched_set_affinity API. This adds a 32768 CPUs limit. +- Fix segfault with /sys not mounted. +- Clean up Makefile +- Make numactl --show more clever + +0.7-pre1 + +- add test/regress2 and some fixes to test programs +- Fix DSO relocation patch for global variables +- Change nodeset sizes back to be binary compatible with SLES9 +- Cosmetic changes to manpages (pointed out by Eric S. Raymond) +- Make numa_run_on_node etc. act on current thread only even on NPTL systems + (Dinakar Guniguntala) +- Make numa_no_nodes / numa_all_nodes const (Werner Almesberger) +- Fix up the warnings caused by above change +- Add numa_distance() on systems with ACPI +- remove some obsolete code +- add rdtsc for ppc64 +- fix unsigned/unsigned long confusion in cpumasks (Matt Dobson) +- fix CPU_BYTES and rename CPU_WORDS to CPU_LONGS (Matt Dobson) +- Print node distances in numactl + +[0.7 skipped] + +0.8 +- hardend numactl command line parsing against bad arguments in some cases +- remove cpumask/nodemask confusion which has become a FAQ: + --cpubind deprecated, added --cpunodebind and --physcpubind= options + print both in --show, old cpumask kept for compatibility +- Fix --show problems +- various fixes for bugs noted by Mike Stroyan (thanks!) +- install set_mempolicy manpage +- various smaller fixes + +0.9 +- Get rid of bogus distance.o that broke compilation on !x86-64 (sorry) +- Handle CFLAGS overriding without OPT_CFLAGS (Ian Wienand) +- Fix up section of get/set_mempolicy (Ian Wienand) +- When no NUMA available fall back to one global node instead of one node +per CPU (Samuel Thibault) +- Don't rely on architecture symbols for dependency generation +- Use __powerpc__ to detect PPC/PPC64 +- numastat: + * wrap display properly with many nodes + * display nodes in forward order + * install manpage in `make install'. +- remove bogus numamemcpy.c +- numademo: + * allow standalone compile, make streamlib optional + * clean up output + * change output unit to standard MB/s + * compile with more optimization + * add random pass to fool any prefetching (slow) +- make numademo compileable outside source tree +- use gettimeofday instead of time stamp counters in benchmarks +- support valgrind in testsuite +- other minor changes + +0.9.1 +- Make automatic selection of lib64 vs lib more robust. Now should work +even on ppc32 with a lib64 directory. Architecture lists are hardcoded now +unfortunately. + +0.9.2 +- Fix compilation on architectures with gcc 3.3+ but without TLS +(MIPS, Alpha, Sparc) +- Add warning against using of MPOL_F_NODE +- numa.3 improvements from Michael Kerrisk +- Support page migration (migratepages, manpages) from Christoph +Lameter. Requires 2.6.16+ kernels + +0.9.3 +- Some more manpage fixes +- install migratepages manpage in make install +- Build fix for Debian make from Ian Wienand + +0.9.4: +- Remove syscall manpages. They're in main man-pages now. +- More migrate fixes from C.Lameter + +0.9.5: +- Fix parsing of cpumap in sysfs from Doug Chapman + +0.9.6: +- Fix make install again + +0.9.7 +- Fix cpumap parsing fix to not corrupt memory (Doug Chapman) +- Small optimization to cpumap parsing +- Create target directories for Debian (Ian Wienand) + +0.9.8 +- Fix cpumap parsing again (Doug Chapman) + +0.9.9 (aka "Will 1.0 ever happen?") +- Fix sizing of cpu buffers for affinity syscalls +- Don't corrupt errno in numa_run_on_node_mask. This fixes +numactl cpubind issues on some systems. +- Print cpus belonging to nodes in numactl --hardware +- Rewrite cpumap parser to be simpler and hopefully finally work +in all cases +- add testcases for cpu affinity and topology discovery +- Add make test target to run regression test easier and fix up +test/README + +Lots of fixes thanks to thorough testing by Noriyuki Taniuchi: +- Better command line parsing in numactl and fix various documentation bugs +- Wrong arguments to --prefered don't crash numactl anymore +- Fix --cpunodebind=all +- Auto collect short option list in numactl - a couple were missing. +- Fix documentation of numa_set_localalloc. It doesn't have a flag. +- Fix numa_run_on_node(-1) +- numa_get_run_node_mask(): Fix documentation, don't warn + +0.9.10: +- Fix cpumap parsing bug when NR_CPUS < 32 (dean gaudet) + +0.9.11 +- Fix usage output for --shmid (Noriyuki Taniuchi) +- Use correct syscall number for migrate_pages() on PPC + +1.0 +- Add sleep to regression test to work with delayed statistic +updating in newer kernels (Mel Gorman) +- Default to -O2 + +1.0.1 +- Fix build on powerpc + +1.0.2 +- Fix parallel Makefile build (Andreas Herrmann) +- Fix target command argument parsing for numactl (no -- needed again anymore) +- Clarify numa_node_to_cpus() manpage + +1.0.3 + +- Add the migspeed test program to test the speed of migrating pages from + one node to another +- Support for move_pages(2) to numactl, and numa_move_pages() to libnuma +- Add the move_pages test command to exercise the move_pages(2) system call +- Add the mbind_mig_pages test command to verify the moving of a task's + pages with move_pages(2) +- Add the migrate_pages test command to test that a task's pages can be + moved with move_pages(2) +- Support numactl +nn syntax for cpuset_relative cpu and node numbers +- Modify libnuma to use variable-length bit masks (the bitmask structure) +- Modify numactl, migspeed, memhog, migratepages, numademo and stream_main + to use variable-length bit masks +- Modify the test/ programs to use the libnuma that uses variable size bit masks +- Version 2 (symbol versioning) +- Man page changes with the change to variable-length bit masks, move_pages + migrate_pages and others + +2.0.0 +- Added API version 2 and symbol versioning. This provides binary + compatibility with old codes that use libnuma. +- Brought the man page in line with the version 2 changes. +- Provide numacompat1.h and additions to numa.h, which provide source code + compatibility to libnuma version 1. (The application progamming interface + changes, but the ABI is preserved through the use of symbol versioning. So + the library stays libnuma.so.1) +- Added variable-length bit masks to libnuma. These are struct bitmask. + This allows libnuma to be independent of ever increasing cpu counts. + o Modified the test/ programs to use variable size bit masks. + o Modified numactl, migspeed, memhog, migratepages, numademo and stream_main + to use variable-length bit masks. +- Added support for move_pages(2) (sys_move_pages()) to numactl. Adds + numa_move_pages() to libnuma. + o Added the mbind_mig_pages test command to verify the moving of a + task's pages with move_pages(2). + o Added the move_pages test command to exercise the move_pages(2) system call. + o Added the migrate_pages test command to test that a task's pages can + be moved with move_pages(2). + o Added the migspeed test program. It tests the speed of migrating pages + from one node to another. +- Allow a numactl +nn syntax for cpuset_relative cpu and node numbers. +- General cleanup of man page. +- Return nodes allowed by the application's current cpuset context via new + API numa_get_mems_allowed(). +- Change numa_alloc_local() to use MPOL_PREFERRED with NULL nodemask to + effect local allocation. +- Man page for numactl: numa_maps man page clarifications and cleanup +- Minor cleanups of numademo.c +- Fix numastat sysfs scanning in numactl +- Reorganize the regress test script. +- Fix mempolicy regression test for assymetric platforms and memoryless nodes. +- Fix checkaffinity and checktopology regression tests. +- Fix the __NR_migrate_pages system call number. +- Fix the way numactl handles the building of the mask when executing + the --physbind option, and the way Cpus_allowed mask is created. + +2.0.1 +- Fix bug in dombind (when passed a null) +- Make 4 fixes from Debian: MIPS support, MIPS hppa fix for syscalls, + make sure -lm for numademo, build a static library +- Fix parsing of /proc/self/status for 2.6.25 additions to it + +2.0.2 +- Various numademo improvements: + * Fix random benchmark to use all specified memory + * Rename to random2 to signify it's different + * Optimize random benchmark by inlining random number generator fast path. + * Clear caches between runs for more stable results + * Add new random pointer chaser benchmark + * Compile benchmarks with gcc vectorization if available + * run numademo in regression test +- Add numa_exit_on_warn +- Fix no cpuset fallback in libnuma.c +- Install symlinks for the manpages of all new functions +- Make internal libnuma functions static +- Add copy_bitmask_to_bitmask() to numa.h +- Some cleanups +- Fix line reading in proc +- Add versions.ldscript to libnuma.so dependencies +- Remove the non-"numa_" functions from numacompat1.h and numa.h +- Add ia64 clearcache() to numademo +- Add -t to numademo for regression testing +- Remove "numa_all_cpus" from numa.h +- Changed VERSION1_COMPATIBILITY to NUMA_VERSION1_COMPATIBILITY +- Defined LIBNUMA_API_VERSION 2 in numa.h +- Fix numaif.h and numaint.h (migrate_pages; from Masatake Yamato) +- Fixes to numademo (min/max, and array index; from Kent Liu) +- Fixes to Makefile and permissions; from Berhard Walle + +2.0.3-rc1 - rc3 +- Fixes to libnuma.c numa.h numacompat1.h by Daniel Gollub to fix v1 compatiblity +- Restore nodemask_zero() and nodemask_equal() +- Drops a warning message about this not being a NUMA system +- Remove the numa_maps.5 man page (it's in Linux now) (by Bernhard Walle) +- Fix makefiles in tests (Andi) +- Fix off-by-ones in test mbind_mig_pages (Andi) +- Fix test/prefered.c (Andi) +- Fix to print_node_cpus() (Arnd/Bill Buros) +- Fixes to read_mask() (Arnd's on top of cpw's) +- Fix to makefile (LDFLAGS/LDLIBS/AR/RANLIB) (Mike Frysinger) +- Fix numactl for noncontiguous nodes (Amit Arora) +- Fix bitmask memory leaks, numa_alloc_onnode/numa_preferred (Kornilios Kourtis) +- Add numa_node_of_cpu() to retrieve local node of a cpu (Kornilios Kourtis) +- Fix parsing of /proc/self/status (Brice Goglin/Lee Shermerhorn) +- Small reorganization of numa_alloc_local() (L.S.) +- Fixes of bitmask memory leaks in about eight functions (L.S.) +- Make library always return allocated masks that user can free (L.S.) +- Fix to numademo memtest (allocation overhead) (L.S.) +- Fix to checkaffinity test (possible shell errors) (L.S.) +- Fix a printf in migspeed.c (Frederik Himpe) +- Fix test/regress grep of node number (Cliff W.) +- Change numademo to finish in a timely manner on large machines (Cliff W.) +- tested on 96p 48nodes + +2.0.3-rc1 - rc4 +- Add --dump-nodes option to numactl (Andi) + +2.0.3 released in June, 2009 + +2.0.4-rc3 +- Fix numactl for a machine with sparse cpu ids (Anton Blanchard) +- Fix makefile to remove move_pages on make clean (Andi) +- Fix numa_node_to_cpus() (Sharyathi Nagesh) +- Rename 'thread' to 'task' (L.S.) +- Remove other trailing spaces (Cliff W.) +- Man page correction/clarification for numa_node_to_cpus() (Ian Wienand) +- Man page clarification for numactl (Mike MacCana) +- Fix numactl --hardware for cpu-less nodes (Thomas Renninger) +- Fix set_configured_cpus() (Jan Beulich) +- Fix memory corrupting use of strlen (Jan Beulich) +- Add a DSO destructor for memory deallocation (Neil Horman) + +2.0.4 released in July, 2010 + +2.0.5 released in July, 2010 (about 2 days after 2.0.4) +- Fix numactl calls to set_mempolicy, get_mempolicy and mbind + +newer: +- include stat.h in shm.c (Mike Frysinger) + +2.0.6-rc1 +- Correct numa_max_node() use of broken numa_num_configured_nodes() (Tim Pepper) +- Use numa_max_node() not numa_num_configured_nodes() (Tim Pepper) +- Fix numa_num_configured_nodes() to match man page description (Tim Pepper) +- Clarify comment for numa_all_nodes_ptr extern (Tim Pepper) +- numactl --hardware should handle sparse node numbering (Tim Pepper) +- Maintain compatibility with 2.0.3 numa_num_thread...()'s (Cliff W.) +2.0.6-rc2 +- numa_num_task_cpus()/..nodes() to return actual counts (Cliff W.) +2.0.6-rc3 +- Fix numa_get_run_node_mask() to return a cpuset-aware node mask (Cliff W.) + (replaced 110112) +- Add a better warning to numa_node_to_cpus() +2.0.6-rc4 +- Fix numa_get_mems_allowed() to use MPOL_F_MEMS_ALLOWED (Michael Spiegel) + +2.0.6 released Dec, 2010 + +2.0.7-rc1 +- 110111 Add numa_realloc() (and realloc_test) (Vasileios Karakasis) +- 110112 Re-fix numa_get_run_node_mask() and fix numa_get_run_node_mask (Cliff) +- 110112 Fix the numa_get_run_node_mask() man page (cpus vs nodes) (Cliff W.) +2.0.7-rc2 +- 110112 Fix the cpu and node parsing to be cpuset aware (Cliff W.) +- 110112 Fix test/checkaffinity to be cpuset aware (Cliff W.) +- 110302 Fix two typos in numactl.8 (John Bradshaw) + +2.0.7 released Apr, 2011 + +2.0.8-rc1 +- 110818 Checking of sucessful allocations in numademo (Petr Holasek) +2.0.8-rc2 +- 110823 Fix of numactl (--touch) warnings and man page (Cliff W.) +2.0.8-rc3 +- 111214 Add "same" nodemask alias to numactl (Andi Kleen) +- 111214 Add constructors for numa_init/exit (Andi Kleen) +- 111214 Add use of glibc syscall stub where possible (Andi Kleen) +- 111214 Fix regress1 to show all the problems before exiting (Andi Kleen) +- 111214 Add IO affinity support (Andi Kleen) +- 111214 Clean regression test temp files (Andi Kleen) +- 111214 Add an option to memhog to disable transparent huge pages (Andi Kleen) +- 111214 Fix the test suite on systems that force THP, disable them (Andi Kleen) +2.0.8-rc4 +- 120106 Install man pages migspeed, migratepages and numastat (Petr Holasek) +- 120106 Warnings in numa_node_to_cpus_v1 to be more verbose (Petr Holasek) +- 120216 Fix for numademo: msize check for ptrchase test (Petr Holasek) +2.0.8-rc5 +- 120823 Fix calculation of maxconfiguredcpu (Petr Holasek) +- 120823 Fix: do not recalculate maxconfiguredcpu (Petr Holasek) +- 120823 v.1.3 numa_num_possible_cpus symbol is exported (Petr Holasek) +- 120823 Add all versions of numa_parse_{cpu,node}string() (Petr Holasek) +- 120823 Add numa_parse_cpustring take a const char* parameter (Petr Holasek) +- 120823 Fix unused bufferlen variable (Petr Holasek) +- 120823 Fix warnings when there are holes in numbering of nodes (Petr Holasek) +2.0.8-rc6 +- 120911 Show distances on machines without a node 0 (Petr Holasek) +- 121007 Replace perl numastat with a C command (Bill Gray) +- 121011 Allow an install location PREFIX in the Makefile (Frank Tampe) + +2.0.8 released Oct, 2012 + +2.0.9-rc1 +- 130207 Add a prototype for numa_bitmask_weight (Cliff W.) +2.0.9-rc2 +- 130725 Fix numastat huge pages bug, version number, man page (Bill Gray) +- 130726 Disable the regress-io test (Cliff W.) +- 130730 Fix typos in numactl man page; add short opts to --help (Petr Holasek) +2.0.9-rc3 +- 130906 numactl: option --all/-a added for policy settings (Petr Holasek) +- 130906 libnuma: new function numa_run_on_node_mask_all (Petr Holasek) + +2.0.9 released Oct, 2013 + +2.0.10-rc1 +- 131123 numactl: numactl check for NUMA available (Elena Ufimtseva) +- 140715 numactl: fix numactl --show and preferrred node (Bill Gray) +2.0.10-rc2 +- 140722 makefile: remove warning about missing .depend (Filipe Brandenburger) +- 140820 convert the build procedure to automake (Filipe Brandenburger) + +2.0.10 released Oct, 2014 +
diff --git a/DESIGN b/DESIGN new file mode 100644 index 0000000..fee72c5 --- /dev/null +++ b/DESIGN
@@ -0,0 +1,2 @@ + +[ old description removed because it was too out of date ]
diff --git a/INSTALL b/INSTALL new file mode 100644 index 0000000..04d403b --- /dev/null +++ b/INSTALL
@@ -0,0 +1,61 @@ +Building numactl +---------------- + + $ ./autogen.sh + $ ./configure + $ make + # make install + + Start by configuring the build running the configure script: + + $ ./autogen.sh + $ ./configure + + You can pass options to configure to define build options, to pass it + compiler paths, compiler flags and to define the installation layout. Run + "./configure --help" for more details on how to customize the build. + + Once build is completed, build numactl with: + + $ make + + If you would like to increase verbosity by printing the full build command + lines, pass "make" the V=1 parameter: + + $ make V=1 + + You can run the tests included with numactl with the following command: + + $ make check + + The results will be saved in test/*.log files and a test-suite.log will be + generated with the summary of test passes and failures. + + Install numactl to the system by running the following command as root: + + # make install + + You can also install it to a staging directory, in which case it is not + required to be root while running the install steps. Just pass a DESTDIR + variable while running "make install" with the path to the staging + directory. + + $ make install DESTDIR=/path/to/staging/numactl + +Source code maintenance: + https://github.com/numactl/numactl/ + +Using a snapshot from the Git repository + + First, the build system files need to be generated using the ./autogen.sh + script, which calls autoreconf with the appropriate options to generate the + configure script and the templates for Makefile, config.h, etc. + + Once those files are generated, follow the normal steps to configure and + build numactl. + + In order to create a distribution tarball, use "make dist" from a + configured build tree. Use "make distcheck" to confirm that rebuilding from + the distribution archive works as expected, that building from out-of-tree + works, that test cases pass, etc. +
diff --git a/LICENSE b/LICENSE new file mode 100644 index 0000000..1ca6eb7 --- /dev/null +++ b/LICENSE
@@ -0,0 +1,851 @@ +Libraries: + +Retrieved from: http://www.gnu.org/licenses/old-licenses/lgpl-2.1.txt + + GNU LESSER GENERAL PUBLIC LICENSE + Version 2.1, February 1999 + + Copyright (C) 1991, 1999 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + +[This is the first released version of the Lesser GPL. It also counts + as the successor of the GNU Library Public License, version 2, hence + the version number 2.1.] + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +Licenses are intended to guarantee your freedom to share and change +free software--to make sure the software is free for all its users. + + This license, the Lesser General Public License, applies to some +specially designated software packages--typically libraries--of the +Free Software Foundation and other authors who decide to use it. You +can use it too, but we suggest you first think carefully about whether +this license or the ordinary General Public License is the better +strategy to use in any particular case, based on the explanations below. + + When we speak of free software, we are referring to freedom of use, +not price. Our General Public Licenses are designed to make sure that +you have the freedom to distribute copies of free software (and charge +for this service if you wish); that you receive source code or can get +it if you want it; that you can change the software and use pieces of +it in new free programs; and that you are informed that you can do +these things. + + To protect your rights, we need to make restrictions that forbid +distributors to deny you these rights or to ask you to surrender these +rights. These restrictions translate to certain responsibilities for +you if you distribute copies of the library or if you modify it. + + For example, if you distribute copies of the library, whether gratis +or for a fee, you must give the recipients all the rights that we gave +you. You must make sure that they, too, receive or can get the source +code. If you link other code with the library, you must provide +complete object files to the recipients, so that they can relink them +with the library after making changes to the library and recompiling +it. And you must show them these terms so they know their rights. + + We protect your rights with a two-step method: (1) we copyright the +library, and (2) we offer you this license, which gives you legal +permission to copy, distribute and/or modify the library. + + To protect each distributor, we want to make it very clear that +there is no warranty for the free library. Also, if the library is +modified by someone else and passed on, the recipients should know +that what they have is not the original version, so that the original +author's reputation will not be affected by problems that might be +introduced by others. + + Finally, software patents pose a constant threat to the existence of +any free program. We wish to make sure that a company cannot +effectively restrict the users of a free program by obtaining a +restrictive license from a patent holder. Therefore, we insist that +any patent license obtained for a version of the library must be +consistent with the full freedom of use specified in this license. + + Most GNU software, including some libraries, is covered by the +ordinary GNU General Public License. This license, the GNU Lesser +General Public License, applies to certain designated libraries, and +is quite different from the ordinary General Public License. We use +this license for certain libraries in order to permit linking those +libraries into non-free programs. + + When a program is linked with a library, whether statically or using +a shared library, the combination of the two is legally speaking a +combined work, a derivative of the original library. The ordinary +General Public License therefore permits such linking only if the +entire combination fits its criteria of freedom. The Lesser General +Public License permits more lax criteria for linking other code with +the library. + + We call this license the "Lesser" General Public License because it +does Less to protect the user's freedom than the ordinary General +Public License. It also provides other free software developers Less +of an advantage over competing non-free programs. These disadvantages +are the reason we use the ordinary General Public License for many +libraries. However, the Lesser license provides advantages in certain +special circumstances. + + For example, on rare occasions, there may be a special need to +encourage the widest possible use of a certain library, so that it becomes +a de-facto standard. To achieve this, non-free programs must be +allowed to use the library. A more frequent case is that a free +library does the same job as widely used non-free libraries. In this +case, there is little to gain by limiting the free library to free +software only, so we use the Lesser General Public License. + + In other cases, permission to use a particular library in non-free +programs enables a greater number of people to use a large body of +free software. For example, permission to use the GNU C Library in +non-free programs enables many more people to use the whole GNU +operating system, as well as its variant, the GNU/Linux operating +system. + + Although the Lesser General Public License is Less protective of the +users' freedom, it does ensure that the user of a program that is +linked with the Library has the freedom and the wherewithal to run +that program using a modified version of the Library. + + The precise terms and conditions for copying, distribution and +modification follow. Pay close attention to the difference between a +"work based on the library" and a "work that uses the library". The +former contains code derived from the library, whereas the latter must +be combined with the library in order to run. + + GNU LESSER GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License Agreement applies to any software library or other +program which contains a notice placed by the copyright holder or +other authorized party saying it may be distributed under the terms of +this Lesser General Public License (also called "this License"). +Each licensee is addressed as "you". + + A "library" means a collection of software functions and/or data +prepared so as to be conveniently linked with application programs +(which use some of those functions and data) to form executables. + + The "Library", below, refers to any such software library or work +which has been distributed under these terms. A "work based on the +Library" means either the Library or any derivative work under +copyright law: that is to say, a work containing the Library or a +portion of it, either verbatim or with modifications and/or translated +straightforwardly into another language. (Hereinafter, translation is +included without limitation in the term "modification".) + + "Source code" for a work means the preferred form of the work for +making modifications to it. For a library, complete source code means +all the source code for all modules it contains, plus any associated +interface definition files, plus the scripts used to control compilation +and installation of the library. + + Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running a program using the Library is not restricted, and output from +such a program is covered only if its contents constitute a work based +on the Library (independent of the use of the Library in a tool for +writing it). Whether that is true depends on what the Library does +and what the program that uses the Library does. + + 1. You may copy and distribute verbatim copies of the Library's +complete source code as you receive it, in any medium, provided that +you conspicuously and appropriately publish on each copy an +appropriate copyright notice and disclaimer of warranty; keep intact +all the notices that refer to this License and to the absence of any +warranty; and distribute a copy of this License along with the +Library. + + You may charge a fee for the physical act of transferring a copy, +and you may at your option offer warranty protection in exchange for a +fee. + + 2. You may modify your copy or copies of the Library or any portion +of it, thus forming a work based on the Library, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) The modified work must itself be a software library. + + b) You must cause the files modified to carry prominent notices + stating that you changed the files and the date of any change. + + c) You must cause the whole of the work to be licensed at no + charge to all third parties under the terms of this License. + + d) If a facility in the modified Library refers to a function or a + table of data to be supplied by an application program that uses + the facility, other than as an argument passed when the facility + is invoked, then you must make a good faith effort to ensure that, + in the event an application does not supply such function or + table, the facility still operates, and performs whatever part of + its purpose remains meaningful. + + (For example, a function in a library to compute square roots has + a purpose that is entirely well-defined independent of the + application. Therefore, Subsection 2d requires that any + application-supplied function or table used by this function must + be optional: if the application does not supply it, the square + root function must still compute square roots.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Library, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Library, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote +it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Library. + +In addition, mere aggregation of another work not based on the Library +with the Library (or with a work based on the Library) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may opt to apply the terms of the ordinary GNU General Public +License instead of this License to a given copy of the Library. To do +this, you must alter all the notices that refer to this License, so +that they refer to the ordinary GNU General Public License, version 2, +instead of to this License. (If a newer version than version 2 of the +ordinary GNU General Public License has appeared, then you can specify +that version instead if you wish.) Do not make any other change in +these notices. + + Once this change is made in a given copy, it is irreversible for +that copy, so the ordinary GNU General Public License applies to all +subsequent copies and derivative works made from that copy. + + This option is useful when you wish to copy part of the code of +the Library into a program that is not a library. + + 4. You may copy and distribute the Library (or a portion or +derivative of it, under Section 2) in object code or executable form +under the terms of Sections 1 and 2 above provided that you accompany +it with the complete corresponding machine-readable source code, which +must be distributed under the terms of Sections 1 and 2 above on a +medium customarily used for software interchange. + + If distribution of object code is made by offering access to copy +from a designated place, then offering equivalent access to copy the +source code from the same place satisfies the requirement to +distribute the source code, even though third parties are not +compelled to copy the source along with the object code. + + 5. A program that contains no derivative of any portion of the +Library, but is designed to work with the Library by being compiled or +linked with it, is called a "work that uses the Library". Such a +work, in isolation, is not a derivative work of the Library, and +therefore falls outside the scope of this License. + + However, linking a "work that uses the Library" with the Library +creates an executable that is a derivative of the Library (because it +contains portions of the Library), rather than a "work that uses the +library". The executable is therefore covered by this License. +Section 6 states terms for distribution of such executables. + + When a "work that uses the Library" uses material from a header file +that is part of the Library, the object code for the work may be a +derivative work of the Library even though the source code is not. +Whether this is true is especially significant if the work can be +linked without the Library, or if the work is itself a library. The +threshold for this to be true is not precisely defined by law. + + If such an object file uses only numerical parameters, data +structure layouts and accessors, and small macros and small inline +functions (ten lines or less in length), then the use of the object +file is unrestricted, regardless of whether it is legally a derivative +work. (Executables containing this object code plus portions of the +Library will still fall under Section 6.) + + Otherwise, if the work is a derivative of the Library, you may +distribute the object code for the work under the terms of Section 6. +Any executables containing that work also fall under Section 6, +whether or not they are linked directly with the Library itself. + + 6. As an exception to the Sections above, you may also combine or +link a "work that uses the Library" with the Library to produce a +work containing portions of the Library, and distribute that work +under terms of your choice, provided that the terms permit +modification of the work for the customer's own use and reverse +engineering for debugging such modifications. + + You must give prominent notice with each copy of the work that the +Library is used in it and that the Library and its use are covered by +this License. You must supply a copy of this License. If the work +during execution displays copyright notices, you must include the +copyright notice for the Library among them, as well as a reference +directing the user to the copy of this License. Also, you must do one +of these things: + + a) Accompany the work with the complete corresponding + machine-readable source code for the Library including whatever + changes were used in the work (which must be distributed under + Sections 1 and 2 above); and, if the work is an executable linked + with the Library, with the complete machine-readable "work that + uses the Library", as object code and/or source code, so that the + user can modify the Library and then relink to produce a modified + executable containing the modified Library. (It is understood + that the user who changes the contents of definitions files in the + Library will not necessarily be able to recompile the application + to use the modified definitions.) + + b) Use a suitable shared library mechanism for linking with the + Library. A suitable mechanism is one that (1) uses at run time a + copy of the library already present on the user's computer system, + rather than copying library functions into the executable, and (2) + will operate properly with a modified version of the library, if + the user installs one, as long as the modified version is + interface-compatible with the version that the work was made with. + + c) Accompany the work with a written offer, valid for at + least three years, to give the same user the materials + specified in Subsection 6a, above, for a charge no more + than the cost of performing this distribution. + + d) If distribution of the work is made by offering access to copy + from a designated place, offer equivalent access to copy the above + specified materials from the same place. + + e) Verify that the user has already received a copy of these + materials or that you have already sent this user a copy. + + For an executable, the required form of the "work that uses the +Library" must include any data and utility programs needed for +reproducing the executable from it. However, as a special exception, +the materials to be distributed need not include anything that is +normally distributed (in either source or binary form) with the major +components (compiler, kernel, and so on) of the operating system on +which the executable runs, unless that component itself accompanies +the executable. + + It may happen that this requirement contradicts the license +restrictions of other proprietary libraries that do not normally +accompany the operating system. Such a contradiction means you cannot +use both them and the Library together in an executable that you +distribute. + + 7. You may place library facilities that are a work based on the +Library side-by-side in a single library together with other library +facilities not covered by this License, and distribute such a combined +library, provided that the separate distribution of the work based on +the Library and of the other library facilities is otherwise +permitted, and provided that you do these two things: + + a) Accompany the combined library with a copy of the same work + based on the Library, uncombined with any other library + facilities. This must be distributed under the terms of the + Sections above. + + b) Give prominent notice with the combined library of the fact + that part of it is a work based on the Library, and explaining + where to find the accompanying uncombined form of the same work. + + 8. You may not copy, modify, sublicense, link with, or distribute +the Library except as expressly provided under this License. Any +attempt otherwise to copy, modify, sublicense, link with, or +distribute the Library is void, and will automatically terminate your +rights under this License. However, parties who have received copies, +or rights, from you under this License will not have their licenses +terminated so long as such parties remain in full compliance. + + 9. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Library or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Library (or any work based on the +Library), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Library or works based on it. + + 10. Each time you redistribute the Library (or any work based on the +Library), the recipient automatically receives a license from the +original licensor to copy, distribute, link with or modify the Library +subject to these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties with +this License. + + 11. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Library at all. For example, if a patent +license would not permit royalty-free redistribution of the Library by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Library. + +If any portion of this section is held invalid or unenforceable under any +particular circumstance, the balance of the section is intended to apply, +and the section as a whole is intended to apply in other circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 12. If the distribution and/or use of the Library is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Library under this License may add +an explicit geographical distribution limitation excluding those countries, +so that distribution is permitted only in or among countries not thus +excluded. In such case, this License incorporates the limitation as if +written in the body of this License. + + 13. The Free Software Foundation may publish revised and/or new +versions of the Lesser General Public License from time to time. +Such new versions will be similar in spirit to the present version, +but may differ in detail to address new problems or concerns. + +Each version is given a distinguishing version number. If the Library +specifies a version number of this License which applies to it and +"any later version", you have the option of following the terms and +conditions either of that version or of any later version published by +the Free Software Foundation. If the Library does not specify a +license version number, you may choose any version ever published by +the Free Software Foundation. + + 14. If you wish to incorporate parts of the Library into other free +programs whose distribution conditions are incompatible with these, +write to the author to ask for permission. For software which is +copyrighted by the Free Software Foundation, write to the Free +Software Foundation; we sometimes make exceptions for this. Our +decision will be guided by the two goals of preserving the free status +of all derivatives of our free software and of promoting the sharing +and reuse of software generally. + + NO WARRANTY + + 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO +WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. +EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR +OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY +KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE +LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME +THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN +WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY +AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU +FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR +CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE +LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING +RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A +FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF +SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Libraries + + If you develop a new library, and you want it to be of the greatest +possible use to the public, we recommend making it free software that +everyone can redistribute and change. You can do so by permitting +redistribution under these terms (or, alternatively, under the terms of the +ordinary General Public License). + + To apply these terms, attach the following notices to the library. It is +safest to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least the +"copyright" line and a pointer to where the full notice is found. + + <one line to give the library's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This library is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; either + version 2.1 of the License, or (at your option) any later version. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should have received a copy of the GNU Lesser General Public + License along with this library; if not, write to the Free Software + Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + +Also add information on how to contact you by electronic and paper mail. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the library, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the + library `Frob' (a library for tweaking knobs) written by James Random Hacker. + + <signature of Ty Coon>, 1 April 1990 + Ty Coon, President of Vice + +That's all there is to it! + +Commands: + +Retrieved from: http://www.gnu.org/licenses/gpl-2.0.txt + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License along + with this program; if not, write to the Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) year name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + <signature of Ty Coon>, 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. +
diff --git a/Makefile.am b/Makefile.am new file mode 100644 index 0000000..03b0ab6 --- /dev/null +++ b/Makefile.am
@@ -0,0 +1,161 @@ + +ACLOCAL_AMFLAGS = -I m4 +CLEANFILES = + +AM_CPPFLAGS = -Wall + +bin_PROGRAMS = numactl numastat numademo migratepages migspeed memhog + +lib_LTLIBRARIES = libnuma.la + +include_HEADERS = numa.h numacompat1.h numaif.h + +noinst_HEADERS = numaint.h util.h + +dist_man_MANS = move_pages.2 numa.3 numactl.8 numastat.8 migratepages.8 migspeed.8 + +EXTRA_DIST = README.md INSTALL.md + +numactl_SOURCES = numactl.c util.c shm.c shm.h +numactl_LDADD = libnuma.la + +numastat_SOURCES = numastat.c +numastat_CFLAGS = $(AM_CFLAGS) -std=gnu99 + +numademo_SOURCES = numademo.c stream_lib.c stream_lib.h mt.c mt.h clearcache.c clearcache.h +numademo_CPPFLAGS = $(AM_CPPFLAGS) -DHAVE_STREAM_LIB -DHAVE_MT -DHAVE_CLEAR_CACHE +numademo_CFLAGS = -O3 -ffast-math -funroll-loops +if HAVE_TREE_VECTORIZE +numademo_CFLAGS += -ftree-vectorize +endif +numademo_LDADD = libnuma.la -lm + +migratepages_SOURCES = migratepages.c util.c +migratepages_LDADD = libnuma.la + +migspeed_SOURCES = migspeed.c util.c +migspeed_LDADD = libnuma.la -lrt + +memhog_SOURCES = memhog.c util.c +memhog_LDADD = libnuma.la + +libnuma_la_SOURCES = libnuma.c syscall.c distance.c affinity.c affinity.h sysfs.c sysfs.h rtnetlink.c rtnetlink.h versions.ldscript +libnuma_la_LDFLAGS = -version-info 1:0:0 -Wl,--version-script,$(srcdir)/versions.ldscript -Wl,-init,numa_init -Wl,-fini,numa_fini + +check_PROGRAMS = \ + test/distance \ + test/ftok \ + test/mbind_mig_pages \ + test/migrate_pages \ + test/move_pages \ + test/mynode \ + test/node-parse \ + test/nodemap \ + test/pagesize \ + test/prefered \ + test/randmap \ + test/realloc_test \ + test/tbitmap \ + test/tshared + +EXTRA_DIST += \ + test/README \ + test/bind_range \ + test/checkaffinity \ + test/checktopology \ + test/numademo \ + test/printcpu \ + test/regress \ + test/regress2 \ + test/regress-io \ + test/runltp \ + test/shmtest + +test_distance_SOURCES = test/distance.c +test_distance_LDADD = libnuma.la + +test_ftok_SOURCES = test/ftok.c +test_ftok_LDADD = libnuma.la + +test_mbind_mig_pages_SOURCES = test/mbind_mig_pages.c +test_mbind_mig_pages_LDADD = libnuma.la + +test_migrate_pages_SOURCES = test/migrate_pages.c +test_migrate_pages_LDADD = libnuma.la + +test_move_pages_SOURCES = test/move_pages.c +test_move_pages_LDADD = libnuma.la + +test_mynode_SOURCES = test/mynode.c +test_mynode_LDADD = libnuma.la + +test_node_parse_SOURCES = test/node-parse.c util.c +test_node_parse_LDADD = libnuma.la + +test_nodemap_SOURCES = test/nodemap.c +test_nodemap_LDADD = libnuma.la + +test_pagesize_SOURCES = test/pagesize.c +test_pagesize_LDADD = libnuma.la + +test_prefered_SOURCES = test/prefered.c +test_prefered_LDADD = libnuma.la + +test_randmap_SOURCES = test/randmap.c +test_randmap_LDADD = libnuma.la + +test_realloc_test_SOURCES = test/realloc_test.c +test_realloc_test_LDADD = libnuma.la + +test_tbitmap_SOURCES = test/tbitmap.c util.c +test_tbitmap_LDADD = libnuma.la + +test_tshared_SOURCES = test/tshared.c +test_tshared_LDADD = libnuma.la + +# Legacy make rules for test cases. +# These will be superceded by "make check". + +regress1: $(check_PROGRAMS) + cd test && ./regress + +regress2: $(check_PROGRAMS) + cd test && ./regress2 + +test_numademo: numademo + ./numademo -t -e 10M + +test: all $(check_PROGRAMS) regress1 regress2 test_numademo + +TESTS_ENVIRONMENT = builddir='$(builddir)'; export builddir; + +TESTS = \ + test/bind_range \ + test/checkaffinity \ + test/checktopology \ + test/distance \ + test/nodemap \ + test/numademo \ + test/regress \ + test/tbitmap + +# These are known to be broken: +# test/prefered +# test/randmap + +SED_PROCESS = \ + $(AM_V_GEN)$(SED) \ + -e 's,@VERSION\@,$(VERSION),g' \ + -e 's,@prefix\@,$(prefix),g' \ + -e 's,@exec_prefix\@,$(exec_prefix),g' \ + -e 's,@libdir\@,$(libdir),g' \ + -e 's,@includedir\@,$(includedir),g' \ + < $< > $@ || rm $@ + +%.pc: %.pc.in Makefile + $(SED_PROCESS) + +pkgconfigdir = $(libdir)/pkgconfig +pkgconfig_DATA = numa.pc +EXTRA_DIST += numa.pc.in +CLEANFILES += numa.pc
diff --git a/README b/README new file mode 100644 index 0000000..6d77543 --- /dev/null +++ b/README
@@ -0,0 +1,44 @@ + +Simple NUMA policy support. It consists of a numactl program to run +other programs with a specific NUMA policy and a libnuma shared library +("NUMA API") to set NUMA policy in applications. + +The libnuma binary interface is supposed to stay binary compatible. +Incompatible changes will use new symbol version numbers. + +In addition there are various test and utility programs, like +numastat to display NUMA allocation statistics and memhog. + +In test there is a small regression test suite. +Note that regress assumes a unloaded machine with memory free on each +node. Otherwise you will get spurious failures in the non strict +policies (prefered, interleave) + +See the manpages numactl.8 and numa.3 for details. + +Source code maintenance: + https://github.com/numactl/numactl/ + +Copyright: + +numactl and the demo programs are under the GNU General Public License, v.2 +libnuma is under the GNU Lesser General Public License, v2.1. + +The manpages are under the same license as the Linux manpages (see the files) + +numademo links with a library derived from the C version of STREAM +by John D. McCalpin and Joe R. Zagar for one sub benchmark. See stream_lib.c +for the license. In particular when you publish numademo output +you might need to pay attention there or filter out the STREAM results. + +It also uses a public domain Mersenne Twister implementation from +Michael Brundage. + +Version 2.0.10-rc2: (C)2014 SGI + +Author: +Andi Kleen, SUSE Labs + +Version 2.0.0 by Cliff Wickman, Christoph Lameter and Lee Schermerhorn +cpw@sgi.com clameter@sgi.com lee.schermerhorn@hp.com +
diff --git a/TODO b/TODO new file mode 100644 index 0000000..1af588d --- /dev/null +++ b/TODO
@@ -0,0 +1,23 @@ +last update Aug 16 2007: + +need to fix hugetlbfs to allow holey files (for numactl) +numademo numbers seem to be unstable. investigate. +need more test programs + +Replace unreliable counters in numamon with supported ones. +According to Alex Tomas: +seems ht bus utilization can be found by +(cmd+data+bufrelease / cmd+data+bufrelease+nop) +according to chm in codeanalyst. +quote from .chm: +Buses 0,1, 2 ?The number of Dwords transmitted (or unused, in the case ++of Nops) on the outgoing side of the HyperTransport links. The sum of all ++four sub-events (all four unit mask bits set) directly reflects the ++transmission rate of the link. Calculate link utilization by dividing the ++combined Command, Data, and Buffer Release count (unit mask 07h) by that ++value plus the Nop count (unit mask 08h). Bandwi +dth in terms of bytes per unit-time for any one component, or ++combination of components, is calculated by multiplying the count by four and ++dividing by elapsed time. +and +0xE9 is also documented
diff --git a/affinity.c b/affinity.c new file mode 100644 index 0000000..6f69a9b --- /dev/null +++ b/affinity.c
@@ -0,0 +1,347 @@ +/* Support for specifying IO affinity by various means. + Copyright 2010 Intel Corporation + Author: Andi Kleen + + libnuma is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; version + 2.1. + + libnuma is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should find a copy of v2.1 of the GNU Lesser General Public License + somewhere on your Linux system; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +/* Notebook: + - Separate real errors from no NUMA with fallback + - Infiniband + - FCoE? + - Support for other special IO devices + - Specifying cpu subsets inside the IO node? + - Handle multiple IO nodes (needs kernel changes) + - Better support for multi-path IO? + */ +#define _GNU_SOURCE 1 +#include <string.h> +#include <errno.h> +#include <sys/stat.h> +#include <netdb.h> +#include <unistd.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/socket.h> +#include <sys/ioctl.h> +#include <net/if.h> +#include <dirent.h> +#include <linux/rtnetlink.h> +#include <linux/netlink.h> +#include <sys/types.h> +#include <sys/sysmacros.h> +#include <ctype.h> +#include <assert.h> +#include <regex.h> +#include <sys/sysmacros.h> +#include "numa.h" +#include "numaint.h" +#include "sysfs.h" +#include "affinity.h" +#include "rtnetlink.h" + +static int badchar(const char *s) +{ + if (strpbrk(s, "/.")) + return 1; + return 0; +} + +static int node_parse_failure(int ret, char *cls, const char *dev) +{ + if (!cls) + cls = ""; + if (ret == -2) + numa_warn(W_node_parse1, + "Kernel does not know node mask for%s%s device `%s'", + *cls ? " " : "", cls, dev); + else + numa_warn(W_node_parse2, + "Cannot read node mask for %s device `%s'", + cls, dev); + return -1; +} + +/* Generic sysfs class lookup */ +static int +affinity_class(struct bitmask *mask, char *cls, const char *dev) +{ + int ret; + while (isspace(*dev)) + dev++; + if (badchar(dev)) { + numa_warn(W_badchar, "Illegal characters in `%s' specification", + dev); + return -1; + } + + /* Somewhat hackish: extract device from symlink path. + Better would be a direct backlink. This knows slightly too + much about the actual sysfs layout. */ + char path[1024]; + char *fn = NULL; + if (asprintf(&fn, "/sys/class/%s/%s", cls, dev) > 0 && + readlink(fn, path, sizeof path) > 0) { + regex_t re; + regmatch_t match[2]; + char *p; + + regcomp(&re, "(/devices/pci[0-9a-fA-F:/]+\\.[0-9]+)/", + REG_EXTENDED); + ret = regexec(&re, path, 2, match, 0); + regfree(&re); + if (ret == 0) { + free(fn); + assert(match[0].rm_so > 0); + assert(match[0].rm_eo > 0); + path[match[1].rm_eo + 1] = 0; + p = path + match[0].rm_so; + ret = sysfs_node_read(mask, "/sys/%s/numa_node", p); + if (ret < 0) + return node_parse_failure(ret, NULL, p); + return ret; + } + } + free(fn); + + ret = sysfs_node_read(mask, "/sys/class/%s/%s/device/numa_node", + cls, dev); + if (ret < 0) + return node_parse_failure(ret, cls, dev); + return 0; +} + +/* Turn file (or device node) into class name */ +static int affinity_file(struct bitmask *mask, char *cls, const char *file) +{ + struct stat st; + DIR *dir; + int n; + unsigned maj = 0, min = 0; + dev_t d; + struct dirent *dep; + + cls = "block"; + char fn[sizeof("/sys/class/") + strlen(cls)]; + if (stat(file, &st) < 0) { + numa_warn(W_blockdev1, "Cannot stat file %s", file); + return -1; + } + d = st.st_dev; + if (S_ISCHR(st.st_mode)) { + /* Better choice than misc? Most likely misc will not work + anyways unless the kernel is fixed. */ + cls = "misc"; + d = st.st_rdev; + } else if (S_ISBLK(st.st_mode)) + d = st.st_rdev; + + sprintf(fn, "/sys/class/%s", cls); + dir = opendir(fn); + if (!dir) { + numa_warn(W_blockdev2, "Cannot enumerate %s devices in sysfs", + cls); + return -1; + } + while ((dep = readdir(dir)) != NULL) { + char *name = dep->d_name; + int ret; + + if (*name == '.') + continue; + char *dev; + char fn2[sizeof("/sys/class/block//dev") + strlen(name)]; + + n = -1; + if (sprintf(fn2, "/sys/class/block/%s/dev", name) < 0) + break; + dev = sysfs_read(fn2); + if (dev) { + n = sscanf(dev, "%u:%u", &maj, &min); + free(dev); + } + if (n != 2) { + numa_warn(W_blockdev3, "Cannot parse sysfs device %s", + name); + continue; + } + + if (major(d) != maj || minor(d) != min) + continue; + + ret = affinity_class(mask, "block", name); + closedir(dir); + return ret; + } + closedir(dir); + numa_warn(W_blockdev5, "Cannot find block device %x:%x in sysfs for `%s'", + maj, min, file); + return -1; +} + +/* Look up interface of route using rtnetlink. */ +static int find_route(struct sockaddr *dst, int *iifp) +{ + struct rtattr *rta; + const int hdrlen = NLMSG_LENGTH(sizeof(struct rtmsg)); + struct { + struct nlmsghdr msg; + struct rtmsg rt; + char buf[256]; + } req = { + .msg = { + .nlmsg_len = hdrlen, + .nlmsg_type = RTM_GETROUTE, + .nlmsg_flags = NLM_F_REQUEST, + }, + .rt = { + .rtm_family = dst->sa_family, + }, + }; + struct sockaddr_nl adr = { + .nl_family = AF_NETLINK, + }; + + if (rta_put_address(&req.msg, RTA_DST, dst) < 0) { + numa_warn(W_netlink1, "Cannot handle network family %x", + dst->sa_family); + return -1; + } + + if (rtnetlink_request(&req.msg, sizeof req, &adr) < 0) { + numa_warn(W_netlink2, "Cannot request rtnetlink route: %s", + strerror(errno)); + return -1; + } + + /* Fish the interface out of the netlink soup. */ + rta = NULL; + while ((rta = rta_get(&req.msg, rta, hdrlen)) != NULL) { + if (rta->rta_type == RTA_OIF) { + memcpy(iifp, RTA_DATA(rta), sizeof(int)); + return 0; + } + } + + numa_warn(W_netlink3, "rtnetlink query did not return interface"); + return -1; +} + +static int iif_to_name(int iif, struct ifreq *ifr) +{ + int n; + int sk = socket(PF_INET, SOCK_DGRAM, 0); + if (sk < 0) + return -1; + ifr->ifr_ifindex = iif; + n = ioctl(sk, SIOCGIFNAME, ifr); + close(sk); + return n; +} + +/* Resolve an IP address to the nodes of a network device. + This generally only attempts to handle simple cases: + no multi-path, no bounding etc. In these cases only + the first interface or none is chosen. */ +static int affinity_ip(struct bitmask *mask, char *cls, const char *id) +{ + struct addrinfo *ai; + int n; + int iif; + struct ifreq ifr; + + if ((n = getaddrinfo(id, NULL, NULL, &ai)) != 0) { + numa_warn(W_net1, "Cannot resolve %s: %s", + id, gai_strerror(n)); + return -1; + } + + if (find_route(&ai->ai_addr[0], &iif) < 0) + goto out_ai; + + if (iif_to_name(iif, &ifr) < 0) { + numa_warn(W_net2, "Cannot resolve network interface %d", iif); + goto out_ai; + } + + freeaddrinfo(ai); + return affinity_class(mask, "net", ifr.ifr_name); + +out_ai: + freeaddrinfo(ai); + return -1; +} + +/* Look up affinity for a PCI device */ +static int affinity_pci(struct bitmask *mask, char *cls, const char *id) +{ + unsigned seg, bus, dev, func; + int n, ret; + + /* Func is optional. */ + if ((n = sscanf(id, "%x:%x:%x.%x",&seg,&bus,&dev,&func)) == 4 || n == 3) { + if (n == 3) + func = 0; + } + /* Segment is optional too */ + else if ((n = sscanf(id, "%x:%x.%x",&bus,&dev,&func)) == 3 || n == 2) { + seg = 0; + if (n == 2) + func = 0; + } else { + numa_warn(W_pci1, "Cannot parse PCI device `%s'", id); + return -1; + } + ret = sysfs_node_read(mask, + "/sys/devices/pci%04x:%02x/%04x:%02x:%02x.%x/numa_node", + seg, bus, seg, bus, dev, func); + if (ret < 0) + return node_parse_failure(ret, cls, id); + return 0; +} + +static struct handler { + char first; + char *name; + char *cls; + int (*handler)(struct bitmask *mask, char *cls, const char *desc); +} handlers[] = { + { 'n', "netdev:", "net", affinity_class }, + { 'i', "ip:", NULL, affinity_ip }, + { 'f', "file:", NULL, affinity_file }, + { 'b', "block:", "block", affinity_class }, + { 'p', "pci:", NULL, affinity_pci }, + {} +}; + +hidden int resolve_affinity(const char *id, struct bitmask *mask) +{ + struct handler *h; + + for (h = &handlers[0]; h->first; h++) { + int len; + if (id[0] != h->first) + continue; + len = strlen(h->name); + if (!strncmp(id, h->name, len)) { + int ret = h->handler(mask, h->cls, id + len); + if (ret == -2) { + numa_warn(W_nonode, "Kernel does not know node for %s\n", + id + len); + } + return ret; + } + } + return NO_IO_AFFINITY; +}
diff --git a/affinity.h b/affinity.h new file mode 100644 index 0000000..6fbd364 --- /dev/null +++ b/affinity.h
@@ -0,0 +1,5 @@ +enum { + NO_IO_AFFINITY = -2 +}; + +int resolve_affinity(const char *id, struct bitmask *mask);
diff --git a/autogen.sh b/autogen.sh new file mode 100755 index 0000000..a36375b --- /dev/null +++ b/autogen.sh
@@ -0,0 +1,5 @@ +#!/bin/sh + +set -e + +autoreconf --install --symlink
diff --git a/clearcache.c b/clearcache.c new file mode 100644 index 0000000..82469c1 --- /dev/null +++ b/clearcache.c
@@ -0,0 +1,77 @@ +/* Clear the CPU cache for benchmark purposes. Pretty simple minded. + * Might not work in some complex cache topologies. + * When you switch CPUs it's a good idea to clear the cache after testing + * too. + */ +#include <unistd.h> +#include <stdio.h> +#include <stdlib.h> +#include "clearcache.h" + +unsigned cache_size(void) +{ + unsigned cs = 0; +#ifdef _SC_LEVEL1_DCACHE_SIZE + cs += sysconf(_SC_LEVEL1_DCACHE_SIZE); +#endif +#ifdef _SC_LEVEL2_DCACHE_SIZE + cs += sysconf(_SC_LEVEL2_DCACHE_SIZE); +#endif +#ifdef _SC_LEVEL3_DCACHE_SIZE + cs += sysconf(_SC_LEVEL3_DCACHE_SIZE); +#endif +#ifdef _SC_LEVEL4_DCACHE_SIZE + cs += sysconf(_SC_LEVEL4_DCACHE_SIZE); +#endif + if (cs == 0) { + static int warned; + if (!warned) { + printf("Cannot determine CPU cache size\n"); + warned = 1; + } + cs = 64*1024*1024; + } + cs *= 2; /* safety factor */ + + return cs; +} + +void fallback_clearcache(void) +{ + static unsigned char *clearmem; + unsigned cs = cache_size(); + unsigned i; + + if (!clearmem) + clearmem = malloc(cs); + if (!clearmem) { + printf("Warning: cannot allocate %u bytes of clear cache buffer\n", cs); + return; + } + for (i = 0; i < cs; i += 32) + clearmem[i] = 1; +} + +void clearcache(unsigned char *mem, unsigned size) +{ +#if defined(__i386__) || defined(__x86_64__) + unsigned i, cl, eax, feat; + /* get clflush unit and feature */ + asm("cpuid" : "=a" (eax), "=b" (cl), "=d" (feat) : "0" (1) : "cx"); + if (!(feat & (1 << 19))) + fallback_clearcache(); + cl = ((cl >> 8) & 0xff) * 8; + for (i = 0; i < size; i += cl) + asm("clflush %0" :: "m" (mem[i])); +#elif defined(__ia64__) + unsigned long cl, endcl; + // flush probable 128 byte cache lines (but possibly 64 bytes) + cl = (unsigned long)mem; + endcl = (unsigned long)(mem + (size-1)); + for (; cl <= endcl; cl += 64) + asm ("fc %0" :: "r"(cl) : "memory" ); +#else +#warning "Consider adding a clearcache implementation for your architecture" + fallback_clearcache(); +#endif +}
diff --git a/clearcache.h b/clearcache.h new file mode 100644 index 0000000..258b576 --- /dev/null +++ b/clearcache.h
@@ -0,0 +1 @@ +void clearcache(unsigned char *mem, unsigned size);
diff --git a/config.h b/config.h new file mode 100644 index 0000000..f42d6b1 --- /dev/null +++ b/config.h
@@ -0,0 +1,66 @@ +/* config.h. Generated from config.h.in by configure. */ +/* config.h.in. Generated from configure.ac by autoheader. */ + +/* Define to 1 if you have the <dlfcn.h> header file. */ +#define HAVE_DLFCN_H 1 + +/* Define to 1 if you have the <inttypes.h> header file. */ +#define HAVE_INTTYPES_H 1 + +/* Define to 1 if you have the <memory.h> header file. */ +#define HAVE_MEMORY_H 1 + +/* Define to 1 if you have the <stdint.h> header file. */ +#define HAVE_STDINT_H 1 + +/* Define to 1 if you have the <stdlib.h> header file. */ +#define HAVE_STDLIB_H 1 + +/* Define to 1 if you have the <strings.h> header file. */ +#define HAVE_STRINGS_H 1 + +/* Define to 1 if you have the <string.h> header file. */ +#define HAVE_STRING_H 1 + +/* Define to 1 if you have the <sys/stat.h> header file. */ +#define HAVE_SYS_STAT_H 1 + +/* Define to 1 if you have the <sys/types.h> header file. */ +#define HAVE_SYS_TYPES_H 1 + +/* Define to 1 if you have the <unistd.h> header file. */ +#define HAVE_UNISTD_H 1 + +/* Define to the sub-directory in which libtool stores uninstalled libraries. + */ +#define LT_OBJDIR ".libs/" + +/* Name of package */ +#define PACKAGE "numactl" + +/* Define to the address where bug reports for this package should be sent. */ +#define PACKAGE_BUGREPORT "" + +/* Define to the full name of this package. */ +#define PACKAGE_NAME "numactl" + +/* Define to the full name and version of this package. */ +#define PACKAGE_STRING "numactl 2.0.10" + +/* Define to the one symbol short name of this package. */ +#define PACKAGE_TARNAME "numactl" + +/* Define to the home page for this package. */ +#define PACKAGE_URL "" + +/* Define to the version of this package. */ +#define PACKAGE_VERSION "2.0.10" + +/* Define to 1 if you have the ANSI C header files. */ +#define STDC_HEADERS 1 + +/* If the compiler supports a TLS storage class define it to that here */ +#define TLS __thread + +/* Version number of package */ +#define VERSION "2.0.10"
diff --git a/configure.ac b/configure.ac new file mode 100644 index 0000000..572b53d --- /dev/null +++ b/configure.ac
@@ -0,0 +1,27 @@ +AC_PREREQ([2.64]) +AC_INIT([numactl], [2.0.12]) + +AC_CONFIG_SRCDIR([numactl.c]) +AC_CONFIG_MACRO_DIR([m4]) +AC_CONFIG_AUX_DIR([build-aux]) +AC_CONFIG_HEADERS([config.h]) + +AM_INIT_AUTOMAKE([foreign 1.11 silent-rules subdir-objects parallel-tests]) +AM_SILENT_RULES([yes]) + +LT_PREREQ([2.2]) +LT_INIT + +AC_PROG_CC + +# Override CFLAGS so that we can specify custom CFLAGS for numademo. +AX_AM_OVERRIDE_VAR([CFLAGS]) + +AX_TLS + +AX_CHECK_COMPILE_FLAG([-ftree-vectorize], [tree_vectorize="true"]) +AM_CONDITIONAL([HAVE_TREE_VECTORIZE], [test x"${tree_vectorize}" = x"true"]) + +AC_CONFIG_FILES([Makefile]) + +AC_OUTPUT
diff --git a/distance.c b/distance.c new file mode 100644 index 0000000..8d472af --- /dev/null +++ b/distance.c
@@ -0,0 +1,120 @@ +/* Discover distances + Copyright (C) 2005 Andi Kleen, SuSE Labs. + + libnuma is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; version + 2.1. + + libnuma is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should find a copy of v2.1 of the GNU Lesser General Public License + somewhere on your Linux system; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + All calls are undefined when numa_available returns an error. */ +#define _GNU_SOURCE 1 +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include "numa.h" +#include "numaint.h" + +static int distance_numnodes; +static int *distance_table; + +static void parse_numbers(char *s, int *iptr) +{ + int i, d, j; + char *end; + int maxnode = numa_max_node(); + int numnodes = 0; + + for (i = 0; i <= maxnode; i++) + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) + numnodes++; + + for (i = 0, j = 0; i <= maxnode; i++, j++) { + d = strtoul(s, &end, 0); + /* Skip unavailable nodes */ + while (j<=maxnode && !numa_bitmask_isbitset(numa_nodes_ptr, j)) + j++; + if (s == end) + break; + *(iptr+j) = d; + s = end; + } +} + +static int read_distance_table(void) +{ + int nd, len; + char *line = NULL; + size_t linelen = 0; + int maxnode = numa_max_node() + 1; + int *table = NULL; + int err = -1; + + for (nd = 0;; nd++) { + char fn[100]; + FILE *dfh; + sprintf(fn, "/sys/devices/system/node/node%d/distance", nd); + dfh = fopen(fn, "r"); + if (!dfh) { + if (errno == ENOENT) + err = 0; + if (!err && nd<maxnode) + continue; + else + break; + } + len = getdelim(&line, &linelen, '\n', dfh); + fclose(dfh); + if (len <= 0) + break; + + if (!table) { + table = calloc(maxnode * maxnode, sizeof(int)); + if (!table) { + errno = ENOMEM; + break; + } + } + + parse_numbers(line, table + nd * maxnode); + } + free(line); + if (err) { + numa_warn(W_distance, + "Cannot parse distance information in sysfs: %s", + strerror(errno)); + free(table); + return err; + } + /* Update the global table pointer. Race window here with + other threads, but in the worst case we leak one distance + array one time, which is tolerable. This avoids a + dependency on pthreads. */ + if (distance_table) { + free(table); + return 0; + } + distance_numnodes = maxnode; + distance_table = table; + return 0; +} + +int numa_distance(int a, int b) +{ + if (!distance_table) { + int err = read_distance_table(); + if (err < 0) + return 0; + } + if ((unsigned)a >= distance_numnodes || (unsigned)b >= distance_numnodes) + return 0; + return distance_table[a * distance_numnodes + b]; +}
diff --git a/libnuma.c b/libnuma.c new file mode 100755 index 0000000..700b30c --- /dev/null +++ b/libnuma.c
@@ -0,0 +1,2047 @@ +/* Simple NUMA library. + Copyright (C) 2003,2004,2005,2008 Andi Kleen,SuSE Labs and + Cliff Wickman,SGI. + + libnuma is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; version + 2.1. + + libnuma is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should find a copy of v2.1 of the GNU Lesser General Public License + somewhere on your Linux system; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + All calls are undefined when numa_available returns an error. */ +#define _GNU_SOURCE 1 +#include <stdlib.h> +#include <stdio.h> +#include <unistd.h> +#include <string.h> +#include <sched.h> +#include <dirent.h> +#include <errno.h> +#include <stdarg.h> +#include <ctype.h> + +#include <sys/mman.h> +#include <limits.h> + +#ifdef MEMORY_SANITIZER +#include <sanitizer/msan_interface.h> +#endif + +#include "config.h" +#include "numa.h" +#include "numaif.h" +#include "numaint.h" +#include "util.h" +#include "affinity.h" + +#define WEAK __attribute__((weak)) + +#define CPU_BUFFER_SIZE 4096 /* This limits you to 32768 CPUs */ + +/* these are the old (version 1) masks */ +nodemask_t numa_no_nodes; +nodemask_t numa_all_nodes; +/* these are now the default bitmask (pointers to) (version 2) */ +struct bitmask *numa_no_nodes_ptr = NULL; +struct bitmask *numa_all_nodes_ptr = NULL; +struct bitmask *numa_possible_nodes_ptr = NULL; +struct bitmask *numa_all_cpus_ptr = NULL; +struct bitmask *numa_possible_cpus_ptr = NULL; +/* I would prefer to use symbol versioning to create v1 and v2 versions + of numa_no_nodes and numa_all_nodes, but the loader does not correctly + handle versioning of BSS versus small data items */ + +struct bitmask *numa_nodes_ptr = NULL; +static struct bitmask *numa_memnode_ptr = NULL; +static unsigned long *node_cpu_mask_v1[NUMA_NUM_NODES]; +static struct bitmask **node_cpu_mask_v2; + +WEAK void numa_error(const char *where); + +#ifndef TLS +#warning "not threadsafe" +#define __thread +#endif + +static __thread int bind_policy = MPOL_BIND; +static __thread unsigned int mbind_flags = 0; +static int sizes_set=0; +static int maxconfigurednode = -1; +static int maxconfiguredcpu = -1; +static int numprocnode = -1; +static int numproccpu = -1; +static int nodemask_sz = 0; +static int cpumask_sz = 0; + +int numa_exit_on_error = 0; +int numa_exit_on_warn = 0; +static void set_sizes(void); + +/* + * There are two special functions, _init(void) and _fini(void), which + * are called automatically by the dynamic loader whenever a library is loaded. + * + * The v1 library depends upon nodemask_t's of all nodes and no nodes. + */ +void __attribute__((constructor)) +numa_init(void) +{ + int max,i; + + if (sizes_set) + return; + + set_sizes(); + /* numa_all_nodes should represent existing nodes on this system */ + max = numa_num_configured_nodes(); + for (i = 0; i < max; i++) + nodemask_set_compat((nodemask_t *)&numa_all_nodes, i); + memset(&numa_no_nodes, 0, sizeof(numa_no_nodes)); +} + +#define FREE_AND_ZERO(x) if (x) { \ + numa_bitmask_free(x); \ + x = NULL; \ + } + +void __attribute__((destructor)) +numa_fini(void) +{ + FREE_AND_ZERO(numa_all_cpus_ptr); + FREE_AND_ZERO(numa_possible_cpus_ptr); + FREE_AND_ZERO(numa_all_nodes_ptr); + FREE_AND_ZERO(numa_possible_nodes_ptr); + FREE_AND_ZERO(numa_no_nodes_ptr); + FREE_AND_ZERO(numa_memnode_ptr); + FREE_AND_ZERO(numa_nodes_ptr); +} + +/* + * The following bitmask declarations, bitmask_*() routines, and associated + * _setbit() and _getbit() routines are: + * Copyright (c) 2004_2007 Silicon Graphics, Inc. (SGI) All rights reserved. + * SGI publishes it under the terms of the GNU General Public License, v2, + * as published by the Free Software Foundation. + */ +static unsigned int +_getbit(const struct bitmask *bmp, unsigned int n) +{ + if (n < bmp->size) + return (bmp->maskp[n/bitsperlong] >> (n % bitsperlong)) & 1; + else + return 0; +} + +static void +_setbit(struct bitmask *bmp, unsigned int n, unsigned int v) +{ + if (n < bmp->size) { + if (v) + bmp->maskp[n/bitsperlong] |= 1UL << (n % bitsperlong); + else + bmp->maskp[n/bitsperlong] &= ~(1UL << (n % bitsperlong)); + } +} + +int +numa_bitmask_isbitset(const struct bitmask *bmp, unsigned int i) +{ + return _getbit(bmp, i); +} + +struct bitmask * +numa_bitmask_setall(struct bitmask *bmp) +{ + unsigned int i; + for (i = 0; i < bmp->size; i++) + _setbit(bmp, i, 1); + return bmp; +} + +struct bitmask * +numa_bitmask_clearall(struct bitmask *bmp) +{ + unsigned int i; + for (i = 0; i < bmp->size; i++) + _setbit(bmp, i, 0); + return bmp; +} + +struct bitmask * +numa_bitmask_setbit(struct bitmask *bmp, unsigned int i) +{ + _setbit(bmp, i, 1); + return bmp; +} + +struct bitmask * +numa_bitmask_clearbit(struct bitmask *bmp, unsigned int i) +{ + _setbit(bmp, i, 0); + return bmp; +} + +unsigned int +numa_bitmask_nbytes(struct bitmask *bmp) +{ + return longsperbits(bmp->size) * sizeof(unsigned long); +} + +/* where n is the number of bits in the map */ +/* This function should not exit on failure, but right now we cannot really + recover from this. */ +struct bitmask * +numa_bitmask_alloc(unsigned int n) +{ + struct bitmask *bmp; + + if (n < 1) { + errno = EINVAL; + numa_error("request to allocate mask for invalid number"); + exit(1); + } + bmp = malloc(sizeof(*bmp)); + if (bmp == 0) + goto oom; + bmp->size = n; + bmp->maskp = calloc(longsperbits(n), sizeof(unsigned long)); + if (bmp->maskp == 0) { + free(bmp); + goto oom; + } + return bmp; + +oom: + numa_error("Out of memory allocating bitmask"); + exit(1); +} + +void +numa_bitmask_free(struct bitmask *bmp) +{ + if (bmp == 0) + return; + free(bmp->maskp); + bmp->maskp = (unsigned long *)0xdeadcdef; /* double free tripwire */ + free(bmp); + return; +} + +/* True if two bitmasks are equal */ +int +numa_bitmask_equal(const struct bitmask *bmp1, const struct bitmask *bmp2) +{ + unsigned int i; + for (i = 0; i < bmp1->size || i < bmp2->size; i++) + if (_getbit(bmp1, i) != _getbit(bmp2, i)) + return 0; + return 1; +} + +/* Hamming Weight: number of set bits */ +unsigned int numa_bitmask_weight(const struct bitmask *bmp) +{ + unsigned int i; + unsigned int w = 0; + for (i = 0; i < bmp->size; i++) + if (_getbit(bmp, i)) + w++; + return w; +} + +/* *****end of bitmask_ routines ************ */ + +/* Next two can be overwritten by the application for different error handling */ +WEAK void numa_error(const char *where) +{ + int olde = errno; + perror(where); + if (numa_exit_on_error) + exit(1); + errno = olde; +} + +WEAK void numa_warn(int num, const char *fmt, ...) +{ + static unsigned warned; + va_list ap; + int olde = errno; + + /* Give each warning only once */ + if ((1<<num) & warned) + return; + warned |= (1<<num); + + va_start(ap,fmt); + fprintf(stderr, "libnuma: Warning: "); + vfprintf(stderr, fmt, ap); + fputc('\n', stderr); + va_end(ap); + + errno = olde; +} + +static void setpol(int policy, struct bitmask *bmp) +{ + if (set_mempolicy(policy, bmp->maskp, bmp->size + 1) < 0) + numa_error("set_mempolicy"); +} + +static void getpol(int *oldpolicy, struct bitmask *bmp) +{ + if (get_mempolicy(oldpolicy, bmp->maskp, bmp->size + 1, 0, 0) < 0) + numa_error("get_mempolicy"); +} + +static void dombind(void *mem, size_t size, int pol, struct bitmask *bmp) +{ + if (mbind(mem, size, pol, bmp ? bmp->maskp : NULL, bmp ? bmp->size + 1 : 0, + mbind_flags) < 0) + numa_error("mbind"); +} + +/* (undocumented) */ +/* gives the wrong answer for hugetlbfs mappings. */ +int numa_pagesize(void) +{ + static int pagesize; + if (pagesize > 0) + return pagesize; + pagesize = getpagesize(); + return pagesize; +} + +make_internal_alias(numa_pagesize); + +/* + * Find nodes (numa_nodes_ptr), nodes with memory (numa_memnode_ptr) + * and the highest numbered existing node (maxconfigurednode). + */ +static void +set_configured_nodes(void) +{ + DIR *d; + struct dirent *de; + long long freep; + + numa_memnode_ptr = numa_allocate_nodemask(); + numa_nodes_ptr = numa_allocate_nodemask(); + + d = opendir("/sys/devices/system/node"); + if (!d) { + maxconfigurednode = 0; + } else { + while ((de = readdir(d)) != NULL) { + int nd; + if (strncmp(de->d_name, "node", 4)) + continue; + nd = strtoul(de->d_name+4, NULL, 0); + numa_bitmask_setbit(numa_nodes_ptr, nd); + if (numa_node_size64(nd, &freep) > 0) + numa_bitmask_setbit(numa_memnode_ptr, nd); + if (maxconfigurednode < nd) + maxconfigurednode = nd; + } + closedir(d); + } +} + +/* + * Convert the string length of an ascii hex mask to the number + * of bits represented by that mask. + */ +static int s2nbits(const char *s) +{ + return strlen(s) * 32 / 9; +} + +/* Is string 'pre' a prefix of string 's'? */ +static int strprefix(const char *s, const char *pre) +{ + return strncmp(s, pre, strlen(pre)) == 0; +} + +static const char *mask_size_file = "/proc/self/status"; +static const char *nodemask_prefix = "Mems_allowed:\t"; +/* + * (do this the way Paul Jackson's libcpuset does it) + * The nodemask values in /proc/self/status are in an + * ascii format that uses 9 characters for each 32 bits of mask. + * (this could also be used to find the cpumask size) + */ +static void +set_nodemask_size(void) +{ + FILE *fp; + char *buf = NULL; + size_t bufsize = 0; + + if ((fp = fopen(mask_size_file, "r")) == NULL) + goto done; + + while (getline(&buf, &bufsize, fp) > 0) { +#ifdef MEMORY_SANITIZER + __msan_unpoison_string(buf); +#endif + if (strprefix(buf, nodemask_prefix)) { + nodemask_sz = s2nbits(buf + strlen(nodemask_prefix)); + break; + } + } + free(buf); + fclose(fp); +done: + if (nodemask_sz == 0) {/* fall back on error */ + int pol; + unsigned long *mask = NULL; + nodemask_sz = 16; + do { + nodemask_sz <<= 1; + mask = realloc(mask, nodemask_sz / 8); + if (!mask) + return; + } while (get_mempolicy(&pol, mask, nodemask_sz + 1, 0, 0) < 0 && errno == EINVAL && + nodemask_sz < 4096*8); + free(mask); + } +} + +/* + * Read a mask consisting of a sequence of hexadecimal longs separated by + * commas. Order them correctly and return the number of bits set. + */ +static int +read_mask(char *s, struct bitmask *bmp) +{ + char *end = s; + int tmplen = (bmp->size + bitsperint - 1) / bitsperint; + unsigned int tmp[tmplen]; + unsigned int *start = tmp; + unsigned int i, n = 0, m = 0; + + if (!s) + return 0; /* shouldn't happen */ + + i = strtoul(s, &end, 16); + + /* Skip leading zeros */ + while (!i && *end++ == ',') { + i = strtoul(end, &end, 16); + } + + if (!i) + /* End of string. No mask */ + return -1; + + start[n++] = i; + /* Read sequence of ints */ + while (*end++ == ',') { + i = strtoul(end, &end, 16); + start[n++] = i; + + /* buffer overflow */ + if (n > tmplen) + return -1; + } + + /* + * Invert sequence of ints if necessary since the first int + * is the highest and we put it first because we read it first. + */ + while (n) { + int w; + unsigned long x = 0; + /* read into long values in an endian-safe way */ + for (w = 0; n && w < bitsperlong; w += bitsperint) + x |= ((unsigned long)start[n-- - 1] << w); + + bmp->maskp[m++] = x; + } + /* + * Return the number of bits set + */ + return numa_bitmask_weight(bmp); +} + +/* + * Read a processes constraints in terms of nodes and cpus from + * /proc/self/status. + */ +static void +set_task_constraints(void) +{ + int hicpu = maxconfiguredcpu; + int i; + char *buffer = NULL; + size_t buflen = 0; + FILE *f; + + numa_all_cpus_ptr = numa_allocate_cpumask(); + numa_possible_cpus_ptr = numa_allocate_cpumask(); + numa_all_nodes_ptr = numa_allocate_nodemask(); + numa_possible_nodes_ptr = numa_allocate_cpumask(); + numa_no_nodes_ptr = numa_allocate_nodemask(); + + f = fopen(mask_size_file, "r"); + if (!f) { + //numa_warn(W_cpumap, "Cannot parse %s", mask_size_file); + return; + } + + while (getline(&buffer, &buflen, f) > 0) { +#ifdef MEMORY_SANITIZER + __msan_unpoison_string(buffer); +#endif + /* mask starts after [last] tab */ + char *mask = strrchr(buffer,'\t') + 1; + + if (strncmp(buffer,"Cpus_allowed:",13) == 0) + numproccpu = read_mask(mask, numa_all_cpus_ptr); + + if (strncmp(buffer,"Mems_allowed:",13) == 0) { + numprocnode = read_mask(mask, numa_all_nodes_ptr); + } + } + fclose(f); + free(buffer); + + for (i = 0; i <= hicpu; i++) + numa_bitmask_setbit(numa_possible_cpus_ptr, i); + for (i = 0; i <= maxconfigurednode; i++) + numa_bitmask_setbit(numa_possible_nodes_ptr, i); + + /* + * Cpus_allowed in the kernel can be defined to all f's + * i.e. it may be a superset of the actual available processors. + * As such let's reduce numproccpu to the number of actual + * available cpus. + */ + if (numproccpu <= 0) { + for (i = 0; i <= hicpu; i++) + numa_bitmask_setbit(numa_all_cpus_ptr, i); + numproccpu = hicpu+1; + } + + if (numproccpu > hicpu+1) { + numproccpu = hicpu+1; + for (i=hicpu+1; i<numa_all_cpus_ptr->size; i++) { + numa_bitmask_clearbit(numa_all_cpus_ptr, i); + } + } + + if (numprocnode <= 0) { + for (i = 0; i <= maxconfigurednode; i++) + numa_bitmask_setbit(numa_all_nodes_ptr, i); + numprocnode = maxconfigurednode + 1; + } + + return; +} + +/* + * Find the highest cpu number possible (in other words the size + * of a kernel cpumask_t (in bits) - 1) + */ +static void +set_numa_max_cpu(void) +{ + int len = 4096; + int n; + int olde = errno; + struct bitmask *buffer; + + do { + buffer = numa_bitmask_alloc(len); + n = numa_sched_getaffinity_v2_int(0, buffer); + /* on success, returns size of kernel cpumask_t, in bytes */ + if (n < 0) { + if (errno == EINVAL) { + if (len >= 1024*1024) + break; + len *= 2; + numa_bitmask_free(buffer); + continue; + } else { + numa_warn(W_numcpus, "Unable to determine max cpu" + " (sched_getaffinity: %s); guessing...", + strerror(errno)); + n = sizeof(cpu_set_t); + break; + } + } + } while (n < 0); + numa_bitmask_free(buffer); + errno = olde; + cpumask_sz = n*8; +} + +/* + * get the total (configured) number of cpus - both online and offline + */ +static void +set_configured_cpus(void) +{ + maxconfiguredcpu = sysconf(_SC_NPROCESSORS_CONF) - 1; + if (maxconfiguredcpu == -1) + numa_error("sysconf(NPROCESSORS_CONF) failed"); +} + +/* + * Initialize all the sizes. + */ +static void +set_sizes(void) +{ + sizes_set++; + set_nodemask_size(); /* size of kernel nodemask_t */ + set_configured_nodes(); /* configured nodes listed in /sys */ + set_numa_max_cpu(); /* size of kernel cpumask_t */ + set_configured_cpus(); /* cpus listed in /sys/devices/system/cpu */ + set_task_constraints(); /* cpus and nodes for current task */ +} + +int +numa_num_configured_nodes(void) +{ + /* + * NOTE: this function's behavior matches the documentation (ie: it + * returns a count of nodes with memory) despite the poor function + * naming. We also cannot use the similarly poorly named + * numa_all_nodes_ptr as it only tracks nodes with memory from which + * the calling process can allocate. Think sparse nodes, memory-less + * nodes, cpusets... + */ + int memnodecount=0, i; + + for (i=0; i <= maxconfigurednode; i++) { + if (numa_bitmask_isbitset(numa_memnode_ptr, i)) + memnodecount++; + } + return memnodecount; +} + +int +numa_num_configured_cpus(void) +{ + + return maxconfiguredcpu+1; +} + +int +numa_num_possible_nodes(void) +{ + return nodemask_sz; +} + +int +numa_num_possible_cpus(void) +{ + return cpumask_sz; +} + +int +numa_num_task_nodes(void) +{ + return numprocnode; +} + +/* + * for backward compatibility + */ +int +numa_num_thread_nodes(void) +{ + return numa_num_task_nodes(); +} + +int +numa_num_task_cpus(void) +{ + return numproccpu; +} + +/* + * for backward compatibility + */ +int +numa_num_thread_cpus(void) +{ + return numa_num_task_cpus(); +} + +/* + * Return the number of the highest node in this running system, + */ +int +numa_max_node(void) +{ + return maxconfigurednode; +} + +make_internal_alias(numa_max_node); + +/* + * Return the number of the highest possible node in a system, + * which for v1 is the size of a numa.h nodemask_t(in bits)-1. + * but for v2 is the size of a kernel nodemask_t(in bits)-1. + */ +int +numa_max_possible_node_v1(void) +{ + return ((sizeof(nodemask_t)*8)-1); +} +backward_symver(numa_max_possible_node_v1,numa_max_possible_node); + +int +numa_max_possible_node_v2(void) +{ + return numa_num_possible_nodes()-1; +} +symver(numa_max_possible_node_v2,numa_max_possible_node); + +make_internal_alias(numa_max_possible_node_v1); +make_internal_alias(numa_max_possible_node_v2); + +/* + * Allocate a bitmask for cpus, of a size large enough to + * match the kernel's cpumask_t. + */ +struct bitmask * +numa_allocate_cpumask() +{ + int ncpus = numa_num_possible_cpus(); + + return numa_bitmask_alloc(ncpus); +} + +/* + * Allocate a bitmask the size of a libnuma nodemask_t + */ +static struct bitmask * +allocate_nodemask_v1(void) +{ + int nnodes = numa_max_possible_node_v1_int()+1; + + return numa_bitmask_alloc(nnodes); +} + +/* + * Allocate a bitmask for nodes, of a size large enough to + * match the kernel's nodemask_t. + */ +struct bitmask * +numa_allocate_nodemask(void) +{ + struct bitmask *bmp; + int nnodes = numa_max_possible_node_v2_int() + 1; + + bmp = numa_bitmask_alloc(nnodes); + return bmp; +} + +/* (cache the result?) */ +long long numa_node_size64(int node, long long *freep) +{ + size_t len = 0; + char *line = NULL; + long long size = -1; + FILE *f; + char fn[64]; + int ok = 0; + int required = freep ? 2 : 1; + + if (freep) + *freep = -1; + sprintf(fn,"/sys/devices/system/node/node%d/meminfo", node); + f = fopen(fn, "r"); + if (!f) + return -1; + while (getdelim(&line, &len, '\n', f) > 0) { + char *end; + char *s = strcasestr(line, "kB"); + if (!s) + continue; + --s; + while (s > line && isspace(*s)) + --s; + while (s > line && isdigit(*s)) + --s; + if (strstr(line, "MemTotal")) { + size = strtoull(s,&end,0) << 10; + if (end == s) + size = -1; + else + ok++; + } + if (freep && strstr(line, "MemFree")) { + *freep = strtoull(s,&end,0) << 10; + if (end == s) + *freep = -1; + else + ok++; + } + } + fclose(f); + free(line); + if (ok != required) + numa_warn(W_badmeminfo, "Cannot parse sysfs meminfo (%d)", ok); + return size; +} + +make_internal_alias(numa_node_size64); + +long numa_node_size(int node, long *freep) +{ + long long f2; + long sz = numa_node_size64_int(node, &f2); + if (freep) + *freep = f2; + return sz; +} + +int numa_available(void) +{ + if (get_mempolicy(NULL, NULL, 0, 0, 0) < 0 && errno == ENOSYS) + return -1; + return 0; +} + +void +numa_interleave_memory_v1(void *mem, size_t size, const nodemask_t *mask) +{ + struct bitmask bitmask; + + bitmask.size = sizeof(nodemask_t) * 8; + bitmask.maskp = (unsigned long *)mask; + dombind(mem, size, MPOL_INTERLEAVE, &bitmask); +} +backward_symver(numa_interleave_memory_v1,numa_interleave_memory); + +void +numa_interleave_memory_v2(void *mem, size_t size, struct bitmask *bmp) +{ + dombind(mem, size, MPOL_INTERLEAVE, bmp); +} +symver(numa_interleave_memory_v2,numa_interleave_memory); + +void numa_tonode_memory(void *mem, size_t size, int node) +{ + struct bitmask *nodes; + + nodes = numa_allocate_nodemask(); + numa_bitmask_setbit(nodes, node); + dombind(mem, size, bind_policy, nodes); + numa_bitmask_free(nodes); +} + +void +numa_tonodemask_memory_v1(void *mem, size_t size, const nodemask_t *mask) +{ + struct bitmask bitmask; + + bitmask.maskp = (unsigned long *)mask; + bitmask.size = sizeof(nodemask_t); + dombind(mem, size, bind_policy, &bitmask); +} +backward_symver(numa_tonodemask_memory_v1,numa_tonodemask_memory); + +void +numa_tonodemask_memory_v2(void *mem, size_t size, struct bitmask *bmp) +{ + dombind(mem, size, bind_policy, bmp); +} +symver(numa_tonodemask_memory_v2,numa_tonodemask_memory); + +void numa_setlocal_memory(void *mem, size_t size) +{ + dombind(mem, size, MPOL_PREFERRED, NULL); +} + +void numa_police_memory(void *mem, size_t size) +{ + int pagesize = numa_pagesize_int(); + unsigned long i; + for (i = 0; i < size; i += pagesize) + ((volatile char*)mem)[i] = ((volatile char*)mem)[i]; +} + +make_internal_alias(numa_police_memory); + +void *numa_alloc(size_t size) +{ + char *mem; + mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (mem == (char *)-1) + return NULL; + numa_police_memory_int(mem, size); + return mem; +} + +void *numa_realloc(void *old_addr, size_t old_size, size_t new_size) +{ + char *mem; + mem = mremap(old_addr, old_size, new_size, MREMAP_MAYMOVE); + if (mem == (char *)-1) + return NULL; + /* + * The memory policy of the allocated pages is preserved by mremap(), so + * there is no need to (re)set it here. If the policy of the original + * allocation is not set, the new pages will be allocated according to the + * process' mempolicy. Trying to allocate explicitly the new pages on the + * same node as the original ones would require changing the policy of the + * newly allocated pages, which violates the numa_realloc() semantics. + */ + return mem; +} + +void *numa_alloc_interleaved_subset_v1(size_t size, const nodemask_t *mask) +{ + char *mem; + struct bitmask bitmask; + + mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (mem == (char *)-1) + return NULL; + bitmask.maskp = (unsigned long *)mask; + bitmask.size = sizeof(nodemask_t); + dombind(mem, size, MPOL_INTERLEAVE, &bitmask); + return mem; +} +backward_symver(numa_alloc_interleaved_subset_v1,numa_alloc_interleaved_subset); + +void *numa_alloc_interleaved_subset_v2(size_t size, struct bitmask *bmp) +{ + char *mem; + + mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (mem == (char *)-1) + return NULL; + dombind(mem, size, MPOL_INTERLEAVE, bmp); + return mem; +} +symver(numa_alloc_interleaved_subset_v2,numa_alloc_interleaved_subset); + +make_internal_alias(numa_alloc_interleaved_subset_v1); +make_internal_alias(numa_alloc_interleaved_subset_v2); + +void * +numa_alloc_interleaved(size_t size) +{ + return numa_alloc_interleaved_subset_v2_int(size, numa_all_nodes_ptr); +} + +/* + * given a user node mask, set memory policy to use those nodes + */ +void +numa_set_interleave_mask_v1(nodemask_t *mask) +{ + struct bitmask *bmp; + int nnodes = numa_max_possible_node_v1_int()+1; + + bmp = numa_bitmask_alloc(nnodes); + copy_nodemask_to_bitmask(mask, bmp); + if (numa_bitmask_equal(bmp, numa_no_nodes_ptr)) + setpol(MPOL_DEFAULT, bmp); + else + setpol(MPOL_INTERLEAVE, bmp); + numa_bitmask_free(bmp); +} + +backward_symver(numa_set_interleave_mask_v1,numa_set_interleave_mask); + +void +numa_set_interleave_mask_v2(struct bitmask *bmp) +{ + if (numa_bitmask_equal(bmp, numa_no_nodes_ptr)) + setpol(MPOL_DEFAULT, bmp); + else + setpol(MPOL_INTERLEAVE, bmp); +} +symver(numa_set_interleave_mask_v2,numa_set_interleave_mask); + +nodemask_t +numa_get_interleave_mask_v1(void) +{ + int oldpolicy = -1; + struct bitmask *bmp; + nodemask_t mask; + + bmp = allocate_nodemask_v1(); + getpol(&oldpolicy, bmp); + if (oldpolicy == MPOL_INTERLEAVE) + copy_bitmask_to_nodemask(bmp, &mask); + else + copy_bitmask_to_nodemask(numa_no_nodes_ptr, &mask); + numa_bitmask_free(bmp); + return mask; +} +backward_symver(numa_get_interleave_mask_v1,numa_get_interleave_mask); + +struct bitmask * +numa_get_interleave_mask_v2(void) +{ + int oldpolicy = -1; + struct bitmask *bmp; + + bmp = numa_allocate_nodemask(); + getpol(&oldpolicy, bmp); + if (oldpolicy != MPOL_INTERLEAVE) + copy_bitmask_to_bitmask(numa_no_nodes_ptr, bmp); + return bmp; +} +symver(numa_get_interleave_mask_v2,numa_get_interleave_mask); + +/* (undocumented) */ +int numa_get_interleave_node(void) +{ + int nd; + if (get_mempolicy(&nd, NULL, 0, 0, MPOL_F_NODE) == 0) + return nd; + return 0; +} + +void *numa_alloc_onnode(size_t size, int node) +{ + char *mem; + struct bitmask *bmp; + + bmp = numa_allocate_nodemask(); + numa_bitmask_setbit(bmp, node); + mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (mem == (char *)-1) + mem = NULL; + else + dombind(mem, size, bind_policy, bmp); + numa_bitmask_free(bmp); + return mem; +} + +void *numa_alloc_local(size_t size) +{ + char *mem; + mem = mmap(0, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (mem == (char *)-1) + mem = NULL; + else + dombind(mem, size, MPOL_PREFERRED, NULL); + return mem; +} + +void numa_set_bind_policy(int strict) +{ + if (strict) + bind_policy = MPOL_BIND; + else + bind_policy = MPOL_PREFERRED; +} + +void +numa_set_membind_v1(const nodemask_t *mask) +{ + struct bitmask bitmask; + + bitmask.maskp = (unsigned long *)mask; + bitmask.size = sizeof(nodemask_t); + setpol(MPOL_BIND, &bitmask); +} +backward_symver(numa_set_membind_v1,numa_set_membind); + +void +numa_set_membind_v2(struct bitmask *bmp) +{ + setpol(MPOL_BIND, bmp); +} +symver(numa_set_membind_v2,numa_set_membind); + +make_internal_alias(numa_set_membind_v2); + +/* + * copy a bitmask map body to a numa.h nodemask_t structure + */ +void +copy_bitmask_to_nodemask(struct bitmask *bmp, nodemask_t *nmp) +{ + int max, i; + + memset(nmp, 0, sizeof(nodemask_t)); + max = (sizeof(nodemask_t)*8); + for (i=0; i<bmp->size; i++) { + if (i >= max) + break; + if (numa_bitmask_isbitset(bmp, i)) + nodemask_set_compat((nodemask_t *)nmp, i); + } +} + +/* + * copy a bitmask map body to another bitmask body + * fill a larger destination with zeroes + */ +void +copy_bitmask_to_bitmask(struct bitmask *bmpfrom, struct bitmask *bmpto) +{ + int bytes; + + if (bmpfrom->size >= bmpto->size) { + memcpy(bmpto->maskp, bmpfrom->maskp, CPU_BYTES(bmpto->size)); + } else if (bmpfrom->size < bmpto->size) { + bytes = CPU_BYTES(bmpfrom->size); + memcpy(bmpto->maskp, bmpfrom->maskp, bytes); + memset(((char *)bmpto->maskp)+bytes, 0, + CPU_BYTES(bmpto->size)-bytes); + } +} + +/* + * copy a numa.h nodemask_t structure to a bitmask map body + */ +void +copy_nodemask_to_bitmask(nodemask_t *nmp, struct bitmask *bmp) +{ + int max, i; + + numa_bitmask_clearall(bmp); + max = (sizeof(nodemask_t)*8); + if (max > bmp->size) + max = bmp->size; + for (i=0; i<max; i++) { + if (nodemask_isset_compat(nmp, i)) + numa_bitmask_setbit(bmp, i); + } +} + +nodemask_t +numa_get_membind_v1(void) +{ + int oldpolicy = -1; + struct bitmask *bmp; + nodemask_t nmp; + + bmp = allocate_nodemask_v1(); + getpol(&oldpolicy, bmp); + if (oldpolicy == MPOL_BIND) { + copy_bitmask_to_nodemask(bmp, &nmp); + } else { + /* copy the body of the map to numa_all_nodes */ + copy_bitmask_to_nodemask(bmp, &numa_all_nodes); + nmp = numa_all_nodes; + } + numa_bitmask_free(bmp); + return nmp; +} +backward_symver(numa_get_membind_v1,numa_get_membind); + +struct bitmask * +numa_get_membind_v2(void) +{ + int oldpolicy = -1; + struct bitmask *bmp; + + bmp = numa_allocate_nodemask(); + getpol(&oldpolicy, bmp); + if (oldpolicy != MPOL_BIND) + copy_bitmask_to_bitmask(numa_all_nodes_ptr, bmp); + return bmp; +} +symver(numa_get_membind_v2,numa_get_membind); + +//TODO: do we need a v1 nodemask_t version? +struct bitmask *numa_get_mems_allowed(void) +{ + struct bitmask *bmp; + + /* + * can change, so query on each call. + */ + bmp = numa_allocate_nodemask(); + if (get_mempolicy(NULL, bmp->maskp, bmp->size + 1, 0, + MPOL_F_MEMS_ALLOWED) < 0) + numa_error("get_mempolicy"); + return bmp; +} +make_internal_alias(numa_get_mems_allowed); + +void numa_free(void *mem, size_t size) +{ + munmap(mem, size); +} + +int +numa_parse_bitmap_v1(char *line, unsigned long *mask, int ncpus) +{ + int i; + char *p = strchr(line, '\n'); + if (!p) + return -1; + + for (i = 0; p > line;i++) { + char *oldp, *endp; + oldp = p; + if (*p == ',') + --p; + while (p > line && *p != ',') + --p; + /* Eat two 32bit fields at a time to get longs */ + if (p > line && sizeof(unsigned long) == 8) { + oldp--; + memmove(p, p+1, oldp-p+1); + while (p > line && *p != ',') + --p; + } + if (*p == ',') + p++; + if (i >= CPU_LONGS(ncpus)) + return -1; + mask[i] = strtoul(p, &endp, 16); + if (endp != oldp) + return -1; + p--; + } + return 0; +} +backward_symver(numa_parse_bitmap_v1,numa_parse_bitmap); + +int +numa_parse_bitmap_v2(char *line, struct bitmask *mask) +{ + int i, ncpus; + char *p = strchr(line, '\n'); + if (!p) + return -1; + ncpus = mask->size; + + for (i = 0; p > line;i++) { + char *oldp, *endp; + oldp = p; + if (*p == ',') + --p; + while (p > line && *p != ',') + --p; + /* Eat two 32bit fields at a time to get longs */ + if (p > line && sizeof(unsigned long) == 8) { + oldp--; + memmove(p, p+1, oldp-p+1); + while (p > line && *p != ',') + --p; + } + if (*p == ',') + p++; + if (i >= CPU_LONGS(ncpus)) + return -1; + mask->maskp[i] = strtoul(p, &endp, 16); + if (endp != oldp) + return -1; + p--; + } + return 0; +} +symver(numa_parse_bitmap_v2,numa_parse_bitmap); + +void +static init_node_cpu_mask_v2(void) +{ + int nnodes = numa_max_possible_node_v2_int() + 1; + node_cpu_mask_v2 = calloc (nnodes, sizeof(struct bitmask *)); +} + +/* This would be better with some locking, but I don't want to make libnuma + dependent on pthreads right now. The races are relatively harmless. */ +int +numa_node_to_cpus_v1(int node, unsigned long *buffer, int bufferlen) +{ + int err = 0; + char fn[64]; + FILE *f; + char *line = NULL; + size_t len = 0; + struct bitmask bitmask; + int buflen_needed; + unsigned long *mask; + int ncpus = numa_num_possible_cpus(); + int maxnode = numa_max_node_int(); + + buflen_needed = CPU_BYTES(ncpus); + if ((unsigned)node > maxnode || bufferlen < buflen_needed) { + errno = ERANGE; + return -1; + } + if (bufferlen > buflen_needed) + memset(buffer, 0, bufferlen); + if (node_cpu_mask_v1[node]) { + memcpy(buffer, node_cpu_mask_v1[node], buflen_needed); + return 0; + } + + mask = malloc(buflen_needed); + if (!mask) + mask = (unsigned long *)buffer; + memset(mask, 0, buflen_needed); + + sprintf(fn, "/sys/devices/system/node/node%d/cpumap", node); + f = fopen(fn, "r"); + if (!f || getdelim(&line, &len, '\n', f) < 1) { + if (numa_bitmask_isbitset(numa_nodes_ptr, node)) { + numa_warn(W_nosysfs2, + "/sys not mounted or invalid. Assuming one node: %s", + strerror(errno)); + numa_warn(W_nosysfs2, + "(cannot open or correctly parse %s)", fn); + } + bitmask.maskp = (unsigned long *)mask; + bitmask.size = buflen_needed * 8; + numa_bitmask_setall(&bitmask); + err = -1; + } + if (f) + fclose(f); + + if (line && (numa_parse_bitmap_v1(line, mask, ncpus) < 0)) { + numa_warn(W_cpumap, "Cannot parse cpumap. Assuming one node"); + bitmask.maskp = (unsigned long *)mask; + bitmask.size = buflen_needed * 8; + numa_bitmask_setall(&bitmask); + err = -1; + } + + free(line); + memcpy(buffer, mask, buflen_needed); + + /* slightly racy, see above */ + if (node_cpu_mask_v1[node]) { + if (mask != buffer) + free(mask); + } else { + node_cpu_mask_v1[node] = mask; + } + return err; +} +backward_symver(numa_node_to_cpus_v1,numa_node_to_cpus); + +/* + * test whether a node has cpus + */ +/* This would be better with some locking, but I don't want to make libnuma + dependent on pthreads right now. The races are relatively harmless. */ +/* + * deliver a bitmask of cpus representing the cpus on a given node + */ +int +numa_node_to_cpus_v2(int node, struct bitmask *buffer) +{ + int err = 0; + int nnodes = numa_max_node(); + char fn[64], *line = NULL; + FILE *f; + size_t len = 0; + struct bitmask *mask; + + if (!node_cpu_mask_v2) + init_node_cpu_mask_v2(); + + if (node > nnodes) { + errno = ERANGE; + return -1; + } + numa_bitmask_clearall(buffer); + + if (node_cpu_mask_v2[node]) { + /* have already constructed a mask for this node */ + if (buffer->size < node_cpu_mask_v2[node]->size) { + errno = EINVAL; + numa_error("map size mismatch"); + return -1; + } + copy_bitmask_to_bitmask(node_cpu_mask_v2[node], buffer); + return 0; + } + + /* need a new mask for this node */ + mask = numa_allocate_cpumask(); + + /* this is a kernel cpumask_t (see node_read_cpumap()) */ + sprintf(fn, "/sys/devices/system/node/node%d/cpumap", node); + f = fopen(fn, "r"); + if (!f || getdelim(&line, &len, '\n', f) < 1) { + if (numa_bitmask_isbitset(numa_nodes_ptr, node)) { + numa_warn(W_nosysfs2, + "/sys not mounted or invalid. Assuming one node: %s", + strerror(errno)); + numa_warn(W_nosysfs2, + "(cannot open or correctly parse %s)", fn); + } + numa_bitmask_setall(mask); + err = -1; + } + if (f) + fclose(f); + + if (line && (numa_parse_bitmap_v2(line, mask) < 0)) { + numa_warn(W_cpumap, "Cannot parse cpumap. Assuming one node"); + numa_bitmask_setall(mask); + err = -1; + } + + free(line); + copy_bitmask_to_bitmask(mask, buffer); + + /* slightly racy, see above */ + /* save the mask we created */ + if (node_cpu_mask_v2[node]) { + /* how could this be? */ + if (mask != buffer) + numa_bitmask_free(mask); + } else { + /* we don't want to cache faulty result */ + if (!err) + node_cpu_mask_v2[node] = mask; + else + numa_bitmask_free(mask); + } + return err; +} +symver(numa_node_to_cpus_v2,numa_node_to_cpus); + +make_internal_alias(numa_node_to_cpus_v1); +make_internal_alias(numa_node_to_cpus_v2); + +/* report the node of the specified cpu */ +int numa_node_of_cpu(int cpu) +{ + struct bitmask *bmp; + int ncpus, nnodes, node, ret; + + ncpus = numa_num_possible_cpus(); + if (cpu > ncpus){ + errno = EINVAL; + return -1; + } + bmp = numa_bitmask_alloc(ncpus); + nnodes = numa_max_node(); + for (node = 0; node <= nnodes; node++){ + if (numa_node_to_cpus_v2_int(node, bmp) < 0) { + /* It's possible for the node to not exist */ + continue; + } + if (numa_bitmask_isbitset(bmp, cpu)){ + ret = node; + goto end; + } + } + ret = -1; + errno = EINVAL; +end: + numa_bitmask_free(bmp); + return ret; +} + +int +numa_run_on_node_mask_v1(const nodemask_t *mask) +{ + int ncpus = numa_num_possible_cpus(); + int i, k, err; + unsigned long cpus[CPU_LONGS(ncpus)], nodecpus[CPU_LONGS(ncpus)]; + memset(cpus, 0, CPU_BYTES(ncpus)); + for (i = 0; i < NUMA_NUM_NODES; i++) { + if (mask->n[i / BITS_PER_LONG] == 0) + continue; + if (nodemask_isset_compat(mask, i)) { + if (numa_node_to_cpus_v1_int(i, nodecpus, CPU_BYTES(ncpus)) < 0) { + numa_warn(W_noderunmask, + "Cannot read node cpumask from sysfs"); + continue; + } + for (k = 0; k < CPU_LONGS(ncpus); k++) + cpus[k] |= nodecpus[k]; + } + } + err = numa_sched_setaffinity_v1(0, CPU_BYTES(ncpus), cpus); + + /* The sched_setaffinity API is broken because it expects + the user to guess the kernel cpuset size. Do this in a + brute force way. */ + if (err < 0 && errno == EINVAL) { + int savederrno = errno; + char *bigbuf; + static int size = -1; + if (size == -1) + size = CPU_BYTES(ncpus) * 2; + bigbuf = malloc(CPU_BUFFER_SIZE); + if (!bigbuf) { + errno = ENOMEM; + return -1; + } + errno = savederrno; + while (size <= CPU_BUFFER_SIZE) { + memcpy(bigbuf, cpus, CPU_BYTES(ncpus)); + memset(bigbuf + CPU_BYTES(ncpus), 0, + CPU_BUFFER_SIZE - CPU_BYTES(ncpus)); + err = numa_sched_setaffinity_v1_int(0, size, (unsigned long *)bigbuf); + if (err == 0 || errno != EINVAL) + break; + size *= 2; + } + savederrno = errno; + free(bigbuf); + errno = savederrno; + } + return err; +} +backward_symver(numa_run_on_node_mask_v1,numa_run_on_node_mask); + +/* + * Given a node mask (size of a kernel nodemask_t) (probably populated by + * a user argument list) set up a map of cpus (map "cpus") on those nodes. + * Then set affinity to those cpus. + */ +int +numa_run_on_node_mask_v2(struct bitmask *bmp) +{ + int ncpus, i, k, err; + struct bitmask *cpus, *nodecpus; + + cpus = numa_allocate_cpumask(); + ncpus = cpus->size; + nodecpus = numa_allocate_cpumask(); + + for (i = 0; i < bmp->size; i++) { + if (bmp->maskp[i / BITS_PER_LONG] == 0) + continue; + if (numa_bitmask_isbitset(bmp, i)) { + /* + * numa_all_nodes_ptr is cpuset aware; use only + * these nodes + */ + if (!numa_bitmask_isbitset(numa_all_nodes_ptr, i)) { + numa_warn(W_noderunmask, + "node %d not allowed", i); + continue; + } + if (numa_node_to_cpus_v2_int(i, nodecpus) < 0) { + numa_warn(W_noderunmask, + "Cannot read node cpumask from sysfs"); + continue; + } + for (k = 0; k < CPU_LONGS(ncpus); k++) + cpus->maskp[k] |= nodecpus->maskp[k]; + } + } + err = numa_sched_setaffinity_v2_int(0, cpus); + + numa_bitmask_free(cpus); + numa_bitmask_free(nodecpus); + + /* used to have to consider that this could fail - it shouldn't now */ + if (err < 0) { + numa_error("numa_sched_setaffinity_v2_int() failed; abort\n"); + } + + return err; +} +symver(numa_run_on_node_mask_v2,numa_run_on_node_mask); + +make_internal_alias(numa_run_on_node_mask_v2); + +/* + * Given a node mask (size of a kernel nodemask_t) (probably populated by + * a user argument list) set up a map of cpus (map "cpus") on those nodes + * without any cpuset awareness. Then set affinity to those cpus. + */ +int +numa_run_on_node_mask_all(struct bitmask *bmp) +{ + int ncpus, i, k, err; + struct bitmask *cpus, *nodecpus; + + cpus = numa_allocate_cpumask(); + ncpus = cpus->size; + nodecpus = numa_allocate_cpumask(); + + for (i = 0; i < bmp->size; i++) { + if (bmp->maskp[i / BITS_PER_LONG] == 0) + continue; + if (numa_bitmask_isbitset(bmp, i)) { + if (!numa_bitmask_isbitset(numa_possible_nodes_ptr, i)) { + numa_warn(W_noderunmask, + "node %d not allowed", i); + continue; + } + if (numa_node_to_cpus_v2_int(i, nodecpus) < 0) { + numa_warn(W_noderunmask, + "Cannot read node cpumask from sysfs"); + continue; + } + for (k = 0; k < CPU_LONGS(ncpus); k++) + cpus->maskp[k] |= nodecpus->maskp[k]; + } + } + err = numa_sched_setaffinity_v2_int(0, cpus); + + numa_bitmask_free(cpus); + numa_bitmask_free(nodecpus); + + /* With possible nodes freedom it can happen easily now */ + if (err < 0) { + numa_error("numa_sched_setaffinity_v2_int() failed"); + } + + return err; +} + +nodemask_t +numa_get_run_node_mask_v1(void) +{ + int ncpus = numa_num_configured_cpus(); + int i, k; + int max = numa_max_node_int(); + struct bitmask *bmp, *cpus, *nodecpus; + nodemask_t nmp; + + cpus = numa_allocate_cpumask(); + if (numa_sched_getaffinity_v2_int(0, cpus) < 0){ + nmp = numa_no_nodes; + goto free_cpus; + } + + nodecpus = numa_allocate_cpumask(); + bmp = allocate_nodemask_v1(); /* the size of a nodemask_t */ + for (i = 0; i <= max; i++) { + if (numa_node_to_cpus_v2_int(i, nodecpus) < 0) { + /* It's possible for the node to not exist */ + continue; + } + for (k = 0; k < CPU_LONGS(ncpus); k++) { + if (nodecpus->maskp[k] & cpus->maskp[k]) + numa_bitmask_setbit(bmp, i); + } + } + copy_bitmask_to_nodemask(bmp, &nmp); + numa_bitmask_free(bmp); + numa_bitmask_free(nodecpus); +free_cpus: + numa_bitmask_free(cpus); + return nmp; +} +backward_symver(numa_get_run_node_mask_v1,numa_get_run_node_mask); + +struct bitmask * +numa_get_run_node_mask_v2(void) +{ + int i, k; + int ncpus = numa_num_configured_cpus(); + int max = numa_max_node_int(); + struct bitmask *bmp, *cpus, *nodecpus; + + bmp = numa_allocate_cpumask(); + cpus = numa_allocate_cpumask(); + if (numa_sched_getaffinity_v2_int(0, cpus) < 0){ + copy_bitmask_to_bitmask(numa_no_nodes_ptr, bmp); + goto free_cpus; + } + + nodecpus = numa_allocate_cpumask(); + for (i = 0; i <= max; i++) { + /* + * numa_all_nodes_ptr is cpuset aware; show only + * these nodes + */ + if (!numa_bitmask_isbitset(numa_all_nodes_ptr, i)) { + continue; + } + if (numa_node_to_cpus_v2_int(i, nodecpus) < 0) { + /* It's possible for the node to not exist */ + continue; + } + for (k = 0; k < CPU_LONGS(ncpus); k++) { + if (nodecpus->maskp[k] & cpus->maskp[k]) + numa_bitmask_setbit(bmp, i); + } + } + numa_bitmask_free(nodecpus); +free_cpus: + numa_bitmask_free(cpus); + return bmp; +} +symver(numa_get_run_node_mask_v2,numa_get_run_node_mask); + +int +numa_migrate_pages(int pid, struct bitmask *fromnodes, struct bitmask *tonodes) +{ + int numa_num_nodes = numa_num_possible_nodes(); + + return migrate_pages(pid, numa_num_nodes + 1, fromnodes->maskp, + tonodes->maskp); +} + +int numa_move_pages(int pid, unsigned long count, + void **pages, const int *nodes, int *status, int flags) +{ + return move_pages(pid, count, pages, nodes, status, flags); +} + +int numa_run_on_node(int node) +{ + int numa_num_nodes = numa_num_possible_nodes(); + int ret = -1; + struct bitmask *cpus; + + if (node >= numa_num_nodes){ + errno = EINVAL; + goto out; + } + + cpus = numa_allocate_cpumask(); + + if (node == -1) + numa_bitmask_setall(cpus); + else if (numa_node_to_cpus_v2_int(node, cpus) < 0){ + numa_warn(W_noderunmask, "Cannot read node cpumask from sysfs"); + goto free; + } + + ret = numa_sched_setaffinity_v2_int(0, cpus); +free: + numa_bitmask_free(cpus); +out: + return ret; +} + +int numa_preferred(void) +{ + int policy; + int ret; + struct bitmask *bmp; + + bmp = numa_allocate_nodemask(); + getpol(&policy, bmp); + if (policy == MPOL_PREFERRED || policy == MPOL_BIND) { + int i; + int max = numa_num_possible_nodes(); + for (i = 0; i < max ; i++) + if (numa_bitmask_isbitset(bmp, i)){ + ret = i; + goto end; + } + } + /* could read the current CPU from /proc/self/status. Probably + not worth it. */ + ret = 0; /* or random one? */ +end: + numa_bitmask_free(bmp); + return ret; +} + +void numa_set_preferred(int node) +{ + struct bitmask *bmp; + + bmp = numa_allocate_nodemask(); + if (node >= 0) { + numa_bitmask_setbit(bmp, node); + setpol(MPOL_PREFERRED, bmp); + } else + setpol(MPOL_DEFAULT, bmp); + numa_bitmask_free(bmp); +} + +void numa_set_localalloc(void) +{ + setpol(MPOL_DEFAULT, numa_no_nodes_ptr); +} + +void numa_bind_v1(const nodemask_t *nodemask) +{ + struct bitmask bitmask; + + bitmask.maskp = (unsigned long *)nodemask; + bitmask.size = sizeof(nodemask_t); + numa_run_on_node_mask_v2_int(&bitmask); + numa_set_membind_v2_int(&bitmask); +} +backward_symver(numa_bind_v1,numa_bind); + +void numa_bind_v2(struct bitmask *bmp) +{ + numa_run_on_node_mask_v2_int(bmp); + numa_set_membind_v2_int(bmp); +} +symver(numa_bind_v2,numa_bind); + +void numa_set_strict(int flag) +{ + if (flag) + mbind_flags |= MPOL_MF_STRICT; + else + mbind_flags &= ~MPOL_MF_STRICT; +} + +/* + * Extract a node or processor number from the given string. + * Allow a relative node / processor specification within the allowed + * set if "relative" is nonzero + */ +static unsigned long get_nr(const char *s, char **end, struct bitmask *bmp, int relative) +{ + long i, nr; + + if (!relative) + return strtoul(s, end, 0); + + nr = strtoul(s, end, 0); + if (s == *end) + return nr; + /* Find the nth set bit */ + for (i = 0; nr >= 0 && i <= bmp->size; i++) + if (numa_bitmask_isbitset(bmp, i)) + nr--; + return i-1; +} + +/* + * __numa_parse_nodestring() is called to create a node mask, given + * an ascii string such as 25 or 12-15 or 1,3,5-7 or +6-10. + * (the + indicates that the numbers are nodeset-relative) + * + * The nodes may be specified as absolute, or relative to the current nodeset. + * The list of available nodes is in a map pointed to by "allowed_nodes_ptr", + * which may represent all nodes or the nodes in the current nodeset. + * + * The caller must free the returned bitmask. + */ +static struct bitmask * +__numa_parse_nodestring(const char *s, struct bitmask *allowed_nodes_ptr) +{ + int invert = 0, relative = 0; + int conf_nodes = numa_num_configured_nodes(); + char *end; + struct bitmask *mask; + + mask = numa_allocate_nodemask(); + + if (s[0] == 0){ + copy_bitmask_to_bitmask(numa_no_nodes_ptr, mask); + return mask; /* return freeable mask */ + } + if (*s == '!') { + invert = 1; + s++; + } + if (*s == '+') { + relative++; + s++; + } + do { + unsigned long arg; + int i; + if (isalpha(*s)) { + int n; + if (!strcmp(s,"all")) { + copy_bitmask_to_bitmask(allowed_nodes_ptr, + mask); + s+=4; + break; + } + n = resolve_affinity(s, mask); + if (n != NO_IO_AFFINITY) { + if (n < 0) + goto err; + s += strlen(s) + 1; + break; + } + } + arg = get_nr(s, &end, allowed_nodes_ptr, relative); + if (end == s) { + numa_warn(W_nodeparse, "unparseable node description `%s'\n", s); + goto err; + } + if (!numa_bitmask_isbitset(allowed_nodes_ptr, arg)) { + numa_warn(W_nodeparse, "node argument %d is out of range\n", arg); + goto err; + } + i = arg; + numa_bitmask_setbit(mask, i); + s = end; + if (*s == '-') { + char *end2; + unsigned long arg2; + arg2 = get_nr(++s, &end2, allowed_nodes_ptr, relative); + if (end2 == s) { + numa_warn(W_nodeparse, "missing node argument %s\n", s); + goto err; + } + if (!numa_bitmask_isbitset(allowed_nodes_ptr, arg2)) { + numa_warn(W_nodeparse, "node argument %d out of range\n", arg2); + goto err; + } + while (arg <= arg2) { + i = arg; + if (numa_bitmask_isbitset(allowed_nodes_ptr,i)) + numa_bitmask_setbit(mask, i); + arg++; + } + s = end2; + } + } while (*s++ == ','); + if (s[-1] != '\0') + goto err; + if (invert) { + int i; + for (i = 0; i < conf_nodes; i++) { + if (numa_bitmask_isbitset(mask, i)) + numa_bitmask_clearbit(mask, i); + else + numa_bitmask_setbit(mask, i); + } + } + return mask; + +err: + numa_bitmask_free(mask); + return NULL; +} + +/* + * numa_parse_nodestring() is called to create a bitmask from nodes available + * for this task. + */ + +struct bitmask * numa_parse_nodestring(const char *s) +{ + return __numa_parse_nodestring(s, numa_all_nodes_ptr); +} + +/* + * numa_parse_nodestring_all() is called to create a bitmask from all nodes + * available. + */ + +struct bitmask * numa_parse_nodestring_all(const char *s) +{ + return __numa_parse_nodestring(s, numa_possible_nodes_ptr); +} + +/* + * __numa_parse_cpustring() is called to create a bitmask, given + * an ascii string such as 25 or 12-15 or 1,3,5-7 or +6-10. + * (the + indicates that the numbers are cpuset-relative) + * + * The cpus may be specified as absolute, or relative to the current cpuset. + * The list of available cpus for this task is in the map pointed to by + * "allowed_cpus_ptr", which may represent all cpus or the cpus in the + * current cpuset. + * + * The caller must free the returned bitmask. + */ +static struct bitmask * +__numa_parse_cpustring(const char *s, struct bitmask *allowed_cpus_ptr) +{ + int invert = 0, relative=0; + int conf_cpus = numa_num_configured_cpus(); + char *end; + struct bitmask *mask; + + mask = numa_allocate_cpumask(); + + if (s[0] == 0) + return mask; + if (*s == '!') { + invert = 1; + s++; + } + if (*s == '+') { + relative++; + s++; + } + do { + unsigned long arg; + int i; + + if (!strcmp(s,"all")) { + copy_bitmask_to_bitmask(allowed_cpus_ptr, mask); + s+=4; + break; + } + arg = get_nr(s, &end, allowed_cpus_ptr, relative); + if (end == s) { + numa_warn(W_cpuparse, "unparseable cpu description `%s'\n", s); + goto err; + } + if (!numa_bitmask_isbitset(allowed_cpus_ptr, arg)) { + numa_warn(W_cpuparse, "cpu argument %s is out of range\n", s); + goto err; + } + i = arg; + numa_bitmask_setbit(mask, i); + s = end; + if (*s == '-') { + char *end2; + unsigned long arg2; + int i; + arg2 = get_nr(++s, &end2, allowed_cpus_ptr, relative); + if (end2 == s) { + numa_warn(W_cpuparse, "missing cpu argument %s\n", s); + goto err; + } + if (!numa_bitmask_isbitset(allowed_cpus_ptr, arg2)) { + numa_warn(W_cpuparse, "cpu argument %s out of range\n", s); + goto err; + } + while (arg <= arg2) { + i = arg; + if (numa_bitmask_isbitset(allowed_cpus_ptr, i)) + numa_bitmask_setbit(mask, i); + arg++; + } + s = end2; + } + } while (*s++ == ','); + if (s[-1] != '\0') + goto err; + if (invert) { + int i; + for (i = 0; i < conf_cpus; i++) { + if (numa_bitmask_isbitset(mask, i)) + numa_bitmask_clearbit(mask, i); + else + numa_bitmask_setbit(mask, i); + } + } + return mask; + +err: + numa_bitmask_free(mask); + return NULL; +} + +/* + * numa_parse_cpustring() is called to create a bitmask from cpus available + * for this task. + */ + +struct bitmask * numa_parse_cpustring(const char *s) +{ + return __numa_parse_cpustring(s, numa_all_cpus_ptr); +} + +/* + * numa_parse_cpustring_all() is called to create a bitmask from all cpus + * available. + */ + +struct bitmask * numa_parse_cpustring_all(const char *s) +{ + return __numa_parse_cpustring(s, numa_possible_cpus_ptr); +}
diff --git a/m4/ax_am_override_var.m4 b/m4/ax_am_override_var.m4 new file mode 100644 index 0000000..21803fa --- /dev/null +++ b/m4/ax_am_override_var.m4
@@ -0,0 +1,155 @@ +# =========================================================================== +# http://www.gnu.org/software/autoconf-archive/ax_am_override_var.html +# =========================================================================== +# +# SYNOPSIS +# +# AX_AM_OVERRIDE_VAR([varname1 varname ... ]) +# AX_AM_OVERRIDE_FINALIZE +# +# DESCRIPTION +# +# This autoconf macro generalizes the approach given in +# <http://lists.gnu.org/archive/html/automake/2005-09/msg00108.html> which +# moves user specified values for variable 'varname' given at configure +# time into the corresponding AM_${varname} variable and clears out +# 'varname', allowing further manipulation by the configure script so that +# target specific variables can be given specialized versions. 'varname +# may still be specified on the make command line and will be appended as +# usual. +# +# As an example usage, consider a project which might benefit from +# different compiler flags for different components. Typically this is +# done via target specific flags, e.g. +# +# libgtest_la_CXXFLAGS = \ +# -I $(top_srcdir)/tests \ +# -I $(top_builddir)/tests \ +# $(GTEST_CXXFLAGS) +# +# automake will automatically append $(CXXFLAGS) -- provided by the user +# -- to the build rule for libgtest_la. That might be problematic, as +# CXXFLAGS may contain compiler options which are inappropriate for +# libgtest_la. +# +# The approach laid out in the referenced mailing list message is to +# supply a base value for a variable during _configure_ time, during which +# it is possible to amend it for specific targets. The user may +# subsequently specify a value for the variable during _build_ time, which +# make will apply (via the standard automake rules) to all appropriate +# targets. +# +# For example, +# +# AX_AM_OVERRIDE_VAR([CXXFLAGS]) +# +# will store the value of CXXFLAGS specified at configure time into the +# AM_CXXFLAGS variable, AC_SUBST it, and clear CXXFLAGS. configure may +# then create a target specific set of flags based upon AM_CXXFLAGS, e.g. +# +# # googletest uses variadic macros, which g++ -pedantic-errors +# # is very unhappy about +# AC_SUBST([GTEST_CXXFLAGS], +# [`AS_ECHO_N(["$AM_CXXFLAGS"]) \ +# | sed s/-pedantic-errors/-pedantic/` +# ] +# ) +# +# which would be used in a Makefile.am as above. Since CXXFLAGS is +# cleared, the configure time value will not affect the build for +# libgtest_la. +# +# Prior to _any other command_ which may set ${varname}, call +# +# AX_AM_OVERRIDE_VAR([varname]) +# +# This will preserve the value (if any) passed to configure in +# AM_${varname} and AC_SUBST([AM_${varname}). You may pass a space +# separated list of variable names, or may call AX_AM_OVERRIDE_VAR +# multiple times for the same effect. +# +# If any subsequent configure commands set ${varname} and you wish to +# capture the resultant value into AM_${varname} in the case where +# ${varname} was _not_ provided at configure time, call +# +# AX_AM_OVERRIDE_FINALIZE +# +# after _all_ commands which might affect any of the variables specified +# in calls to AX_AM_OVERRIDE_VAR. This need be done only once, but +# repeated calls will not cause harm. +# +# There is a bit of trickery required to allow further manipulation of the +# AM_${varname} in a Makefile.am file. If AM_CFLAGS is used as is in a +# Makefile.am, e.g. +# +# libfoo_la_CFLAGS = $(AM_CFLAGS) +# +# then automake will emit code in Makefile.in which sets AM_CFLAGS from +# the configure'd value. +# +# If however, AM_CFLAGS is manipulated (i.e. appended to), you will have +# to explicitly arrange for the configure'd value to be substituted: +# +# AM_CFLAGS = @AM_CFLAGS@ +# AM_CFLAGS += -lfoo +# +# or else automake will complain about using += before =. +# +# LICENSE +# +# Copyright (c) 2013 Smithsonian Astrophysical Observatory +# Copyright (c) 2013 Diab Jerius <djerius@cfa.harvard.edu> +# +# Copying and distribution of this file, with or without modification, are +# permitted in any medium without royalty provided the copyright notice +# and this notice are preserved. This file is offered as-is, without any +# warranty. + +#serial 1 + +AC_DEFUN([_AX_AM_OVERRIDE_INITIALIZE], +[ + m4_define([_mst_am_override_vars],[]) +]) + + +# _AX_AM_OVERRIDE_VAR(varname) +AC_DEFUN([_AX_AM_OVERRIDE_VAR], +[ + m4_define([_mst_am_override_vars], m4_defn([_mst_am_override_vars]) $1 ) + _mst_am_override_$1_set=false + + AS_IF( [test "${$1+set}" = set], + [AC_SUBST([AM_$1],["$$1"]) + $1= + _mst_am_override_$1_set=: + ] + ) +]) # _AX_AM_OVERRIDE_VAR + +# _AX_AM_OVERRIDE_FINALIZE(varname) +AC_DEFUN([_AX_AM_OVERRIDE_FINALIZE], +[ + AS_IF([$_mst_am_override_$1_set = :], + [], + [AC_SUBST([AM_$1],["$$1"]) + $1= + _mst_am_override_$1_set= + ] + ) + AC_SUBST($1) +]) # _AX_AM_OVERRIDE_FINALIZE + +AC_DEFUN([AX_AM_OVERRIDE_VAR], +[ + AC_REQUIRE([_AX_AM_OVERRIDE_INITIALIZE]) + m4_map_args_w([$1],[_AX_AM_OVERRIDE_VAR(],[)]) +])# AX_OVERRIDE_VAR + + +# AX_AM_OVERRIDE_FINALIZE +AC_DEFUN([AX_AM_OVERRIDE_FINALIZE], +[ + AC_REQUIRE([_AX_AM_OVERRIDE_INITIALIZE]) + m4_map_args_w(_mst_am_override_vars,[_AX_AM_OVERRIDE_FINALIZE(],[)]) +]) # AX_AM_OVERRIDE_FINALIZE
diff --git a/m4/ax_check_compile_flag.m4 b/m4/ax_check_compile_flag.m4 new file mode 100644 index 0000000..51df0c0 --- /dev/null +++ b/m4/ax_check_compile_flag.m4
@@ -0,0 +1,74 @@ +# =========================================================================== +# http://www.gnu.org/software/autoconf-archive/ax_check_compile_flag.html +# =========================================================================== +# +# SYNOPSIS +# +# AX_CHECK_COMPILE_FLAG(FLAG, [ACTION-SUCCESS], [ACTION-FAILURE], [EXTRA-FLAGS], [INPUT]) +# +# DESCRIPTION +# +# Check whether the given FLAG works with the current language's compiler +# or gives an error. (Warnings, however, are ignored) +# +# ACTION-SUCCESS/ACTION-FAILURE are shell commands to execute on +# success/failure. +# +# If EXTRA-FLAGS is defined, it is added to the current language's default +# flags (e.g. CFLAGS) when the check is done. The check is thus made with +# the flags: "CFLAGS EXTRA-FLAGS FLAG". This can for example be used to +# force the compiler to issue an error when a bad flag is given. +# +# INPUT gives an alternative input source to AC_COMPILE_IFELSE. +# +# NOTE: Implementation based on AX_CFLAGS_GCC_OPTION. Please keep this +# macro in sync with AX_CHECK_{PREPROC,LINK}_FLAG. +# +# LICENSE +# +# Copyright (c) 2008 Guido U. Draheim <guidod@gmx.de> +# Copyright (c) 2011 Maarten Bosmans <mkbosmans@gmail.com> +# +# This program is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the +# Free Software Foundation, either version 3 of the License, or (at your +# option) any later version. +# +# This program is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General +# Public License for more details. +# +# You should have received a copy of the GNU General Public License along +# with this program. If not, see <http://www.gnu.org/licenses/>. +# +# As a special exception, the respective Autoconf Macro's copyright owner +# gives unlimited permission to copy, distribute and modify the configure +# scripts that are the output of Autoconf when processing the Macro. You +# need not follow the terms of the GNU General Public License when using +# or distributing such scripts, even though portions of the text of the +# Macro appear in them. The GNU General Public License (GPL) does govern +# all other use of the material that constitutes the Autoconf Macro. +# +# This special exception to the GPL applies to versions of the Autoconf +# Macro released by the Autoconf Archive. When you make and distribute a +# modified version of the Autoconf Macro, you may extend this special +# exception to the GPL to apply to your modified version as well. + +#serial 3 + +AC_DEFUN([AX_CHECK_COMPILE_FLAG], +[AC_PREREQ(2.59)dnl for _AC_LANG_PREFIX +AS_VAR_PUSHDEF([CACHEVAR],[ax_cv_check_[]_AC_LANG_ABBREV[]flags_$4_$1])dnl +AC_CACHE_CHECK([whether _AC_LANG compiler accepts $1], CACHEVAR, [ + ax_check_save_flags=$[]_AC_LANG_PREFIX[]FLAGS + _AC_LANG_PREFIX[]FLAGS="$[]_AC_LANG_PREFIX[]FLAGS $4 $1" + AC_COMPILE_IFELSE([m4_default([$5],[AC_LANG_PROGRAM()])], + [AS_VAR_SET(CACHEVAR,[yes])], + [AS_VAR_SET(CACHEVAR,[no])]) + _AC_LANG_PREFIX[]FLAGS=$ax_check_save_flags]) +AS_IF([test x"AS_VAR_GET(CACHEVAR)" = xyes], + [m4_default([$2], :)], + [m4_default([$3], :)]) +AS_VAR_POPDEF([CACHEVAR])dnl +])dnl AX_CHECK_COMPILE_FLAGS
diff --git a/m4/ax_tls.m4 b/m4/ax_tls.m4 new file mode 100644 index 0000000..033e3b1 --- /dev/null +++ b/m4/ax_tls.m4
@@ -0,0 +1,76 @@ +# =========================================================================== +# http://www.gnu.org/software/autoconf-archive/ax_tls.html +# =========================================================================== +# +# SYNOPSIS +# +# AX_TLS([action-if-found], [action-if-not-found]) +# +# DESCRIPTION +# +# Provides a test for the compiler support of thread local storage (TLS) +# extensions. Defines TLS if it is found. Currently knows about GCC/ICC +# and MSVC. I think SunPro uses the same as GCC, and Borland apparently +# supports either. +# +# LICENSE +# +# Copyright (c) 2008 Alan Woodland <ajw05@aber.ac.uk> +# Copyright (c) 2010 Diego Elio Petteno` <flameeyes@gmail.com> +# +# This program is free software: you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the +# Free Software Foundation, either version 3 of the License, or (at your +# option) any later version. +# +# This program is distributed in the hope that it will be useful, but +# WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General +# Public License for more details. +# +# You should have received a copy of the GNU General Public License along +# with this program. If not, see <http://www.gnu.org/licenses/>. +# +# As a special exception, the respective Autoconf Macro's copyright owner +# gives unlimited permission to copy, distribute and modify the configure +# scripts that are the output of Autoconf when processing the Macro. You +# need not follow the terms of the GNU General Public License when using +# or distributing such scripts, even though portions of the text of the +# Macro appear in them. The GNU General Public License (GPL) does govern +# all other use of the material that constitutes the Autoconf Macro. +# +# This special exception to the GPL applies to versions of the Autoconf +# Macro released by the Autoconf Archive. When you make and distribute a +# modified version of the Autoconf Macro, you may extend this special +# exception to the GPL to apply to your modified version as well. + +#serial 10 + +AC_DEFUN([AX_TLS], [ + AC_MSG_CHECKING(for thread local storage (TLS) class) + AC_CACHE_VAL(ac_cv_tls, [ + ax_tls_keywords="__thread __declspec(thread) none" + for ax_tls_keyword in $ax_tls_keywords; do + AS_CASE([$ax_tls_keyword], + [none], [ac_cv_tls=none ; break], + [AC_TRY_COMPILE( + [#include <stdlib.h> + static void + foo(void) { + static ] $ax_tls_keyword [ int bar; + exit(1); + }], + [], + [ac_cv_tls=$ax_tls_keyword ; break], + ac_cv_tls=none + )]) + done + ]) + AC_MSG_RESULT($ac_cv_tls) + + AS_IF([test "$ac_cv_tls" != "none"], + AC_DEFINE_UNQUOTED([TLS], $ac_cv_tls, [If the compiler supports a TLS storage class define it to that here]) + m4_ifnblank([$1], [$1]), + m4_ifnblank([$2], [$2]) + ) +])
diff --git a/manlinks b/manlinks new file mode 100644 index 0000000..811c925 --- /dev/null +++ b/manlinks
@@ -0,0 +1,5 @@ +#!/bin/sh +# print names of all functions listed in numa.3 +# no globals + +grep '^\.BI.*numa.*(' numa.3 | sed -e 's/.*\(numa_.*\)(.*/\1/'
diff --git a/memhog.c b/memhog.c new file mode 100644 index 0000000..361a2ed --- /dev/null +++ b/memhog.c
@@ -0,0 +1,145 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + Allocate memory with policy for testing. + + numactl is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numactl is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +#include <stdlib.h> +#include <stdio.h> +#include <sys/mman.h> +#include <sys/fcntl.h> +#include <string.h> +#include <stdbool.h> +#include "numa.h" +#include "numaif.h" +#include "util.h" + +#define terr(x) perror(x) + +enum { + UNIT = 10*1024*1024, +}; + +#ifndef MADV_NOHUGEPAGE +#define MADV_NOHUGEPAGE 15 +#endif + +int repeat = 1; + +void usage(void) +{ + printf("memhog [-rNUM] size[kmg] [policy [nodeset]]\n"); + printf("-rNUM repeat memset NUM times\n"); + printf("-H disable transparent hugepages\n"); + print_policies(); + exit(1); +} + +long length; + +void hog(void *map) +{ + long i; + for (i = 0; i < length; i += UNIT) { + long left = length - i; + if (left > UNIT) + left = UNIT; + putchar('.'); + fflush(stdout); + memset(map + i, 0xff, left); + } + putchar('\n'); +} + +int main(int ac, char **av) +{ + char *map; + struct bitmask *nodes, *gnodes; + int policy, gpolicy; + int ret = 0; + int loose = 0; + int i; + int fd = -1; + bool disable_hugepage = false; + + nodes = numa_allocate_nodemask(); + gnodes = numa_allocate_nodemask(); + + while (av[1] && av[1][0] == '-') { + switch (av[1][1]) { + case 'f': + fd = open(av[1]+2, O_RDWR); + if (fd < 0) + perror(av[1]+2); + break; + case 'r': + repeat = atoi(av[1] + 2); + break; + case 'H': + disable_hugepage = true; + break; + default: + usage(); + } + av++; + } + + if (!av[1]) usage(); + + length = memsize(av[1]); + if (av[2] && numa_available() < 0) { + printf("Kernel doesn't support NUMA policy\n"); + exit(1); + } else + loose = 1; + policy = parse_policy(av[2], av[3]); + if (policy != MPOL_DEFAULT) + nodes = numa_parse_nodestring(av[3]); + if (!nodes) { + printf ("<%s> is invalid\n", av[3]); + exit(1); + } + + if (fd >= 0) + map = mmap(NULL,length,PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + else + map = mmap(NULL, length, PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (map == (char*)-1) + err("mmap"); + + if (mbind(map, length, policy, nodes->maskp, nodes->size, 0) < 0) + terr("mbind"); + + if (disable_hugepage) + madvise(map, length, MADV_NOHUGEPAGE); + + gpolicy = -1; + if (get_mempolicy(&gpolicy, gnodes->maskp, gnodes->size, map, MPOL_F_ADDR) < 0) + terr("get_mempolicy"); + if (!loose && policy != gpolicy) { + ret = 1; + printf("policy %d gpolicy %d\n", policy, gpolicy); + } + if (!loose && !numa_bitmask_equal(gnodes, nodes)) { + printf("nodes differ %lx, %lx!\n", + gnodes->maskp[0], nodes->maskp[0]); + ret = 1; + } + + for (i = 0; i < repeat; i++) + hog(map); + exit(ret); +}
diff --git a/migratepages.8 b/migratepages.8 new file mode 100644 index 0000000..0f7a1d0 --- /dev/null +++ b/migratepages.8
@@ -0,0 +1,75 @@ +.\" t +.\" Copyright 2005-2006 Christoph Lameter, Silicon Graphics, Inc. +.\" +.\" based on Andi Kleen's numactl manpage +.\" +.TH MIGRATEPAGES 8 "Jan 2005" "SGI" "Linux Administrator's Manual" +.SH NAME +migratepages \- Migrate the physical location a processes pages +.SH SYNOPSIS +.B migratepages +pid from-nodes to-nodes +.SH DESCRIPTION +.B migratepages +moves the physical location of a processes pages without any changes of the +virtual address space of the process. Moving the pages allows one to change +the distances of a process to its memory. Performance may be optimized by moving +a processes pages to the node where it is executing. + +If multiple nodes are specified for from-nodes or to-nodes then +an attempt is made to preserve the relative location of +each page in each nodeset. + +For example if we move from nodes 2-5 to 7,9,12-13 then the preferred mode of +operation is to move pages from 2->7, 3->9, 4->12 and 5->13. However, this +is only posssible if enough memory is available. +.TP +Valid node specifiers +.TS +tab(:); +l l. +all:All nodes +number:Node number +number1{,number2}:Node number1 and Node number2 +number1-number2:Nodes from number1 to number2 +! nodes:Invert selection of the following specification. +.TE +.SH NOTES +Requires an NUMA policy aware kernel with support for page migration +(linux 2.6.16 and later). + +migratepages will only move pages that are not shared with other +processes if called by a user without administrative priviledges (but +with the right to modify the process). + +migratepages will move all pages if invoked from root (or a user with +administrative priviledges). + +.SH FILES +.I /proc/<pid>/numa_maps +for information about the NUMA memory use of a process. +.SH COPYRIGHT +Copyright 2005-2006 Christoph Lameter, Silicon Graphics, Inc. +migratepages is under the GNU General Public License, v.2 + +.SH SEE ALSO +.I numactl(8) +, +.I set_mempolicy(2) +, +.I get_mempolicy(2) +, +.I mbind(2) +, +.I sched_setaffinity(2) +, +.I sched_getaffinity(2) +, +.I proc(5) +, +.I ftok(3) +, +.I shmat(2) +, +.I taskset(1) +
diff --git a/migratepages.c b/migratepages.c new file mode 100644 index 0000000..61ba6cb --- /dev/null +++ b/migratepages.c
@@ -0,0 +1,105 @@ +/* + * Copyright (C) 2005 Christoph Lameter, Silicon Graphics, Incorporated. + * based on Andi Kleen's numactl.c. + * + * Manual process migration + * + * migratepages is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; version 2. + * + * migratepages is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should find a copy of v2 of the GNU General Public License somewhere + * on your Linux system; if not, write to the Free Software Foundation, + * Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + */ + +#define _GNU_SOURCE +#include <getopt.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <stdarg.h> +#include "numa.h" +#include "numaif.h" +#include "numaint.h" +#include "util.h" + +struct option opts[] = { + {"help", 0, 0, 'h' }, + { 0 } +}; + +void usage(void) +{ + fprintf(stderr, + "usage: migratepages pid from-nodes to-nodes\n" + "\n" + "nodes is a comma delimited list of node numbers or A-B ranges or all.\n" +); + exit(1); +} + +void checknuma(void) +{ + static int numa = -1; + if (numa < 0) { + if (numa_available() < 0) + complain("This system does not support NUMA functionality"); + } + numa = 0; +} + +int main(int argc, char *argv[]) +{ + int c; + char *end; + int rc; + int pid; + struct bitmask *fromnodes; + struct bitmask *tonodes; + + while ((c = getopt_long(argc,argv,"h", opts, NULL)) != -1) { + switch (c) { + default: + usage(); + } + } + + argv += optind; + argc -= optind; + + if (argc != 3) + usage(); + + checknuma(); + + pid = strtoul(argv[0], &end, 0); + if (*end || end == argv[0]) + usage(); + + fromnodes = numa_parse_nodestring(argv[1]); + if (!fromnodes) { + printf ("<%s> is invalid\n", argv[1]); + exit(1); + } + tonodes = numa_parse_nodestring(argv[2]); + if (!tonodes) { + printf ("<%s> is invalid\n", argv[2]); + exit(1); + } + + rc = numa_migrate_pages(pid, fromnodes, tonodes); + + if (rc < 0) { + perror("migrate_pages"); + return 1; + } + return 0; +}
diff --git a/migspeed.8 b/migspeed.8 new file mode 100644 index 0000000..6f4176e --- /dev/null +++ b/migspeed.8
@@ -0,0 +1,31 @@ +.\" t +.\" Copyright 2005-2007 Christoph Lameter, Silicon Graphics, Inc. +.\" +.\" based on Andi Kleen's numactl manpage +.\" +.TH MIGSPEED 8 "April 2005" "SGI" "Linux Administrator's Manual" +.SH NAME +migspeed \- Test the speed of page migration +.SH SYNOPSIS +.B migspeed +-p pages from-nodes to-nodes +.SH DESCRIPTION +.B migspeed +attempts to move a sample of pages from the indicated node to the target node +and measures the time it takes to perform the move. + +.B -p pages + +The default sample is 1000 pages. Override that with another number. + +.SH NOTES +Requires an NUMA policy aware kernel with support for page migration +(Linux 2.6.16 and later). + +.SH COPYRIGHT +Copyright 2007 Christoph Lameter, Silicon Graphics, Inc. +migratepages is under the GNU General Public License, v.2 + +.SH SEE ALSO +.I numactl(8) +
diff --git a/migspeed.c b/migspeed.c new file mode 100644 index 0000000..6260c24 --- /dev/null +++ b/migspeed.c
@@ -0,0 +1,187 @@ +/* + * Migration test program + * + * (C) 2007 Silicon Graphics, Inc. Christoph Lameter <clameter@sgi.com> + * + */ +#include <stdio.h> +#include <stdlib.h> +#include "numa.h" +#include "numaif.h" +#include <time.h> +#include <errno.h> +#include <malloc.h> +#include <unistd.h> +#include "util.h" + +char *memory; + +unsigned long pages = 1000; + +unsigned long pagesize; + +const char *optstr = "hvp:"; + +char *cmd; + +int verbose; +struct timespec start,end; + +void usage(void) +{ + printf("usage %s [-p pages] [-h] [-v] from-nodes to-nodes\n", cmd); + printf(" from and to nodes may specified in form N or N-N\n"); + printf(" -p pages number of pages to try (defaults to %ld)\n", + pages); + printf(" -v verbose\n"); + printf(" -h usage\n"); + exit(1); +} + +void displaymap(void) +{ + FILE *f = fopen("/proc/self/numa_maps","r"); + + if (!f) { + printf("/proc/self/numa_maps not accessible.\n"); + exit(1); + } + + while (!feof(f)) + { + char buffer[2000]; + + if (!fgets(buffer, sizeof(buffer), f)) + break; + if (!strstr(buffer, "bind")) + continue ; + printf("%s", buffer); + + } + fclose(f); +} + +int main(int argc, char *argv[]) +{ + char *p; + int option; + struct timespec result; + unsigned long bytes; + double duration, mbytes; + struct bitmask *from; + struct bitmask *to; + + pagesize = getpagesize(); + + /* Command line processing */ + opterr = 1; + cmd = argv[0]; + + while ((option = getopt(argc, argv, optstr)) != EOF) + switch (option) { + case 'h' : + case '?' : + usage(); + case 'v' : + verbose++; + break; + case 'p' : + pages = strtoul(optarg, &p, 0); + if (p == optarg || *p) + usage(); + break; + } + + if (!argv[optind]) + usage(); + + if (verbose > 1) + printf("numa_max_node = %d\n", numa_max_node()); + + numa_exit_on_error = 1; + + from = numa_parse_nodestring(argv[optind]); + if (!from) { + printf ("<%s> is invalid\n", argv[optind]); + exit(1); + } + if (errno) { + perror("from mask"); + exit(1); + } + + if (verbose) + printmask("From", from); + + if (!argv[optind+1]) + usage(); + + to = numa_parse_nodestring(argv[optind+1]); + if (!to) { + printf ("<%s> is invalid\n", argv[optind+1]); + exit(1); + } + if (errno) { + perror("to mask"); + exit(1); + } + + if (verbose) + printmask("To", to); + + bytes = pages * pagesize; + + if (verbose) + printf("Allocating %lu pages of %lu bytes of memory\n", + pages, pagesize); + + memory = memalign(pagesize, bytes); + + if (!memory) { + printf("Out of Memory\n"); + exit(2); + } + + if (mbind(memory, bytes, MPOL_BIND, from->maskp, from->size, 0) < 0) + numa_error("mbind"); + + if (verbose) + printf("Dirtying memory....\n"); + + for (p = memory; p <= memory + bytes; p += pagesize) + *p = 1; + + if (verbose) + printf("Starting test\n"); + + displaymap(); + clock_gettime(CLOCK_REALTIME, &start); + + if (mbind(memory, bytes, MPOL_BIND, to->maskp, to->size, MPOL_MF_MOVE) <0) + numa_error("memory move"); + + clock_gettime(CLOCK_REALTIME, &end); + displaymap(); + + result.tv_sec = end.tv_sec - start.tv_sec; + result.tv_nsec = end.tv_nsec - start.tv_nsec; + + if (result.tv_nsec < 0) { + result.tv_sec--; + result.tv_nsec += 1000000000; + } + + if (result.tv_nsec >= 1000000000) { + result.tv_sec++; + result.tv_nsec -= 1000000000; + } + + duration = result.tv_sec + result.tv_nsec / 1000000000.0; + mbytes = bytes / (1024*1024.0); + + printf("%1.1f Mbyte migrated in %1.2f secs. %3.1f Mbytes/second\n", + mbytes, + duration, + mbytes / duration); + return 0; +}
diff --git a/mkolddemo b/mkolddemo new file mode 100644 index 0000000..4bf4baf --- /dev/null +++ b/mkolddemo
@@ -0,0 +1,9 @@ +# test the numacompat1.h stuff by compiling an old version of numademo.c + +cc -L. -lnuma -I. -DNUMA_VERSION1_COMPATIBILITY -o olddemo olddemo.c + + +export LD_LIBRARY_PATH=. +echo "executing olddemo:" +./olddemo +
diff --git a/move_pages.2 b/move_pages.2 new file mode 100644 index 0000000..6c98fa8 --- /dev/null +++ b/move_pages.2
@@ -0,0 +1,155 @@ +.\" Hey Emacs! This file is -*- nroff -*- source. +.\" +.\" This manpage is Copyright (C) 2006 Silicon Graphics, Inc. +.\" Christoph Lameter +.\" +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.TH MOVE_PAGES 2 2006-10-31 "Linux 2.6.18" "Linux Programmer's Manual" +.SH NAME +move_pages \- Move individual pages of a process to another node +.SH SYNOPSIS +.B #include <numaif.h> +.sp +.BI "long move_pages(int " pid ", unsigned long count, void ** " pages ", const int * " nodes ", int * " status ", int " flags ); +.SH DESCRIPTION +.BR move_pages () +moves +.I count +pages to the +.I nodes. +The result of the move is reflected in +.I status. +The +.I flags +indicate constraints on the pages to be moved. + +.I pid +is the process id in which pages are to be moved. Sufficient rights +must exist to move pages of another process. This means the moving +process either has root priviledges, has SYS_NICE administrative rights or +the same owner. If pid is 0 then we move pages of the current process. + +.I count +is the number of pages to move. It defines the size of the three +arrays +.I pages, +.I nodes +and +.I status. + +.I pages +is an array of pointers to the pages that should be moved. These are pointers +that should be aligned to page boundaries. Addresses are specified as seen by +the process specified by +.I pid. + +.I nodes +is either an array of integers that specify the desired location for each +page or it is NULL. Each integer is a node number. If NULL is specified then +move_pages will not move any pages but return the node of each page in +the +.I status +array. Having the status of each page may be necessary to determine +pages that need to be moved. + +.I status +is an array of integers that return the status of each page. The array +only contains valid values if +.I move_pages +did not return an error code. + +.I flags +specify what types of pages to move. +.B MPOL_MF_MOVE +means that only pages that are in exclusive use by the process +are to be moved. +.B MPOL_MF_MOVE_ALL +means that pages shared between multiple processes can also be moved. +The process must have root priviledges or SYS_NICE priviledges. + +.SH Page states in the status array + +.TP +.B 0..MAX_NUMNODES +Indicates that the location of the page is on this node. +.TP +.B -ENOENT +The page is not present. +.TP +.B -EACCES +The page is mapped by multiple processes and can only be moved +if +.I MPOL_MF_MOVE_ALL +is specified. +.TP +.B -EBUSY +The page is currently busy and cannot be moved. Try again later. +This occurs if a page is undergoing I/O or another kernel subsystem +is holding a reference to the page. +.TP +.B -EFAULT +This is a zero page or the memory area is not mapped by the process. +.TP +.B -ENOMEM +Unable to allocate memory on target node. +.TP +.B -EIO +Unable to write back a page. The page has to be written back +in order to move ti since the page is dirty and the filesystem +has not provide a migration function that would allow the move +of dirty pages. +.TP +.B -EINVAL +A dirty page cannot be moved. The filesystem does not +provide a migration function and has no ability to write back pages. + +.SH "RETURN VALUE" +On success +.B move_pages +returns zero. +.SH ERRORS +.TP +.B -ENOENT +No pages were found that require moving. All pages are either already +on the target node, not present, had an invalid address or could not be +moved because they were mapped by multiple processes. +.TP +.B -EINVAL +Flags other than +.I MPOL_MF_MOVE +and +.I MPOL_MF_MOVE_ALL +was specified or an attempt was made to migrate pages of a kernel thread. +.TP +.B -EPERM +.I MPOL_MF_MOVE_ALL +specified without sufficient privileges or an attempt to move a process +belonging to another user. +.TP +.B -EACCESS +On of the target nodes is not allowed by the current cpuset. +.TP +.B -ENODEV +On of the target nodes is not online. +.TP +.B -ESRCH +Process does not exist. +.TP +.B -E2BIG +Too many pages to move. +.TP +.B -EFAULT +Parameter array could not be accessed. +.SH "SEE ALSO" +.BR numa_maps (5), +.BR migratepages (8), +.BR numa_stat (8), +.BR numa (3)
diff --git a/mt.c b/mt.c new file mode 100644 index 0000000..112399c --- /dev/null +++ b/mt.c
@@ -0,0 +1,46 @@ +/* Mersenne twister implementation from Michael Brundage. Public Domain. + MT is a very fast pseudo random number generator. This version works + on 32bit words. Changes by AK. */ +#include <stdlib.h> +#include "mt.h" + +int mt_index; +unsigned int mt_buffer[MT_LEN]; + +void mt_init(void) +{ + int i; + srand(1); + for (i = 0; i < MT_LEN; i++) + mt_buffer[i] = rand(); + mt_index = 0; +} + +#define MT_IA 397 +#define MT_IB (MT_LEN - MT_IA) +#define UPPER_MASK 0x80000000 +#define LOWER_MASK 0x7FFFFFFF +#define MATRIX_A 0x9908B0DF +#define TWIST(b,i,j) ((b)[i] & UPPER_MASK) | ((b)[j] & LOWER_MASK) +#define MAGIC(s) (((s)&1)*MATRIX_A) + +void mt_refill(void) +{ + int i; + unsigned int s; + unsigned int * b = mt_buffer; + + mt_index = 0; + i = 0; + for (; i < MT_IB; i++) { + s = TWIST(b, i, i+1); + b[i] = b[i + MT_IA] ^ (s >> 1) ^ MAGIC(s); + } + for (; i < MT_LEN-1; i++) { + s = TWIST(b, i, i+1); + b[i] = b[i - MT_IB] ^ (s >> 1) ^ MAGIC(s); + } + + s = TWIST(b, MT_LEN-1, 0); + b[MT_LEN-1] = b[MT_IA-1] ^ (s >> 1) ^ MAGIC(s); +}
diff --git a/mt.h b/mt.h new file mode 100644 index 0000000..ffbf1c9 --- /dev/null +++ b/mt.h
@@ -0,0 +1,20 @@ +#define MT_LEN 624 + +extern void mt_init(void); +extern void mt_refill(); + +extern int mt_index; +extern unsigned int mt_buffer[MT_LEN]; + +static inline unsigned int mt_random(void) +{ + unsigned int * b = mt_buffer; + int idx = mt_index; + + if (idx == MT_LEN*sizeof(unsigned int)) { + mt_refill(); + idx = 0; + } + mt_index += sizeof(unsigned int); + return *(unsigned int *)((unsigned char *)b + idx); +}
diff --git a/numa.3 b/numa.3 new file mode 100644 index 0000000..ba00572 --- /dev/null +++ b/numa.3
@@ -0,0 +1,1058 @@ +.\" Copyright 2003,2004 Andi Kleen, SuSE Labs. +.\" +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.TH NUMA 3 "December 2007" "SuSE Labs" "Linux Programmer's Manual" +.SH NAME +numa \- NUMA policy library +.SH SYNOPSIS +.B #include <numa.h> +.sp +.B cc ... \-lnuma +.sp +.B int numa_available(void); +.sp +.BI "int numa_max_possible_node(void);" +.br +.BI "int numa_num_possible_nodes();" +.sp +.B int numa_max_node(void); +.br +.BI "int numa_num_configured_nodes();" +.br +.B struct bitmask *numa_get_mems_allowed(void); +.sp +.BI "int numa_num_configured_cpus(void);" +.br +.BI "struct bitmask *numa_all_nodes_ptr;" +.br +.BI "struct bitmask *numa_no_nodes_ptr;" +.br +.BI "struct bitmask *numa_all_cpus_ptr;" +.sp +.BI "int numa_num_task_cpus();" +.br +.BI "int numa_num_task_nodes();" +.sp +.BI "int numa_parse_bitmap(char *" line " , struct bitmask *" mask "); +.br +.BI "struct bitmask *numa_parse_nodestring(const char *" string ); +.br +.BI "struct bitmask *numa_parse_nodestring_all(const char *" string ); +.br +.BI "struct bitmask *numa_parse_cpustring(const char *" string ); +.br +.BI "struct bitmask *numa_parse_cpustring_all(const char *" string ); +.sp +.BI "long numa_node_size(int " node ", long *" freep ); +.br +.BI "long long numa_node_size64(int " node ", long long *" freep ); +.sp +.B int numa_preferred(void); +.br +.BI "void numa_set_preferred(int " node ); +.br +.BI "int numa_get_interleave_node(void); +.br +.B struct bitmask *numa_get_interleave_mask(void); +.br +.BI "void numa_set_interleave_mask(struct bitmask *" nodemask ); +.br +.BI "void numa_interleave_memory(void *" start ", size_t " size ", struct bitmask *" nodemask ); +.br +.BI "void numa_bind(struct bitmask *" nodemask ); +.br +.BI "void numa_set_localalloc(void); +.br +.BI "void numa_set_membind(struct bitmask *" nodemask ); +.br +.B struct bitmask *numa_get_membind(void); +.sp +.BI "void *numa_alloc_onnode(size_t " size ", int " node ); +.br +.BI "void *numa_alloc_local(size_t " size ); +.br +.BI "void *numa_alloc_interleaved(size_t " size ); +.br +.BI "void *numa_alloc_interleaved_subset(size_t " size ", struct bitmask *" nodemask ); +.BI "void *numa_alloc(size_t " size ); +.br +.BI "void *numa_realloc(void *"old_addr ", size_t " old_size ", size_t " new_size ); +.br +.BI "void numa_free(void *" start ", size_t " size ); +.sp +.BI "int numa_run_on_node(int " node ); +.br +.BI "int numa_run_on_node_mask(struct bitmask *" nodemask ); +.br +.BI "int numa_run_on_node_mask_all(struct bitmask *" nodemask ); +.br +.B struct bitmask *numa_get_run_node_mask(void); +.sp +.BI "void numa_tonode_memory(void *" start ", size_t " size ", int " node ); +.br +.BI "void numa_tonodemask_memory(void *" start ", size_t " size ", struct bitmask *" nodemask ); +.br +.BI "void numa_setlocal_memory(void *" start ", size_t " size ); +.br +.BI "void numa_police_memory(void *" start ", size_t " size ); +.br +.BI "void numa_set_bind_policy(int " strict ); +.br +.BI "void numa_set_strict(int " strict ); +.sp +.\" should be undocumented ?? +.BI "int numa_distance(int " node1 ", int " node2 ); +.sp +.BI "int numa_sched_getaffinity(pid_t " pid ", struct bitmask *" mask ); +.br +.BI "int numa_sched_setaffinity(pid_t " pid ", struct bitmask *" mask ); +.br +.BI "int numa_node_to_cpus(int " node ", struct bitmask *" mask "); +.br +.BI "int numa_node_of_cpu(int " cpu "); +.sp +.BI "struct bitmask *numa_allocate_cpumask();" +.sp +.BI "void numa_free_cpumask();" +.br +.BI "struct bitmask *numa_allocate_nodemask();" +.sp +.BI "void numa_free_nodemask();" +.br +.BI "struct bitmask *numa_bitmask_alloc(unsigned int " n "); +.br +.BI "struct bitmask *numa_bitmask_clearall(struct bitmask *" bmp ); +.br +.BI "struct bitmask *numa_bitmask_clearbit(struct bitmask *" bmp ", unsigned int " n ); +.br +.BI "int numa_bitmask_equal(const struct bitmask *" bmp1 ", const struct bitmask *" bmp2 ); +.br +.BI "void numa_bitmask_free(struct bitmask *" bmp ); +.br +.BI "int numa_bitmask_isbitset(const struct bitmask *" bmp ", unsigned int " n ");" +.br +.BI "unsigned int numa_bitmask_nbytes(struct bitmask *" bmp ); +.br +.BI "struct bitmask *numa_bitmask_setall(struct bitmask *" bmp ); +.br +.BI "struct bitmask *numa_bitmask_setbit(struct bitmask *" bmp ", unsigned int " n ); +.br +.BI "void copy_bitmask_to_nodemask(struct bitmask *" bmp ", nodemask_t *" nodemask ) +.br +.BI "void copy_nodemask_to_bitmask(nodemask_t *" nodemask ", struct bitmask *" bmp ) +.br +.BI "void copy_bitmask_to_bitmask(struct bitmask *" bmpfrom ", struct bitmask *" bmpto ) +.br +.BI "unsigned int numa_bitmask_weight(const struct bitmask *bmp ) +.sp +.BI "int numa_move_pages(int " pid ", unsigned long " count ", void **" pages ", const int *" nodes ", int *" status ", int " flags ); +.br +.BI "int numa_migrate_pages(int " pid ", struct bitmask *" fromnodes ", struct bitmask *" tonodes ); +.sp +.BI "void numa_error(char *" where ); +.sp +.BI "extern int " numa_exit_on_error ; +.br +.BI "extern int " numa_exit_on_warn ; +.br +.BI "void numa_warn(int " number ", char *" where ", ...);" +.br + +.SH DESCRIPTION +The +.I libnuma +library offers a simple programming interface to the +NUMA (Non Uniform Memory Access) +policy supported by the +Linux kernel. On a NUMA architecture some +memory areas have different latency or bandwidth than others. + +Available policies are +page interleaving (i.e., allocate in a round-robin fashion from all, +or a subset, of the nodes on the system), +preferred node allocation (i.e., preferably allocate on a particular node), +local allocation (i.e., allocate on the node on which +the task is currently executing), +or allocation only on specific nodes (i.e., allocate on +some subset of the available nodes). +It is also possible to bind tasks to specific nodes. + +Numa memory allocation policy may be specified as a per-task attribute, +that is inherited by children tasks and processes, or as an attribute +of a range of process virtual address space. +Numa memory policies specified for a range of virtual address space are +shared by all tasks in the process. +Furthermore, memory policies specified for a range of a shared memory +attached using +.I shmat(2) +or +.I mmap(2) +from shmfs/hugetlbfs are shared by all processes that attach to that region. +Memory policies for shared disk backed file mappings are currently ignored. + +The default memory allocation policy for tasks and all memory range +is local allocation. +This assumes that no ancestor has installed a non-default policy. + +For setting a specific policy globally for all memory allocations +in a process and its children it is easiest +to start it with the +.BR numactl (8) +utility. For more finegrained policy inside an application this library +can be used. + +All numa memory allocation policy only takes effect when a page is actually +faulted into the address space of a process by accessing it. The +.B numa_alloc_* +functions take care of this automatically. + +A +.I node +is defined as an area where all memory has the same speed as seen from +a particular CPU. +A node can contain multiple CPUs. +Caches are ignored for this definition. + +Most functions in this library are only concerned about numa nodes and +their memory. +The exceptions to this are: +.IR numa_node_to_cpus (), +.IR numa_node_of_cpu (), +.IR numa_bind (), +.IR numa_run_on_node (), +.IR numa_run_on_node_mask (), +.IR numa_run_on_node_mask_all (), +and +.IR numa_get_run_node_mask (). +These functions deal with the CPUs associated with numa nodes. +See the descriptions below for more information. + +Some of these functions accept or return a pointer to struct bitmask. +A struct bitmask controls a bit map of arbitrary length containing a bit +representation of nodes. The predefined variable +.I numa_all_nodes_ptr +points to a bit mask that has all available nodes set; +.I numa_no_nodes_ptr +points to the empty set. + +Before any other calls in this library can be used +.BR numa_available () +must be called. If it returns \-1, all other functions in this +library are undefined. + +.BR numa_max_possible_node() +returns the number of the highest possible node in a system. +In other words, the size of a kernel type nodemask_t (in bits) minus 1. +This number can be gotten by calling +.BR numa_num_possible_nodes() +and subtracting 1. + +.BR numa_num_possible_nodes() +returns the size of kernel's node mask (kernel type nodemask_t). +In other words, large enough to represent the maximum number of nodes that +the kernel can handle. This will match the kernel's MAX_NUMNODES value. +This count is derived from /proc/self/status, field Mems_allowed. + +.BR numa_max_node () +returns the highest node number available on the current system. +(See the node numbers in /sys/devices/system/node/ ). Also see +.BR numa_num_configured_nodes(). + +.BR numa_num_configured_nodes() +returns the number of memory nodes in the system. This count +includes any nodes that are currently disabled. This count is derived from +the node numbers in /sys/devices/system/node. (Depends on the kernel being +configured with /sys (CONFIG_SYSFS)). + +.BR numa_get_mems_allowed() +returns the mask of nodes from which the process is allowed to allocate +memory in it's current cpuset context. +Any nodes that are not included in the returned bitmask will be ignored +in any of the following libnuma memory policy calls. + +.BR numa_num_configured_cpus() +returns the number of cpus in the system. This count includes +any cpus that are currently disabled. This count is derived from the cpu +numbers in /sys/devices/system/cpu. If the kernel is configured without +/sys (CONFIG_SYSFS=n) then it falls back to using the number of online cpus. + +.BR numa_all_nodes_ptr +points to a bitmask that is allocated by the library with bits +representing all nodes on which the calling task may allocate memory. +This set may be up to all nodes on the system, or up to the nodes in +the current cpuset. +The bitmask is allocated by a call to +.BR numa_allocate_nodemask() +using size +.BR numa_max_possible_node(). +The set of nodes to record is derived from /proc/self/status, field +"Mems_allowed". The user should not alter this bitmask. + +.BR numa_no_nodes_ptr +points to a bitmask that is allocated by the library and left all +zeroes. The bitmask is allocated by a call to +.BR numa_allocate_nodemask() +using size +.BR numa_max_possible_node(). +The user should not alter this bitmask. + +.BR numa_all_cpus_ptr +points to a bitmask that is allocated by the library with bits +representing all cpus on which the calling task may execute. +This set may be up to all cpus on the system, or up to the cpus in +the current cpuset. +The bitmask is allocated by a call to +.BR numa_allocate_cpumask() +using size +.BR numa_num_possible_cpus(). +The set of cpus to record is derived from /proc/self/status, field +"Cpus_allowed". The user should not alter this bitmask. + +.BR numa_num_task_cpus() +returns the number of cpus that the calling task is allowed +to use. This count is derived from the map /proc/self/status, field +"Cpus_allowed". Also see the bitmask +.BR numa_all_cpus_ptr. + +.BR numa_num_task_nodes() +returns the number of nodes on which the calling task is +allowed to allocate memory. This count is derived from the map +/proc/self/status, field "Mems_allowed". +Also see the bitmask +.BR numa_all_nodes_ptr. + +.BR numa_parse_bitmap() +parses +.I line +, which is a character string such as found in +/sys/devices/system/node/nodeN/cpumap into a bitmask structure. +The string contains the hexadecimal representation of a bit map. +The bitmask may be allocated with +.BR numa_allocate_cpumask(). +Returns 0 on success. Returns -1 on failure. +This function is probably of little use to a user application, but +it is used by +.I libnuma +internally. + +.BR numa_parse_nodestring() +parses a character string list of nodes into a bit mask. +The bit mask is allocated by +.BR numa_allocate_nodemask(). +The string is a comma-separated list of node numbers or node ranges. +A leading ! can be used to indicate "not" this list (in other words, all +nodes except this list), and a leading + can be used to indicate that the +node numbers in the list are relative to the task's cpuset. The string can +be "all" to specify all ( +.BR numa_num_task_nodes() +) nodes. Node numbers are limited by the number in the system. See +.BR numa_max_node() +and +.BR numa_num_configured_nodes(). +.br +Examples: 1-5,7,10 !4-5 +0-3 +.br +If the string is of 0 length, bitmask +.BR numa_no_nodes_ptr +is returned. Returns 0 if the string is invalid. + +.BR numa_parse_nodestring_all() +is similar to +.BR numa_parse_nodestring +, but can parse all possible nodes, not only current nodeset. + +.BR numa_parse_cpustring() +parses a character string list of cpus into a bit mask. +The bit mask is allocated by +.BR numa_allocate_cpumask(). +The string is a comma-separated list of cpu numbers or cpu ranges. +A leading ! can be used to indicate "not" this list (in other words, all +cpus except this list), and a leading + can be used to indicate that the cpu +numbers in the list are relative to the task's cpuset. The string can be +"all" to specify all ( +.BR numa_num_task_cpus() +) cpus. +Cpu numbers are limited by the number in the system. See +.BR numa_num_task_cpus() +and +.BR numa_num_configured_cpus(). +.br +Examples: 1-5,7,10 !4-5 +0-3 +.br +Returns 0 if the string is invalid. + +.BR numa_parse_cpustring_all() +is similar to +.BR numa_parse_cpustring +, but can parse all possible cpus, not only current cpuset. + +.BR numa_node_size () +returns the memory size of a node. If the argument +.I freep +is not NULL, it used to return the amount of free memory on the node. +On error it returns \-1. + +.BR numa_node_size64 () +works the same as +.BR numa_node_size () +except that it returns values as +.I long long +instead of +.IR long . +This is useful on 32-bit architectures with large nodes. + +.BR numa_preferred () +returns the preferred node of the current task. +This is the node on which the kernel preferably +allocates memory, unless some other policy overrides this. +.\" TODO: results are misleading for MPOL_PREFERRED and may +.\" be incorrect for MPOL_BIND when Mel Gorman's twozonelist +.\" patches go in. In the latter case, we'd need to know the +.\" order of the current node's zonelist to return the correct +.\" node. Need to tighten this up with the syscall results. + +.BR numa_set_preferred () +sets the preferred node for the current task to +.IR node . +The system will attempt to allocate memory from the preferred node, +but will fall back to other nodes if no memory is available on the +the preferred node. +Passing a +.I node +of \-1 argument specifies local allocation and is equivalent to +calling +.BR numa_set_localalloc (). + +.BR numa_get_interleave_mask () +returns the current interleave mask if the task's memory allocation policy +is page interleaved. +Otherwise, this function returns an empty mask. + +.BR numa_set_interleave_mask () +sets the memory interleave mask for the current task to +.IR nodemask . +All new memory allocations +are page interleaved over all nodes in the interleave mask. Interleaving +can be turned off again by passing an empty mask +.RI ( numa_no_nodes ). +The page interleaving only occurs on the actual page fault that puts a new +page into the current address space. It is also only a hint: the kernel +will fall back to other nodes if no memory is available on the interleave +target. +.\" NOTE: the following is not really the case. this function sets the +.\" task policy for all future allocations, including stack, bss, ... +.\" The functions specified in this sentence actually allocate a new memory +.\" range [via mmap()]. This is quite a different thing. Suggest we drop +.\" this. +.\" This is a low level +.\" function, it may be more convenient to use the higher level functions like +.\" .BR numa_alloc_interleaved () +.\" or +.\" .BR numa_alloc_interleaved_subset (). + +.BR numa_interleave_memory () +interleaves +.I size +bytes of memory page by page from +.I start +on nodes specified in +.IR nodemask . +The +.I size +argument will be rounded up to a multiple of the system page size. +If +.I nodemask +contains nodes that are externally denied to this process, +this call will fail. +This is a lower level function to interleave allocated but not yet faulted in +memory. Not yet faulted in means the memory is allocated using +.BR mmap (2) +or +.BR shmat (2), +but has not been accessed by the current process yet. The memory is page +interleaved to all nodes specified in +.IR nodemask . +Normally +.BR numa_alloc_interleaved () +should be used for private memory instead, but this function is useful to +handle shared memory areas. To be useful the memory area should be +several megabytes at least (or tens of megabytes of hugetlbfs mappings) +If the +.BR numa_set_strict () +flag is true then the operation will cause a numa_error if there were already +pages in the mapping that do not follow the policy. + +.BR numa_bind () +binds the current task and its children to the nodes +specified in +.IR nodemask . +They will only run on the CPUs of the specified nodes and only be able to allocate +memory from them. +This function is equivalent to calling +.\" FIXME checkme +.\" This is the case. --lts +.I numa_run_on_node_mask(nodemask) +followed by +.IR numa_set_membind(nodemask) . +If tasks should be bound to individual CPUs inside nodes +consider using +.I numa_node_to_cpus +and the +.I sched_setaffinity(2) +syscall. + +.BR numa_set_localalloc () +sets the memory allocation policy for the calling task to +local allocation. +In this mode, the preferred node for memory allocation is +effectively the node where the task is executing at the +time of a page allocation. + +.BR numa_set_membind () +sets the memory allocation mask. +The task will only allocate memory from the nodes set in +.IR nodemask . +Passing an empty +.I nodemask +or a +.I nodemask +that contains nodes other than those in the mask returned by +.IR numa_get_mems_allowed () +will result in an error. + +.BR numa_get_membind () +returns the mask of nodes from which memory can currently be allocated. +If the returned mask is equal to +.IR numa_all_nodes , +then memory allocation is allowed from all nodes. + +.BR numa_alloc_onnode () +allocates memory on a specific node. +The +.I size +argument will be rounded up to a multiple of the system page size. +if the specified +.I node +is externally denied to this process, this call will fail. +This function is relatively slow compared to the +.IR malloc (3), +family of functions. +The memory must be freed +with +.BR numa_free (). +On errors NULL is returned. + +.BR numa_alloc_local () +allocates +.I size +bytes of memory on the local node. +The +.I size +argument will be rounded up to a multiple of the system page size. +This function is relatively slow compared to the +.IR malloc (3) +family of functions. +The memory must be freed +with +.BR numa_free (). +On errors NULL is returned. + +.BR numa_alloc_interleaved () +allocates +.I size +bytes of memory page interleaved on all nodes. This function is relatively slow +and should only be used for large areas consisting of multiple pages. The +interleaving works at page level and will only show an effect when the +area is large. +The allocated memory must be freed with +.BR numa_free (). +On error, NULL is returned. + +.BR numa_alloc_interleaved_subset () +attempts to allocate +.I size +bytes of memory page interleaved on all nodes. +The +.I size +argument will be rounded up to a multiple of the system page size. +The nodes on which a process is allowed to allocate memory may +be constrained externally. +If this is the case, this function may fail. +This function is relatively slow compare to +.IR malloc (3), +family of functions and should only be used for large areas consisting +of multiple pages. +The interleaving works at page level and will only show an effect when the +area is large. +The allocated memory must be freed with +.BR numa_free (). +On error, NULL is returned. + +.BR numa_alloc () +allocates +.I size +bytes of memory with the current NUMA policy. +The +.I size +argument will be rounded up to a multiple of the system page size. +This function is relatively slow compare to the +.IR malloc (3) +family of functions. +The memory must be freed +with +.BR numa_free (). +On errors NULL is returned. + +.BR numa_realloc () +changes the size of the memory area pointed to by +.I old_addr +from +.I old_size +to +.I new_size. +The memory area pointed to by +.I old_addr +must have been allocated with one of the +.BR numa_alloc* +functions. +The +.I new_size +will be rounded up to a multiple of the system page size. The contents of the +memory area will be unchanged to the minimum of the old and new sizes; newly +allocated memory will be uninitialized. The memory policy (and node bindings) +associated with the original memory area will be preserved in the resized +area. For example, if the initial area was allocated with a call to +.BR numa_alloc_onnode(), +then the new pages (if the area is enlarged) will be allocated on the same node. +However, if no memory policy was set for the original area, then +.BR numa_realloc () +cannot guarantee that the new pages will be allocated on the same node. On +success, the address of the resized area is returned (which might be different +from that of the initial area), otherwise NULL is returned and +.I errno +is set to indicate the error. The pointer returned by +.BR numa_realloc () +is suitable for passing to +.BR numa_free (). + + +.BR numa_free () +frees +.I size +bytes of memory starting at +.IR start , +allocated by the +.B numa_alloc_* +functions above. +The +.I size +argument will be rounded up to a multiple of the system page size. + +.BR numa_run_on_node () +runs the current task and its children +on a specific node. They will not migrate to CPUs of +other nodes until the node affinity is reset with a new call to +.BR numa_run_on_node_mask (). +Passing \-1 +permits the kernel to schedule on all nodes again. +On success, 0 is returned; on error \-1 is returned, and +.I errno +is set to indicate the error. + +.BR numa_run_on_node_mask () +runs the current task and its children only on nodes specified in +.IR nodemask . +They will not migrate to CPUs of +other nodes until the node affinity is reset with a new call to +.BR numa_run_on_node_mask () +or +.BR numa_run_on_node (). +Passing +.I numa_all_nodes +permits the kernel to schedule on all nodes again. +On success, 0 is returned; on error \-1 is returned, and +.I errno +is set to indicate the error. + +.BR numa_run_on_node_mask_all () +runs the current task and its children only on nodes specified in +.IR nodemask +like +.I numa_run_on_node_mask +but without any cpuset awareness. + +.BR numa_get_run_node_mask () +returns a mask of CPUs on which the current task is allowed to run. + +.BR numa_tonode_memory () +put memory on a specific node. The constraints described for +.BR numa_interleave_memory () +apply here too. + +.BR numa_tonodemask_memory () +put memory on a specific set of nodes. The constraints described for +.BR numa_interleave_memory () +apply here too. + +.BR numa_setlocal_memory () +locates memory on the current node. The constraints described for +.BR numa_interleave_memory () +apply here too. + +.BR numa_police_memory () +locates memory with the current NUMA policy. The constraints described for +.BR numa_interleave_memory () +apply here too. + +.BR numa_distance () +reports the distance in the machine topology between two nodes. +The factors are a multiple of 10. It returns 0 when the distance +cannot be determined. A node has distance 10 to itself. +Reporting the distance requires a Linux +kernel version of +.I 2.6.10 +or newer. + +.BR numa_set_bind_policy () +specifies whether calls that bind memory to a specific node should +use the preferred policy or a strict policy. +The preferred policy allows the kernel +to allocate memory on other nodes when there isn't enough free +on the target node. strict will fail the allocation in that case. +Setting the argument to specifies strict, 0 preferred. +Note that specifying more than one node non strict may only use +the first node in some kernel versions. + +.BR numa_set_strict () +sets a flag that says whether the functions allocating on specific +nodes should use use a strict policy. Strict means the allocation +will fail if the memory cannot be allocated on the target node. +Default operation is to fall back to other nodes. +This doesn't apply to interleave and default. + +.BR numa_get_interleave_node() +is used by +.I libnuma +internally. It is probably not useful for user applications. +It uses the MPOL_F_NODE flag of the get_mempolicy system call, which is +not intended for application use (its operation may change or be removed +altogether in future kernel versions). See get_mempolicy(2). + +.BR numa_pagesize() +returns the number of bytes in page. This function is simply a fast +alternative to repeated calls to the getpagesize system call. +See getpagesize(2). + +.BR numa_sched_getaffinity() +retrieves a bitmask of the cpus on which a task may run. The task is +specified by +.I pid. +Returns the return value of the sched_getaffinity +system call. See sched_getaffinity(2). +The bitmask must be at least the size of the kernel's cpu mask structure. Use +.BR numa_allocate_cpumask() +to allocate it. +Test the bits in the mask by calling +.BR numa_bitmask_isbitset(). + +.BR numa_sched_setaffinity() +sets a task's allowed cpu's to those cpu's specified in +.I mask. +The task is specified by +.I pid. +Returns the return value of the sched_setaffinity system call. +See sched_setaffinity(2). You may allocate the bitmask with +.BR numa_allocate_cpumask(). +Or the bitmask may be smaller than the kernel's cpu mask structure. For +example, call +.BR numa_bitmask_alloc() +using a maximum number of cpus from +.BR numa_num_configured_cpus(). +Set the bits in the mask by calling +.BR numa_bitmask_setbit(). + +.BR numa_node_to_cpus () +converts a node number to a bitmask of CPUs. The user must pass a bitmask +structure with a mask buffer long enough to represent all possible cpu's. +Use numa_allocate_cpumask() to create it. If the bitmask is not long enough +.I errno +will be set to +.I ERANGE +and \-1 returned. On success 0 is returned. + +.BR numa_node_of_cpu () +returns the node that a cpu belongs to. If the user supplies an invalid cpu +.I errno +will be set to +.I EINVAL +and \-1 will be returned. + +.BR numa_allocate_cpumask +() returns a bitmask of a size equal to the kernel's cpu +mask (kernel type cpumask_t). In other words, large enough to represent +NR_CPUS cpus. This number of cpus can be gotten by calling +.BR numa_num_possible_cpus(). +The bitmask is zero-filled. + +.BR numa_free_cpumask +frees a cpumask previously allocate by +.I numa_allocate_cpumask. + +.BR numa_allocate_nodemask() +returns a bitmask of a size equal to the kernel's node +mask (kernel type nodemask_t). In other words, large enough to represent +MAX_NUMNODES nodes. This number of nodes can be gotten by calling +.BR numa_num_possible_nodes(). +The bitmask is zero-filled. + +.BR numa_free_nodemask() +frees a nodemask previous allocated by +.I numa_allocate_nodemask(). + +.BR numa_bitmask_alloc() +allocates a bitmask structure and its associated bit mask. +The memory allocated for the bit mask contains enough words (type unsigned +long) to contain +.I n +bits. The bit mask is zero-filled. The bitmask +structure points to the bit mask and contains the +.I n +value. + +.BR numa_bitmask_clearall() +sets all bits in the bit mask to 0. The bitmask structure +points to the bit mask and contains its size ( +.I bmp +->size). The value of +.I bmp +is always returned. Note that +.BR numa_bitmask_alloc() +creates a zero-filled bit mask. + +.BR numa_bitmask_clearbit() +sets a specified bit in a bit mask to 0. Nothing is done if +the +.I n +value is greater than the size of the bitmask (and no error is +returned). The value of +.I bmp +is always returned. + +.BR numa_bitmask_equal() +returns 1 if two bitmasks are equal. It returns 0 if they +are not equal. If the bitmask structures control bit masks of different +sizes, the "missing" trailing bits of the smaller bit mask are considered +to be 0. + +.BR numa_bitmask_free() +deallocates the memory of both the bitmask structure pointed +to by +.I bmp +and the bit mask. It is an error to attempt to free this bitmask twice. + +.BR numa_bitmask_isbitset() +returns the value of a specified bit in a bit mask. +If the +.I n +value is greater than the size of the bit map, 0 is returned. + +.BR numa_bitmask_nbytes() +returns the size (in bytes) of the bit mask controlled by +.I bmp. +The bit masks are always full words (type unsigned long), and the returned +size is the actual size of all those words. + +.BR numa_bitmask_setall() +sets all bits in the bit mask to 1. The bitmask structure +points to the bit mask and contains its size ( +.I bmp +->size). +The value of +.I bmp +is always returned. + +.BR numa_bitmask_setbit() +sets a specified bit in a bit mask to 1. Nothing is done if +.I n +is greater than the size of the bitmask (and no error is +returned). The value of +.I bmp +is always returned. + +.BR copy_bitmask_to_nodemask() +copies the body (the bit map itself) of the bitmask structure pointed +to by +.I bmp +to the nodemask_t structure pointed to by the +.I nodemask +pointer. If the two areas differ in size, the copy is truncated to the size +of the receiving field or zero-filled. + +.BR copy_nodemask_to_bitmask() +copies the nodemask_t structure pointed to by the +.I nodemask +pointer to the body (the bit map itself) of the bitmask structure pointed +to by the +.I bmp +pointer. If the two areas differ in size, the copy is truncated to the size +of the receiving field or zero-filled. + +.BR copy_bitmask_to_bitmask() +copies the body (the bit map itself) of the bitmask structure pointed +to by the +.I bmpfrom +pointer to the body of the bitmask structure pointed to by the +.I bmpto +pointer. If the two areas differ in size, the copy is truncated to the size +of the receiving field or zero-filled. + +.BR numa_bitmask_weight() +returns a count of the bits that are set in the body of the bitmask pointed +to by the +.I bmp +argument. + +.br +.BR numa_move_pages() +moves a list of pages in the address space of the currently +executing or current process. +It simply uses the move_pages system call. +.br +.I pid +- ID of task. If not valid, use the current task. +.br +.I count +- Number of pages. +.br +.I pages +- List of pages to move. +.br +.I nodes +- List of nodes to which pages can be moved. +.br +.I status +- Field to which status is to be returned. +.br +.I flags +- MPOL_MF_MOVE or MPOL_MF_MOVE_ALL +.br +See move_pages(2). + +.BR numa_migrate_pages() +simply uses the migrate_pages system call to cause the pages of the calling +task, or a specified task, to be migated from one set of nodes to another. +See migrate_pages(2). +The bit masks representing the nodes should be allocated with +.BR numa_allocate_nodemask() +, or with +.BR numa_bitmask_alloc() +using an +.I n +value returned from +.BR numa_num_possible_nodes(). +A task's current node set can be gotten by calling +.BR numa_get_membind(). +Bits in the +.I tonodes +mask can be set by calls to +.BR numa_bitmask_setbit(). + +.BR numa_error () +is a +.I libnuma +internal function that can be overridden by the +user program. +This function is called with a +.I char * +argument when a +.I libnuma +function fails. +Overriding the library internal definition +makes it possible to specify a different error handling strategy +when a +.I libnuma +function fails. It does not affect +.BR numa_available (). +The +.BR numa_error () +function defined in +.I libnuma +prints an error on +.I stderr +and terminates +the program if +.I numa_exit_on_error +is set to a non-zero value. +The default value of +.I numa_exit_on_error +is zero. + +.BR numa_warn () +is a +.I libnuma +internal function that can be also overridden +by the user program. +It is called to warn the user when a +.I libnuma +function encounters a non-fatal error. +The default implementation +prints a warning to +.IR stderr . +The first argument is a unique +number identifying each warning. After that there is a +.BR printf (3)-style +format string and a variable number of arguments. +.I numa_warn +exits the program when +.I numa_exit_on_warn +is set to a non-zero value. +The default value of +.I numa_exit_on_warn +is zero. + +.SH Compatibility with libnuma version 1 +Binaries that were compiled for libnuma version 1 need not be re-compiled +to run with libnuma version 2. +.br +Source codes written for libnuma version 1 may be re-compiled without +change with version 2 installed. To do so, in the code's Makefile add +this option to CFLAGS: -DNUMA_VERSION1_COMPATIBILITY + +.SH THREAD SAFETY +.I numa_set_bind_policy +and +.I numa_exit_on_error +are process global. The other calls are thread safe. + +.SH COPYRIGHT +Copyright 2002, 2004, 2007, 2008 Andi Kleen, SuSE Labs. +.I libnuma +is under the GNU Lesser General Public License, v2.1. + +.SH SEE ALSO +.BR get_mempolicy (2), +.BR set_mempolicy (2), +.BR getpagesize (2), +.BR mbind (2), +.BR mmap (2), +.BR shmat (2), +.BR numactl (8), +.BR sched_getaffinity (2) +.BR sched_setaffinity (2) +.BR move_pages (2) +.BR migrate_pages (2)
diff --git a/numa.h b/numa.h new file mode 100644 index 0000000..6ab3121 --- /dev/null +++ b/numa.h
@@ -0,0 +1,511 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + + libnuma is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; version + 2.1. + + libnuma is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should find a copy of v2.1 of the GNU Lesser General Public License + somewhere on your Linux system; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +#ifndef _NUMA_H +#define _NUMA_H 1 + +/* allow an application to test for the current programming interface: */ +#define LIBNUMA_API_VERSION 2 + +/* Simple NUMA policy library */ + +#include <stddef.h> +#include <string.h> +#include <sys/types.h> +#include <stdlib.h> + +#if defined(__x86_64__) || defined(__i386__) +#define NUMA_NUM_NODES 128 +#else +#define NUMA_NUM_NODES 2048 +#endif + +#ifdef GOOGLE3_LIBNUMA_BUILD +/* Google-local: + the version definitions below make sense when libnuma is built into + a shared library. In google3 builds that doesn't happen. + + The version definitions are mostly harmless when libnuma.a is linked + into an executable. + + But when libnuma.a is linked into a shared library (such as when there + is a dependency of some SWIG code on //third_party/libnuma:numa, + the version definitions cause actual errors: + http://sponge/2287a293-35c2-4a9a-80e5-1f0735320f02). */ + +/* Don't provide backward symbol at all: */ +#define backward_symver(orig,new) /**/ + +/* Provide unversioned alias: */ +#define symver(orig,new) extern __typeof (orig) new __attribute((alias(#orig), visibility("default"))) + +#else /* GOOGLE3_LIBNUMA_BUILD */ + +#define backward_symver(orig,new) \ + __asm__(".symver " #orig "," #new "@libnuma_1.1") +#define symver(orig,new) __asm__(".symver " #orig "," #new "@@libnuma_1.2") + +#endif /* GOOGLE3_LIBNUMA_BUILD */ + + +#ifdef __cplusplus +extern "C" { +#endif + +typedef struct { + unsigned long n[NUMA_NUM_NODES/(sizeof(unsigned long)*8)]; +} nodemask_t; + +struct bitmask { + unsigned long size; /* number of bits in the map */ + unsigned long *maskp; +}; + +/* operations on struct bitmask */ +int numa_bitmask_isbitset(const struct bitmask *, unsigned int); +struct bitmask *numa_bitmask_setall(struct bitmask *); +struct bitmask *numa_bitmask_clearall(struct bitmask *); +struct bitmask *numa_bitmask_setbit(struct bitmask *, unsigned int); +struct bitmask *numa_bitmask_clearbit(struct bitmask *, unsigned int); +unsigned int numa_bitmask_nbytes(struct bitmask *); +unsigned int numa_bitmask_weight(const struct bitmask *); +struct bitmask *numa_bitmask_alloc(unsigned int); +void numa_bitmask_free(struct bitmask *); +int numa_bitmask_equal(const struct bitmask *, const struct bitmask *); +void copy_nodemask_to_bitmask(nodemask_t *, struct bitmask *); +void copy_bitmask_to_nodemask(struct bitmask *, nodemask_t *); +void copy_bitmask_to_bitmask(struct bitmask *, struct bitmask *); + +/* compatibility for codes that used them: */ + +static inline void nodemask_zero(nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + numa_bitmask_clearall(&tmp); +} + +static inline void nodemask_zero_compat(nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + numa_bitmask_clearall(&tmp); +} + +static inline void nodemask_set_compat(nodemask_t *mask, int node) +{ + mask->n[node / (8*sizeof(unsigned long))] |= + (1UL<<(node%(8*sizeof(unsigned long)))); +} + +static inline void nodemask_clr_compat(nodemask_t *mask, int node) +{ + mask->n[node / (8*sizeof(unsigned long))] &= + ~(1UL<<(node%(8*sizeof(unsigned long)))); +} + +static inline int nodemask_isset_compat(const nodemask_t *mask, int node) +{ + if ((unsigned)node >= NUMA_NUM_NODES) + return 0; + if (mask->n[node / (8*sizeof(unsigned long))] & + (1UL<<(node%(8*sizeof(unsigned long))))) + return 1; + return 0; +} + +static inline int nodemask_equal(const nodemask_t *a, const nodemask_t *b) +{ + struct bitmask tmp_a, tmp_b; + + tmp_a.maskp = (unsigned long *)a; + tmp_a.size = sizeof(nodemask_t) * 8; + + tmp_b.maskp = (unsigned long *)b; + tmp_b.size = sizeof(nodemask_t) * 8; + + return numa_bitmask_equal(&tmp_a, &tmp_b); +} + +static inline int nodemask_equal_compat(const nodemask_t *a, const nodemask_t *b) +{ + struct bitmask tmp_a, tmp_b; + + tmp_a.maskp = (unsigned long *)a; + tmp_a.size = sizeof(nodemask_t) * 8; + + tmp_b.maskp = (unsigned long *)b; + tmp_b.size = sizeof(nodemask_t) * 8; + + return numa_bitmask_equal(&tmp_a, &tmp_b); +} + +/* numa_init must be called before any other operations. + If libnuma is linked dynamically, it will be called automatically. */ +void numa_init(void); + +/* NUMA support available. If this returns a negative value all other function + in this library are undefined. */ +int numa_available(void); + +/* Basic NUMA state */ + +/* Get max available node */ +int numa_max_node(void); +int numa_max_possible_node(void); +/* Return preferred node */ +int numa_preferred(void); + +/* Return node size and free memory */ +long long numa_node_size64(int node, long long *freep); +long numa_node_size(int node, long *freep); + +int numa_pagesize(void); + +/* Set with all nodes from which the calling process may allocate memory. + Only valid after numa_available. */ +extern struct bitmask *numa_all_nodes_ptr; + +/* Set with all nodes the kernel has exposed to userspace */ +extern struct bitmask *numa_nodes_ptr; + +/* For source compatibility */ +extern nodemask_t numa_all_nodes; + +/* Set with all cpus. */ +extern struct bitmask *numa_all_cpus_ptr; + +/* Set with no nodes */ +extern struct bitmask *numa_no_nodes_ptr; + +/* Source compatibility */ +extern nodemask_t numa_no_nodes; + +/* Only run and allocate memory from a specific set of nodes. */ +void numa_bind(struct bitmask *nodes); + +/* Set the NUMA node interleaving mask. 0 to turn off interleaving */ +void numa_set_interleave_mask(struct bitmask *nodemask); + +/* Return the current interleaving mask */ +struct bitmask *numa_get_interleave_mask(void); + +/* allocate a bitmask big enough for all nodes */ +struct bitmask *numa_allocate_nodemask(void); + +static inline void numa_free_nodemask(struct bitmask *b) +{ + numa_bitmask_free(b); +} + +/* Some node to preferably allocate memory from for task. */ +void numa_set_preferred(int node); + +/* Set local memory allocation policy for task */ +void numa_set_localalloc(void); + +/* Only allocate memory from the nodes set in mask. 0 to turn off */ +void numa_set_membind(struct bitmask *nodemask); + +/* Return current membind */ +struct bitmask *numa_get_membind(void); + +/* Return allowed memories [nodes] */ +struct bitmask *numa_get_mems_allowed(void); + +int numa_get_interleave_node(void); + +/* NUMA memory allocation. These functions always round to page size + and are relatively slow. */ + +/* Alloc memory page interleaved on nodes in mask */ +void *numa_alloc_interleaved_subset(size_t size, struct bitmask *nodemask); +/* Alloc memory page interleaved on all nodes. */ +void *numa_alloc_interleaved(size_t size); +/* Alloc memory located on node */ +void *numa_alloc_onnode(size_t size, int node); +/* Alloc memory on local node */ +void *numa_alloc_local(size_t size); +/* Allocation with current policy */ +void *numa_alloc(size_t size); +/* Change the size of a memory area preserving the memory policy */ +void *numa_realloc(void *old_addr, size_t old_size, size_t new_size); +/* Free memory allocated by the functions above */ +void numa_free(void *mem, size_t size); + +/* Low level functions, primarily for shared memory. All memory + processed by these must not be touched yet */ + +/* Interleave an memory area. */ +void numa_interleave_memory(void *mem, size_t size, struct bitmask *mask); + +/* Allocate a memory area on a specific node. */ +void numa_tonode_memory(void *start, size_t size, int node); + +/* Allocate memory on a mask of nodes. */ +void numa_tonodemask_memory(void *mem, size_t size, struct bitmask *mask); + +/* Allocate a memory area on the current node. */ +void numa_setlocal_memory(void *start, size_t size); + +/* Allocate memory area with current memory policy */ +void numa_police_memory(void *start, size_t size); + +/* Run current task only on nodes in mask */ +int numa_run_on_node_mask(struct bitmask *mask); +/* Run current task on nodes in mask without any cpuset awareness */ +int numa_run_on_node_mask_all(struct bitmask *mask); +/* Run current task only on node */ +int numa_run_on_node(int node); +/* Return current mask of nodes the task can run on */ +struct bitmask * numa_get_run_node_mask(void); + +/* When strict fail allocation when memory cannot be allocated in target node(s). */ +void numa_set_bind_policy(int strict); + +/* Fail when existing memory has incompatible policy */ +void numa_set_strict(int flag); + +/* maximum nodes (size of kernel nodemask_t) */ +int numa_num_possible_nodes(); + +/* maximum cpus (size of kernel cpumask_t) */ +int numa_num_possible_cpus(); + +/* nodes in the system */ +int numa_num_configured_nodes(); + +/* maximum cpus */ +int numa_num_configured_cpus(); + +/* maximum cpus allowed to current task */ +int numa_num_task_cpus(); +int numa_num_thread_cpus(); /* backward compatibility */ + +/* maximum nodes allowed to current task */ +int numa_num_task_nodes(); +int numa_num_thread_nodes(); /* backward compatibility */ + +/* allocate a bitmask the size of the kernel cpumask_t */ +struct bitmask *numa_allocate_cpumask(); + +static inline void numa_free_cpumask(struct bitmask *b) +{ + numa_bitmask_free(b); +} + +/* Convert node to CPU mask. -1/errno on failure, otherwise 0. */ +int numa_node_to_cpus(int, struct bitmask *); + +/* report the node of the specified cpu. -1/errno on invalid cpu. */ +int numa_node_of_cpu(int cpu); + +/* Report distance of node1 from node2. 0 on error.*/ +int numa_distance(int node1, int node2); + +/* Error handling. */ +/* This is an internal function in libnuma that can be overwritten by an user + program. Default is to print an error to stderr and exit if numa_exit_on_error + is true. */ +void numa_error(const char *where); + +/* When true exit the program when a NUMA system call (except numa_available) + fails */ +extern int numa_exit_on_error; +/* Warning function. Can also be overwritten. Default is to print on stderr + once. */ +void numa_warn(int num, const char *fmt, ...); + +/* When true exit the program on a numa_warn() call */ +extern int numa_exit_on_warn; + +int numa_migrate_pages(int pid, struct bitmask *from, struct bitmask *to); + +int numa_move_pages(int pid, unsigned long count, void **pages, + const int *nodes, int *status, int flags); + +int numa_sched_getaffinity(pid_t, struct bitmask *); +int numa_sched_setaffinity(pid_t, struct bitmask *); + +/* Convert an ascii list of nodes to a bitmask */ +struct bitmask *numa_parse_nodestring(const char *); + +/* Convert an ascii list of nodes to a bitmask without current nodeset + * dependency */ +struct bitmask *numa_parse_nodestring_all(const char *); + +/* Convert an ascii list of cpu to a bitmask */ +struct bitmask *numa_parse_cpustring(const char *); + +/* Convert an ascii list of cpu to a bitmask without current taskset + * dependency */ +struct bitmask *numa_parse_cpustring_all(const char *); + +/* + * The following functions are for source code compatibility + * with releases prior to version 2. + * Such codes should be compiled with NUMA_VERSION1_COMPATIBILITY defined. + */ + +static inline void numa_set_interleave_mask_compat(nodemask_t *nodemask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)nodemask; + tmp.size = sizeof(nodemask_t) * 8; + numa_set_interleave_mask(&tmp); +} + +static inline nodemask_t numa_get_interleave_mask_compat() +{ + struct bitmask *tp; + nodemask_t mask; + + tp = numa_get_interleave_mask(); + copy_bitmask_to_nodemask(tp, &mask); + numa_bitmask_free(tp); + return mask; +} + +static inline void numa_bind_compat(nodemask_t *mask) +{ + struct bitmask *tp; + + tp = numa_allocate_nodemask(); + copy_nodemask_to_bitmask(mask, tp); + numa_bind(tp); + numa_bitmask_free(tp); +} + +static inline void numa_set_membind_compat(nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + numa_set_membind(&tmp); +} + +static inline nodemask_t numa_get_membind_compat() +{ + struct bitmask *tp; + nodemask_t mask; + + tp = numa_get_membind(); + copy_bitmask_to_nodemask(tp, &mask); + numa_bitmask_free(tp); + return mask; +} + +static inline void *numa_alloc_interleaved_subset_compat(size_t size, + const nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + return numa_alloc_interleaved_subset(size, &tmp); +} + +static inline int numa_run_on_node_mask_compat(const nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + return numa_run_on_node_mask(&tmp); +} + +static inline nodemask_t numa_get_run_node_mask_compat() +{ + struct bitmask *tp; + nodemask_t mask; + + tp = numa_get_run_node_mask(); + copy_bitmask_to_nodemask(tp, &mask); + numa_bitmask_free(tp); + return mask; +} + +static inline void numa_interleave_memory_compat(void *mem, size_t size, + const nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + numa_interleave_memory(mem, size, &tmp); +} + +static inline void numa_tonodemask_memory_compat(void *mem, size_t size, + const nodemask_t *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = sizeof(nodemask_t) * 8; + numa_tonodemask_memory(mem, size, &tmp); +} + +static inline int numa_sched_getaffinity_compat(pid_t pid, unsigned len, + unsigned long *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = len * 8; + return numa_sched_getaffinity(pid, &tmp); +} + +static inline int numa_sched_setaffinity_compat(pid_t pid, unsigned len, + unsigned long *mask) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)mask; + tmp.size = len * 8; + return numa_sched_setaffinity(pid, &tmp); +} + +static inline int numa_node_to_cpus_compat(int node, unsigned long *buffer, + int buffer_len) +{ + struct bitmask tmp; + + tmp.maskp = (unsigned long *)buffer; + tmp.size = buffer_len * 8; + return numa_node_to_cpus(node, &tmp); +} + +/* end of version 1 compatibility functions */ + +/* + * To compile an application that uses libnuma version 1: + * add -DNUMA_VERSION1_COMPATIBILITY to your Makefile's CFLAGS + */ +#ifdef NUMA_VERSION1_COMPATIBILITY +#include <numacompat1.h> +#endif + +#ifdef __cplusplus +} +#endif + +#endif
diff --git a/numacompat1.h b/numacompat1.h new file mode 100644 index 0000000..d5af142 --- /dev/null +++ b/numacompat1.h
@@ -0,0 +1,18 @@ +#define numa_set_interleave_mask(m) numa_set_interleave_mask_compat(m) +#define numa_get_interleave_mask() numa_get_interleave_mask_compat() +#define numa_bind(m) numa_bind_compat(m) +#define numa_get_membind(m) numa_get_membind_compat(m) +#define numa_set_membind(m) numa_set_membind_compat(m) +#define numa_alloc_interleaved_subset(s,m) numa_alloc_interleaved_subset_compat(s,m) +#define numa_run_on_node_mask(m) numa_run_on_node_mask_compat(m) +#define numa_get_run_node_mask() numa_get_run_node_mask_compat() +#define numa_interleave_memory(st,si,m) numa_interleave_memory_compat(st,si,m) +#define numa_tonodemask_memory(st,si,m) numa_tonodemask_memory_compat(st,si,m) +#define numa_sched_getaffinity(p,l,m) numa_sched_getaffinity_compat(p,l,m) +#define numa_sched_setaffinity(p,l,m) numa_sched_setaffinity_compat(p,l,m) +#define numa_node_to_cpus(n,b,bl) numa_node_to_cpus_compat(n,b,bl) +#define nodemask_zero(m) nodemask_zero_compat(m) +#define nodemask_set(m, n) nodemask_set_compat(m, n) +#define nodemask_clr(m, n) nodemask_clr_compat(m, n) +#define nodemask_isset(m, n) nodemask_isset_compat(m, n) +#define nodemask_equal(a, b) nodemask_equal_compat(a, b)
diff --git a/numactl.8 b/numactl.8 new file mode 100644 index 0000000..7a001c0 --- /dev/null +++ b/numactl.8
@@ -0,0 +1,354 @@ +.\" t +.\" Copyright 2003,2004 Andi Kleen, SuSE Labs. +.\" +.\" Permission is granted to make and distribute verbatim copies of this +.\" manual provided the copyright notice and this permission notice are +.\" preserved on all copies. +.\" +.\" Permission is granted to copy and distribute modified versions of this +.\" manual under the conditions for verbatim copying, provided that the +.\" entire resulting derived work is distributed under the terms of a +.\" permission notice identical to this one. +.\" +.\" Since the Linux kernel and libraries are constantly changing, this +.\" manual page may be incorrect or out-of-date. The author(s) assume no +.\" responsibility for errors or omissions, or for damages resulting from +.\" the use of the information contained herein. +.\" +.\" Formatted or processed versions of this manual, if unaccompanied by +.\" the source, must acknowledge the copyright and authors of this work. +.TH NUMACTL 8 "Mar 2004" "SuSE Labs" "Linux Administrator's Manual" +.SH NAME +numactl \- Control NUMA policy for processes or shared memory +.SH SYNOPSIS +.B numactl +[ +.B \-\-all +] [ +.B \-\-interleave nodes +] [ +.B \-\-preferred node +] [ +.B \-\-membind nodes +] [ +.B \-\-cpunodebind nodes +] [ +.B \-\-physcpubind cpus +] [ +.B \-\-localalloc +] [\-\-] command {arguments ...} +.br +.B numactl \-\-show +.br +.B numactl \-\-hardware +.br +.B numactl +[ +.B \-\-huge +] [ +.B \-\-offset offset +] [ +.B \-\-shmmode shmmode +] [ +.B \-\-length length +] [ +.B \-\-strict +] +.br +[ +.B \-\-shmid id +] +.B \-\-shm shmkeyfile +| +.B \-\-file tmpfsfile +.br +[ +.B \-\-touch +] [ +.B \-\-dump +] [ +.B \-\-dump-nodes +] +memory policy +.SH DESCRIPTION +.B numactl +runs processes with a specific NUMA scheduling or memory placement policy. +The policy is set for command and inherited by all of its children. +In addition it can set persistent policy for shared memory segments or files. +.PP +Use -- before command if using command options that could be confused +with numactl options. +.PP +.I nodes +may be specified as N,N,N or N-N or N,N-N or N-N,N-N and so forth. +Relative +.I nodes +may be specifed as +N,N,N or +N-N or +N,N-N and so forth. The + indicates that +the node numbers are relative to the process' set of allowed nodes in its +current cpuset. +A !N-N notation indicates the inverse of N-N, in other words all nodes +except N-N. If used with + notation, specify !+N-N. When +.I same +is specified the previous nodemask specified on the command line is used. +all means all nodes in the current cpuset. +.PP +Instead of a number a node can also be: +.TS +tab(|); +l l. +netdev:DEV|The node connected to network device DEV. +file:PATH |The node the block device of PATH. +ip:HOST |The node of the network device of HOST +block:PATH|The node of block device PATH +pci:[seg:]bus:dev[:func]|The node of a PCI device. +.TE + +Note that block resolves the kernel block device names only +for udev names in /dev use +.I file: +.TP +Policy settings are: +.TP +.B \-\-all, \-a +Unset default cpuset awareness, so user can use all possible CPUs/nodes +for following policy settings. +.TP +.B \-\-interleave=nodes, \-i nodes +Set a memory interleave policy. Memory will be allocated using round robin +on +.I nodes. +When memory cannot be allocated on the current interleave target fall back +to other nodes. +Multiple nodes may be specified on --interleave, --membind and --cpunodebind. +.TP +.B \-\-membind=nodes, \-m nodes +Only allocate memory from nodes. Allocation will fail when there +is not enough memory available on these nodes. +.I nodes +may be specified as noted above. +.TP +.B \-\-cpunodebind=nodes, \-N nodes +Only execute +.I command +on the CPUs of +.I nodes. +Note that nodes may consist of multiple CPUs. +.I nodes +may be specified as noted above. +.TP +.B \-\-physcpubind=cpus, \-C cpus +Only execute +.I process +on +.I cpus. +This accepts cpu numbers as shown in the +.I processor +fields of +.I /proc/cpuinfo, +or relative cpus as in relative to the current cpuset. +You may specify "all", which means all cpus in the current cpuset. +Physical +.I cpus +may be specified as N,N,N or N-N or N,N-N or N-N,N-N and so forth. +Relative +.I cpus +may be specifed as +N,N,N or +N-N or +N,N-N and so forth. The + indicates that +the cpu numbers are relative to the process' set of allowed cpus in its +current cpuset. +A !N-N notation indicates the inverse of N-N, in other words all cpus +except N-N. If used with + notation, specify !+N-N. +.TP +.B \-\-localalloc, \-l +Always allocate on the current node. +.TP +.B \-\-preferred=node +Preferably allocate memory on +.I node, +but if memory cannot be allocated there fall back to other nodes. +This option takes only a single node number. +Relative notation may be used. +.TP +.B \-\-show, \-s +Show NUMA policy settings of the current process. +.TP +.B \-\-hardware, \-H +Show inventory of available nodes on the system. +.TP 0 +Numactl can set up policy for a SYSV shared memory segment or a file in shmfs/hugetlbfs. + +This policy is persistent and will be used by +all mappings from that shared memory. The order of options matters here. +The specification must at least include either of +.I \-\-shm, +.I \-\-shmid, +.I \-\-file +to specify the shared memory segment or file and a memory policy like described +above ( +.I \-\-interleave, +.I \-\-localalloc, +.I \-\-preferred, +.I \-\-membind +). +.TP +.B \-\-huge +When creating a SYSV shared memory segment use huge pages. +Only valid before \-\-shmid or \-\-shm +.TP +.B \-\-offset +Specify offset into the shared memory segment. Default 0. +Valid units are +.I m +(for MB), +.I g +(for GB), +.I k +(for KB), +otherwise it specifies bytes. +.TP +.B \-\-strict +Give an error when a page in the policied area in the shared memory +segment already was faulted in with a conflicting policy. Default +is to silently ignore this. +.TP +.B \-\-shmmode shmmode +Only valid before \-\-shmid or \-\-shm +When creating a shared memory segment set it to numeric mode +.I shmmode. +.TP +.B \-\-length length +Apply policy to +.I length +range in the shared memory segment or make +the segment length long +Default is to use the remaining length +Required when a shared memory segment is created and specifies the length +of the new segment then. Valid units are +.I m +(for MB), +.I g +(for GB), +.I k +(for KB), +otherwise it specifies bytes. +.TP +.B \-\-shmid id +Create or use an shared memory segment with numeric ID +.I id +.TP +.B \-\-shm shmkeyfile +Create or use an shared memory segment, with the ID generated +using +.I ftok(3) +from shmkeyfile +.TP +.B \-\-file tmpfsfile +Set policy for a file in tmpfs or hugetlbfs +.TP +.B \-\-touch +Touch pages to enforce policy early. Default is to not touch them, the policy +is applied when an applications maps and accesses a page. +.TP +.B \-\-dump +Dump policy in the specified range. +.TP +.B \-\-dump-nodes +Dump all nodes of the specific range (very verbose!) +.TP +Valid node specifiers +.TS +tab(:); +l l. +all:All nodes +number:Node number +number1{,number2}:Node number1 and Node number2 +number1-number2:Nodes from number1 to number2 +! nodes:Invert selection of the following specification. +.TE +.SH EXAMPLES +numactl \-\-physcpubind=+0-4,8-12 myapplic arguments +Run myapplic on cpus 0-4 and 8-12 of the current cpuset. + +numactl \-\-interleave=all bigdatabase arguments +Run big database with its memory interleaved on all CPUs. + +numactl \-\-cpunodebind=0 \-\-membind=0,1 process +Run process on node 0 with memory allocated on node 0 and 1. + +numactl \-\-cpunodebind=0 \-\-membind=0,1 -- process -l +Run process as above, but with an option (-l) that would be confused with +a numactl option. + +numactl \-\-cpunodebind=netdev:eth0 \-\-membind=netdev:eth0 network-server +Run network-server on the node of network device eth0 with its memory +also in the same node. + +numactl \-\-preferred=1 numactl \-\-show +Set preferred node 1 and show the resulting state. + +numactl --interleave=all --shm /tmp/shmkey +Interleave all of the sysv shared memory region specified by +/tmp/shmkey over all nodes. + +Place a tmpfs file on 2 nodes: + numactl --membind=2 dd if=/dev/zero of=/dev/shm/A bs=1M count=1024 + numactl --membind=3 dd if=/dev/zero of=/dev/shm/A seek=1024 bs=1M count=1024 + + +numactl --localalloc /dev/shm/file +Reset the policy for the shared memory file +.I file +to the default localalloc policy. +.SH NOTES +Requires an NUMA policy aware kernel. + +Command is not executed using a shell. If you want to use shell metacharacters +in the child use sh -c as wrapper. + +Setting policy for a hugetlbfs file does currently not work because +it cannot be extended by truncate. + +Shared memory segments larger than numactl's address space cannot +be completely policied. This could be a problem on 32bit architectures. +Changing it piece by piece may work. + +The old +.I --cpubind +which accepts node numbers, not cpu numbers, is deprecated +and replaced with the new +.I --cpunodebind +and +.I --physcpubind +options. + +.SH FILES +.I /proc/cpuinfo +for the listing of active CPUs. See +.I proc(5) +for details. + +.I /sys/devices/system/node/node*/numastat +for NUMA memory hit statistics. + +.SH COPYRIGHT +Copyright 2002,2004 Andi Kleen, SuSE Labs. +numactl and the demo programs are under the GNU General Public License, v.2 + +.SH SEE ALSO +.I set_mempolicy(2) +, +.I get_mempolicy(2) +, +.I mbind(2) +, +.I sched_setaffinity(2) +, +.I sched_getaffinity(2) +, +.I proc(5) +, +.I ftok(3) +, +.I shmat(2) +, +.I migratepages(8) +
diff --git a/numactl.c b/numactl.c new file mode 100644 index 0000000..a2b2d9d --- /dev/null +++ b/numactl.c
@@ -0,0 +1,662 @@ +/* Copyright (C) 2003,2004,2005 Andi Kleen, SuSE Labs. + Command line NUMA policy control. + + numactl is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numactl is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#define _GNU_SOURCE +#include <getopt.h> +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <unistd.h> +#include <stdarg.h> +#include <ctype.h> +#include "numa.h" +#include "numaif.h" +#include "numaint.h" +#include "util.h" +#include "shm.h" + +#define CPUSET 0 +#define ALL 1 + +int exitcode; + +struct option opts[] = { + {"all", 0, 0, 'a'}, + {"interleave", 1, 0, 'i' }, + {"preferred", 1, 0, 'p' }, + {"cpubind", 1, 0, 'c' }, + {"cpunodebind", 1, 0, 'N' }, + {"physcpubind", 1, 0, 'C' }, + {"membind", 1, 0, 'm'}, + {"show", 0, 0, 's' }, + {"localalloc", 0,0, 'l'}, + {"hardware", 0,0,'H' }, + + {"shm", 1, 0, 'S'}, + {"file", 1, 0, 'f'}, + {"offset", 1, 0, 'o'}, + {"length", 1, 0, 'L'}, + {"strict", 0, 0, 't'}, + {"shmmode", 1, 0, 'M'}, + {"dump", 0, 0, 'd'}, + {"dump-nodes", 0, 0, 'D'}, + {"shmid", 1, 0, 'I'}, + {"huge", 0, 0, 'u'}, + {"touch", 0, 0, 'T'}, + {"verify", 0, 0, 'V'}, /* undocumented - for debugging */ + { 0 } +}; + +void usage(void) +{ + fprintf(stderr, + "usage: numactl [--all | -a] [--interleave= | -i <nodes>] [--preferred= | -p <node>]\n" + " [--physcpubind= | -C <cpus>] [--cpunodebind= | -N <nodes>]\n" + " [--membind= | -m <nodes>] [--localalloc | -l] command args ...\n" + " numactl [--show | -s]\n" + " numactl [--hardware | -H]\n" + " numactl [--length | -l <length>] [--offset | -o <offset>] [--shmmode | -M <shmmode>]\n" + " [--strict | -t]\n" + " [--shmid | -I <id>] --shm | -S <shmkeyfile>\n" + " [--shmid | -I <id>] --file | -f <tmpfsfile>\n" + " [--huge | -u] [--touch | -T] \n" + " memory policy | --dump | -d | --dump-nodes | -D\n" + "\n" + "memory policy is --interleave | -i, --preferred | -p, --membind | -m, --localalloc | -l\n" + "<nodes> is a comma delimited list of node numbers or A-B ranges or all.\n" + "Instead of a number a node can also be:\n" + " netdev:DEV the node connected to network device DEV\n" + " file:PATH the node the block device of path is connected to\n" + " ip:HOST the node of the network device host routes through\n" + " block:PATH the node of block device path\n" + " pci:[seg:]bus:dev[:func] The node of a PCI device\n" + "<cpus> is a comma delimited list of cpu numbers or A-B ranges or all\n" + "all ranges can be inverted with !\n" + "all numbers and ranges can be made cpuset-relative with +\n" + "the old --cpubind argument is deprecated.\n" + "use --cpunodebind or --physcpubind instead\n" + "<length> can have g (GB), m (MB) or k (KB) suffixes\n"); + exit(1); +} + +void usage_msg(char *msg, ...) +{ + va_list ap; + va_start(ap,msg); + fprintf(stderr, "numactl: "); + vfprintf(stderr, msg, ap); + putchar('\n'); + usage(); +} + +void show_physcpubind(void) +{ + int ncpus = numa_num_configured_cpus(); + + for (;;) { + struct bitmask *cpubuf; + + cpubuf = numa_bitmask_alloc(ncpus); + + if (numa_sched_getaffinity(0, cpubuf) < 0) { + if (errno == EINVAL && ncpus < 1024*1024) { + ncpus *= 2; + continue; + } + err("sched_get_affinity"); + } + printmask("physcpubind", cpubuf); + break; + } +} + +void show(void) +{ + unsigned long prefnode; + struct bitmask *membind, *interleave, *cpubind; + unsigned long cur; + int policy; + + if (numa_available() < 0) { + show_physcpubind(); + printf("No NUMA support available on this system.\n"); + exit(1); + } + + cpubind = numa_get_run_node_mask(); + + prefnode = numa_preferred(); + interleave = numa_get_interleave_mask(); + membind = numa_get_membind(); + cur = numa_get_interleave_node(); + + policy = 0; + if (get_mempolicy(&policy, NULL, 0, 0, 0) < 0) + perror("get_mempolicy"); + + printf("policy: %s\n", policy_name(policy)); + + printf("preferred node: "); + switch (policy) { + case MPOL_PREFERRED: + if (prefnode != -1) { + printf("%ld\n", prefnode); + break; + } + /*FALL THROUGH*/ + case MPOL_DEFAULT: + printf("current\n"); + break; + case MPOL_INTERLEAVE: + printf("%ld (interleave next)\n",cur); + break; + case MPOL_BIND: + printf("%d\n", find_first(membind)); + break; + } + if (policy == MPOL_INTERLEAVE) { + printmask("interleavemask", interleave); + printf("interleavenode: %ld\n", cur); + } + show_physcpubind(); + printmask("cpubind", cpubind); // for compatibility + printmask("nodebind", cpubind); + printmask("membind", membind); +} + +char *fmt_mem(unsigned long long mem, char *buf) +{ + if (mem == -1L) + sprintf(buf, "<not available>"); + else + sprintf(buf, "%llu MB", mem >> 20); + return buf; +} + +static void print_distances(int maxnode) +{ + int i,k; + int fst = 0; + + for (i = 0; i <= maxnode; i++) + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) { + fst = i; + break; + } + if (numa_distance(maxnode,fst) == 0) { + printf("No distance information available.\n"); + return; + } + printf("node distances:\n"); + printf("node "); + for (i = 0; i <= maxnode; i++) + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) + printf("% 3d ", i); + printf("\n"); + for (i = 0; i <= maxnode; i++) { + if (!numa_bitmask_isbitset(numa_nodes_ptr, i)) + continue; + printf("% 3d: ", i); + for (k = 0; k <= maxnode; k++) + if (numa_bitmask_isbitset(numa_nodes_ptr, i) && + numa_bitmask_isbitset(numa_nodes_ptr, k)) + printf("% 3d ", numa_distance(i,k)); + printf("\n"); + } +} + +void print_node_cpus(int node) +{ + int i, err; + struct bitmask *cpus; + + cpus = numa_allocate_cpumask(); + err = numa_node_to_cpus(node, cpus); + if (err >= 0) { + for (i = 0; i < cpus->size; i++) + if (numa_bitmask_isbitset(cpus, i)) + printf(" %d", i); + } + putchar('\n'); +} + +void hardware(void) +{ + int i; + int numnodes=0; + int prevnode=-1; + int skip=0; + int maxnode = numa_max_node(); + + if (numa_available() < 0) { + printf("No NUMA available on this system\n"); + exit(1); + } + + for (i=0; i<=maxnode; i++) + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) + numnodes++; + printf("available: %d nodes (", numnodes); + for (i=0; i<=maxnode; i++) { + if (numa_bitmask_isbitset(numa_nodes_ptr, i)) { + if (prevnode == -1) { + printf("%d", i); + prevnode=i; + continue; + } + + if (i > prevnode + 1) { + if (skip) { + printf("%d", prevnode); + skip=0; + } + printf(",%d", i); + prevnode=i; + continue; + } + + if (i == prevnode + 1) { + if (!skip) { + printf("-"); + skip=1; + } + prevnode=i; + } + + if ((i == maxnode) && skip) + printf("%d", prevnode); + } + } + printf(")\n"); + + for (i = 0; i <= maxnode; i++) { + char buf[64]; + long long fr; + unsigned long long sz = numa_node_size64(i, &fr); + if (!numa_bitmask_isbitset(numa_nodes_ptr, i)) + continue; + + printf("node %d cpus:", i); + print_node_cpus(i); + printf("node %d size: %s\n", i, fmt_mem(sz, buf)); + printf("node %d free: %s\n", i, fmt_mem(fr, buf)); + } + print_distances(maxnode); +} + +void checkerror(char *s) +{ + if (errno) { + perror(s); + exit(1); + } +} + +void checknuma(void) +{ + static int numa = -1; + if (numa < 0) { + if (numa_available() < 0) + complain("This system does not support NUMA policy"); + } + numa = 0; +} + +int set_policy = -1; + +void setpolicy(int pol) +{ + if (set_policy != -1) + usage_msg("Conflicting policies"); + set_policy = pol; +} + +void nopolicy(void) +{ + if (set_policy >= 0) + usage_msg("specify policy after --shm/--file"); +} + +int did_cpubind = 0; +int did_strict = 0; +int do_shm = 0; +int do_dump = 0; +int shmattached = 0; +int did_node_cpu_parse = 0; +int parse_all = 0; +char *shmoption; + +void check_cpubind(int flag) +{ + if (flag) + usage_msg("cannot do --cpubind on shared memory\n"); +} + +void noshm(char *opt) +{ + if (shmattached) + usage_msg("%s must be before shared memory specification", opt); + shmoption = opt; +} + +void dontshm(char *opt) +{ + if (shmoption) + usage_msg("%s shm option is not allowed before %s", shmoption, opt); +} + +void needshm(char *opt) +{ + if (!shmattached) + usage_msg("%s must be after shared memory specification", opt); +} + +void check_all_parse(int flag) +{ + if (did_node_cpu_parse) + usage_msg("--all/-a option must be before all cpu/node specifications"); +} + +void get_short_opts(struct option *o, char *s) +{ + *s++ = '+'; + while (o->name) { + if (isprint(o->val)) { + *s++ = o->val; + if (o->has_arg) + *s++ = ':'; + } + o++; + } + *s = '\0'; +} + +void check_shmbeyond(char *msg) +{ + if (shmoffset >= shmlen) { + fprintf(stderr, + "numactl: region offset %#llx beyond its length %#llx at %s\n", + shmoffset, shmlen, msg); + exit(1); + } +} + +static struct bitmask *numactl_parse_nodestring(char *s, int flag) +{ + static char *last; + + if (s[0] == 's' && !strcmp(s, "same")) { + if (!last) + usage_msg("same needs previous node specification"); + s = last; + } else { + last = s; + } + + if (flag == ALL) + return numa_parse_nodestring_all(s); + else + return numa_parse_nodestring(s); +} + +int main(int ac, char **av) +{ + int c, i, nnodes=0; + long node=-1; + char *end; + char shortopts[array_len(opts)*2 + 1]; + struct bitmask *mask = NULL; + + get_short_opts(opts,shortopts); + while ((c = getopt_long(ac, av, shortopts, opts, NULL)) != -1) { + switch (c) { + case 's': /* --show */ + show(); + exit(0); + case 'H': /* --hardware */ + nopolicy(); + hardware(); + exit(0); + case 'i': /* --interleave */ + checknuma(); + if (parse_all) + mask = numactl_parse_nodestring(optarg, ALL); + else + mask = numactl_parse_nodestring(optarg, CPUSET); + if (!mask) { + printf ("<%s> is invalid\n", optarg); + usage(); + } + + errno = 0; + did_node_cpu_parse = 1; + setpolicy(MPOL_INTERLEAVE); + if (shmfd >= 0) + numa_interleave_memory(shmptr, shmlen, mask); + else + numa_set_interleave_mask(mask); + checkerror("setting interleave mask"); + break; + case 'N': /* --cpunodebind */ + case 'c': /* --cpubind */ + dontshm("-c/--cpubind/--cpunodebind"); + checknuma(); + if (parse_all) + mask = numactl_parse_nodestring(optarg, ALL); + else + mask = numactl_parse_nodestring(optarg, CPUSET); + if (!mask) { + printf ("<%s> is invalid\n", optarg); + usage(); + } + errno = 0; + check_cpubind(do_shm); + did_cpubind = 1; + did_node_cpu_parse = 1; + numa_run_on_node_mask_all(mask); + checkerror("sched_setaffinity"); + break; + case 'C': /* --physcpubind */ + { + struct bitmask *cpubuf; + dontshm("-C/--physcpubind"); + if (parse_all) + cpubuf = numa_parse_cpustring_all(optarg); + else + cpubuf = numa_parse_cpustring(optarg); + if (!cpubuf) { + printf ("<%s> is invalid\n", optarg); + usage(); + } + errno = 0; + check_cpubind(do_shm); + did_cpubind = 1; + did_node_cpu_parse = 1; + numa_sched_setaffinity(0, cpubuf); + checkerror("sched_setaffinity"); + free(cpubuf); + break; + } + case 'm': /* --membind */ + checknuma(); + setpolicy(MPOL_BIND); + if (parse_all) + mask = numactl_parse_nodestring(optarg, ALL); + else + mask = numactl_parse_nodestring(optarg, CPUSET); + if (!mask) { + printf ("<%s> is invalid\n", optarg); + usage(); + } + errno = 0; + did_node_cpu_parse = 1; + numa_set_bind_policy(1); + if (shmfd >= 0) { + numa_tonodemask_memory(shmptr, shmlen, mask); + } else { + numa_set_membind(mask); + } + numa_set_bind_policy(0); + checkerror("setting membind"); + break; + case 'p': /* --preferred */ + checknuma(); + setpolicy(MPOL_PREFERRED); + if (parse_all) + mask = numactl_parse_nodestring(optarg, ALL); + else + mask = numactl_parse_nodestring(optarg, CPUSET); + if (!mask) { + printf ("<%s> is invalid\n", optarg); + usage(); + } + for (i=0; i<mask->size; i++) { + if (numa_bitmask_isbitset(mask, i)) { + node = i; + nnodes++; + } + } + if (nnodes != 1) + usage(); + numa_bitmask_free(mask); + errno = 0; + did_node_cpu_parse = 1; + numa_set_bind_policy(0); + if (shmfd >= 0) + numa_tonode_memory(shmptr, shmlen, node); + else + numa_set_preferred(node); + checkerror("setting preferred node"); + break; + case 'l': /* --local */ + checknuma(); + setpolicy(MPOL_DEFAULT); + errno = 0; + if (shmfd >= 0) + numa_setlocal_memory(shmptr, shmlen); + else + numa_set_localalloc(); + checkerror("local allocation"); + break; + case 'S': /* --shm */ + check_cpubind(did_cpubind); + nopolicy(); + attach_sysvshm(optarg, "--shm"); + shmattached = 1; + break; + case 'f': /* --file */ + check_cpubind(did_cpubind); + nopolicy(); + attach_shared(optarg, "--file"); + shmattached = 1; + break; + case 'L': /* --length */ + noshm("--length"); + shmlen = memsize(optarg); + break; + case 'M': /* --shmmode */ + noshm("--shmmode"); + shmmode = strtoul(optarg, &end, 8); + if (end == optarg || *end) + usage(); + break; + case 'd': /* --dump */ + if (shmfd < 0) + complain( + "Cannot do --dump without shared memory.\n"); + dump_shm(); + do_dump = 1; + break; + case 'D': /* --dump-nodes */ + if (shmfd < 0) + complain( + "Cannot do --dump-nodes without shared memory.\n"); + dump_shm_nodes(); + do_dump = 1; + break; + case 't': /* --strict */ + did_strict = 1; + numa_set_strict(1); + break; + case 'I': /* --shmid */ + shmid = strtoul(optarg, &end, 0); + if (end == optarg || *end) + usage(); + break; + + case 'u': /* --huge */ + noshm("--huge"); + shmflags |= SHM_HUGETLB; + break; + + case 'o': /* --offset */ + noshm("--offset"); + shmoffset = memsize(optarg); + break; + + case 'T': /* --touch */ + needshm("--touch"); + check_shmbeyond("--touch"); + numa_police_memory(shmptr, shmlen); + break; + + case 'V': /* --verify */ + needshm("--verify"); + if (set_policy < 0) + complain("Need a policy first to verify"); + check_shmbeyond("--verify"); + numa_police_memory(shmptr, shmlen); + if (!mask) + complain("Need a mask to verify"); + else + verify_shm(set_policy, mask); + break; + + case 'a': /* --all */ + check_all_parse(did_node_cpu_parse); + parse_all = 1; + break; + default: + usage(); + } + } + + av += optind; + ac -= optind; + + if (shmfd >= 0) { + if (*av) + usage(); + exit(exitcode); + } + + if (did_strict) + fprintf(stderr, + "numactl: warning. Strict flag for process ignored.\n"); + + if (do_dump) + usage_msg("cannot do --dump|--dump-shm for process"); + + if (shmoption) + usage_msg("shm related option %s for process", shmoption); + + if (*av == NULL) + usage(); + execvp(*av, av); + complain("execution of `%s': %s\n", av[0], strerror(errno)); + return 0; /* not reached */ +}
diff --git a/numademo.c b/numademo.c new file mode 100644 index 0000000..b01e995 --- /dev/null +++ b/numademo.c
@@ -0,0 +1,570 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + Test/demo program for libnuma. This is also a more or less useful benchmark + of the NUMA characteristics of your machine. It benchmarks most possible + NUMA policy memory configurations with various benchmarks. + Compile standalone with cc -O2 numademo.c -o numademo -lnuma -lm + + numactl is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numactl is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#define _GNU_SOURCE 1 +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <ctype.h> +#include <sys/time.h> +#include "numa.h" +#ifdef HAVE_STREAM_LIB +#include "stream_lib.h" +#endif +#ifdef HAVE_MT +#include "mt.h" +#endif +#ifdef HAVE_CLEAR_CACHE +#include "clearcache.h" +#else +static inline void clearcache(void *a, unsigned size) {} +#endif +#define FRACT_NODES 8 +#define FRACT_MASKS 32 +int fract_nodes; +int *node_to_use; +unsigned long msize; + +/* Should get this from cpuinfo, but on !x86 it's not there */ +enum { + CACHELINESIZE = 64, +}; + +enum test { + MEMSET = 0, + MEMCPY, + FORWARD, + BACKWARD, + STREAM, + RANDOM2, + PTRCHASE, +} thistest; + +char *delim = " "; +int force; +int regression_testing=0; + +char *testname[] = { + "memset", + "memcpy", + "forward", + "backward", +#ifdef HAVE_STREAM_LIB + "stream", +#endif +#ifdef HAVE_MT + "random2", +#endif + "ptrchase", + NULL, +}; + +void output(char *title, char *result) +{ + if (!isspace(delim[0])) + printf("%s%s%s\n", title,delim, result); + else + printf("%-42s%s\n", title, result); +} + +#ifdef HAVE_STREAM_LIB +void do_stream(char *name, unsigned char *mem) +{ + int i; + char title[100], buf[100]; + double res[STREAM_NRESULTS]; + stream_verbose = 0; + clearcache(mem, msize); + stream_init(mem); + stream_test(res); + sprintf(title, "%s%s%s", name, delim, "STREAM"); + buf[0] = '\0'; + for (i = 0; i < STREAM_NRESULTS; i++) { + if (buf[0]) + strcat(buf,delim); + sprintf(buf+strlen(buf), "%s%s%.2f%sMB/s", + stream_names[i], delim, res[i], delim); + } + output(title, buf); + clearcache(mem, msize); +} +#endif + +/* Set up a randomly distributed list to fool prefetchers */ +union node { + union node *next; + struct { + unsigned nexti; + unsigned val; + }; +}; + +static int cmp_node(const void *ap, const void *bp) +{ + union node *a = (union node *)ap; + union node *b = (union node *)bp; + return a->val - b->val; +} + +void **ptrchase_init(unsigned char *mem) +{ + long i; + union node *nodes = (union node *)mem; + long nmemb = msize / sizeof(union node); + srand(1234); + for (i = 0; i < nmemb; i++) { + nodes[i].val = rand(); + nodes[i].nexti = i + 1; + } + qsort(nodes, nmemb, sizeof(union node), cmp_node); + for (i = 0; i < nmemb; i++) { + union node *n = &nodes[i]; + n->next = n->nexti >= nmemb ? NULL : &nodes[n->nexti]; + } + return (void **)nodes; +} + +static inline unsigned long long timerfold(struct timeval *tv) +{ + return tv->tv_sec * 1000000ULL + tv->tv_usec; +} + +#define LOOPS 10 + +void memtest(char *name, unsigned char *mem) +{ + long k; + struct timeval start, end, res; + unsigned long long max, min, sum, r; + int i; + char title[128], result[128]; + + if (!mem) { + fprintf(stderr, + "Failed to allocate %lu bytes of memory. Test \"%s\" exits.\n", + msize, name); + return; + } + +#ifdef HAVE_STREAM_LIB + if (thistest == STREAM) { + do_stream(name, mem); + goto out; + } +#endif + + max = 0; + min = ~0UL; + sum = 0; + + /* + * Note: 0th pass allocates the pages, don't measure + */ + for (i = 0; i < LOOPS+1; i++) { + clearcache(mem, msize); + switch (thistest) { + case PTRCHASE: + { + void **ptr; + ptr = ptrchase_init(mem); + gettimeofday(&start,NULL); + while (*ptr) + ptr = (void **)*ptr; + gettimeofday(&end,NULL); + /* Side effect to trick the optimizer */ + *ptr = "bla"; + break; + } + + case MEMSET: + gettimeofday(&start,NULL); + memset(mem, 0xff, msize); + gettimeofday(&end,NULL); + break; + + case MEMCPY: + gettimeofday(&start,NULL); + memcpy(mem, mem + msize/2, msize/2); + gettimeofday(&end,NULL); + break; + + case FORWARD: + /* simple kernel to just fetch cachelines and write them back. + will trigger hardware prefetch */ + gettimeofday(&start,NULL); + for (k = 0; k < msize; k+=CACHELINESIZE) + mem[k]++; + gettimeofday(&end,NULL); + break; + + case BACKWARD: + gettimeofday(&start,NULL); + for (k = msize-5; k > 0; k-=CACHELINESIZE) + mem[k]--; + gettimeofday(&end,NULL); + break; + +#ifdef HAVE_MT + case RANDOM2: + { + unsigned * __restrict m = (unsigned *)mem; + unsigned max = msize / sizeof(unsigned); + unsigned mask; + + mt_init(); + mask = 1; + while (mask < max) + mask = (mask << 1) | 1; + /* + * There's no guarantee all memory is touched, but + * we assume (hope) that the distribution of the MT + * is good enough to touch most. + */ + gettimeofday(&start,NULL); + for (k = 0; k < max; k++) { + unsigned idx = mt_random() & mask; + if (idx >= max) + idx -= max; + m[idx]++; + } + gettimeofday(&end,NULL); + } + +#endif + default: + break; + } + + if (!i) + continue; /* don't count allocation pass */ + + timersub(&end, &start, &res); + r = timerfold(&res); + if (r > max) max = r; + if (r < min) min = r; + sum += r; + } + sprintf(title, "%s%s%s", name, delim, testname[thistest]); +#define H(t) (((double)msize) / ((double)t)) +#define D3 delim,delim,delim + sprintf(result, "Avg%s%.2f%sMB/s%sMax%s%.2f%sMB/s%sMin%s%.2f%sMB/s", + delim, + H(sum/LOOPS), + D3, + H(min), + D3, + H(max), + delim); +#undef H +#undef D3 + output(title,result); + +#ifdef HAVE_STREAM_LIB + out: +#endif + /* Just to make sure that when we switch CPUs that the old guy + doesn't still keep it around. */ + clearcache(mem, msize); + + numa_free(mem, msize); +} + +int popcnt(unsigned long val) +{ + int i = 0, cnt = 0; + while (val >> i) { + if ((1UL << i) & val) + cnt++; + i++; + } + return cnt; +} + +int max_node, numnodes; + +void get_node_list() +{ + int a, got_nodes = 0; + long free_node_sizes; + + numnodes = numa_num_configured_nodes(); + node_to_use = (int *)malloc(numnodes * sizeof(int)); + max_node = numa_max_node(); + for (a = 0; a <= max_node; a++) { + if(numa_node_size(a, &free_node_sizes) != -1) + node_to_use[got_nodes++] = a; + } +} + +void test(enum test type) +{ + unsigned long mask; + int i, k; + char buf[512]; + struct bitmask *nodes; + + nodes = numa_allocate_nodemask(); + thistest = type; + + if (regression_testing) { + printf("\nTest %s doing 1 of %d nodes and 1 of %d masks.\n", + testname[thistest], fract_nodes, FRACT_MASKS); + } + + memtest("memory with no policy", numa_alloc(msize)); + memtest("local memory", numa_alloc_local(msize)); + + memtest("memory interleaved on all nodes", numa_alloc_interleaved(msize)); + for (i = 0; i < numnodes; i++) { + if (regression_testing && (node_to_use[i] % fract_nodes)) { + /* for regression testing (-t) do only every eighth node */ + continue; + } + sprintf(buf, "memory on node %d", node_to_use[i]); + memtest(buf, numa_alloc_onnode(msize, node_to_use[i])); + } + + for (mask = 1, i = 0; mask < (1UL<<numnodes); mask++, i++) { + int w; + char buf2[20]; + if (popcnt(mask) == 1) + continue; + if (regression_testing && (i > 50)) { + break; + } + if (regression_testing && (i % FRACT_MASKS)) { + /* for regression testing (-t) + do only every 32nd mask permutation */ + continue; + } + numa_bitmask_clearall(nodes); + for (w = 0; mask >> w; w++) { + if ((mask >> w) & 1) + numa_bitmask_setbit(nodes, w); + } + + sprintf(buf, "memory interleaved on"); + for (k = 0; k < numnodes; k++) + if ((1UL<<node_to_use[k]) & mask) { + sprintf(buf2, " %d", node_to_use[k]); + strcat(buf, buf2); + } + memtest(buf, numa_alloc_interleaved_subset(msize, nodes)); + } + + for (i = 0; i < numnodes; i++) { + if (regression_testing && (node_to_use[i] % fract_nodes)) { + /* for regression testing (-t) do only every eighth node */ + continue; + } + printf("setting preferred node to %d\n", node_to_use[i]); + numa_set_preferred(node_to_use[i]); + memtest("memory without policy", numa_alloc(msize)); + } + + numa_set_interleave_mask(numa_all_nodes_ptr); + memtest("manual interleaving to all nodes", numa_alloc(msize)); + + if (numnodes > 0) { + numa_bitmask_clearall(nodes); + numa_bitmask_setbit(nodes, 0); + numa_bitmask_setbit(nodes, 1); + numa_set_interleave_mask(nodes); + memtest("manual interleaving on node 0/1", numa_alloc(msize)); + printf("current interleave node %d\n", numa_get_interleave_node()); + } + + numa_set_interleave_mask(numa_no_nodes_ptr); + + nodes = numa_allocate_nodemask(); + + for (i = 0; i < numnodes; i++) { + int oldhn = numa_preferred(); + + if (regression_testing && (node_to_use[i] % fract_nodes)) { + /* for regression testing (-t) do only every eighth node */ + continue; + } + numa_run_on_node(node_to_use[i]); + printf("running on node %d, preferred node %d\n",node_to_use[i], oldhn); + + memtest("local memory", numa_alloc_local(msize)); + + memtest("memory interleaved on all nodes", + numa_alloc_interleaved(msize)); + + if (numnodes >= 2) { + numa_bitmask_clearall(nodes); + numa_bitmask_setbit(nodes, 0); + numa_bitmask_setbit(nodes, 1); + memtest("memory interleaved on node 0/1", + numa_alloc_interleaved_subset(msize, nodes)); + } + + for (k = 0; k < numnodes; k++) { + if (node_to_use[k] == node_to_use[i]) + continue; + if (regression_testing && (node_to_use[k] % fract_nodes)) { + /* for regression testing (-t) + do only every eighth node */ + continue; + } + sprintf(buf, "alloc on node %d", node_to_use[k]); + numa_bitmask_clearall(nodes); + numa_bitmask_setbit(nodes, node_to_use[k]); + numa_set_membind(nodes); + memtest(buf, numa_alloc(msize)); + numa_set_membind(numa_all_nodes_ptr); + } + + numa_set_localalloc(); + memtest("local allocation", numa_alloc(msize)); + + numa_set_preferred((node_to_use[i]+1) % numnodes ); + memtest("setting wrong preferred node", numa_alloc(msize)); + numa_set_preferred(node_to_use[i]); + memtest("setting correct preferred node", numa_alloc(msize)); + numa_set_preferred(-1); + if (!delim[0]) + printf("\n\n\n"); + } + + /* numa_run_on_node_mask is not tested */ +} + +void usage(void) +{ + int i; + printf("usage: numademo [-S] [-f] [-c] [-e] [-t] msize[kmg] {tests}\nNo tests means run all.\n"); + printf("-c output CSV data. -f run even without NUMA API. -S run stupid tests. -e exit on error\n"); + printf("-t regression test; do not run all node combinations\n"); + printf("valid tests:"); + for (i = 0; testname[i]; i++) + printf(" %s", testname[i]); + putchar('\n'); + exit(1); +} + +/* duplicated to make numademo standalone */ +long memsize(char *s) +{ + char *end; + long length = strtoul(s,&end,0); + switch (toupper(*end)) { + case 'G': length *= 1024; /*FALL THROUGH*/ + case 'M': length *= 1024; /*FALL THROUGH*/ + case 'K': length *= 1024; break; + } + return length; +} + +int main(int ac, char **av) +{ + int simple_tests = 0; + + while (av[1] && av[1][0] == '-') { + ac--; + switch (av[1][1]) { + case 'c': + delim = ","; + break; + case 'f': + force = 1; + break; + case 'S': + simple_tests = 1; + break; + case 'e': + numa_exit_on_error = 1; + numa_exit_on_warn = 1; + break; + case 't': + regression_testing = 1; + break; + default: + usage(); + break; + } + ++av; + } + + if (!av[1]) + usage(); + + if (numa_available() < 0) { + printf("your system does not support the numa API.\n"); + if (!force) + exit(1); + } + get_node_list(); + printf("%d nodes available\n", numnodes); + fract_nodes = (((numnodes-1)/8)*2) + FRACT_NODES; + + if (numnodes <= 3) + regression_testing = 0; /* set -t auto-off for small systems */ + + msize = memsize(av[1]); + + if (!msize) + usage(); + +#ifdef HAVE_STREAM_LIB + stream_setmem(msize); +#endif + + if (av[2] == NULL) { + test(MEMSET); + test(MEMCPY); + if (simple_tests) { + test(FORWARD); + test(BACKWARD); + } +#ifdef HAVE_MT + test(RANDOM2); +#endif +#ifdef HAVE_STREAM_LIB + test(STREAM); +#endif + if (msize >= sizeof(union node)) { + test(PTRCHASE); + } else { + fprintf(stderr, "You must set msize at least %lu bytes for ptrchase test.\n", + sizeof(union node)); + exit(1); + } + } else { + int k; + for (k = 2; k < ac; k++) { + int i; + int found = 0; + for (i = 0; testname[i]; i++) { + if (!strcmp(testname[i],av[k])) { + test(i); + found = 1; + break; + } + } + if (!found) { + fprintf(stderr,"unknown test `%s'\n", av[k]); + usage(); + } + } + } + return 0; +}
diff --git a/numaif.h b/numaif.h new file mode 100644 index 0000000..d81d458 --- /dev/null +++ b/numaif.h
@@ -0,0 +1,48 @@ +#ifndef NUMAIF_H +#define NUMAIF_H 1 + +#ifdef __cplusplus +extern "C" { +#endif + +/* Kernel interface for NUMA API */ + +/* System calls */ +extern long get_mempolicy(int *policy, unsigned long *nmask, + unsigned long maxnode, void *addr, unsigned flags); +extern long mbind(void *start, unsigned long len, int mode, + const unsigned long *nmask, unsigned long maxnode, unsigned flags); +extern long set_mempolicy(int mode, const unsigned long *nmask, + unsigned long maxnode); +extern long migrate_pages(int pid, unsigned long maxnode, + const unsigned long *frommask, + const unsigned long *tomask); + +extern long move_pages(int pid, unsigned long count, + void **pages, const int *nodes, int *status, int flags); + +/* Policies */ +#define MPOL_DEFAULT 0 +#define MPOL_PREFERRED 1 +#define MPOL_BIND 2 +#define MPOL_INTERLEAVE 3 + +#define MPOL_MAX MPOL_INTERLEAVE + +/* Flags for get_mem_policy */ +#define MPOL_F_NODE (1<<0) /* return next il node or node of address */ + /* Warning: MPOL_F_NODE is unsupported and + subject to change. Don't use. */ +#define MPOL_F_ADDR (1<<1) /* look up vma using address */ +#define MPOL_F_MEMS_ALLOWED (1<<2) /* query nodes allowed in cpuset */ + +/* Flags for mbind */ +#define MPOL_MF_STRICT (1<<0) /* Verify existing pages in the mapping */ +#define MPOL_MF_MOVE (1<<1) /* Move pages owned by this process to conform to mapping */ +#define MPOL_MF_MOVE_ALL (1<<2) /* Move every page to conform to mapping */ + +#ifdef __cplusplus +} +#endif + +#endif
diff --git a/numaint.h b/numaint.h new file mode 100644 index 0000000..e9cd385 --- /dev/null +++ b/numaint.h
@@ -0,0 +1,57 @@ +/* Internal interfaces of libnuma */ + +extern int numa_sched_setaffinity_v1(pid_t pid, unsigned len, const unsigned long *mask); +extern int numa_sched_getaffinity_v1(pid_t pid, unsigned len, const unsigned long *mask); +extern int numa_sched_setaffinity_v1_int(pid_t pid, unsigned len,const unsigned long *mask); +extern int numa_sched_getaffinity_v1_int(pid_t pid, unsigned len,const unsigned long *mask); +extern int numa_sched_setaffinity_v2(pid_t pid, struct bitmask *mask); +extern int numa_sched_getaffinity_v2(pid_t pid, struct bitmask *mask); +extern int numa_sched_setaffinity_v2_int(pid_t pid, struct bitmask *mask); +extern int numa_sched_getaffinity_v2_int(pid_t pid, struct bitmask *mask); + +#define SHM_HUGETLB 04000 /* segment will use huge TLB pages */ + +#define BITS_PER_LONG (sizeof(unsigned long) * 8) +#define CPU_BYTES(x) (round_up(x, BITS_PER_LONG)/8) +#define CPU_LONGS(x) (CPU_BYTES(x) / sizeof(long)) + +#define make_internal_alias(x) extern __typeof (x) x##_int __attribute((alias(#x), visibility("hidden"))) +#define hidden __attribute__((visibility("hidden"))) + +enum numa_warn { + W_nosysfs, + W_noproc, + W_badmeminfo, + W_nosysfs2, + W_cpumap, + W_numcpus, + W_noderunmask, + W_distance, + W_memory, + W_cpuparse, + W_nodeparse, + W_blockdev1, + W_blockdev2, + W_blockdev3, + W_blockdev4, + W_blockdev5, + W_netlink1, + W_netlink2, + W_netlink3, + W_net1, + W_net2, + W_class1, + W_class2, + W_pci1, + W_pci2, + W_node_parse1, + W_node_parse2, + W_nonode, + W_badchar, +}; + +#define howmany(x,y) (((x)+((y)-1))/(y)) +#define bitsperlong (8 * sizeof(unsigned long)) +#define bitsperint (8 * sizeof(unsigned int)) +#define longsperbits(n) howmany(n, bitsperlong) +#define bytesperbits(x) ((x+7)/8)
diff --git a/numamon.c b/numamon.c new file mode 100644 index 0000000..a118c5f --- /dev/null +++ b/numamon.c
@@ -0,0 +1,331 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + + numamon is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numamon is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + + Display some numa statistics collected by the CPU. + Opteron specific. Also not reliable because the counters + are not quite correct in hardware. */ + +#define _LARGE_FILE_SOURCE 1 +#define _GNU_SOURCE 1 +#include <string.h> +#include <errno.h> +#include <stdio.h> +#include <unistd.h> +#include <dirent.h> +#include <getopt.h> +#include <stdarg.h> +#include <stdlib.h> +#include <sys/fcntl.h> + +enum { LOCALLOCAL = 0, LOCALREMOTE = 1, REMOTELOCAL = 2 }; +static int mem[] = { [LOCALLOCAL] = 0xa8, [LOCALREMOTE] = 0x98, [REMOTELOCAL] = 0x68 }; +static int io[] = { [LOCALLOCAL] = 0xa4, [LOCALREMOTE] = 0x94, [REMOTELOCAL] = 0x64 }; +static int *masks = mem; + +#define err(x) perror(x),exit(1) + +#define PERFEVTSEL0 0xc0010000 +#define PERFEVTSEL1 0xc0010001 +#define PERFEVTSEL2 0xc0010002 +#define PERFEVTSEL3 0xc0010003 + +#define PERFCTR0 0xc0010004 +#define PERFCTR1 0xc0010005 +#define PERFCTR2 0xc0010006 +#define PERFCTR3 0xc0010007 + +#define EVENT 0xe9 +#define PERFEVTSEL_EN (1 << 22) +#define PERFEVTSEL_OS (1 << 17) +#define PERFEVTSEL_USR (1 << 16) + +#define BASE (EVENT | PERFEVTSEL_EN | PERFEVTSEL_OS | PERFEVTSEL_USR) + +#define MAXCPU 8 + +int force = 0; +int msrfd[MAXCPU]; +int delay; +int absolute; +char *cfilter; +int verbose; + +void usage(void); + +void Vprintf(char *fmt, ...) +{ + va_list ap; + va_start(ap,fmt); + if (verbose) + vfprintf(stderr,fmt,ap); + va_end(ap); +} + +unsigned long long rdmsr(int cpu, unsigned long msr) +{ + unsigned long long val; + if (pread(msrfd[cpu], &val, 8, msr) != 8) { + fprintf(stderr, "rdmsr of %lx failed: %s\n", msr, strerror(errno)); + exit(1); + } + return val; +} + +void wrmsr(int cpu, unsigned long msr, unsigned long long value) +{ + if (pwrite(msrfd[cpu], &value, 8, msr) != 8) { + fprintf(stderr, "wdmsr of %lx failed: %s\n", msr, strerror(errno)); + exit(1); + } +} + +int cpufilter(int cpu) +{ + long num; + char *end; + char *s; + + if (!cfilter) + return 1; + for (s = cfilter;;) { + num = strtoul(s, &end, 0); + if (end == s) + usage(); + if (cpu == num) + return 1; + if (*end == ',') + s = end+1; + else if (*end == 0) + break; + else + usage(); + } + return 0; +} + +void checkcounter(int cpu, int clear) +{ + int i; + for (i = 1; i < 4; i++) { + int clear_this = clear; + unsigned long long evtsel = rdmsr(cpu, PERFEVTSEL0 + i); + Vprintf("%d: %x %Lx\n", cpu, PERFEVTSEL0 + i, evtsel); + if (!(evtsel & PERFEVTSEL_EN)) { + Vprintf("reinit %d\n", cpu); + wrmsr(cpu, PERFEVTSEL0 + i, BASE | masks[i - 1]); + clear_this = 1; + } else if (evtsel == (BASE | (masks[i-1] << 8))) { + /* everything fine */ + } else if (force) { + Vprintf("reinit force %d\n", cpu); + wrmsr(cpu, PERFEVTSEL0 + i, BASE | (masks[i - 1] << 8)); + clear_this = 1; + } else { + fprintf(stderr, "perfctr %d cpu %d already used with %Lx\n", + i, cpu, evtsel); + fprintf(stderr, "Consider using -f if you know what you're doing.\n"); + exit(1); + } + if (clear_this) { + Vprintf("clearing %d\n", cpu); + wrmsr(cpu, PERFCTR0 + i, 0); + } + } +} + +void setup(int clear) +{ + DIR *dir; + struct dirent *d; + int numcpus = 0; + + memset(msrfd, -1, sizeof(msrfd)); + dir = opendir("/dev/cpu"); + if (!dir) + err("cannot open /dev/cpu"); + while ((d = readdir(dir)) != NULL) { + char buf[64]; + char *end; + long cpunum = strtoul(d->d_name, &end, 0); + if (*end != 0) + continue; + if (cpunum > MAXCPU) { + fprintf(stderr, "too many cpus %ld %s\n", cpunum, d->d_name); + continue; + } + if (!cpufilter(cpunum)) + continue; + snprintf(buf, 63, "/dev/cpu/%ld/msr", cpunum); + msrfd[cpunum] = open64(buf, O_RDWR); + if (msrfd[cpunum] < 0) + continue; + numcpus++; + checkcounter(cpunum, clear); + } + closedir(dir); + if (numcpus == 0) { + fprintf(stderr, "No CPU found using MSR driver.\n"); + exit(1); + } +} + +void printf_padded(int pad, char *fmt, ...) +{ + char buf[pad + 1]; + va_list ap; + va_start(ap, fmt); + vsnprintf(buf, pad, fmt, ap); + printf("%-*s", pad, buf); + va_end(ap); +} + +void print_header(void) +{ + printf_padded(4, "CPU "); + printf_padded(16, "LOCAL"); + printf_padded(16, "LOCAL->REMOTE"); + printf_padded(16, "REMOTE->LOCAL"); + putchar('\n'); +} + +void print_cpu(int cpu) +{ + int i; + static unsigned long long lastval[4]; + printf_padded(4, "%d", cpu); + for (i = 1; i < 4; i++) { + unsigned long long val = rdmsr(cpu, PERFCTR0 + i); + if (absolute) + printf_padded(16, "%Lu", val); + else + printf_padded(16, "%Lu", val - lastval[i]); + lastval[i] = val; + } + putchar('\n'); +} + +void dumpall(void) +{ + int cnt = 0; + int cpu; + print_header(); + for (;;) { + for (cpu = 0; cpu < MAXCPU; ++cpu) { + if (msrfd[cpu] < 0) + continue; + print_cpu(cpu); + } + if (!delay) + break; + sleep(delay); + if (++cnt > 40) { + cnt = 0; + print_header(); + } + } +} + +void checkk8(void) +{ + char *line = NULL; + size_t size = 0; + int bad = 0; + FILE *f = fopen("/proc/cpuinfo", "r"); + if (!f) + return; + while (getline(&line, &size, f) > 0) { + if (!strncmp("vendor_id", line, 9)) { + if (!strstr(line, "AMD")) + bad++; + } + if (!strncmp("cpu family", line, 10)) { + char *s = line + strcspn(line,":"); + int family; + if (*s == ':') ++s; + family = strtoul(s, NULL, 0); + if (family != 15) + bad++; + } + } + if (bad) { + printf("not a opteron cpu\n"); + exit(1); + } + free(line); + fclose(f); +} + +void usage(void) +{ + fprintf(stderr, "usage: numamon [args] [delay]\n"); + fprintf(stderr, " -f forcibly overwrite counters\n"); + fprintf(stderr, " -i count IO (default memory)\n"); + fprintf(stderr, " -a print absolute counter values (with delay)\n"); + fprintf(stderr, " -s setup counters and exit\n"); + fprintf(stderr, " -c clear counters and exit\n"); + fprintf(stderr, " -m Print memory traffic (default)\n"); + fprintf(stderr, " -C cpu{,cpu} only print for cpus\n"); + fprintf(stderr, " -v Be verbose\n"); + exit(1); +} + +int main(int ac, char **av) +{ + int opt; + checkk8(); + while ((opt = getopt(ac,av,"ifscmaC:v")) != -1) { + switch (opt) { + case 'f': + force = 1; + break; + case 'c': + setup(1); + exit(0); + case 's': + setup(0); + exit(0); + case 'm': + masks = mem; + break; + case 'i': + masks = io; + break; + case 'a': + absolute = 1; + break; + case 'C': + cfilter = optarg; + break; + case 'v': + verbose = 1; + break; + default: + usage(); + } + } + if (av[optind]) { + char *end; + delay = strtoul(av[optind], &end, 10); + if (*end) + usage(); + if (av[optind+1]) + usage(); + } + + setup(0); + dumpall(); + return 0; +}
diff --git a/numastat.8 b/numastat.8 new file mode 100644 index 0000000..4dcddf3 --- /dev/null +++ b/numastat.8
@@ -0,0 +1,158 @@ +.TH "numastat" "8" "1.0.0" "Bill Gray" "Administration" +.SH NAME +.LP +\fBnumastat\fP \- Show per-NUMA-node memory statistics for processes and the operating system +.SH "SYNTAX" +.LP +\fBnumastat\fP +.br +.LP +\fBnumastat\fP [\fI\-V\fP] +.br +.LP +\fBnumastat\fP [\fI\<PID>|<pattern>...\fP] +.br +.LP +\fBnumastat\fP [\fI\-c\fP] [\fI\-m\fP] [\fI\-n\fP] [\fI\-p <PID>|<pattern>\fP] [\fI\-s[<node>]\fP] [\fI\-v\fP] [\fI\-z\fP] [\fI\<PID>|<pattern>...\fP] +.br +.SH "DESCRIPTION" +.LP +.B numastat +with no command options or arguments at all, displays per-node NUMA hit and +miss system statistics from the kernel memory allocator. This default +\fBnumastat\fP behavior is strictly compatible with the previous long-standing +\fBnumastat\fP perl script, written by Andi Kleen. The default \fBnumastat\fP +statistics shows per-node numbers (in units of pages of memory) in these categories: +.LP +.B numa_hit +is memory successfully allocated on this node as intended. +.LP +.B numa_miss +is memory allocated on this node despite the process preferring some different node. Each +.I numa_miss +has a +.I numa_foreign +on another node. +.LP +.B numa_foreign +is memory intended for this node, but actually allocated on some different node. Each +.I numa_foreign +has a +.I numa_miss +on another node. +.LP +.B interleave_hit +is interleaved memory successfully allocated on this node as intended. +.LP +.B local_node +is memory allocated on this node while a process was running on it. +.LP +.B other_node +is memory allocated on this node while a process was running on some other node. +.LP +Any supplied options or arguments with the \fBnumastat\fP command will +significantly change both the content and the format of the display. Specified +options will cause display units to change to megabytes of memory, and will +change other specific behaviors of \fBnumastat\fP as described below. +.SH "OPTIONS" +.LP +.TP +\fB\-c\fR +Minimize table display width by dynamically shrinking column widths based on +data contents. With this option, amounts of memory will be rounded to the +nearest megabyte (rather than the usual display with two decimal places). +Column width and inter-column spacing will be somewhat unpredictable with this +option, but the more dense display will be very useful on systems with many +NUMA nodes. +.TP +\fB\-m\fR +Show the meminfo-like system-wide memory usage information. This option +produces a per-node breakdown of memory usage information similar to that found +in /proc/meminfo. +.TP +\fB\-n\fR +Show the original \fBnumastat\fP statistics info. This will show the same +information as the default \fBnumastat\fP behavior but the units will be megabytes of +memory, and there will be other formatting and layout changes versus the +original \fBnumastat\fP behavior. +.TP +\fB\-p\fR <\fBPID\fP> or <\fBpattern\fP> +Show per-node memory allocation information for the specified PID or pattern. +If the \-p argument is only digits, it is assumed to be a numerical PID. If +the argument characters are not only digits, it is assumed to be a text +fragment pattern to search for in process command lines. For example, +\fBnumastat -p qemu\fP will attempt to find and show information for processes +with "qemu" in the command line. Any command line arguments remaining after +\fBnumastat\fP option flag processing is completed, are assumed to be +additional <\fBPID\fP> or <\fBpattern\fP> process specifiers. In this sense, +the \fB\-p\fP option flag is optional: \fBnumastat qemu\fP is equivalent to +\fBnumastat -p qemu\fP +.TP +\fB\-s[<node>]\fR +Sort the table data in descending order before displaying it, so the biggest +memory consumers are listed first. With no specified <node>, the table will be +sorted by the total column. If the optional <node> argument is supplied, the +data will be sorted by the <node> column. Note that <node> must follow the +\fB\-s\fP immediately with no intermediate white space (e.g., \fBnumastat +\-s2\fP). Because \fB\-s\fP can allow an optional argument, it must always be +the last option character in a compound option character string. For example, +instead of \fBnumastat \-msc\fP (which probably will not work as you expect), +use \fBnumastat \-mcs\fP +.TP +\fB\-v\fR +Make some reports more verbose. In particular, process information for +multiple processes will display detailed information for each process. +Normally when per-node information for multiple processes is displayed, only +the total lines are shown. +.TP +\fB\-V\fR +Display \fBnumastat\fP version information and exit. +.TP +\fB\-z\fR +Skip display of table rows and columns of only zero valuess. This can be used +to greatly reduce the amount of uninteresting zero data on systems with many +NUMA nodes. Note that when rows or columns of zeros are still displayed with +this option, that probably means there is at least one value in the row or +column that is actually non-zero, but rounded to zero for display. +.SH NOTES +\fBnumastat\fP attempts to fold each table display so it will be conveniently +readable on the output terminal. Normally a terminal width of 80 characters is +assumed. When the \fBresize\fP command is available, \fBnumastat\fP attempts +to dynamically determine and fine tune the output tty width from \fBresize\fP +output. If \fBnumastat\fP output is not to a tty, very long output lines can +be produced, depending on how many NUMA nodes are present. In all cases, +output width can be explicitly specified via the \fBNUMASTAT_WIDTH\fP +environment variable. For example, \fBNUMASTAT_WIDTH=100 numastat\fP. On +systems with many NUMA nodes, \fBnumastat \-c \-z ....\fP can be very helpful +to selectively reduce the amount of displayed information. +.SH "ENVIRONMENT VARIABLES" +.LP +.TP +NUMASTAT_WIDTH +.SH "FILES" +.LP +\fI/proc/*/numa_maps\fP +.br +\fI/sys/devices/system/node/node*/meminfo\fP +.br +\fI/sys/devices/system/node/node*/numastat\fP +.SH "EXAMPLES" +.I numastat \-c \-z \-m \-n +.br +.I numastat \-czs libvirt kvm qemu +.br +.I watch \-n1 numastat +.br +.I watch \-n1 \-\-differences=cumulative numastat +.SH "AUTHORS" +.LP +The original \fBnumastat\fP perl script was written circa 2003 by Andi Kleen +<andi.kleen@intel.com>. The current \fBnumastat\fP program was written in 2012 +by Bill Gray <bgray@redhat.com> to be compatible by default with the original, +and to add options to display per-node system memory usage and per-node process +memory allocation. +.SH "SEE ALSO" +.LP +.BR numactl (8), +.BR set_mempolicy( 2), +.BR numa (3)
diff --git a/numastat.c b/numastat.c new file mode 100644 index 0000000..92d8496 --- /dev/null +++ b/numastat.c
@@ -0,0 +1,1480 @@ +/* + +numastat - NUMA monitoring tool to show per-node usage of memory +Copyright (C) 2012 Bill Gray (bgray@redhat.com), Red Hat Inc + +numastat is free software; you can redistribute it and/or modify it under the +terms of the GNU Lesser General Public License as published by the Free +Software Foundation; version 2.1. + +numastat is distributed in the hope that it will be useful, but WITHOUT ANY +WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A +PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. + +You should find a copy of v2.1 of the GNU Lesser General Public License +somewhere on your Linux system; if not, write to the Free Software Foundation, +Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA + +*/ + +/* + +Historical note: From approximately 2003 to 2012, numastat was a perl script +written by Andi Kleen to display the /sys/devices/system/node/node<N>/numastat +statistics. In 2012, numastat was rewritten as a C program by Red Hat to +display per-node memory data for applications and the system in general, +while also remaining strictly compatible by default with the original numastat. +A copy of the original numastat perl script is included for reference at the +end of this file. + +*/ + +// Compile with: gcc -O -std=gnu99 -Wall -o numastat numastat.c + +#define __USE_MISC +#include <ctype.h> +#include <dirent.h> +#include <errno.h> +#include <getopt.h> +#include <stdint.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/types.h> +#include <unistd.h> + +#define STRINGIZE(s) #s +#define STRINGIFY(s) STRINGIZE(s) + +#define KILOBYTE (1024) +#define MEGABYTE (1024 * 1024) + +#define BUF_SIZE 2048 +#define SMALL_BUF_SIZE 128 + +// Don't assume nodes are sequential or contiguous. +// Need to discover and map node numbers. + +int *node_ix_map = NULL; +char **node_header; + +// Structure to organize memory info from /proc/<PID>/numa_maps for a specific +// process, or from /sys/devices/system/node/node?/meminfo for system-wide +// data. Tables are defined below for each process and for system-wide data. + +typedef struct meminfo { + int index; + char *token; + char *label; +} meminfo_t, *meminfo_p; + +#define PROCESS_HUGE_INDEX 0 +#define PROCESS_PRIVATE_INDEX 3 + +meminfo_t process_meminfo[] = { + { PROCESS_HUGE_INDEX, "huge", "Huge" }, + { 1, "heap", "Heap" }, + { 2, "stack", "Stack" }, + { PROCESS_PRIVATE_INDEX, "N", "Private" } +}; + +#define PROCESS_MEMINFO_ROWS (sizeof(process_meminfo) / sizeof(process_meminfo[0])) + +meminfo_t numastat_meminfo[] = { + { 0, "numa_hit", "Numa_Hit" }, + { 1, "numa_miss", "Numa_Miss" }, + { 2, "numa_foreign", "Numa_Foreign" }, + { 3, "interleave_hit", "Interleave_Hit" }, + { 4, "local_node", "Local_Node" }, + { 5, "other_node", "Other_Node" }, +}; + +#define NUMASTAT_MEMINFO_ROWS (sizeof(numastat_meminfo) / sizeof(numastat_meminfo[0])) + +meminfo_t system_meminfo[] = { + { 0, "MemTotal", "MemTotal" }, + { 1, "MemFree", "MemFree" }, + { 2, "MemUsed", "MemUsed" }, + { 3, "HighTotal", "HighTotal" }, + { 4, "HighFree", "HighFree" }, + { 5, "LowTotal", "LowTotal" }, + { 6, "LowFree", "LowFree" }, + { 7, "Active", "Active" }, + { 8, "Inactive", "Inactive" }, + { 9, "Active(anon)", "Active(anon)" }, + { 10, "Inactive(anon)", "Inactive(anon)" }, + { 11, "Active(file)", "Active(file)" }, + { 12, "Inactive(file)", "Inactive(file)" }, + { 13, "Unevictable", "Unevictable" }, + { 14, "Mlocked", "Mlocked" }, + { 15, "Dirty", "Dirty" }, + { 16, "Writeback", "Writeback" }, + { 17, "FilePages", "FilePages" }, + { 18, "Mapped", "Mapped" }, + { 19, "AnonPages", "AnonPages" }, + { 20, "Shmem", "Shmem" }, + { 21, "KernelStack", "KernelStack" }, + { 22, "PageTables", "PageTables" }, + { 23, "NFS_Unstable", "NFS_Unstable" }, + { 24, "Bounce", "Bounce" }, + { 25, "WritebackTmp", "WritebackTmp" }, + { 26, "Slab", "Slab" }, + { 27, "SReclaimable", "SReclaimable" }, + { 28, "SUnreclaim", "SUnreclaim" }, + { 29, "AnonHugePages", "AnonHugePages" }, + { 30, "HugePages_Total", "HugePages_Total" }, + { 31, "HugePages_Free", "HugePages_Free" }, + { 32, "HugePages_Surp", "HugePages_Surp" } +}; + +#define SYSTEM_MEMINFO_ROWS (sizeof(system_meminfo) / sizeof(system_meminfo[0])) + +// To allow re-ordering the meminfo memory categories in system_meminfo and +// numastat_meminfo relative to order in /proc, etc., a simple hash index is +// used to look up the meminfo categories. The allocated hash table size must +// be bigger than necessary to reduce collisions (and because these specific +// hash algorithms depend on having some unused buckets. + +#define HASH_TABLE_SIZE 151 +int hash_collisions = 0; + +struct hash_entry { + char *name; + int index; +} hash_table[HASH_TABLE_SIZE]; + +void init_hash_table() { + memset(hash_table, 0, sizeof(hash_table)); +} + +int hash_ix(char *s) { + unsigned int h = 17; + while (*s) { + // h * 33 + *s++ + h = ((h << 5) + h) + *s++; + } + return (h % HASH_TABLE_SIZE); +} + +int hash_lookup(char *s) { + int ix = hash_ix(s); + while (hash_table[ix].name) { // Assumes big table with blank entries + if (!strcmp(s, hash_table[ix].name)) { + return hash_table[ix].index; // found it + } + ix += 1; + if (ix >= HASH_TABLE_SIZE) { + ix = 0; + } + } + return -1; +} + +int hash_insert(char *s, int i) { + int ix = hash_ix(s); + while (hash_table[ix].name) { // assumes no duplicate entries + hash_collisions += 1; + ix += 1; + if (ix >= HASH_TABLE_SIZE) { + ix = 0; + } + } + hash_table[ix].name = s; + hash_table[ix].index = i; + return ix; +} + +// To decouple details of table display (e.g. column width, line folding for +// display screen width, et cetera) from acquiring the data and populating the +// tables, this semi-general table handling code is used. There are various +// routines to set table attributes, assign and test some cell contents, +// initialize and actually display the table. + +#define CELL_TYPE_NULL 0 +#define CELL_TYPE_LONG 1 +#define CELL_TYPE_DOUBLE 2 +#define CELL_TYPE_STRING 3 +#define CELL_TYPE_CHAR8 4 +#define CELL_TYPE_REPCHAR 5 + +#define CELL_FLAG_FREEABLE (1 << 0) +#define CELL_FLAG_ROWSPAN (1 << 1) +#define CELL_FLAG_COLSPAN (1 << 2) + +#define COL_JUSTIFY_LEFT (1 << 0) +#define COL_JUSTIFY_RIGHT (1 << 1) +#define COL_JUSTIFY_CENTER 3 +#define COL_JUSTIFY_MASK 0x3 +#define COL_FLAG_SEEN_DATA (1 << 2) +#define COL_FLAG_NON_ZERO_DATA (1 << 3) +#define COL_FLAG_ALWAYS_SHOW (1 << 4) + +#define ROW_FLAG_SEEN_DATA COL_FLAG_SEEN_DATA +#define ROW_FLAG_NON_ZERO_DATA COL_FLAG_NON_ZERO_DATA +#define ROW_FLAG_ALWAYS_SHOW COL_FLAG_ALWAYS_SHOW + +typedef struct cell { + uint32_t type; + uint32_t flags; + union { + char *s; + double d; + int64_t l; + char c[8]; + }; +} cell_t, *cell_p; + +typedef struct vtab { + int header_rows; + int header_cols; + int data_rows; + int data_cols; + cell_p cell; + int *row_ix_map; + uint8_t *row_flags; + uint8_t *col_flags; + uint8_t *col_width; + uint8_t *col_decimal_places; +} vtab_t, *vtab_p; + +#define ALL_TABLE_ROWS (table->header_rows + table->data_rows) +#define ALL_TABLE_COLS (table->header_cols + table->data_cols) +#define GET_CELL_PTR(row, col) (&table->cell[(row * ALL_TABLE_COLS) + col]) + +#define USUAL_GUTTER_WIDTH 1 + +void set_row_flag(vtab_p table, int row, int flag) { + table->row_flags[row] |= (uint8_t)flag; +} + +void set_col_flag(vtab_p table, int col, int flag) { + table->col_flags[col] |= (uint8_t)flag; +} + +void clear_row_flag(vtab_p table, int row, int flag) { + table->row_flags[row] &= (uint8_t)~flag; +} + +void clear_col_flag(vtab_p table, int col, int flag) { + table->col_flags[col] &= (uint8_t)~flag; +} + +int test_row_flag(vtab_p table, int row, int flag) { + return ((table->row_flags[row] & (uint8_t)flag) != 0); +} + +int test_col_flag(vtab_p table, int col, int flag) { + return ((table->col_flags[col] & (uint8_t)flag) != 0); +} + +void set_col_justification(vtab_p table, int col, int justify) { + table->col_flags[col] &= (uint8_t)~COL_JUSTIFY_MASK; + table->col_flags[col] |= (uint8_t)(justify & COL_JUSTIFY_MASK); +} + +void set_col_width(vtab_p table, int col, uint8_t width) { + if (width >= SMALL_BUF_SIZE) { + width = SMALL_BUF_SIZE - 1; + } + table->col_width[col] = width; +} + +void set_col_decimal_places(vtab_p table, int col, uint8_t places) { + table->col_decimal_places[col] = places; +} + +void set_cell_flag(vtab_p table, int row, int col, int flag) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->flags |= (uint32_t)flag; +} + +void clear_cell_flag(vtab_p table, int row, int col, int flag) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->flags &= (uint32_t)~flag; +} + +int test_cell_flag(vtab_p table, int row, int col, int flag) { + cell_p c_ptr = GET_CELL_PTR(row, col); + return ((c_ptr->flags & (uint32_t)flag) != 0); +} + +void string_assign(vtab_p table, int row, int col, char *s) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_STRING; + c_ptr->s = s; +} + +void repchar_assign(vtab_p table, int row, int col, char c) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_REPCHAR; + c_ptr->c[0] = c; +} + +void double_assign(vtab_p table, int row, int col, double d) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_DOUBLE; + c_ptr->d = d; +} + +void long_assign(vtab_p table, int row, int col, int64_t l) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_LONG; + c_ptr->l = l; +} + +void double_addto(vtab_p table, int row, int col, double d) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_DOUBLE; + c_ptr->d += d; +} + +void long_addto(vtab_p table, int row, int col, int64_t l) { + cell_p c_ptr = GET_CELL_PTR(row, col); + c_ptr->type = CELL_TYPE_LONG; + c_ptr->l += l; +} + +void clear_assign(vtab_p table, int row, int col) { + cell_p c_ptr = GET_CELL_PTR(row, col); + memset(c_ptr, 0, sizeof(cell_t)); +} + +void zero_table_data(vtab_p table, int type) { + // Sets data area of table to zeros of specified type + for (int row = table->header_rows; (row < ALL_TABLE_ROWS); row++) { + for (int col = table->header_cols; (col < ALL_TABLE_COLS); col++) { + cell_p c_ptr = GET_CELL_PTR(row, col); + memset(c_ptr, 0, sizeof(cell_t)); + c_ptr->type = type; + } + } +} + +void sort_rows_descending_by_col(vtab_p table, int start_row, int stop_row, int col) { + // Rearrange row_ix_map[] indices so the rows will be in + // descending order by the value in the specified column + for (int ix = start_row; (ix <= stop_row); ix++) { + int biggest_ix = ix; + cell_p biggest_ix_c_ptr = GET_CELL_PTR(table->row_ix_map[ix], col); + for (int iy = ix + 1; (iy <= stop_row); iy++) { + cell_p iy_c_ptr = GET_CELL_PTR(table->row_ix_map[iy], col); + if (biggest_ix_c_ptr->d < iy_c_ptr->d) { + biggest_ix_c_ptr = iy_c_ptr; + biggest_ix = iy; + } + } + if (biggest_ix != ix) { + int tmp = table->row_ix_map[ix]; + table->row_ix_map[ix] = table->row_ix_map[biggest_ix]; + table->row_ix_map[biggest_ix] = tmp; + } + } +} + +void span(vtab_p table, int first_row, int first_col, int last_row, int last_col) { + // FIXME: implement row / col spannnig someday? +} + +void init_table(vtab_p table, int header_rows, int header_cols, int data_rows, int data_cols) { + // init table sizes + table->header_rows = header_rows; + table->header_cols = header_cols; + table->data_rows = data_rows; + table->data_cols = data_cols; + // allocate memory for all the cells + int alloc_size = ALL_TABLE_ROWS * ALL_TABLE_COLS * sizeof(cell_t); + table->cell = malloc(alloc_size); + if (table->cell == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + memset(table->cell, 0, alloc_size); + // allocate memory for the row map vector + alloc_size = ALL_TABLE_ROWS * sizeof(int); + table->row_ix_map = malloc(alloc_size); + if (table->row_ix_map == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + for (int row = 0; (row < ALL_TABLE_ROWS); row++) { + table->row_ix_map[row] = row; + } + // allocate memory for the row flags vector + alloc_size = ALL_TABLE_ROWS * sizeof(uint8_t); + table->row_flags = malloc(alloc_size); + if (table->row_flags == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + memset(table->row_flags, 0, alloc_size); + // allocate memory for the column flags vector + alloc_size = ALL_TABLE_COLS * sizeof(uint8_t); + table->col_flags = malloc(alloc_size); + if (table->col_flags == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + memset(table->col_flags, 0, alloc_size); + // allocate memory for the column width vector + alloc_size = ALL_TABLE_COLS * sizeof(uint8_t); + table->col_width = malloc(alloc_size); + if (table->col_width == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + memset(table->col_width, 0, alloc_size); + // allocate memory for the column precision vector + alloc_size = ALL_TABLE_COLS * sizeof(uint8_t); + table->col_decimal_places = malloc(alloc_size); + if (table->col_decimal_places == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + memset(table->col_decimal_places, 0, alloc_size); +} + +void free_cell(vtab_p table, int row, int col) { + cell_p c_ptr = GET_CELL_PTR(row, col); + if ((c_ptr->type == CELL_TYPE_STRING) + && (c_ptr->flags & CELL_FLAG_FREEABLE) + && (c_ptr->s != NULL)) { + free(c_ptr->s); + } + memset(c_ptr, 0, sizeof(cell_t)); +} + +void free_table(vtab_p table) { + if (table->cell != NULL) { + for (int row = 0; (row < ALL_TABLE_ROWS); row++) { + for (int col = 0; (col < ALL_TABLE_COLS); col++) { + free_cell(table, row, col); + } + } + free(table->cell); + } + if (table->row_ix_map != NULL) { + free(table->row_ix_map); + } + if (table->row_flags != NULL) { + free(table->row_flags); + } + if (table->col_flags != NULL) { + free(table->col_flags); + } + if (table->col_width != NULL) { + free(table->col_width); + } + if (table->col_decimal_places != NULL) { + free(table->col_decimal_places); + } +} + +char *fmt_cell_data(cell_p c_ptr, int max_width, int decimal_places) { + // Returns pointer to a static buffer, expecting caller to + // immediately use or copy the contents before calling again. + int rep_width = max_width - USUAL_GUTTER_WIDTH; + static char buf[SMALL_BUF_SIZE]; + switch (c_ptr->type) { + case CELL_TYPE_NULL: + buf[0] = '\0'; + break; + case CELL_TYPE_LONG: + snprintf(buf, SMALL_BUF_SIZE, "%ld", c_ptr->l); + break; + case CELL_TYPE_DOUBLE: + snprintf(buf, SMALL_BUF_SIZE, "%.*f", decimal_places, c_ptr->d); + break; + case CELL_TYPE_STRING: + snprintf(buf, SMALL_BUF_SIZE, "%s", c_ptr->s); + break; + case CELL_TYPE_CHAR8: + strncpy(buf, c_ptr->c, 8); + buf[8] = '\0'; + break; + case CELL_TYPE_REPCHAR: + memset(buf, c_ptr->c[0], rep_width); + buf[rep_width] = '\0'; + break; + default: + strcpy(buf, "Unknown"); + break; + } + buf[max_width] = '\0'; + return buf; +} + +void auto_set_col_width(vtab_p table, int col, int min_width, int max_width) { + int width = min_width; + for (int row = 0; (row < ALL_TABLE_ROWS); row++) { + cell_p c_ptr = GET_CELL_PTR(row, col); + if (c_ptr->type == CELL_TYPE_REPCHAR) { + continue; + } + char *p = fmt_cell_data(c_ptr, max_width, (int)(table->col_decimal_places[col])); + int l = strlen(p); + if (width < l) { + width = l; + } + } + width += USUAL_GUTTER_WIDTH; + if (width > max_width) { + width = max_width; + } + table->col_width[col] = (uint8_t)width; +} + +void display_justified_cell(cell_p c_ptr, int row_flags, int col_flags, int width, int decimal_places) { + char *p = fmt_cell_data(c_ptr, width, decimal_places); + int l = strlen(p); + char buf[SMALL_BUF_SIZE]; + switch (col_flags & COL_JUSTIFY_MASK) { + case COL_JUSTIFY_LEFT: + memcpy(buf, p, l); + if (l < width) { + memset(&buf[l], ' ', width - l); + } + break; + case COL_JUSTIFY_RIGHT: + if (l < width) { + memset(buf, ' ', width - l); + } + memcpy(&buf[width - l], p, l); + break; + case COL_JUSTIFY_CENTER: + default: + memset(buf, ' ', width); + memcpy(&buf[(width - l + 1) / 2], p, l); + break; + } + buf[width] = '\0'; + printf("%s", buf); +} + +void display_table(vtab_p table, + int screen_width, + int show_unseen_rows, + int show_unseen_cols, + int show_zero_rows, + int show_zero_cols) +{ + // Set row and column flags according to whether data in rows and cols + // has been assigned, and is currently non-zero. + int some_seen_data = 0; + int some_non_zero_data = 0; + for (int row = table->header_rows; (row < ALL_TABLE_ROWS); row++) { + for (int col = table->header_cols; (col < ALL_TABLE_COLS); col++) { + cell_p c_ptr = GET_CELL_PTR(row, col); + // Currently, "seen data" includes not only numeric data, but also + // any strings, etc -- anything non-NULL (other than rephcars). + if ((c_ptr->type != CELL_TYPE_NULL) && (c_ptr->type != CELL_TYPE_REPCHAR)) { + some_seen_data = 1; + set_row_flag(table, row, ROW_FLAG_SEEN_DATA); + set_col_flag(table, col, COL_FLAG_SEEN_DATA); + // Currently, "non-zero data" includes not only numeric data, + // but also any strings, etc -- anything non-zero (other than + // repchars, which are already excluded above). So, note a + // valid non-NULL pointer to an empty string would still be + // counted as non-zero data. + if (c_ptr->l != (int64_t)0) { + some_non_zero_data = 1; + set_row_flag(table, row, ROW_FLAG_NON_ZERO_DATA); + set_col_flag(table, col, COL_FLAG_NON_ZERO_DATA); + } + } + } + } + if (!some_seen_data) { + printf("Table has no data.\n"); + return; + } + if (!some_non_zero_data && !show_zero_rows && !show_zero_cols) { + printf("Table has no non-zero data.\n"); + return; + } + // Start with first data column and try to display table, + // folding lines as necessary per screen_width + int col = -1; + int data_col = table->header_cols; + while (data_col < ALL_TABLE_COLS) { + // Skip data columns until we have one to display + if ((!test_col_flag(table, data_col, COL_FLAG_ALWAYS_SHOW)) && + (((!show_unseen_cols) && (!test_col_flag(table, data_col, COL_FLAG_SEEN_DATA))) || + ((!show_zero_cols) && (!test_col_flag(table, data_col, COL_FLAG_NON_ZERO_DATA))))) { + data_col += 1; + continue; + } + // Display blank line between table sections + if (col > 0) { + printf("\n"); + } + // For each row, display as many columns as possible + for (int row_ix = 0; (row_ix < ALL_TABLE_ROWS); row_ix++) { + int row = table->row_ix_map[row_ix]; + // If past the header rows, conditionally skip rows + if ((row >= table->header_rows) && (!test_row_flag(table, row, ROW_FLAG_ALWAYS_SHOW))) { + // Optionally skip row if no data seen or if all zeros + if (((!show_unseen_rows) && (!test_row_flag(table, row, ROW_FLAG_SEEN_DATA))) || + ((!show_zero_rows) && (!test_row_flag(table, row, ROW_FLAG_NON_ZERO_DATA)))) { + continue; + } + } + // Begin a new row... + int cur_line_width = 0; + // All lines start with the left header columns + for (col = 0; (col < table->header_cols); col++) { + display_justified_cell(GET_CELL_PTR(row, col), + (int)(table->row_flags[row]), + (int)(table->col_flags[col]), + (int)(table->col_width[col]), + (int)(table->col_decimal_places[col])); + cur_line_width += (int)(table->col_width[col]); + } + // Reset column index to starting data column for each new row + col = data_col; + // Try to display as many data columns as possible in every section + for (;;) { + // See if we should print this column + if (test_col_flag(table, col, COL_FLAG_ALWAYS_SHOW) || + (((show_unseen_cols) || (test_col_flag(table, col, COL_FLAG_SEEN_DATA))) && + ((show_zero_cols) || (test_col_flag(table, col, COL_FLAG_NON_ZERO_DATA))))) { + display_justified_cell(GET_CELL_PTR(row, col), + (int)(table->row_flags[row]), + (int)(table->col_flags[col]), + (int)(table->col_width[col]), + (int)(table->col_decimal_places[col])); + cur_line_width += (int)(table->col_width[col]); + } + col += 1; + // End the line if no more columns or next column would exceed screen width + if ((col >= ALL_TABLE_COLS) || + ((cur_line_width + (int)(table->col_width[col])) > screen_width)) { + break; + } + } + printf("\n"); + } + // Remember next starting data column for next section + data_col = col; + } +} + +int verbose = 0; +int num_pids = 0; +int num_nodes = 0; +int screen_width = 0; +int show_zero_data = 1; +int compress_display = 0; +int sort_table = 0; +int sort_table_node = -1; +int compatibility_mode = 0; +int pid_array_max_pids = 0; +int *pid_array = NULL; +char *prog_name = NULL; +double page_size_in_bytes = 0; +double huge_page_size_in_bytes = 0; + +void display_version_and_exit() { + char *version_string = "20130723"; + printf("%s version: %s: %s\n", prog_name, version_string, __DATE__); + exit(EXIT_SUCCESS); +} + +void display_usage_and_exit() { + fprintf(stderr, "Usage: %s [-c] [-m] [-n] [-p <PID>|<pattern>] [-s[<node>]] [-v] [-V] [-z] [ <PID>|<pattern>... ]\n", prog_name); + fprintf(stderr, "-c to minimize column widths\n"); + fprintf(stderr, "-m to show meminfo-like system-wide memory usage\n"); + fprintf(stderr, "-n to show the numastat statistics info\n"); + fprintf(stderr, "-p <PID>|<pattern> to show process info\n"); + fprintf(stderr, "-s[<node>] to sort data by total column or <node>\n"); + fprintf(stderr, "-v to make some reports more verbose\n"); + fprintf(stderr, "-V to show the %s code version\n", prog_name); + fprintf(stderr, "-z to skip rows and columns of zeros\n"); + exit(EXIT_FAILURE); +} + +int get_screen_width() { + int width = 80; + char *p = getenv("NUMASTAT_WIDTH"); + if (p != NULL) { + width = atoi(p); + if ((width < 1) || (width > 10000000)) { + width = 80; + } + } else if (isatty(fileno(stdout))) { + FILE *fs = popen("resize 2>/dev/null", "r"); + if (fs != NULL) { + char buf[72]; + char *columns; + columns = fgets(buf, sizeof(columns), fs); + pclose(fs); + if (columns && strncmp(columns, "COLUMNS=", 8) == 0) { + width = atoi(&columns[8]); + if ((width < 1) || (width > 10000000)) { + width = 80; + } + } + } + } else { + // Not a tty, so allow a really long line + width = 10000000; + } + if (width < 32) { + width = 32; + } + return width; +} + +char *command_name_for_pid(int pid) { + // Get the PID command name field from /proc/PID/status file. Return + // pointer to a static buffer, expecting caller to immediately copy result. + static char buf[SMALL_BUF_SIZE]; + char fname[64]; + snprintf(fname, sizeof(fname), "/proc/%d/status", pid); + FILE *fs = fopen(fname, "r"); + if (!fs) { + return NULL; + } else { + while (fgets(buf, SMALL_BUF_SIZE, fs)) { + if (strstr(buf, "Name:") == buf) { + char *p = &buf[5]; + while (isspace(*p)) { + p++; + } + if (p[strlen(p) - 1] == '\n') { + p[strlen(p) - 1] = '\0'; + } + fclose(fs); + return p; + } + } + fclose(fs); + } + return NULL; +} + +void show_info_from_system_file(char *file, meminfo_p meminfo, int meminfo_rows, int tok_offset) { + // Setup and init table + vtab_t table; + int header_rows = 2 - compatibility_mode; + int header_cols = 1; + // Add an extra data column for a total column + init_table(&table, header_rows, header_cols, meminfo_rows, num_nodes + 1); + int total_col_ix = header_cols + num_nodes; + // Insert token mapping in hash table and assign left header column label for each row in table + init_hash_table(); + for (int row = 0; (row < meminfo_rows); row++) { + hash_insert(meminfo[row].token, meminfo[row].index); + if (compatibility_mode) { + string_assign(&table, (header_rows + row), 0, meminfo[row].token); + } else { + string_assign(&table, (header_rows + row), 0, meminfo[row].label); + } + } + // printf("There are %d table hash collisions.\n", hash_collisions); + // Set left header column width and left justify it + set_col_width(&table, 0, 16); + set_col_justification(&table, 0, COL_JUSTIFY_LEFT); + // Open /sys/devices/system/node/node?/<file> for each node and store data + // in table. If not compatibility_mode, do approximately first third of + // this loop also for (node_ix == num_nodes) to get "Total" column header. + for (int node_ix = 0; (node_ix < (num_nodes + (1 - compatibility_mode))); node_ix++) { + int col = header_cols + node_ix; + // Assign header row label and horizontal line for this column... + string_assign(&table, 0, col, node_header[node_ix]); + if (!compatibility_mode) { + repchar_assign(&table, 1, col, '-'); + int decimal_places = 2; + if (compress_display) { + decimal_places = 0; + } + set_col_decimal_places(&table, col, decimal_places); + } + // Set column width and right justify data + set_col_width(&table, col, 16); + set_col_justification(&table, col, COL_JUSTIFY_RIGHT); + if (node_ix == num_nodes) { + break; + } + // Open /sys/.../node<N>/numstast file for this node... + char buf[SMALL_BUF_SIZE]; + char fname[64]; + snprintf(fname, sizeof(fname), "/sys/devices/system/node/node%d/%s", node_ix_map[node_ix], file); + FILE *fs = fopen(fname, "r"); + if (!fs) { + sprintf(buf, "cannot open %s", fname); + perror(buf); + exit(EXIT_FAILURE); + } + // Get table values for this node... + while (fgets(buf, SMALL_BUF_SIZE, fs)) { + char *tok[64]; + int tokens = 0; + const char *delimiters = " \t\r\n:"; + char *p = strtok(buf, delimiters); + if (p == NULL) { + continue; // Skip blank lines; + } + while (p) { + tok[tokens++] = p; + p = strtok(NULL, delimiters); + } + // example line from numastat file: "numa_miss 16463" + // example line from meminfo file: "Node 3 Inactive: 210680 kB" + int index = hash_lookup(tok[0 + tok_offset]); + if (index < 0) { + printf("Token %s not in hash table.\n", tok[0]); + } else { + double value = (double)atol(tok[1 + tok_offset]); + if (!compatibility_mode) { + double multiplier = 1.0; + if (tokens < 4) { + multiplier = page_size_in_bytes; + } else if (!strncmp("HugePages", tok[2], 9)) { + multiplier = huge_page_size_in_bytes; + } else if (!strncmp("kB", tok[4], 2)) { + multiplier = KILOBYTE; + } + value *= multiplier; + value /= (double)MEGABYTE; + } + double_assign(&table, header_rows + index, col, value); + double_addto(&table, header_rows + index, total_col_ix, value); + } + } + fclose(fs); + } + // Crompress display column widths, if requested + if (compress_display) { + for (int col = 0; (col < header_cols + num_nodes + 1); col++) { + auto_set_col_width(&table, col, 4, 16); + } + } + // Optionally sort the table data + if (sort_table) { + int sort_col; + if ((sort_table_node < 0) || (sort_table_node >= num_nodes)) { + sort_col = total_col_ix; + } else { + sort_col = header_cols + node_ix_map[sort_table_node]; + } + sort_rows_descending_by_col(&table, header_rows, header_rows + meminfo_rows - 1, sort_col); + } + // Actually display the table now, doing line-folding as necessary + display_table(&table, screen_width, 0, 0, show_zero_data, show_zero_data); + free_table(&table); +} + +void show_numastat_info() { + if (!compatibility_mode) { + printf("\nPer-node numastat info (in MBs):\n"); + } + show_info_from_system_file("numastat", numastat_meminfo, NUMASTAT_MEMINFO_ROWS, 0); +} + +void show_system_info() { + printf("\nPer-node system memory usage (in MBs):\n"); + show_info_from_system_file("meminfo", system_meminfo, SYSTEM_MEMINFO_ROWS, 2); +} + +void show_process_info() { + vtab_t table; + int header_rows = 2; + int header_cols = 1; + int data_rows; + int show_sub_categories = (verbose || (num_pids == 1)); + if (show_sub_categories) { + data_rows = PROCESS_MEMINFO_ROWS; + } else { + data_rows = num_pids; + } + // Add two extra rows for a horizontal rule followed by a total row + // Add one extra data column for a total column + init_table(&table, header_rows, header_cols, data_rows + 2, num_nodes + 1); + int total_col_ix = header_cols + num_nodes; + int total_row_ix = header_rows + data_rows + 1; + string_assign(&table, total_row_ix, 0, "Total"); + if (show_sub_categories) { + // Assign left header column label for each row in table + for (int row = 0; (row < PROCESS_MEMINFO_ROWS); row++) { + string_assign(&table, (header_rows + row), 0, process_meminfo[row].label); + } + } else { + string_assign(&table, 0, 0, "PID"); + repchar_assign(&table, 1, 0, '-'); + printf("\nPer-node process memory usage (in MBs)\n"); + } + // Set left header column width and left justify it + set_col_width(&table, 0, 16); + set_col_justification(&table, 0, COL_JUSTIFY_LEFT); + // Set up "Node <N>" column headers over data columns, plus "Total" column + for (int node_ix = 0; (node_ix <= num_nodes); node_ix++) { + int col = header_cols + node_ix; + // Assign header row label and horizontal line for this column... + string_assign(&table, 0, col, node_header[node_ix]); + repchar_assign(&table, 1, col, '-'); + // Set column width, decimal places, and right justify data + set_col_width(&table, col, 16); + int decimal_places = 2; + if (compress_display) { + decimal_places = 0; + } + set_col_decimal_places(&table, col, decimal_places); + set_col_justification(&table, col, COL_JUSTIFY_RIGHT); + } + // Initialize data in table to all zeros + zero_table_data(&table, CELL_TYPE_DOUBLE); + // If (show_sub_categories), show individual process tables for each PID, + // Otherwise show one big table of process total lines from all the PIDs. + for (int pid_ix = 0; (pid_ix < num_pids); pid_ix++) { + int pid = pid_array[pid_ix]; + if (show_sub_categories) { + printf("\nPer-node process memory usage (in MBs) for PID %d (%s)\n", pid, command_name_for_pid(pid)); + if (pid_ix > 0) { + // Re-initialize show_sub_categories table, because we re-use it for each PID. + zero_table_data(&table, CELL_TYPE_DOUBLE); + } + } else { + // Put this row's "PID (cmd)" label in left header column for this PID total row + char tmp_buf[64]; + snprintf(tmp_buf, sizeof(tmp_buf), "%d (%s)", pid, command_name_for_pid(pid)); + char *p = strdup(tmp_buf); + if (p == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + string_assign(&table, header_rows + pid_ix, 0, p); + set_cell_flag(&table, header_rows + pid_ix, 0, CELL_FLAG_FREEABLE); + } + // Open numa_map for this PID to get per-node data + char fname[64]; + snprintf(fname, sizeof(fname), "/proc/%d/numa_maps", pid); + char buf[BUF_SIZE]; + FILE *fs = fopen(fname, "r"); + if (!fs) { + sprintf(buf, "Can't read /proc/%d/numa_maps", pid); + perror(buf); + continue; + } + // Add up sub-category memory used from each node. Must go line by line + // through the numa_map figuring out which category memory, node, and the + // amount. + while (fgets(buf, BUF_SIZE, fs)) { + int category = PROCESS_PRIVATE_INDEX; // init category to the catch-all... + const char *delimiters = " \t\r\n"; + char *p = strtok(buf, delimiters); + while (p) { + // If the memory category for this line is still the catch-all + // (i.e. private), then see if the current token is a special + // keyword for a specific memory sub-category. + if (category == PROCESS_PRIVATE_INDEX) { + for (int ix = 0; (ix < PROCESS_PRIVATE_INDEX); ix++) { + if (!strncmp(p, process_meminfo[ix].token, strlen(process_meminfo[ix].token))) { + category = ix; + break; + } + } + } + // If the current token is a per-node pages quantity, parse the + // node number and accumulate the number of pages in the specific + // category (and also add to the total). + if (p[0] == 'N') { + int node_num = (int)strtol(&p[1], &p, 10); + if (p[0] != '=') { + perror("node value parse error"); + exit(EXIT_FAILURE); + } + double value = (double)strtol(&p[1], &p, 10); + double multiplier = page_size_in_bytes; + if (category == PROCESS_HUGE_INDEX) { + multiplier = huge_page_size_in_bytes; + } + value *= multiplier; + value /= (double)MEGABYTE; + // Add value to data cell, total_col, and total_row + int tmp_row; + if (show_sub_categories) { + tmp_row = header_rows + category; + } else { + tmp_row = header_rows + pid_ix; + } + // Don't assume nodes are sequential or contiguous. + // Need to find correct tmp_col from node_ix_map + int i = 0; + while(node_ix_map[i++] != node_num) + ; + int tmp_col = header_cols + i - 1; + double_addto(&table, tmp_row, tmp_col, value); + double_addto(&table, tmp_row, total_col_ix, value); + double_addto(&table, total_row_ix, tmp_col, value); + double_addto(&table, total_row_ix, total_col_ix, value); + } + // Get next token on the line + p = strtok(NULL, delimiters); + } + } + // Currently, a non-root user can open some numa_map files successfully + // without error, but can't actually read the contents -- despite the + // 444 file permissions. So, use ferror() to check here to see if we + // actually got a read error, and if so, alert the user so they know + // not to trust the zero in the table. + if (ferror(fs)) { + sprintf(buf, "Can't read /proc/%d/numa_maps", pid); + perror(buf); + } + fclose(fs); + // If showing individual tables, or we just added the last total line, + // prepare the table for display and display it... + if ((show_sub_categories) || (pid_ix + 1 == num_pids)) { + // Crompress display column widths, if requested + if (compress_display) { + for (int col = 0; (col < header_cols + num_nodes + 1); col++) { + auto_set_col_width(&table, col, 4, 16); + } + } else { + // Since not compressing the display, allow the left header + // column to be wider. Otherwise, sometimes process command + // name instance numbers can be truncated in an annoying way. + auto_set_col_width(&table, 0, 16, 24); + } + // Put dashes above Total line... + set_row_flag(&table, total_row_ix - 1, COL_FLAG_ALWAYS_SHOW); + for (int col = 0; (col < header_cols + num_nodes + 1); col++) { + repchar_assign(&table, total_row_ix - 1, col, '-'); + } + // Optionally sort the table data + if (sort_table) { + int sort_col; + if ((sort_table_node < 0) || (sort_table_node >= num_nodes)) { + sort_col = total_col_ix; + } else { + sort_col = header_cols + node_ix_map[sort_table_node]; + } + sort_rows_descending_by_col(&table, header_rows, header_rows + data_rows - 1, sort_col); + } + // Actually show the table + display_table(&table, screen_width, 0, 0, show_zero_data, show_zero_data); + } + } // END OF FOR_EACH-PID loop + free_table(&table); +} // show_process_info() + +int node_and_digits(const struct dirent *dptr) { + char *p = (char *)(dptr->d_name); + if (*p++ != 'n') return 0; + if (*p++ != 'o') return 0; + if (*p++ != 'd') return 0; + if (*p++ != 'e') return 0; + do { + if (!isdigit(*p++)) return 0; + } while (*p != '\0'); + return 1; +} + +void init_node_ix_map_and_header(int compatibility_mode) { + // Count directory names of the form: /sys/devices/system/node/node<N> + struct dirent **namelist; + num_nodes = scandir("/sys/devices/system/node", &namelist, node_and_digits, NULL); + if (num_nodes < 1) { + if (compatibility_mode) { + perror("sysfs not mounted or system not NUMA aware"); + } else { + perror("Couldn't open /sys/devices/system/node"); + } + exit(EXIT_FAILURE); + } else { + node_ix_map = malloc(num_nodes * sizeof(int)); + if (node_ix_map == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + // For each "node<N>" filename present, save <N> in node_ix_map + for (int ix = 0; (ix < num_nodes); ix++) { + node_ix_map[ix] = atoi(&namelist[ix]->d_name[4]); + free(namelist[ix]); + } + free(namelist); + // Now, sort the node map in increasing order. Use a simplistic sort + // since we expect a relatively short (and maybe pre-ordered) list. + for (int ix = 0; (ix < num_nodes); ix++) { + int smallest_ix = ix; + for (int iy = ix + 1; (iy < num_nodes); iy++) { + if (node_ix_map[smallest_ix] > node_ix_map[iy]) { + smallest_ix = iy; + } + } + if (smallest_ix != ix) { + int tmp = node_ix_map[ix]; + node_ix_map[ix] = node_ix_map[smallest_ix]; + node_ix_map[smallest_ix] = tmp; + } + } + // Construct vector of "Node <N>" and "Total" column headers. Allocate + // one for each NUMA node, plus one on the end for the "Total" column + node_header = malloc((num_nodes + 1) * sizeof(char *)); + if (node_header == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + for (int node_ix = 0; (node_ix <= num_nodes); node_ix++) { + char node_label[64]; + if (node_ix == num_nodes) { + strcpy(node_label, "Total"); + } else if (compatibility_mode) { + snprintf(node_label, sizeof(node_label), "node%d", node_ix_map[node_ix]); + } else { + snprintf(node_label, sizeof(node_label), "Node %d", node_ix_map[node_ix]); + } + char *s = strdup(node_label); + if (s == NULL) { + perror("malloc failed line: " STRINGIFY(__LINE__)); + exit(EXIT_FAILURE); + } + node_header[node_ix] = s; + } + } +} + +void free_node_ix_map_and_header() { + if (node_ix_map != NULL) { + free(node_ix_map); + node_ix_map = NULL; + } + if (node_header != NULL) { + for (int ix = 0; (ix <= num_nodes); ix++) { + free(node_header[ix]); + } + free(node_header); + node_header = NULL; + } +} + +double get_huge_page_size_in_bytes() { + double huge_page_size = 0;; + FILE *fs = fopen("/proc/meminfo", "r"); + if (!fs) { + perror("Can't open /proc/meminfo"); + exit(EXIT_FAILURE); + } + char buf[SMALL_BUF_SIZE]; + while (fgets(buf, SMALL_BUF_SIZE, fs)) { + if (!strncmp("Hugepagesize", buf, 12)) { + char *p = &buf[12]; + while ((!isdigit(*p)) && (p < buf + SMALL_BUF_SIZE)) { + p++; + } + huge_page_size = strtod(p, NULL); + break; + } + } + fclose(fs); + return huge_page_size * KILOBYTE; +} + +int all_digits(char *p) { + if (p == NULL) { + return 0; + } + while (*p != '\0') { + if (!isdigit(*p++)) return 0; + } + return 1; +} + +int starts_with_digit(const struct dirent *dptr) { + return (isdigit(dptr->d_name[0])); +} + +void add_pid_to_list(int pid) { + if (num_pids < pid_array_max_pids) { + pid_array[num_pids++] = pid; + } else { + if (pid_array_max_pids == 0) { + pid_array_max_pids = 32; + } + int *tmp_int_ptr = realloc(pid_array, 2 * pid_array_max_pids * sizeof(int)); + if (tmp_int_ptr == NULL) { + char buf[SMALL_BUF_SIZE]; + sprintf(buf, "Too many PIDs, skipping %d", pid); + perror(buf); + } else { + pid_array = tmp_int_ptr; + pid_array_max_pids *= 2; + pid_array[num_pids++] = pid; + } + } +} + +int ascending(const void *p1, const void *p2) { + return *(int *)p1 - *(int *) p2; +} + +void sort_pids_and_remove_duplicates() { + if (num_pids > 1) { + qsort(pid_array, num_pids, sizeof(int), ascending); + int ix1 = 0; + for (int ix2 = 1; (ix2 < num_pids); ix2++) { + if (pid_array[ix2] == pid_array[ix1]) { + continue; + } + ix1 += 1; + if (ix2 > ix1) { + pid_array[ix1] = pid_array[ix2]; + } + } + num_pids = ix1 + 1; + } +} + +void add_pids_from_pattern_search(char *pattern) { + // Search all /proc/<PID>/cmdline files and /proc/<PID>/status:Name fields + // for matching patterns. Show the memory details for matching PIDs. + int num_matches_found = 0; + struct dirent **namelist; + int files = scandir("/proc", &namelist, starts_with_digit, NULL); + if (files < 0) { + perror("Couldn't open /proc"); + } + for (int ix = 0; (ix < files); ix++) { + char buf[BUF_SIZE]; + // First get Name field from status file + int pid = atoi(namelist[ix]->d_name); + char *p = command_name_for_pid(pid); + if (p) { + strcpy(buf, p); + } else { + buf[0] = '\0'; + } + // Next copy cmdline file contents onto end of buffer. Do it a + // character at a time to convert nulls to spaces. + char fname[272]; + snprintf(fname, sizeof(fname), "/proc/%s/cmdline", namelist[ix]->d_name); + FILE *fs = fopen(fname, "r"); + if (fs) { + p = buf; + while (*p != '\0') { + p++; + } + *p++ = ' '; + int c; + while (((c = fgetc(fs)) != EOF) && (p < buf + BUF_SIZE - 1)) { + if (c == '\0') { + c = ' '; + } + *p++ = c; + } + *p++ = '\0'; + fclose(fs); + } + if (strstr(buf, pattern)) { + if (pid != getpid()) { + add_pid_to_list(pid); + num_matches_found += 1; + } + } + free(namelist[ix]); + } + free(namelist); + if (num_matches_found == 0) { + printf("Found no processes containing pattern: \"%s\"\n", pattern); + } +} + +int main(int argc, char **argv) { + prog_name = argv[0]; + int show_the_system_info = 0; + int show_the_numastat_info = 0; + static struct option long_options[] = { + {"help", 0, 0, '?'}, + {0, 0, 0, 0} + }; + int long_option_index = 0; + int opt; + while ((opt = getopt_long(argc, argv, "cmnp:s::vVz?", long_options, &long_option_index)) != -1) { + switch (opt) { + case 0: + printf("Unexpected long option %s", long_options[long_option_index].name); + if (optarg) { + printf(" with arg %s", optarg); + } + printf("\n"); + display_usage_and_exit(); + break; + case 'c': + compress_display = 1; + break; + case 'm': + show_the_system_info = 1; + break; + case 'n': + show_the_numastat_info = 1; + break; + case 'p': + if ((optarg) && (all_digits(optarg))) { + add_pid_to_list(atoi(optarg)); + } else { + add_pids_from_pattern_search(optarg); + } + break; + case 's': + sort_table = 1; + if ((optarg) && (all_digits(optarg))) { + sort_table_node = atoi(optarg); + } + break; + case 'v': + verbose = 1; + break; + case 'V': + display_version_and_exit(); + break; + case 'z': + show_zero_data = 0; + break; + default: + case '?': + display_usage_and_exit(); + break; + } + } + // Figure out the display width, which is used to format the tables + // and limit the output columns per row + screen_width = get_screen_width(); + // Any remaining arguments are assumed to be additional process specifiers + while (optind < argc) { + if (all_digits(argv[optind])) { + add_pid_to_list(atoi(argv[optind])); + } else { + add_pids_from_pattern_search(argv[optind]); + } + optind += 1; + } + // If there are no program options or arguments, be extremely compatible + // with the old numastat perl script (which is included at the end of this + // file for reference) + compatibility_mode = (argc == 1); + init_node_ix_map_and_header(compatibility_mode); // enumarate the NUMA nodes + if (compatibility_mode) { + show_numastat_info(); + free_node_ix_map_and_header(); + exit(EXIT_SUCCESS); + } + // Figure out page sizes + page_size_in_bytes = (double)sysconf(_SC_PAGESIZE); + huge_page_size_in_bytes = get_huge_page_size_in_bytes(); + // Display the info for the process specifiers + if (num_pids > 0) { + sort_pids_and_remove_duplicates(); + show_process_info(); + } + if (pid_array != NULL) { + free(pid_array); + } + // Display the system-wide memory usage info + if (show_the_system_info) { + show_system_info(); + } + // Display the numastat statistics info + if ((show_the_numastat_info) || ((num_pids == 0) && (!show_the_system_info))) { + show_numastat_info(); + } + free_node_ix_map_and_header(); + exit(EXIT_SUCCESS); +} + +#if 0 +/* + +#!/usr/bin/perl +# Print numa statistics for all nodes +# Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. +# +# numastat is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public +# License as published by the Free Software Foundation; version +# 2. +# +# numastat is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +# General Public License for more details. + +# You should find a copy of v2 of the GNU General Public License somewhere +# on your Linux system; if not, write to the Free Software Foundation, +# Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA +# +# Example: NUMASTAT_WIDTH=80 watch -n1 numastat +# + +# output width +$WIDTH=80; +if (defined($ENV{'NUMASTAT_WIDTH'})) { + $WIDTH=$ENV{'NUMASTAT_WIDTH'}; +} else { + use POSIX; + if (POSIX::isatty(fileno(STDOUT))) { + if (open(R, "resize |")) { + while (<R>) { + $WIDTH=$1 if /COLUMNS=(\d+)/; + } + close R; + } + } else { + # don't split it up for easier parsing + $WIDTH=10000000; + } +} +$WIDTH = 32 if $WIDTH < 32; + +if (! -d "/sys/devices/system/node" ) { + print STDERR "sysfs not mounted or system not NUMA aware\n"; + exit 1; +} + +%stat = (); +$title = ""; +$mode = 0; +opendir(NODES, "/sys/devices/system/node") || exit 1; +foreach $nd (readdir(NODES)) { + next unless $nd =~ /node(\d+)/; + # On newer kernels, readdir may enumerate the 'node(\d+) subdirs + # in opposite order from older kernels--e.g., node{0,1,2,...} + # as opposed to node{N,N-1,N-2,...}. Accomodate this by + # switching to new mode so that the stats get emitted in + # the same order. + #print "readdir(NODES) returns $nd\n"; + if (!$title && $nd =~ /node0/) { + $mode = 1; + } + open(STAT, "/sys/devices/system/node/$nd/numastat") || + die "cannot open $nd: $!\n"; + if (! $mode) { + $title = sprintf("%16s",$nd) . $title; + } else { + $title = $title . sprintf("%16s",$nd); + } + @fields = (); + while (<STAT>) { + ($name, $val) = split; + if (! $mode) { + $stat{$name} = sprintf("%16u", $val) . $stat{$name}; + } else { + $stat{$name} = $stat{$name} . sprintf("%16u", $val); + } + push(@fields, $name); + } + close STAT; +} +closedir NODES; + +$numfields = int(($WIDTH - 16) / 16); +$l = 16 * $numfields; +for ($i = 0; $i < length($title); $i += $l) { + print "\n" if $i > 0; + printf "%16s%s\n","",substr($title,$i,$l); + foreach (@fields) { + printf "%-16s%s\n",$_,substr($stat{$_},$i,$l); + } +} + +*/ +#endif
diff --git a/rtnetlink.c b/rtnetlink.c new file mode 100644 index 0000000..985f74a --- /dev/null +++ b/rtnetlink.c
@@ -0,0 +1,89 @@ +/* Simple LPGLed rtnetlink library */ +#include <sys/socket.h> +#include <linux/rtnetlink.h> +#include <linux/netlink.h> +#include <netinet/in.h> +#include <errno.h> +#include <unistd.h> +#define hidden __attribute__((visibility("hidden"))) +#include "rtnetlink.h" + +hidden void *rta_put(struct nlmsghdr *m, int type, int len) +{ + struct rtattr *rta = (void *)m + NLMSG_ALIGN(m->nlmsg_len); + int rtalen = RTA_LENGTH(len); + + rta->rta_type = type; + rta->rta_len = rtalen; + m->nlmsg_len = NLMSG_ALIGN(m->nlmsg_len) + RTA_ALIGN(rtalen); + return RTA_DATA(rta); +} + +hidden struct rtattr *rta_get(struct nlmsghdr *m, struct rtattr *p, int offset) +{ + struct rtattr *rta; + + if (p) { + rta = RTA_NEXT(p, m->nlmsg_len); + if (!RTA_OK(rta, m->nlmsg_len)) + return NULL; + } else { + rta = (void *)m + NLMSG_ALIGN(offset); + } + return rta; +} + +hidden int +rta_put_address(struct nlmsghdr *msg, int type, struct sockaddr *adr) +{ + switch (adr->sa_family) { + case AF_INET: { + struct in_addr *i = rta_put(msg, type, 4); + *i = ((struct sockaddr_in *)adr)->sin_addr; + break; + } + case AF_INET6: { + struct in6_addr *i6 = rta_put(msg, type, 16); + *i6 = ((struct sockaddr_in6 *)adr)->sin6_addr; + break; + } + default: + return -1; + } + return 0; +} + +/* Assumes no truncation. Make the buffer large enough. */ +hidden int +rtnetlink_request(struct nlmsghdr *msg, int buflen, struct sockaddr_nl *adr) +{ + int rsk; + int n; + int e; + + /* Use a private socket to avoid having to keep state + for a sequence number. */ + rsk = socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE); + if (rsk < 0) + return -1; + n = sendto(rsk, msg, msg->nlmsg_len, 0, (struct sockaddr *)adr, + sizeof(struct sockaddr_nl)); + if (n >= 0) { + socklen_t adrlen = sizeof(struct sockaddr_nl); + n = recvfrom(rsk, msg, buflen, 0, (struct sockaddr *)adr, + &adrlen); + } + e = errno; + close(rsk); + errno = e; + if (n < 0) + return -1; + /* Assume we only get a single reply back. This is (hopefully?) + safe because it's a single use socket. */ + if (msg->nlmsg_type == NLMSG_ERROR) { + struct nlmsgerr *err = NLMSG_DATA(msg); + errno = -err->error; + return -1; + } + return 0; +}
diff --git a/rtnetlink.h b/rtnetlink.h new file mode 100644 index 0000000..f73d909 --- /dev/null +++ b/rtnetlink.h
@@ -0,0 +1,5 @@ +hidden int +rta_put_address(struct nlmsghdr *msg, int type, struct sockaddr *adr); +hidden struct rtattr *rta_get(struct nlmsghdr *m, struct rtattr *p, int offset); +hidden void *rta_put(struct nlmsghdr *m, int type, int len); +hidden int rtnetlink_request(struct nlmsghdr *msg, int buflen, struct sockaddr_nl *adr);
diff --git a/shm.c b/shm.c new file mode 100644 index 0000000..260eeff --- /dev/null +++ b/shm.c
@@ -0,0 +1,325 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + Manage shared memory policy for numactl. + The actual policy is set in numactl itself, this just sets up and maps + the shared memory segments and dumps them. + + numactl is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numactl is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ + +#define _GNU_SOURCE 1 +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/mman.h> +#include <sys/ipc.h> +#include <sys/shm.h> +#include <sys/fcntl.h> +#include <sys/stat.h> +#include <stdarg.h> +#include <errno.h> +#include <unistd.h> +#include "numa.h" +#include "numaif.h" +#include "numaint.h" +#include "util.h" +#include "shm.h" + +int shmfd = -1; +long shmid = 0; +char *shmptr; +unsigned long long shmlen; +mode_t shmmode = 0600; +unsigned long long shmoffset; +int shmflags; +static int shm_pagesize; + +long huge_page_size(void) +{ + size_t len = 0; + char *line = NULL; + FILE *f = fopen("/proc/meminfo", "r"); + if (f != NULL) { + while (getdelim(&line, &len, '\n', f) > 0) { + int ps; + if (sscanf(line, "Hugepagesize: %d kB", &ps) == 1) + return ps * 1024; + } + free(line); + fclose(f); + } + return getpagesize(); +} + +static void check_region(char *opt) +{ + if (((unsigned long)shmptr % shm_pagesize) || (shmlen % shm_pagesize)) { + fprintf(stderr, "numactl: policy region not page aligned\n"); + exit(1); + } + if (!shmlen) { + fprintf(stderr, + "numactl: policy region length not specified before %s\n", + opt); + exit(1); + } +} + +static key_t sysvkey(char *name) +{ + int fd; + key_t key = ftok(name, shmid); + if (key >= 0) + return key; + + fprintf(stderr, "numactl: Creating shm key file %s mode %04o\n", + name, shmmode); + fd = creat(name, shmmode); + if (fd < 0) + nerror("cannot create key for shm %s\n", name); + key = ftok(name, shmid); + if (key < 0) + nerror("cannot get key for newly created shm key file %s", + name); + return key; +} + +/* Attach a sysv style shared memory segment. */ +void attach_sysvshm(char *name, char *opt) +{ + struct shmid_ds s; + key_t key = sysvkey(name); + + shmfd = shmget(key, shmlen, shmflags); + if (shmfd < 0 && errno == ENOENT) { + if (shmlen == 0) + complain( + "need a --length to create a sysv shared memory segment"); + fprintf(stderr, + "numactl: Creating shared memory segment %s id %ld mode %04o length %.fMB\n", + name, shmid, shmmode, ((double)shmlen) / (1024*1024) ); + shmfd = shmget(key, shmlen, IPC_CREAT|shmmode|shmflags); + if (shmfd < 0) + nerror("cannot create shared memory segment"); + } + + if (shmlen == 0) { + if (shmctl(shmfd, IPC_STAT, &s) < 0) + err("shmctl IPC_STAT"); + shmlen = s.shm_segsz; + } + + shmptr = shmat(shmfd, NULL, SHM_RDONLY); + if (shmptr == (void*)-1) + err("shmat"); + shmptr += shmoffset; + + shm_pagesize = (shmflags & SHM_HUGETLB) ? huge_page_size() : getpagesize(); + + check_region(opt); +} + +/* Attach a shared memory file. */ +void attach_shared(char *name, char *opt) +{ + struct stat64 st; + + shmfd = open(name, O_RDONLY); + if (shmfd < 0) { + errno = 0; + if (shmlen == 0) + complain("need a --length to create a shared file"); + shmfd = open(name, O_RDWR|O_CREAT, shmmode); + if (shmfd < 0) + nerror("cannot create file %s", name); + } + if (fstat64(shmfd, &st) < 0) + err("shm stat"); + if (shmlen > st.st_size) { + if (ftruncate64(shmfd, shmlen) < 0) { + /* XXX: we could do it by hand, but it would it + would be impossible to apply policy then. + need to fix that in the kernel. */ + perror("ftruncate"); + } + } + + shm_pagesize = st.st_blksize; + + check_region(opt); + + /* RED-PEN For shmlen > address space may need to map in pieces. + Left for some poor 32bit soul. */ + shmptr = mmap64(NULL, shmlen, PROT_READ, MAP_SHARED, shmfd, shmoffset); + if (shmptr == (char*)-1) + err("shm mmap"); + +} + +static void +dumppol(unsigned long long start, unsigned long long end, int pol, struct bitmask *mask) +{ + if (pol == MPOL_DEFAULT) + return; + printf("%016llx-%016llx: %s ", + shmoffset+start, + shmoffset+end, + policy_name(pol)); + printmask("", mask); +} + +/* Dump policies in a shared memory segment. */ +void dump_shm(void) +{ + struct bitmask *nodes, *prevnodes; + int prevpol = -1, pol; + unsigned long long c, start; + + start = 0; + if (shmlen == 0) { + printf("nothing to dump\n"); + return; + } + + nodes = numa_allocate_nodemask(); + prevnodes = numa_allocate_nodemask(); + + for (c = 0; c < shmlen; c += shm_pagesize) { + if (get_mempolicy(&pol, nodes->maskp, nodes->size, c+shmptr, + MPOL_F_ADDR) < 0) + err("get_mempolicy on shm"); + if (pol == prevpol) + continue; + if (prevpol != -1) + dumppol(start, c, prevpol, prevnodes); + prevnodes = nodes; + prevpol = pol; + start = c; + } + dumppol(start, c, prevpol, prevnodes); +} + +static void dumpnode(unsigned long long start, unsigned long long end, int node) +{ + printf("%016llx-%016llx: %d\n", shmoffset+start, shmoffset+end, node); +} + +/* Dump nodes in a shared memory segment. */ +void dump_shm_nodes(void) +{ + int prevnode = -1, node; + unsigned long long c, start; + + start = 0; + if (shmlen == 0) { + printf("nothing to dump\n"); + return; + } + + for (c = 0; c < shmlen; c += shm_pagesize) { + if (get_mempolicy(&node, NULL, 0, c+shmptr, + MPOL_F_ADDR|MPOL_F_NODE) < 0) + err("get_mempolicy on shm"); + if (node == prevnode) + continue; + if (prevnode != -1) + dumpnode(start, c, prevnode); + prevnode = node; + start = c; + } + dumpnode(start, c, prevnode); +} + +static void vwarn(char *ptr, char *fmt, ...) +{ + va_list ap; + unsigned long off = (unsigned long)ptr - (unsigned long)shmptr; + va_start(ap,fmt); + printf("numactl verify %lx(%lx): ", (unsigned long)ptr, off); + vprintf(fmt, ap); + va_end(ap); + exitcode = 1; +} + +static unsigned interleave_next(unsigned cur, struct bitmask *mask) +{ + int numa_num_nodes = numa_num_possible_nodes(); + + ++cur; + while (!numa_bitmask_isbitset(mask, cur)) { + cur = (cur+1) % numa_num_nodes; + } + return cur; +} + +/* Verify policy in a shared memory segment */ +void verify_shm(int policy, struct bitmask *nodes) +{ + char *p; + int ilnode, node; + int pol2; + struct bitmask *nodes2; + + nodes2 = numa_allocate_nodemask(); + + if (policy == MPOL_INTERLEAVE) { + if (get_mempolicy(&ilnode, NULL, 0, shmptr, + MPOL_F_ADDR|MPOL_F_NODE) + < 0) + err("get_mempolicy"); + } + + for (p = shmptr; p - (char *)shmptr < shmlen; p += shm_pagesize) { + if (get_mempolicy(&pol2, nodes2->maskp, nodes2->size, p, + MPOL_F_ADDR) < 0) + err("get_mempolicy"); + if (pol2 != policy) { + vwarn(p, "wrong policy %s, expected %s\n", + policy_name(pol2), policy_name(policy)); + return; + } + if (memcmp(nodes2, nodes, numa_bitmask_nbytes(nodes))) { + vwarn(p, "mismatched node mask\n"); + printmask("expected", nodes); + printmask("real", nodes2); + } + + if (get_mempolicy(&node, NULL, 0, p, MPOL_F_ADDR|MPOL_F_NODE) < 0) + err("get_mempolicy"); + + switch (policy) { + case MPOL_INTERLEAVE: + if (node < 0 || !numa_bitmask_isbitset(nodes2, node)) + vwarn(p, "interleave node out of range %d\n", node); + if (node != ilnode) { + vwarn(p, "expected interleave node %d, got %d\n", + ilnode,node); + return; + } + ilnode = interleave_next(ilnode, nodes2); + break; + case MPOL_PREFERRED: + case MPOL_BIND: + if (!numa_bitmask_isbitset(nodes2, node)) { + vwarn(p, "unexpected node %d\n", node); + printmask("expected", nodes2); + } + break; + + case MPOL_DEFAULT: + break; + + } + } + +}
diff --git a/shm.h b/shm.h new file mode 100644 index 0000000..68166a7 --- /dev/null +++ b/shm.h
@@ -0,0 +1,17 @@ + +extern int shmfd; +extern long shmid; +extern char *shmptr; +extern unsigned long long shmlen; +extern mode_t shmmode; +extern unsigned long long shmoffset; +extern int shmflags; + +extern void dump_shm(void); +extern void dump_shm_nodes(void); +extern void attach_shared(char *, char *); +extern void attach_sysvshm(char *, char *); +extern void verify_shm(int policy, struct bitmask *); + +/* in numactl.c */ +extern int exitcode;
diff --git a/stream_lib.c b/stream_lib.c new file mode 100644 index 0000000..a392bdd --- /dev/null +++ b/stream_lib.c
@@ -0,0 +1,266 @@ +#include <stdio.h> +#include <math.h> +#include <float.h> +#include <limits.h> +#include <sys/time.h> +#include <stdlib.h> +#include "stream_lib.h" + +static inline double mysecond() +{ + struct timeval tv; + gettimeofday(&tv, NULL); + return tv.tv_sec + tv.tv_usec * 1.e-6; +} + +/* + * Program: Stream + * Programmer: Joe R. Zagar + * Revision: 4.0-BETA, October 24, 1995 + * Original code developed by John D. McCalpin + * + * This program measures memory transfer rates in MB/s for simple + * computational kernels coded in C. These numbers reveal the quality + * of code generation for simple uncacheable kernels as well as showing + * the cost of floating-point operations relative to memory accesses. + * + * INSTRUCTIONS: + * + * 1) Stream requires a good bit of memory to run. Adjust the + * value of 'N' (below) to give a 'timing calibration' of + * at least 20 clock-ticks. This will provide rate estimates + * that should be good to about 5% precision. + * + * Hacked by AK to be a library + */ + +long N = 8000000; +#define NTIMES 10 +#define OFFSET 0 + +/* + * 3) Compile the code with full optimization. Many compilers + * generate unreasonably bad code before the optimizer tightens + * things up. If the results are unreasonably good, on the + * other hand, the optimizer might be too smart for me! + * + * Try compiling with: + * cc -O stream_d.c second_wall.c -o stream_d -lm + * + * This is known to work on Cray, SGI, IBM, and Sun machines. + * + * + * 4) Mail the results to mccalpin@cs.virginia.edu + * Be sure to include: + * a) computer hardware model number and software revision + * b) the compiler flags + * c) all of the output from the test case. + * Thanks! + * + */ + +int checktick(); + +# define HLINE "-------------------------------------------------------------\n" + +# ifndef MIN +# define MIN(x,y) ((x)<(y)?(x):(y)) +# endif +# ifndef MAX +# define MAX(x,y) ((x)>(y)?(x):(y)) +# endif + +static double *a, *b, *c; + +static double rmstime[4] = { 0 }, maxtime[4] = { +0}, mintime[4] = { +FLT_MAX, FLT_MAX, FLT_MAX, FLT_MAX}; + +static char *label[4] = { "Copy: ", "Scale: ", + "Add: ", "Triad: " +}; +char *stream_names[] = { "Copy","Scale","Add","Triad" }; + +static double bytes[4]; + +int stream_verbose = 1; + +#define Vprintf(x...) do { if (stream_verbose) printf(x); } while(0) + +void stream_check(void) +{ + int quantum; + int BytesPerWord; + register int j; + double t; + + /* --- SETUP --- determine precision and check timing --- */ + + Vprintf(HLINE); + BytesPerWord = sizeof(double); + Vprintf("This system uses %d bytes per DOUBLE PRECISION word.\n", + BytesPerWord); + + Vprintf(HLINE); + Vprintf("Array size = %lu, Offset = %d\n", N, OFFSET); + Vprintf("Total memory required = %.1f MB.\n", + (3 * N * BytesPerWord) / 1048576.0); + Vprintf("Each test is run %d times, but only\n", NTIMES); + Vprintf("the *best* time for each is used.\n"); + + /* Get initial value for system clock. */ + + for (j = 0; j < N; j++) { + a[j] = 1.0; + b[j] = 2.0; + c[j] = 0.0; + } + + Vprintf(HLINE); + + if ((quantum = checktick()) >= 1) + Vprintf("Your clock granularity/precision appears to be " + "%d microseconds.\n", quantum); + else + Vprintf("Your clock granularity appears to be " + "less than one microsecond.\n"); + + t = mysecond(); + for (j = 0; j < N; j++) + a[j] = 2.0E0 * a[j]; + t = 1.0E6 * (mysecond() - t); + + Vprintf("Each test below will take on the order" + " of %d microseconds.\n", (int) t); + Vprintf(" (= %d clock ticks)\n", (int) (t / quantum)); + Vprintf("Increase the size of the arrays if this shows that\n"); + Vprintf("you are not getting at least 20 clock ticks per test.\n"); + + Vprintf(HLINE); + + Vprintf("WARNING -- The above is only a rough guideline.\n"); + Vprintf("For best results, please be sure you know the\n"); + Vprintf("precision of your system timer.\n"); + Vprintf(HLINE); +} + +void stream_test(double *res) +{ + register int j, k; + double scalar, times[4][NTIMES]; + + /* --- MAIN LOOP --- repeat test cases NTIMES times --- */ + + scalar = 3.0; + for (k = 0; k < NTIMES; k++) { + times[0][k] = mysecond(); + for (j = 0; j < N; j++) + c[j] = a[j]; + times[0][k] = mysecond() - times[0][k]; + + times[1][k] = mysecond(); + for (j = 0; j < N; j++) + b[j] = scalar * c[j]; + times[1][k] = mysecond() - times[1][k]; + + times[2][k] = mysecond(); + for (j = 0; j < N; j++) + c[j] = a[j] + b[j]; + times[2][k] = mysecond() - times[2][k]; + + times[3][k] = mysecond(); + for (j = 0; j < N; j++) + a[j] = b[j] + scalar * c[j]; + times[3][k] = mysecond() - times[3][k]; + } + + /* --- SUMMARY --- */ + + for (k = 0; k < NTIMES; k++) { + for (j = 0; j < 4; j++) { + rmstime[j] = + rmstime[j] + (times[j][k] * times[j][k]); + mintime[j] = MIN(mintime[j], times[j][k]); + maxtime[j] = MAX(maxtime[j], times[j][k]); + } + } + + Vprintf + ("Function Rate (MB/s) RMS time Min time Max time\n"); + for (j = 0; j < 4; j++) { + double speed = 1.0E-06 * bytes[j] / mintime[j]; + + rmstime[j] = sqrt(rmstime[j] / (double) NTIMES); + + Vprintf("%s%11.4f %11.4f %11.4f %11.4f\n", label[j], + speed, + rmstime[j], mintime[j], maxtime[j]); + + if (res) + res[j] = speed; + + } +} + +# define M 20 + +int checktick() +{ + int i, minDelta, Delta; + double t1, t2, timesfound[M]; + +/* Collect a sequence of M unique time values from the system. */ + + for (i = 0; i < M; i++) { + t1 = mysecond(); + while (((t2 = mysecond()) - t1) < 1.0E-6); + timesfound[i] = t1 = t2; + } + +/* + * Determine the minimum difference between these M values. + * This result will be our estimate (in microseconds) for the + * clock granularity. + */ + + minDelta = 1000000; + for (i = 1; i < M; i++) { + Delta = + (int) (1.0E6 * (timesfound[i] - timesfound[i - 1])); + minDelta = MIN(minDelta, MAX(Delta, 0)); + } + + return (minDelta); +} + +void stream_setmem(unsigned long size) +{ + N = (size - OFFSET) / (3*sizeof(double)); +} + +long stream_memsize(void) +{ + return 3*(sizeof(double) * (N+OFFSET)) ; +} + +long stream_init(void *mem) +{ + int i; + + for (i = 0; i < 4; i++) { + rmstime[i] = 0; + maxtime[i] = 0; + mintime[i] = FLT_MAX; + } + + bytes[0] = 2 * sizeof(double) * N; + bytes[1] = 2 * sizeof(double) * N; + bytes[2] = 3 * sizeof(double) * N; + bytes[3] = 3 * sizeof(double) * N; + + a = mem; + b = (double *)mem + (N+OFFSET); + c = (double *)mem + 2*(N+OFFSET); + stream_check(); + return 0; +}
diff --git a/stream_lib.h b/stream_lib.h new file mode 100644 index 0000000..4d749bd --- /dev/null +++ b/stream_lib.h
@@ -0,0 +1,8 @@ +long stream_memsize(void); +long stream_init(void *mem); +#define STREAM_NRESULTS 4 +void stream_test(double *res); +void stream_check(void); +void stream_setmem(unsigned long size); +extern int stream_verbose; +extern char *stream_names[];
diff --git a/stream_main.c b/stream_main.c new file mode 100644 index 0000000..c46cae0 --- /dev/null +++ b/stream_main.c
@@ -0,0 +1,43 @@ +#include <stdio.h> +#include <sys/mman.h> +#include <stdlib.h> +#include "numa.h" +#include "numaif.h" +#include "util.h" +#include "stream_lib.h" + +void usage(void) +{ + exit(1); +} + +char *policy = "default"; + +/* Run STREAM with a numa policy */ +int main(int ac, char **av) +{ + struct bitmask *nodes; + char *map; + long size; + int policy; + + policy = parse_policy(av[1], av[2]); + + nodes = numa_allocate_nodemask(); + + if (av[1] && av[2]) + nodes = numa_parse_nodestring(av[2]); + if (!nodes) { + printf ("<%s> is invalid\n", av[2]); + exit(1); + } + size = stream_memsize(); + map = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, + 0, 0); + if (map == (char*)-1) exit(1); + if (mbind(map, size, policy, nodes->maskp, nodes->size, 0) < 0) + perror("mbind"), exit(1); + stream_init(map); + stream_test(NULL); + return 0; +}
diff --git a/syscall.c b/syscall.c new file mode 100644 index 0000000..31cf005 --- /dev/null +++ b/syscall.c
@@ -0,0 +1,266 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + + libnuma is free software; you can redistribute it and/or + modify it under the terms of the GNU Lesser General Public + License as published by the Free Software Foundation; version + 2.1. + + libnuma is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + Lesser General Public License for more details. + + You should find a copy of v2.1 of the GNU Lesser General Public License + somewhere on your Linux system; if not, write to the Free Software + Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include <unistd.h> +#include <sys/types.h> +#include <asm/unistd.h> +#include <errno.h> +#include "numa.h" +#include "numaif.h" +#include "numaint.h" + +#define WEAK __attribute__((weak)) + +#if !defined(__NR_mbind) || !defined(__NR_set_mempolicy) || \ + !defined(__NR_get_mempolicy) || !defined(__NR_migrate_pages) || \ + !defined(__NR_move_pages) + +#if defined(__x86_64__) + +#define __NR_sched_setaffinity 203 +#define __NR_sched_getaffinity 204 + +/* Official allocation */ + +#define __NR_mbind 237 +#define __NR_set_mempolicy 238 +#define __NR_get_mempolicy 239 +#define __NR_migrate_pages 256 +#define __NR_move_pages 279 + +#elif defined(__ia64__) +#define __NR_sched_setaffinity 1231 +#define __NR_sched_getaffinity 1232 +#define __NR_migrate_pages 1280 +#define __NR_move_pages 1276 + +/* Official allocation */ + +#define __NR_mbind 1259 +#define __NR_get_mempolicy 1260 +#define __NR_set_mempolicy 1261 + +#elif defined(__i386__) + +#define __NR_mbind 274 +#define __NR_get_mempolicy 275 +#define __NR_set_mempolicy 276 +#define __NR_migrate_pages 294 +#define __NR_move_pages 317 + +#elif defined(__powerpc__) + +#define __NR_mbind 259 +#define __NR_get_mempolicy 260 +#define __NR_set_mempolicy 261 +#define __NR_migrate_pages 258 +/* FIXME: powerpc is missing move pages!!! +#define __NR_move_pages xxx +*/ + +#elif defined(__mips__) + +#if _MIPS_SIM == _ABIO32 +/* + * Linux o32 style syscalls are in the range from 4000 to 4999. + */ +#define __NR_Linux 4000 +#define __NR_mbind (__NR_Linux + 268) +#define __NR_get_mempolicy (__NR_Linux + 269) +#define __NR_set_mempolicy (__NR_Linux + 270) +#define __NR_migrate_pages (__NR_Linux + 287) +#endif + +#if _MIPS_SIM == _ABI64 +/* + * Linux 64-bit syscalls are in the range from 5000 to 5999. + */ +#define __NR_Linux 5000 +#define __NR_mbind (__NR_Linux + 227) +#define __NR_get_mempolicy (__NR_Linux + 228) +#define __NR_set_mempolicy (__NR_Linux + 229) +#define __NR_migrate_pages (__NR_Linux + 246) +#endif + +#if _MIPS_SIM == _ABIN32 +/* + * Linux N32 syscalls are in the range from 6000 to 6999. + */ +#define __NR_Linux 6000 +#define __NR_mbind (__NR_Linux + 231) +#define __NR_get_mempolicy (__NR_Linux + 232) +#define __NR_set_mempolicy (__NR_Linux + 233) +#define __NR_migrate_pages (__NR_Linux + 250) +#endif + +#elif defined(__hppa__) + +#define __NR_migrate_pages 272 + +#elif defined(__arm__) +/* https://bugs.debian.org/796802 */ +#warning "ARM does not implement the migrate_pages() syscall" + +#elif !defined(DEPS_RUN) +#error "Add syscalls for your architecture or update kernel headers" +#endif + +#endif + +#ifndef __GLIBC_PREREQ +# define __GLIBC_PREREQ(x,y) 0 +#endif + +#if defined(__GLIBC__) && __GLIBC_PREREQ(2, 11) + +/* glibc 2.11 seems to have working 6 argument sycall. Use the + glibc supplied syscall in this case. + The version cut-off is rather arbitary and could be probably + earlier. */ + +#define syscall6 syscall +#elif defined(__x86_64__) +/* 6 argument calls on x86-64 are often buggy in both glibc and + asm/unistd.h. Add a working version here. */ +long syscall6(long call, long a, long b, long c, long d, long e, long f) +{ + long res; + asm volatile ("movq %[d],%%r10 ; movq %[e],%%r8 ; movq %[f],%%r9 ; syscall" + : "=a" (res) + : "0" (call),"D" (a),"S" (b), "d" (c), + [d] "g" (d), [e] "g" (e), [f] "g" (f) : + "r11","rcx","r8","r10","r9","memory" ); + if (res < 0) { + errno = -res; + res = -1; + } + return res; +} +#elif defined(__i386__) + +/* i386 has buggy syscall6 in glibc too. This is tricky to do + in inline assembly because it clobbers so many registers. Do it + out of line. */ +asm( +"__syscall6:\n" +" pushl %ebp\n" +" pushl %edi\n" +" pushl %esi\n" +" pushl %ebx\n" +" movl (0+5)*4(%esp),%eax\n" +" movl (1+5)*4(%esp),%ebx\n" +" movl (2+5)*4(%esp),%ecx\n" +" movl (3+5)*4(%esp),%edx\n" +" movl (4+5)*4(%esp),%esi\n" +" movl (5+5)*4(%esp),%edi\n" +" movl (6+5)*4(%esp),%ebp\n" +" int $0x80\n" +" popl %ebx\n" +" popl %esi\n" +" popl %edi\n" +" popl %ebp\n" +" ret" +); +extern long __syscall6(long n, long a, long b, long c, long d, long e, long f); + +long syscall6(long call, long a, long b, long c, long d, long e, long f) +{ + long res = __syscall6(call,a,b,c,d,e,f); + if (res < 0) { + errno = -res; + res = -1; + } + return res; +} + +#else +#define syscall6 syscall +#endif + +long WEAK get_mempolicy(int *policy, unsigned long *nmask, + unsigned long maxnode, void *addr, + unsigned flags) +{ + return syscall(__NR_get_mempolicy, policy, nmask, + maxnode, addr, flags); +} + +long WEAK mbind(void *start, unsigned long len, int mode, + const unsigned long *nmask, unsigned long maxnode, unsigned flags) +{ + return syscall6(__NR_mbind, (long)start, len, mode, (long)nmask, + maxnode, flags); +} + +long WEAK set_mempolicy(int mode, const unsigned long *nmask, + unsigned long maxnode) +{ + long i; + i = syscall(__NR_set_mempolicy,mode,nmask,maxnode); + return i; +} + +long WEAK migrate_pages(int pid, unsigned long maxnode, + const unsigned long *frommask, const unsigned long *tomask) +{ +#if defined(__NR_migrate_pages) + return syscall(__NR_migrate_pages, pid, maxnode, frommask, tomask); +#else + errno = ENOSYS; + return -1; +#endif +} + +long WEAK move_pages(int pid, unsigned long count, + void **pages, const int *nodes, int *status, int flags) +{ + return syscall(__NR_move_pages, pid, count, pages, nodes, status, flags); +} + +/* SLES8 glibc doesn't define those */ +int numa_sched_setaffinity_v1(pid_t pid, unsigned len, const unsigned long *mask) +{ + return syscall(__NR_sched_setaffinity,pid,len,mask); +} +backward_symver(numa_sched_setaffinity_v1,numa_sched_setaffinity); + +int numa_sched_setaffinity_v2(pid_t pid, struct bitmask *mask) +{ + return syscall(__NR_sched_setaffinity, pid, numa_bitmask_nbytes(mask), + mask->maskp); +} +symver(numa_sched_setaffinity_v2,numa_sched_setaffinity); + +int numa_sched_getaffinity_v1(pid_t pid, unsigned len, const unsigned long *mask) +{ + return syscall(__NR_sched_getaffinity,pid,len,mask); + +} +backward_symver(numa_sched_getaffinity_v1,numa_sched_getaffinity); + +int numa_sched_getaffinity_v2(pid_t pid, struct bitmask *mask) +{ + /* len is length in bytes */ + return syscall(__NR_sched_getaffinity, pid, numa_bitmask_nbytes(mask), + mask->maskp); + /* sched_getaffinity returns sizeof(cpumask_t) */ + +} +symver(numa_sched_getaffinity_v2,numa_sched_getaffinity); + +make_internal_alias(numa_sched_getaffinity_v1); +make_internal_alias(numa_sched_getaffinity_v2); +make_internal_alias(numa_sched_setaffinity_v1); +make_internal_alias(numa_sched_setaffinity_v2);
diff --git a/sysfs.c b/sysfs.c new file mode 100644 index 0000000..f1cdcdc --- /dev/null +++ b/sysfs.c
@@ -0,0 +1,69 @@ +/* Utility functions for reading sysfs values */ +#define _GNU_SOURCE 1 +#include <stdio.h> +#include <sys/fcntl.h> +#include <stdlib.h> +#include <unistd.h> +#include <stdarg.h> +#include <ctype.h> +#include "numa.h" +#include "numaint.h" + +#define SYSFS_BLOCK 4096 + +hidden char *sysfs_read(char *name) +{ + char *buf; + int n; + int fd; + + fd = open(name, O_RDONLY); + buf = malloc(SYSFS_BLOCK); + if (!buf) + return NULL; + n = read(fd, buf, SYSFS_BLOCK - 1); + close(fd); + if (n <= 0) { + free(buf); + return NULL; + } + buf[n] = 0; + return buf; +} + +hidden int sysfs_node_read(struct bitmask *mask, char *fmt, ...) +{ + int n; + va_list ap; + char *p, *fn, *m, *end; + int num; + + va_start(ap, fmt); + n = vasprintf(&fn, fmt, ap); + va_end(ap); + if (n < 0) + return -1; + p = sysfs_read(fn); + free(fn); + if (!p) + return -1; + + m = p; + do { + num = strtol(m, &end, 0); + if (m == end) + return -1; + if (num < 0) + return -2; + if (num >= numa_num_task_nodes()) + return -1; + numa_bitmask_setbit(mask, num); + + /* Continuation not supported by kernel yet. */ + m = end; + while (isspace(*m) || *m == ',') + m++; + } while (isdigit(*m)); + free(p); + return 0; +}
diff --git a/sysfs.h b/sysfs.h new file mode 100644 index 0000000..0574ab1 --- /dev/null +++ b/sysfs.h
@@ -0,0 +1,3 @@ +struct bitmask; +hidden char *sysfs_read(char *name); +hidden int sysfs_node_read(struct bitmask *mask, char *fmt, ...);
diff --git a/test-libnuma.c b/test-libnuma.c new file mode 100644 index 0000000..0234350 --- /dev/null +++ b/test-libnuma.c
@@ -0,0 +1,21 @@ +#include <sched.h> +#include <sys/types.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> + +#include "third_party/libnuma/numa.h" + +void numa_init(void); + +int main(int argc, char *argv[]) +{ + numa_init(); + if (numa_available() == 0) { + printf("numa max node: %d\n", numa_max_node()); + } else { + printf("numa not available\n"); + } + + return (0); +}
diff --git a/test/README b/test/README new file mode 100644 index 0000000..c584c6f --- /dev/null +++ b/test/README
@@ -0,0 +1,21 @@ + +Various simple test scripts to verify some parts of the NUMA API. + +To do a full regression test run make test + +You should have at least two nodes on a NUMA system for the test suite. + +The tests in regress assume that there is enough memory free on nodes 0/1. +They consider PREFERRED/INTERLEAVE not hitting the first choice node an +error. + +They also require a relatively idle machine to avoid too much +noise from memory allocation from other processes. Without +that regress1 might fail. + +You can run the tests under valgrind with VALGRIND=valgrind make test +Older valgrind versions incorrectly report a uninitialized byte error +on set_mempolicy. That is a false positive. + +TBD: more detailed unit tests for mbind / shm / {get,set}_mempolicy +Currently everything is tested using numactl only.
diff --git a/test/bind_range b/test/bind_range new file mode 100644 index 0000000..70c0df6 --- /dev/null +++ b/test/bind_range
@@ -0,0 +1,109 @@ +#!/bin/bash + +# This simple script checks --all/-a option which is used for +# supressing of default cpuset awareness of options --cpunodebind, +# --physcpubind, --interleave, --preferred and --membind. + +# NOTE: Test needs two nodes and two cpus at least + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +export old_mask + +eval_test() { + # echo "Running $1.." + $1 + if [ $? == 1 ] ; then + echo -e "$1 FAILED!" + reset_mask + exit 1 + fi + echo -e "$1 PASSED" +} + +function check_arg_order +{ + numactl --all --physcpubind=$HIGHESTCPU ls > /dev/null 2>&1 + if [ $? == 1 ] ; then + return 1; + fi + numactl --physcpubind=$HIGHESTCPU --all ls > /dev/null 2>&1 + if [ $? == 0 ] ; then + return 1; + fi + + return 0 +} + +function check_physcpubind +{ + reset_mask + set_cpu_affinity 0 + numactl --physcpubind=$HIGHESTCPU ls > /dev/null 2>&1 + if [ $? == 0 ] ; then # shouldn't pass so easy + return 1; + fi + numactl --all --physcpubind=$HIGHESTCPU ls > /dev/null 2>&1 + if [ $? == 1 ] ; then # shouldn't fail + return 1; + fi + + return 0 +} + +function check_cpunodebind +{ + local low_cpu_range + local high_cpu + + reset_mask + low_cpu_range=$(cat /sys/devices/system/node/node$LOWESTNODE/cpulist) + set_cpu_affinity $low_cpu_range + numactl --cpunodebind=$HIGHESTNODE ls > /dev/null 2>&1 + if [ $? == 1 ] ; then # should pass + return 1; + fi + numactl --all --cpunodebind=$HIGHESTNODE ls > /dev/null 2>&1 + if [ $? == 1 ] ; then # should pass for sure + return 1; + fi + + return 0 +} + +function set_cpu_affinity +{ + taskset -p -c $1 $$ > /dev/null + #echo -e "\taffinity of shell was set to" $1 +} + +function get_mask +{ + old_mask=$(taskset -p $$ | cut -f2 -d: | sed -e 's/^[ \t]*//') +} + +function reset_mask +{ + taskset -p $old_mask $$ > /dev/null + #echo -e "\taffinity of shell was reset to" $old_mask +} + +HIGHESTCPU=$(grep 'processor' /proc/cpuinfo | tail -n1 | cut -f2 -d':') +HIGHESTCPU=$(echo $HIGHESTCPU | cut -f2 -d' ') +HIGHESTNODE=$(numactl -H | grep -e 'node [0-9]* cpus' | tail -n1 | cut -f2 -d' ') +LOWESTNODE=$(numactl -H | grep -e 'node [0-9]* cpus' | head -n1 | cut -f2 -d' ') + +get_mask + +eval_test check_arg_order +eval_test check_physcpubind +eval_test check_cpunodebind + +reset_mask + +exit 0 + +
diff --git a/test/checkaffinity b/test/checkaffinity new file mode 100755 index 0000000..8b96ed9 --- /dev/null +++ b/test/checkaffinity
@@ -0,0 +1,31 @@ +#!/bin/bash +# check if affinity works + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +S=`numactl --show | grep nodebind:` +NODES=`echo $S | sed -e "s/nodebind://"` + +S=`numactl --show | grep physcpubind:` +CPUS=`echo $S | sed -e "s/physcpubind://"` + +for i in $CPUS ; do + if [ "$(numactl --physcpubind=$i "${testdir}"/printcpu)" != "$i" ] ; then + echo "--physcpubind for $i doesn't work" + exit 1 + fi + if [ "$(numactl --physcpubind=$i numactl --show | awk '/^physcpubind/ { print $2 }' )" != "$i" ] ; then + echo "--show doesn't agree with physcpubind for cpu $i" + exit 1 + fi +done + +for i in $NODES ; do + if [ $(numactl --cpunodebind=$i numactl --show | awk '/nodebind/ { print $2 }' ) != $i ] ; then + echo "--show doesn't agree with cpunodebind for node $i" + exit 1 + fi +done
diff --git a/test/checktopology b/test/checktopology new file mode 100755 index 0000000..63403b5 --- /dev/null +++ b/test/checktopology
@@ -0,0 +1,42 @@ +#!/bin/bash +# check numactl --hardware output +# this checks most of the topology discovery in libnuma + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +numcpus=$(grep -c processor /proc/cpuinfo) +numnodes=$(ls -1d /sys/devices/system/node/node[0-9]* | wc -l) + +nccpus=$(numactl --hardware | grep cpus | sed 's/node.*cpus://' | wc -w ) +ncnodes=$(numactl --hardware | grep -c 'node.*size' ) + +if [ $numnodes != $ncnodes ] ; then + echo "numactl --hardware doesnt report all nodes" + exit 1 +fi + +if [ $numcpus != $nccpus -a \( $[$nccpus / $numnodes] != $numcpus \) ] ; then + echo "numactl --hardware cpus look bogus" + exit 1 +fi + +numactl --hardware | grep cpus | while read n ; do + node=${n/ cpus*/} + node=${node/ /} + cpus=${n/*: /} + k=0 + for i in $cpus ; do + if [ ! -h "/sys/devices/system/node/$node/cpu$i" ] ; then + echo "$node doesn't have cpu $i" + exit 1 + fi + k=$[$k+1] + done + if [ $k != $(echo $cpus | wc -w) ] ; then + echo "$node missing cpu" + exit 1 + fi +done
diff --git a/test/distance.c b/test/distance.c new file mode 100644 index 0000000..fca109f --- /dev/null +++ b/test/distance.c
@@ -0,0 +1,42 @@ +/* Test numa_distance */ +#include <numa.h> +#include <stdio.h> +#include <stdlib.h> + +int main(void) +{ + int numnodes, maxnode, a, b, got_nodes = 0; + int *node_to_use; + long size, free_node_sizes; + if (numa_available() < 0) { + printf("no numa support in kernel\n"); + exit(1); + } + numnodes = numa_num_configured_nodes(); + maxnode = numa_max_node(); + node_to_use = (int *)malloc(numnodes * sizeof(int)); + for (a = 0; a <= maxnode; a++) { + size = numa_node_size(a, &free_node_sizes); + if(size != -1) + node_to_use[got_nodes++] = a; + } + for (a = 0; a < got_nodes; a++){ + printf("%03d: ", node_to_use[a]); + if (numa_distance(node_to_use[a], node_to_use[a]) != 10) { + printf("%d: self distance is not 10 (%d)\n", + node_to_use[a], numa_distance(node_to_use[a],node_to_use[a])); + exit(1); + } + for (b = 0; b < got_nodes; b++) { + int d1 = numa_distance(node_to_use[a], node_to_use[b]); + int d2 = numa_distance(node_to_use[b], node_to_use[a]); + printf("%03d ", d1); + if (d1 != d2) { + printf("\n(%d,%d)->(%d,%d) wrong!\n",node_to_use[a],node_to_use[b],d1,d2); + exit(1); + } + } + printf("\n"); + } + return 0; +}
diff --git a/test/ftok.c b/test/ftok.c new file mode 100644 index 0000000..736a42e --- /dev/null +++ b/test/ftok.c
@@ -0,0 +1,8 @@ +#include <sys/ipc.h> +#include <stdio.h> +int main(int ac, char **av) +{ + while (*++av) + printf("0x%x\n", ftok(*av, 0)); + return 0; +}
diff --git a/test/getnodemask.c b/test/getnodemask.c new file mode 100644 index 0000000..7f39be9 --- /dev/null +++ b/test/getnodemask.c
@@ -0,0 +1,28 @@ +#include <sched.h> +#include <sys/types.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <numa.h> + +int main(int argc, char *argv[]) +{ + nodemask_t nodemask; + int rc, i; + + rc = numa_available(); + printf("numa_available returns %d\n", rc); + if (rc < 0) exit(1); + + nodemask_zero(&nodemask); + + nodemask = numa_get_run_node_mask(); + for (i = 0; i < 4; i++) { + printf("numa_get_run_node_mask nodemask_isset returns=0x%lx\n", nodemask_isset(&nodemask, i)); + } + + rc = numa_run_on_node_mask(&nodemask); + printf("rc=%d from numa_run_on_node_mask\n", rc); + + return (0); +}
diff --git a/test/mbind_mig_pages.c b/test/mbind_mig_pages.c new file mode 100644 index 0000000..31c73e4 --- /dev/null +++ b/test/mbind_mig_pages.c
@@ -0,0 +1,129 @@ +/* + * Test program to test the moving of pages using mbind. + * + * (C) 2006 Silicon Graphics, Inc. + * Christoph Lameter <clameter@sgi.com> + */ +#include <stdio.h> +#include <stdlib.h> +#include <numa.h> +#include <numaif.h> +#include <unistd.h> +#include <asm/unistd.h> + +unsigned int pagesize; +unsigned int page_count = 32; + +char *page_base; +char *pages; + +void **addr; +int *status; +int *nodes; +int errors; +int nr_nodes; + +struct bitmask *old_nodes; +struct bitmask *new_nodes; + +int main(int argc, char **argv) +{ + int i, rc; + + pagesize = getpagesize(); + + nr_nodes = numa_max_node()+1; + + old_nodes = numa_bitmask_alloc(nr_nodes); + new_nodes = numa_bitmask_alloc(nr_nodes); + numa_bitmask_setbit(old_nodes, 0); + numa_bitmask_setbit(new_nodes, 1); + + if (nr_nodes < 2) { + printf("A minimum of 2 nodes is required for this test.\n"); + exit(1); + } + + setbuf(stdout, NULL); + printf("mbind migration test ......\n"); + if (argc > 1) + sscanf(argv[1], "%d", &page_count); + + page_base = malloc((pagesize + 1) * page_count); + addr = malloc(sizeof(char *) * page_count); + status = malloc(sizeof(int *) * page_count); + nodes = malloc(sizeof(int *) * page_count); + if (!page_base || !addr || !status || !nodes) { + printf("Unable to allocate memory\n"); + exit(1); + } + + pages = (void *) ((((long)page_base) & ~((long)(pagesize - 1))) + pagesize); + + for (i = 0; i < page_count; i++) { + if (i != 2) + /* We leave page 2 unallocated */ + pages[ i * pagesize ] = (char) i; + addr[i] = pages + i * pagesize; + nodes[i] = 0; + status[i] = -123; + } + + /* Move pages toi node zero */ + numa_move_pages(0, page_count, addr, nodes, status, 0); + + printf("\nPage status before page migration\n"); + printf("---------------------------------\n"); + rc = numa_move_pages(0, page_count, addr, NULL, status, 0); + if (rc < 0) { + perror("move_pages"); + exit(1); + } + + for (i = 0; i < page_count; i++) { + printf("Page %d vaddr=%p node=%d\n", i, pages + i * pagesize, status[i]); + if (i != 2 && status[i]) { + printf("Bad page state. Page %d status %d\n",i, status[i]); + exit(1); + } + } + + /* Move to node zero */ + printf("\nMoving pages via mbind to node 0 ...\n"); + rc = mbind(pages, page_count * pagesize, MPOL_BIND, old_nodes->maskp, + old_nodes->size + 1, MPOL_MF_MOVE | MPOL_MF_STRICT); + if (rc < 0) { + perror("mbind"); + errors++; + } + + printf("\nMoving pages via mbind from node 0 to 1 ...\n"); + rc = mbind(pages, page_count * pagesize, MPOL_BIND, new_nodes->maskp, + new_nodes->size + 1, MPOL_MF_MOVE | MPOL_MF_STRICT); + if (rc < 0) { + perror("mbind"); + errors++; + } + + numa_move_pages(0, page_count, addr, NULL, status, 0); + for (i = 0; i < page_count; i++) { + printf("Page %d vaddr=%lx node=%d\n", i, + (unsigned long)(pages + i * pagesize), status[i]); + if (i != 2) { + if (pages[ i* pagesize ] != (char) i) { + printf("*** Page content corrupted.\n"); + errors++; + } else if (status[i] != 1) { + printf("*** Page on wrong node.\n"); + errors++; + } + } + } + + if (!errors) + printf("Test successful.\n"); + else + printf("%d errors.\n", errors); + + return errors > 0 ? 1 : 0; +}
diff --git a/test/migrate_pages.c b/test/migrate_pages.c new file mode 100644 index 0000000..37c4e97 --- /dev/null +++ b/test/migrate_pages.c
@@ -0,0 +1,124 @@ +/* + * Test program to test the moving of a processes pages. + * + * (C) 2006 Silicon Graphics, Inc. + * Christoph Lameter <clameter@sgi.com> + */ +#include <stdio.h> +#include <stdlib.h> +#include <numa.h> +#include <unistd.h> +#include <errno.h> + +unsigned int pagesize; +unsigned int page_count = 32; + +char *page_base; +char *pages; + +void **addr; +int *status; +int *nodes; +int errors; +int nr_nodes; + +struct bitmask *old_nodes; +struct bitmask *new_nodes; + +int main(int argc, char **argv) +{ + int i, rc; + + pagesize = getpagesize(); + + nr_nodes = numa_max_node()+1; + + old_nodes = numa_bitmask_alloc(nr_nodes); + new_nodes = numa_bitmask_alloc(nr_nodes); + numa_bitmask_setbit(old_nodes, 1); + numa_bitmask_setbit(new_nodes, 0); + + if (nr_nodes < 2) { + printf("A minimum of 2 nodes is required for this test.\n"); + exit(1); + } + + setbuf(stdout, NULL); + printf("migrate_pages() test ......\n"); + if (argc > 1) + sscanf(argv[1], "%d", &page_count); + + page_base = malloc((pagesize + 1) * page_count); + addr = malloc(sizeof(char *) * page_count); + status = malloc(sizeof(int *) * page_count); + nodes = malloc(sizeof(int *) * page_count); + if (!page_base || !addr || !status || !nodes) { + printf("Unable to allocate memory\n"); + exit(1); + } + + pages = (void *) ((((long)page_base) & ~((long)(pagesize - 1))) + pagesize); + + for (i = 0; i < page_count; i++) { + if (i != 2) + /* We leave page 2 unallocated */ + pages[ i * pagesize ] = (char) i; + addr[i] = pages + i * pagesize; + nodes[i] = 1; + status[i] = -123; + } + + /* Move to starting node */ + rc = numa_move_pages(0, page_count, addr, nodes, status, 0); + if (rc < 0 && errno != ENOENT) { + perror("move_pages"); + exit(1); + } + + /* Verify correct startup locations */ + printf("Page location at the beginning of the test\n"); + printf("------------------------------------------\n"); + + numa_move_pages(0, page_count, addr, NULL, status, 0); + for (i = 0; i < page_count; i++) { + printf("Page %d vaddr=%p node=%d\n", i, pages + i * pagesize, status[i]); + if (i != 2 && status[i] != 1) { + printf("Bad page state before migrate_pages. Page %d status %d\n",i, status[i]); + exit(1); + } + } + + /* Move to node zero */ + numa_move_pages(0, page_count, addr, nodes, status, 0); + + printf("\nMigrating the current processes pages ...\n"); + rc = numa_migrate_pages(0, old_nodes, new_nodes); + + if (rc < 0) { + perror("numa_migrate_pages failed"); + errors++; + } + + /* Get page state after migration */ + numa_move_pages(0, page_count, addr, NULL, status, 0); + for (i = 0; i < page_count; i++) { + printf("Page %d vaddr=%lx node=%d\n", i, + (unsigned long)(pages + i * pagesize), status[i]); + if (i != 2) { + if (pages[ i* pagesize ] != (char) i) { + printf("*** Page contents corrupted.\n"); + errors++; + } else if (status[i]) { + printf("*** Page on the wrong node\n"); + errors++; + } + } + } + + if (!errors) + printf("Test successful.\n"); + else + printf("%d errors.\n", errors); + + return errors > 0 ? 1 : 0; +}
diff --git a/test/move_pages.c b/test/move_pages.c new file mode 100644 index 0000000..87d9b3e --- /dev/null +++ b/test/move_pages.c
@@ -0,0 +1,101 @@ +/* + * Test program to test the moving of individual pages in a process. + * + * (C) 2006 Silicon Graphics, Inc. + * Christoph Lameter <clameter@sgi.com> + */ +#include <stdio.h> +#include <stdlib.h> +#include "numa.h" +#include <unistd.h> +#include <asm/unistd.h> + +unsigned int pagesize; +unsigned int page_count = 32; + +char *page_base; +char *pages; + +void **addr; +int *status; +int *nodes; +int errors; +int nr_nodes; + +int main(int argc, char **argv) +{ + int i, rc; + + pagesize = getpagesize(); + + nr_nodes = numa_max_node(); + + if (nr_nodes < 2) { + printf("A minimum of 2 nodes is required for this test.\n"); + exit(1); + } + + setbuf(stdout, NULL); + printf("move_pages() test ......\n"); + if (argc > 1) + sscanf(argv[1], "%d", &page_count); + + printf("pages=%d (%s)\n", page_count, argv[1]); + + page_base = malloc((pagesize + 1) * page_count); + addr = malloc(sizeof(char *) * page_count); + status = malloc(sizeof(int *) * page_count); + nodes = malloc(sizeof(int *) * page_count); + if (!page_base || !addr || !status || !nodes) { + printf("Unable to allocate memory\n"); + exit(1); + } + + pages = (void *) ((((long)page_base) & ~((long)(pagesize - 1))) + pagesize); + + for (i = 0; i < page_count; i++) { + if (i != 2) + /* We leave page 2 unallocated */ + pages[ i * pagesize ] = (char) i; + addr[i] = pages + i * pagesize; + nodes[i] = (i % nr_nodes); + status[i] = -123; + } + + printf("\nMoving pages to start node ...\n"); + rc = numa_move_pages(0, page_count, addr, NULL, status, 0); + if (rc < 0) + perror("move_pages"); + + for (i = 0; i < page_count; i++) + printf("Page %d vaddr=%p node=%d\n", i, pages + i * pagesize, status[i]); + + printf("\nMoving pages to target nodes ...\n"); + rc = numa_move_pages(0, page_count, addr, nodes, status, 0); + + if (rc < 0) { + perror("move_pages"); + errors++; + } + + for (i = 0; i < page_count; i++) { + if (i != 2) { + if (pages[ i* pagesize ] != (char) i) + errors++; + else if (nodes[i] != (i % nr_nodes)) + errors++; + } + } + + for (i = 0; i < page_count; i++) { + printf("Page %d vaddr=%lx node=%d\n", i, + (unsigned long)(pages + i * pagesize), status[i]); + } + + if (!errors) + printf("Test successful.\n"); + else + printf("%d errors.\n", errors); + + return errors > 0 ? 1 : 0; +}
diff --git a/test/mynode.c b/test/mynode.c new file mode 100644 index 0000000..f728a70 --- /dev/null +++ b/test/mynode.c
@@ -0,0 +1,15 @@ +#include <numa.h> +#include <numaif.h> +#include <stdio.h> + +int main(void) +{ + int nd; + char *man = numa_alloc(1000); + *man = 1; + if (get_mempolicy(&nd, NULL, 0, man, MPOL_F_NODE|MPOL_F_ADDR) < 0) + perror("get_mempolicy"); + else + printf("my node %d\n", nd); + return 0; +}
diff --git a/test/node-parse.c b/test/node-parse.c new file mode 100644 index 0000000..85858d7 --- /dev/null +++ b/test/node-parse.c
@@ -0,0 +1,26 @@ +/* Test wrapper for the nodemask parser */ +#include <stdio.h> +#include "numa.h" +#include "util.h" + +/* For util.c. Fixme. */ +void usage(void) +{ + exit(1); +} + +int main(int ac, char **av) +{ + int err = 0; + while (*++av) { + struct bitmask *mask = numa_parse_nodestring(*av); + if (!mask) { + printf("Failed to convert `%s'\n", *av); + err |= 1; + continue; + } + printmask("result", mask); + numa_bitmask_free(mask); + } + return err; +}
diff --git a/test/nodemap.c b/test/nodemap.c new file mode 100644 index 0000000..cd2cffa --- /dev/null +++ b/test/nodemap.c
@@ -0,0 +1,30 @@ +#include "numa.h" +#include <stdio.h> +#include <stdlib.h> + +int main(void) +{ + int i, k, w, ncpus; + struct bitmask *cpus; + int maxnode = numa_num_configured_nodes()-1; + + if (numa_available() < 0) { + printf("no numa\n"); + exit(1); + } + cpus = numa_allocate_cpumask(); + ncpus = cpus->size; + + for (i = 0; i <= maxnode ; i++) { + if (numa_node_to_cpus(i, cpus) < 0) { + printf("node %d failed to convert\n",i); + } + printf("%d: ", i); + w = 0; + for (k = 0; k < ncpus; k++) + if (numa_bitmask_isbitset(cpus, k)) + printf(" %s%d", w>0?",":"", k); + putchar('\n'); + } + return 0; +}
diff --git a/test/numademo b/test/numademo new file mode 100755 index 0000000..5cdd3a0 --- /dev/null +++ b/test/numademo
@@ -0,0 +1,8 @@ +#!/bin/sh + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +exec "${builddir}"/numademo -t -e 10M
diff --git a/test/pagesize.c b/test/pagesize.c new file mode 100644 index 0000000..9f5dec7 --- /dev/null +++ b/test/pagesize.c
@@ -0,0 +1,8 @@ +#include <unistd.h> +#include <stdio.h> + +int main(void) +{ + printf("%d\n", getpagesize()); + return 0; +}
diff --git a/test/prefered.c b/test/prefered.c new file mode 100644 index 0000000..d273bec --- /dev/null +++ b/test/prefered.c
@@ -0,0 +1,63 @@ +/* Test prefer policy */ +#include "numa.h" +#include "numaif.h" +#include <sys/mman.h> +#include <stdio.h> +#include <assert.h> +#include <unistd.h> +#include <stdlib.h> +#include <errno.h> + +#define err(x) perror(x),exit(1) + +int main(void) +{ + int max = numa_max_node(); + int maxmask = numa_num_possible_nodes(); + struct bitmask *nodes, *mask; + int pagesize = getpagesize(); + int i; + int pol; + int node; + int err = 0; + nodes = numa_bitmask_alloc(maxmask); + mask = numa_bitmask_alloc(maxmask); + + for (i = max; i >= 0; --i) { + char *mem = mmap(NULL, pagesize*(max+1), PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANONYMOUS, 0, 0); + char *adr = mem; + + if (mem == (char *)-1) + err("mmap"); + + printf("%d offset %lx\n", i, (long)(adr - mem)); + + numa_bitmask_clearall(nodes); + numa_bitmask_clearall(mask); + numa_bitmask_setbit(nodes, i); + + if (mbind(adr, pagesize, MPOL_PREFERRED, nodes->maskp, + nodes->size, 0) < 0) + err("mbind"); + + ++*adr; + + if (get_mempolicy(&pol, mask->maskp, mask->size, adr, MPOL_F_ADDR) < 0) + err("get_mempolicy"); + + assert(pol == MPOL_PREFERRED); + assert(numa_bitmask_isbitset(mask, i)); + + node = 0x123; + + if (get_mempolicy(&node, NULL, 0, adr, MPOL_F_ADDR|MPOL_F_NODE) < 0) + err("get_mempolicy2"); + + printf("got node %d expected %d\n", node, i); + + if (node != i) + err = 1; + } + return err; +}
diff --git a/test/printcpu b/test/printcpu new file mode 100755 index 0000000..661fa5a --- /dev/null +++ b/test/printcpu
@@ -0,0 +1,5 @@ +#!/bin/bash +#print cpu it is running on +declare -a arr +arr=( $(< /proc/self/stat) ) +echo ${arr[38]}
diff --git a/test/randmap.c b/test/randmap.c new file mode 100644 index 0000000..df8a666 --- /dev/null +++ b/test/randmap.c
@@ -0,0 +1,175 @@ +/* Randomly change policy */ +#include <stdio.h> +#include "numa.h" +#include "numaif.h" +#include <sys/mman.h> +#include <sys/shm.h> +#include <sys/ipc.h> +#include <stdlib.h> +#include <time.h> +#include <unistd.h> +#include <string.h> +#include <errno.h> + +#define SIZE (100*1024*1024) +#define PAGES (SIZE/pagesize) + +#define perror(x) printf("%s: %s\n", x, strerror(errno)) +#define err(x) perror(x),exit(1) + +struct page { + unsigned long mask; + int policy; +}; + +struct page *pages; +char *map; +int pagesize; + +void setpol(unsigned long offset, unsigned long length, int policy, unsigned long nodes) +{ + long i, end; + + printf("off:%lx length:%lx policy:%d nodes:%lx\n", + offset, length, policy, nodes); + + if (mbind(map + offset*pagesize, length*pagesize, policy, + &nodes, 8, 0) < 0) { + printf("mbind: %s offset %lx length %lx policy %d nodes %lx\n", + strerror(errno), + offset*pagesize, length*pagesize, + policy, nodes); + return; + } + + for (i = offset; i < offset+length; i++) { + pages[i].mask = nodes; + pages[i].policy = policy; + } + + i = offset - 20; + if (i < 0) + i = 0; + end = offset+length+20; + if (end > PAGES) + end = PAGES; + for (; i < end; i++) { + int pol2; + unsigned long nodes2; + if (get_mempolicy(&pol2, &nodes2, sizeof(long)*8, map+i*pagesize, + MPOL_F_ADDR) < 0) + err("get_mempolicy"); + if (pol2 != pages[i].policy) { + printf("%lx: got policy %d expected %d, nodes got %lx expected %lx\n", + i, pol2, pages[i].policy, nodes2, pages[i].mask); + } + if (policy != MPOL_DEFAULT && nodes2 != pages[i].mask) { + printf("%lx: nodes %lx, expected %lx, policy %d\n", + i, nodes2, pages[i].mask, policy); + } + } +} + +static unsigned char pop4[16] = { + 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4 +}; + +int popcnt(unsigned long val) +{ + int count = 0; + while (val) { + count += pop4[val & 0xf]; + val >>= 4; + } + return count; +} + +void testmap(void) +{ + pages = calloc(1, PAGES * sizeof(struct page)); + if (!pages) + exit(100); + + printf("simple tests\n"); +#define MB ((1024*1024)/pagesize) + setpol(0, PAGES, MPOL_INTERLEAVE, 3); + setpol(0, MB, MPOL_BIND, 1); + setpol(MB, MB, MPOL_BIND, 1); + setpol(MB, MB, MPOL_DEFAULT, 0); + setpol(MB, MB, MPOL_PREFERRED, 2); + setpol(MB/2, MB, MPOL_DEFAULT, 0); + setpol(MB+MB/2, MB, MPOL_BIND, 2); + setpol(MB/2+100, 100, MPOL_PREFERRED, 1); + setpol(100, 200, MPOL_PREFERRED, 1); + printf("done\n"); + + for (;;) { + unsigned long offset = random() % PAGES; + int policy = random() % (MPOL_MAX+1); + unsigned long nodes = random() % 4; + long length = random() % (PAGES - offset); + + /* validate */ + switch (policy) { + case MPOL_DEFAULT: + nodes = 0; + break; + case MPOL_INTERLEAVE: + case MPOL_BIND: + if (nodes == 0) + continue; + break; + case MPOL_PREFERRED: + if (popcnt(nodes) != 1) + continue; + break; + } + + setpol(offset, length, policy, nodes); + + } +} + +int main(int ac, char **av) +{ + unsigned long seed; + + pagesize = getpagesize(); + +#if 0 + map = mmap(NULL, SIZE, PROT_READ, MAP_ANONYMOUS|MAP_PRIVATE, 0, 0); + if (map == (char*)-1) + err("mmap"); +#else + int shmid = shmget(IPC_PRIVATE, SIZE, IPC_CREAT|0666); + if (shmid < 0) err("shmget"); + map = shmat(shmid, NULL, SHM_RDONLY); + shmctl(shmid, IPC_RMID, NULL); + if (map == (char *)-1) err("shmat"); + printf("map %p\n", map); +#endif + + if (av[1]) { + char *end; + unsigned long timeout = strtoul(av[1], &end, 0); + switch (*end) { + case 'h': timeout *= 3600; break; + case 'm': timeout *= 60; break; + } + printf("running for %lu seconds\n", timeout); + alarm(timeout); + } else + printf("running forever\n"); + + if (av[1] && av[2]) + seed = strtoul(av[2], 0, 0); + else + seed = time(0); + + printf("random seed %lu\n", seed); + srandom(seed); + + testmap(); + /* test shm etc. */ + return 0; +}
diff --git a/test/realloc_test.c b/test/realloc_test.c new file mode 100644 index 0000000..db80820 --- /dev/null +++ b/test/realloc_test.c
@@ -0,0 +1,108 @@ +#include <assert.h> +#include <errno.h> +#include <limits.h> +#include <unistd.h> +#include <stdlib.h> +#include <stdio.h> +#include <sys/mman.h> +#include "numa.h" +#include "numaif.h" + +#define DEFAULT_NR_PAGES 1024 + +static int parse_int(const char *str) +{ + char *endptr; + long ret = strtol(str, &endptr, 0); + if (*endptr != '\0') { + fprintf(stderr, "[error] strtol() failed: parse error: %s\n", endptr); + exit(1); + } + + if (errno == ERANGE) + fprintf(stderr, "[warning] strtol() out of range\n"); + + if (ret > INT_MAX || ret < INT_MIN) { + fprintf(stderr, "[warning] parse_int() out of range\n"); + ret = (ret > 0) ? INT_MAX : INT_MIN; + } + + return (int) ret; +} + +int main(int argc, char **argv) +{ + char *mem; + int page_size = numa_pagesize(); + int node = 0; + int nr_pages = DEFAULT_NR_PAGES; + + if (numa_available() < 0) { + fprintf(stderr, "numa is not available"); + exit(1); + } + + if (argc > 1) + node = parse_int(argv[1]); + if (argc > 2) + nr_pages = parse_int(argv[2]); + + mem = numa_alloc_onnode(page_size, node); + + /* Store the policy of the newly allocated area */ + unsigned long nodemask; + int mode; + int nr_nodes = numa_num_possible_nodes(); + if (get_mempolicy(&mode, &nodemask, nr_nodes, mem, + MPOL_F_NODE | MPOL_F_ADDR) < 0) { + perror("get_mempolicy() failed"); + exit(1); + } + + /* Print some info */ + printf("Page size: %d\n", page_size); + printf("Pages realloc'ed: %d\n", nr_pages); + printf("Allocate data in node: %d\n", node); + + int i; + int nr_inplace = 0; + int nr_moved = 0; + for (i = 0; i < nr_pages; i++) { + /* Enlarge mem with one more page */ + char *new_mem = numa_realloc(mem, (i+1)*page_size, (i+2)*page_size); + if (!new_mem) { + perror("numa_realloc() failed"); + exit(1); + } + + if (new_mem == mem) + ++nr_inplace; + else + ++nr_moved; + mem = new_mem; + + /* Check the policy of the realloc'ed area */ + unsigned long realloc_nodemask; + int realloc_mode; + if (get_mempolicy(&realloc_mode, &realloc_nodemask, + nr_nodes, mem, MPOL_F_NODE | MPOL_F_ADDR) < 0) { + perror("get_mempolicy() failed"); + exit(1); + } + + assert(realloc_nodemask == nodemask && + realloc_mode == mode && "policy changed"); + } + + /* Shrink to the original size */ + mem = numa_realloc(mem, (nr_pages + 1)*page_size, page_size); + if (!mem) { + perror("numa_realloc() failed"); + exit(1); + } + + numa_free(mem, page_size); + printf("In-place reallocs: %d\n", nr_inplace); + printf("Moved reallocs: %d\n", nr_moved); + return 0; +}
diff --git a/test/regress b/test/regress new file mode 100755 index 0000000..c0cf6d7 --- /dev/null +++ b/test/regress
@@ -0,0 +1,220 @@ +#!/bin/bash +# simple regression test for numactl/numaapi +# must be run from 'test' directory of numactl source package, +# after build [just use 'make test'] +# note the statistics checks may fail when the system is under +# memory pressure +# Copyright 2003,2004 Andi Kleen, SuSE Labs. + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +: ${NUMACTL:=${builddir}/numactl} +VALGRIND=${VALGRIND:-} + +MB=$[1024*1024] +SIZE=$[15 * $MB] +DEMOSIZE=$[10 * $MB] +STAT_INTERVAL=5 + +PAGESIZE=$("${builddir}/test/pagesize") +PAGES=$[ $SIZE / $PAGESIZE ] +HALFPAGES=$[ $PAGES / 2 ] +HALFPAGES=$[ $HALFPAGES - 100 ] +DOUBLEPAGES=$[ $PAGES * 2 ] +DOUBLEPAGES=$[ $DOUBLEPAGES - 200 ] +NEEDPAGES=$[ $DOUBLEPAGES + $DOUBLEPAGES / 5 ] # 20% spare + +EXIT=0 + +declare -i maxnode +declare -a node +declare -a nlist + +# ===================================================================== +numactl() { + $VALGRIND $NUMACTL "$@" +} + +failed() { + echo '=======FAILED' + echo "Check if machine doesn't have background jobs and try again" + EXIT=1 +} + +# nstat statname node +nstat() { + sleep $STAT_INTERVAL + declare -a fields + numastat | grep $1 | while read -a fields ; do + echo ${fields[$[1 + $2]]} + done +} + +probe_hardware() +{ + declare -i n=0 + + numnodes=$(numactl --hardware | awk '/^available/ { print $2 }') + maxnode=$(expr $numnodes - 1) + nlist=( $(numactl --hardware | grep "^node" | tail -1 |awk '{$1=""; print }') ) + + # find nodes with at least NEEDPAGES of free memory + for i in $(seq 0 $maxnode) ; do + free=$(numactl --hardware | fgrep " ${nlist[$i]} free" | awk '{print $4}') + free=$(( free * MB )) + if [[ $((free / PAGESIZE)) -ge $NEEDPAGES ]]; then + node[$n]=${nlist[$i]} + n=$((n + 1 )) + fi + done + numnodes=$n + maxnode=$(expr $numnodes - 1) + + if [ $numnodes -lt 2 ] ; then + echo "need at least two nodes with at least $NEEDPAGES each of" + echo "free memory for mempolicy regression tests" + exit 77 # Skip test + fi +} + +# ========================================================================= +_test_process_state() { + echo '=>testing numactl' "$@" "memhog -H $SIZE" + numactl "$@" memhog -H $SIZE || failed +} + +test_process_state() +{ + declare -i n0=${node[0]} n1=${node[1]} + + _test_process_state --interleave=$n1 + + a0=`nstat interleave_hit 0` + a1=`nstat interleave_hit 1` + _test_process_state --interleave=$n0,$n1 + b0=`nstat interleave_hit 0` + b1=`nstat interleave_hit 1` + if [ $(expr $b1 - $a1) -lt $HALFPAGES ]; then + echo "interleaving test failed $n1 $b1 $a1" + failed + fi + if [ $(expr $b0 - $a0) -lt $HALFPAGES ]; then + echo "interleaving test failed $n0 $b0 $a0" + failed + fi + + _test_process_state --interleave=all + _test_process_state --membind=all + + a=$(expr $(nstat numa_hit 0) + $(nstat numa_hit 1)) + _test_process_state --membind=$n0,$n1 + b=$(expr $(nstat numa_hit 0) + $(nstat numa_hit 1)) + if [ $(expr $b - $a) -lt $PAGES ]; then + echo "membind test failed $n1 $b $a ($PAGES)" + failed + fi + + for i in $(seq 0 $maxnode) ; do + declare -i ni=${node[$i]} + a=`nstat numa_hit $i` + _test_process_state --membind=$ni + _test_process_state --preferred=$ni + b=`nstat numa_hit $i` + if [ $(expr $b - $a) -lt $DOUBLEPAGES ]; then + echo "membind/preferred on node $ni failed $b $a" + failed + fi + done + _test_process_state --localalloc +} + +# ========================================================================= +# test mbind + +_test_mbind() { + echo '=>testing memhog -H' "$@" + memhog -H $SIZE "$@" || failed +} + +test_mbind() +{ + declare -i n0=${node[0]} n1=${node[1]} + + a0=`nstat interleave_hit 0` + a1=`nstat interleave_hit 1` + _test_mbind interleave $n0,$n1 + b0=`nstat interleave_hit 0` + b1=`nstat interleave_hit 1` + if [ $(expr $b1 - $a1) -lt $HALFPAGES ]; then + echo "interleaving test 2 failed $n1 $b1 $a1 expected $HALFPAGES" + failed + fi + if [ $(expr $b0 - $a0) -lt $HALFPAGES ]; then + echo "interleaving test 2 failed $n0 $b0 $a0" + failed + fi + + _test_mbind interleave all + + a=$(expr $(nstat numa_hit 0) + $(nstat numa_hit 1)) + _test_mbind membind $n0,$n1 + b=$(expr $(nstat numa_hit 0) + $(nstat numa_hit 1)) + if [ $(expr $b - $a) -lt $PAGES ]; then + echo "membind test 2 failed $b $a ($PAGES)" + failed + fi + + for i in $(seq 0 $maxnode) ; do + declare -i ni=${node[$i]} + a=`nstat numa_hit $i` + _test_mbind membind $ni + _test_mbind preferred $ni + b=`nstat numa_hit $i` + if [ $(expr $b - $a) -lt $DOUBLEPAGES ]; then + echo "membind/preferred test 2 on node $ni failed $b $a" + failed + fi + done +} + +# ========================================================================= +main() +{ + + # Get the interval vm statistics refresh at + if [ -e /proc/sys/vm/stat_interval ]; then + STAT_INTERVAL=`cat /proc/sys/vm/stat_interval` + STAT_INTERVAL=`expr $STAT_INTERVAL \* 2` + fi + + probe_hardware + + numactl --cpubind=${node[0]} /bin/true + numactl --cpubind=${node[1]} /bin/true + + numactl -s + numactl --hardware + + numastat > A + + test_process_state + + test_mbind + + numastat > B + diff -u A B + rm A B + + if [ "$EXIT" = 0 ] ; then + echo '========SUCCESS' + else + echo '========FAILURE' + exit 1 + fi +} + +# ========================================================================= +main
diff --git a/test/regress-io b/test/regress-io new file mode 100644 index 0000000..3e6e789 --- /dev/null +++ b/test/regress-io
@@ -0,0 +1,47 @@ +#!/bin/bash +# test IO affinity parsing +# tests may fail depending on machine setup + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +E=0 + +check() { + echo testing $@ + if "$@" ; then + true + else + echo failed + E=1 + fi + +} + +fail() { + echo testing failure of $@ + if "$@" ; then + echo failed + E=1 + else + true + fi +} + +check "${builddir}/test/node-parse" file:. +check "${builddir}/test/node-parse" ip:8.8.8.8 +fail "${builddir}/test/node-parse" ip:127.0.0.1 + +IF=$(ip link ls | grep eth | cut -d: -f2 | head -1) +check "${builddir}/test/node-parse" "netdev:$IF" +fail "${builddir}/test/node-parse" netdev:lo +DEV=$(df | awk '/\/$/ { print $1 }') +check "${builddir}/test/node-parse" file:$DEV +check "${builddir}/test/node-parse" block:$(basename $DEV) +check "${builddir}/test/node-parse" pci:0:0.0 + +if [ "$E" = 0 ] ; then echo SUCCESS ; else echo FAILURE ; fi + +exit $E
diff --git a/test/regress2 b/test/regress2 new file mode 100755 index 0000000..aa6ea41 --- /dev/null +++ b/test/regress2
@@ -0,0 +1,28 @@ +#!/bin/sh +# More regression tests for libnuma/numa api + +VALGRIND=${VALGRIND:-} + +testdir=`dirname "$0"` +: ${srcdir:=${testdir}/..} +: ${builddir:=${srcdir}} +export PATH=${builddir}:$PATH + +T() { + echo "$@" + if ! $VALGRIND "$@" ; then + echo $1 FAILED!!!! + exit 1 + fi + echo +} + +# still broken +#T "${builddir}/test/prefered" +T "${builddir}/test/distance" +T "${builddir}/test/nodemap" +T "${srcdir}/test/checkaffinity" +T "${srcdir}/test/checktopology" +T "${builddir}/test/tbitmap" +T "${srcdir}/test/bind_range" +#T "${builddir}/test/randmap"
diff --git a/test/runltp b/test/runltp new file mode 100755 index 0000000..4e7d979 --- /dev/null +++ b/test/runltp
@@ -0,0 +1,21 @@ +#!/bin/sh +# run the Linux Test Project with various numactl settings. will run for a few hours. +# must run as root +# You can download LTP from http://ltp.sourceforge.net +# Change LTP below to the source directory of a compiled LTP distribution + +LTP=/src/ltp +LEN=2h + +LTPOPT="-q -p -t $LEN" +export PATH=`pwd`/..:$PATH + +cd $LTP +for i in 1 2 3 ; do + numactl --interleave=all ./runltp $LTPOPT -l n.interleave.all.$i + numactl --interleave=0,1 ./runltp $LTPOPT -l n.interleave.01.$i + numactl --preferred=0 --cpubind=1 ./runltp $LTPOPT -l n.preferred.$i +# the VM test that allocates all memory may fail + numactl --membind=1 --cpubind=0 ./runltp $LTPOPT -l n.membind1.$i + numactl --membind=0,1 ./runltp $LTPOPT -l n.membind01.$i +done
diff --git a/test/shmtest b/test/shmtest new file mode 100755 index 0000000..18c66ee --- /dev/null +++ b/test/shmtest
@@ -0,0 +1,104 @@ +#!/bin/sh +# basic shared memory policy test + + +# hugetlbfs and tmpfs must be mounted on these mount points +TMPFS=/dev/shm +HUGE=/huge + +#valgrind 3.0.1 doesn't implement mbind() yet on x86-64 +#VALGRIND="valgrind --tool=memcheck" +VALGRIND= + +set -e + +export PATH=`pwd`/..:$PATH + +numactl() { + $VALGRIND ../numactl "$@" +} + +failure() { + numastat > after + set +e + diff -u before after + echo + echo TEST FAILED + exit 1 +} + +success() { + echo test succeeded +} + +checkpoint() { + numastat > before +} + +trap failure EXIT + +basictest() { +echo initial +checkpoint +numactl --length=20m $1 --dump +echo interleave +checkpoint +numactl --offset=2m --length=2m $1 --strict --interleave=0,1 --verify --dump +echo interleave verify +checkpoint +numactl $1 --dump +echo membind setup +checkpoint +numactl --offset 4m --length=2m $1 --strict --membind=1 --verify --dump +echo membind verify +checkpoint +numactl $1 --dump +echo preferred setup +checkpoint +numactl --offset 6m --length 2m $1 --strict --preferred=1 --verify --dump +echo preferred verify +checkpoint +numactl $1 --dump + +# check overlaps here +} + +cleanupshm() { + if [ -f $1 ] ; then + ipcrm -M `./ftok $1` || true + rm $1 + fi +} + + +banner() { +echo +echo ++++++++++++ $1 +++++++++++++++ +echo +} + +banner shm +cleanupshm A +basictest --shm=A +cleanupshm A + +banner hugeshm +cleanupshm B +basictest "--huge --shm=B" +cleanupshm B + +banner tmpfs +basictest "--file $TMPFS/B" +rm $TMPFS/B + +# first need a way to create holey hugetlbfs files. + +#banner hugetlbfs +#basictest "--file $HUGE/B" +#rm /hugetlbfs/B + +rm before + +trap success EXIT + +
diff --git a/test/tbitmap.c b/test/tbitmap.c new file mode 100644 index 0000000..4ab48c0 --- /dev/null +++ b/test/tbitmap.c
@@ -0,0 +1,103 @@ +/* Unit test bitmap parser */ +#define _GNU_SOURCE 1 +//#include <asm/bitops.h> +#include <stdio.h> +#include <string.h> +#include <assert.h> +#include <stdlib.h> +#include <ctype.h> +#include "numa.h" +#include "util.h" + +/* For util.c. Fixme. */ +void usage(void) +{ + exit(1); +} + +#define ALIGN(x,a) (((x)+(a)-1)&~((a)-1)) + +#define test_bit(i,p) ((p)[(i) / BITS_PER_LONG] & (1UL << ((i)%BITS_PER_LONG))) +#define set_bit(i,p) ((p)[(i) / BITS_PER_LONG] |= (1UL << ((i)%BITS_PER_LONG))) +#define clear_bit(i,p) ((p)[(i) / BITS_PER_LONG] &= ~(1UL << ((i)%BITS_PER_LONG))) + +typedef unsigned u32; +#define BITS_PER_LONG (sizeof(long)*8) + +#define round_up(x,y) (((x) + (y) - 1) & ~((y)-1)) + +#define CPU_BYTES(x) (round_up(x, BITS_PER_LONG)/8) +#define CPU_LONGS(x) (CPU_BYTES(x) / sizeof(long)) + +/* Following routine extracted from Linux 2.6.16 */ + +#define CHUNKSZ 32 +#define nbits_to_hold_value(val) fls(val) +#define unhex(c) (isdigit(c) ? (c - '0') : (toupper(c) - 'A' + 10)) +#define BASEDEC 10 /* fancier cpuset lists input in decimal */ + +/** + * bitmap_scnprintf - convert bitmap to an ASCII hex string. + * @buf: byte buffer into which string is placed + * @buflen: reserved size of @buf, in bytes + * @mask: pointer to struct bitmask to convert + * + * Hex digits are grouped into comma-separated sets of eight digits per set. + */ +int bitmap_scnprintf(char *buf, unsigned int buflen, struct bitmask *mask) +{ + int i, word, bit, len = 0; + unsigned long val; + const char *sep = ""; + int chunksz; + u32 chunkmask; + + chunksz = mask->size & (CHUNKSZ - 1); + if (chunksz == 0) + chunksz = CHUNKSZ; + + i = ALIGN(mask->size, CHUNKSZ) - CHUNKSZ; + for (; i >= 0; i -= CHUNKSZ) { + chunkmask = ((1ULL << chunksz) - 1); + word = i / BITS_PER_LONG; + bit = i % BITS_PER_LONG; + val = (mask->maskp[word] >> bit) & chunkmask; + len += snprintf(buf+len, buflen-len, "%s%0*lx", sep, + (chunksz+3)/4, val); + chunksz = CHUNKSZ; + sep = ","; + } + return len; +} + +extern int numa_parse_bitmap(char *buf, struct bitmask *mask); +#define MASKSIZE 300 + +int main(void) +{ + char buf[1024]; + struct bitmask *mask, *mask2; + int i; + + mask = numa_bitmask_alloc(MASKSIZE); + mask2 = numa_bitmask_alloc(MASKSIZE); + + printf("Testing bitmap functions\n"); + for (i = 0; i < MASKSIZE; i++) { + numa_bitmask_clearall(mask); + numa_bitmask_clearall(mask2); + numa_bitmask_setbit(mask, i); + assert(find_first(mask) == i); + bitmap_scnprintf(buf, sizeof(buf), mask); + strcat(buf,"\n"); + if (numa_parse_bitmap(buf, mask2) < 0) + assert(0); + if (memcmp(mask->maskp, mask2->maskp, numa_bitmask_nbytes(mask))) { + bitmap_scnprintf(buf, sizeof(buf), mask2); + printf("mask2 differs: %s\n", buf); + assert(0); + } + } + printf("Passed\n"); + return 0; +}
diff --git a/test/tshared.c b/test/tshared.c new file mode 100644 index 0000000..7ff80e8 --- /dev/null +++ b/test/tshared.c
@@ -0,0 +1,50 @@ +#include <numa.h> +#include <numaif.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/wait.h> +#include <unistd.h> + +#define err(x) perror(x),exit(1) + +enum SZ { + MEMSZ = 100<<20, + NTHR = 10, +}; + +/* test if shared interleaving state works. */ +int main(void) +{ + int i, k; + char *mem; + int pagesz = getpagesize(); + int max_node; + + if (numa_available() < 0) { + printf("no NUMA API available\n"); + exit(1); + } + max_node = numa_max_node(); + mem = numa_alloc_interleaved(MEMSZ); + for (i = 0; i < NTHR; i++) { + if (fork() == 0) { + for (k = i*pagesz; k < MEMSZ; k += pagesz * NTHR) { + mem[k] = 1; + } + _exit(0); + } + } + for (i = 0; i < NTHR; i++) + wait(NULL); + k = 0; + for (i = 0; i < MEMSZ; i += pagesz) { + int nd; + if (get_mempolicy(&nd, NULL, 0, mem + i, MPOL_F_NODE|MPOL_F_ADDR) < 0) + err("get_mempolicy"); + if (nd != k) + printf("offset %d node %d expected %d\n", i, nd, k); + k = (k+1)%(max_node+1); + } + + return 0; +}
diff --git a/test/tshm.c b/test/tshm.c new file mode 100644 index 0000000..79ca02c --- /dev/null +++ b/test/tshm.c
@@ -0,0 +1,117 @@ +#include <sys/shm.h> +#include <sys/ipc.h> +#include <sys/fcntl.h> +#include <stdio.h> +#include <numaif.h> + +#define err(x) perror(x),exit(1) + +enum { + MEMSZ = 10*1024*1024, +}; + +struct req { + enum cmd { + SET = 1, + CHECK, + REPLY, + EXIT, + } cmd; + long offset; + long len; + int policy; + nodemask_t nodes; +}; + +void worker(void) +{ + struct req req; + while (read(0, &req, sizeof(struct req) > 0)) { + switch (req.cmd) { + case SET: + if (mbind(map + req.offset, req.len, req.policy, &req.nodes, + NUMA_MAX_NODES+1, 0) < 0) + err("mbind"); + break; + case TEST: + req.cmd = REPLY; + if (get_mempolicy(&req.policy, &req.nodes, NUMA_MAX_NODES+1, + map + req.offset, MPOL_F_ADDR) < 0) + err("get_mempolicy"); + write(1, &req, sizeof(struct req)); + break; + case EXIT: + return; + default: + abort(); + } + } +} + +void sendreq(int fd, enum cmd cmd, int policy, long offset, long len, nodemask_t nodes) +{ + struct req req = { + .cmd = cmd, + .offset = offset, + .len = len, + .policy = policy, + .nodes = nodes + }; + if (write(fd, &req, sizeof(struct req)) != sizeof(struct req)) + panic("bad req write"); +} + +void readreq(int fd, int *policy, nodemask_t *nodes, long offset, long len) +{ + struct req req; + if (read(fd, &req, sizeof(struct req)) != sizeof(struct req)) + panic("bad req read"); + if (req.cmd != REPLY) + abort(); + *policy = req.policy; + *nodes = req.nodes; +} + +int main(void) +{ + int fd = open("tshm", O_CREAT, 0600); + close(fd); + key_t key = ftok("tshm", 1); + int shm = shmget(key, MEMSZ, IPC_CREAT|0600); + if (shm < 0) err("shmget"); + char *map = shmat(shm, NULL, 0); + printf("map = %p\n", map); + + unsigned long nmask = 0x3; + if (mbind(map, MEMSZ, MPOL_INTERLEAVE, &nmask, 4, 0) < 0) err("mbind1"); + + int fd[2]; + if (pipe(fd) < 0) err("pipe"); + if (fork() == 0) { + close(0); + close(1); + dup2(fd[0], 0); + dup2(fd[1], 1); + worker(); + _exit(0); + } + + int pagesz = getpagesize(); + int i; + + srand(1); + for (;;) { + + /* chose random offset */ + + /* either in child or here */ + + /* change policy */ + + /* ask other guy to check */ + + } + + shmdt(map); + shmctl(shm, IPC_RMID, 0); +}
diff --git a/threadtest.c b/threadtest.c new file mode 100644 index 0000000..33fa7bd --- /dev/null +++ b/threadtest.c
@@ -0,0 +1,3 @@ +/* Test if gcc supports thread */ +int __thread x; +int main(void) { return x; }
diff --git a/util.c b/util.c new file mode 100644 index 0000000..96fa1aa --- /dev/null +++ b/util.c
@@ -0,0 +1,131 @@ +/* Copyright (C) 2003,2004 Andi Kleen, SuSE Labs. + + numactl is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public + License as published by the Free Software Foundation; version + 2. + + numactl is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should find a copy of v2 of the GNU General Public License somewhere + on your Linux system; if not, write to the Free Software Foundation, + Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA */ +#include "numa.h" +#include "numaif.h" +#include "util.h" +#include <stdio.h> +#include <string.h> +#include <stdlib.h> +#include <stdarg.h> +#include <ctype.h> +#include <errno.h> +#include <unistd.h> + +void printmask(char *name, struct bitmask *mask) +{ + int i; + printf("%s: ", name); + for (i = 0; i < mask->size; i++) + if (numa_bitmask_isbitset(mask, i)) + printf("%d ", i); + putchar('\n'); +} + +int find_first(struct bitmask *mask) +{ + int i; + for (i = 0; i < mask->size; i++) + if (numa_bitmask_isbitset(mask, i)) + return i; + return -1; +} + +void complain(const char *fmt, ...) +{ + va_list ap; + va_start(ap, fmt); + fprintf(stderr, "numactl: "); + vfprintf(stderr,fmt,ap); + putchar('\n'); + va_end(ap); + exit(1); +} + +void nerror(const char *fmt, ...) +{ + int err = errno; + va_list ap; + va_start(ap,fmt); + fprintf(stderr, "numactl: "); + vfprintf(stderr, fmt, ap); + va_end(ap); + if (err) + fprintf(stderr,": %s\n", strerror(err)); + else + fputc('\n', stderr); + exit(1); +} + +long memsize(char *s) +{ + char *end; + long length = strtoul(s,&end,0); + switch (toupper(*end)) { + case 'G': length *= 1024; /*FALL THROUGH*/ + case 'M': length *= 1024; /*FALL THROUGH*/ + case 'K': length *= 1024; break; + } + return length; +} + +static struct policy { + char *name; + int policy; + int noarg; +} policies[] = { + { "interleave", MPOL_INTERLEAVE, }, + { "membind", MPOL_BIND, }, + { "preferred", MPOL_PREFERRED, }, + { "default", MPOL_DEFAULT, 1 }, + { NULL }, +}; + +static char *policy_names[] = { "default", "preferred", "bind", "interleave" }; + +char *policy_name(int policy) +{ + static char buf[32]; + if (policy >= array_len(policy_names)) { + sprintf(buf, "[%d]", policy); + return buf; + } + return policy_names[policy]; +} + +int parse_policy(char *name, char *arg) +{ + int k; + struct policy *p = NULL; + if (!name) + return MPOL_DEFAULT; + for (k = 0; policies[k].name; k++) { + p = &policies[k]; + if (!strcmp(p->name, name)) + break; + } + if (!p || !p->name || (!arg && !p->noarg)) + usage(); + return p->policy; +} + +void print_policies(void) +{ + int i; + printf("Policies:"); + for (i = 0; policies[i].name; i++) + printf(" %s", policies[i].name); + printf("\n"); +}
diff --git a/util.h b/util.h new file mode 100644 index 0000000..8443086 --- /dev/null +++ b/util.h
@@ -0,0 +1,20 @@ +extern void printmask(char *name, struct bitmask *mask); +extern int find_first(struct bitmask *mask); +extern struct bitmask *nodemask(char *s); +extern struct bitmask *cpumask(char *s, int *ncpus); +extern int read_sysctl(char *name); +extern void complain(const char *fmt, ...); +extern void nerror(const char *fmt, ...); + +/* defined in main module, but called by util.c */ +extern void usage(void); + +extern long memsize(char *s); +extern int parse_policy(char *name, char *arg); +extern void print_policies(void); +extern char *policy_name(int policy); + +#define err(x) perror("numactl: " x),exit(1) +#define array_len(x) (sizeof(x)/sizeof(*(x))) + +#define round_up(x,y) (((x) + (y) - 1) & ~((y)-1))
diff --git a/versions.ldscript b/versions.ldscript new file mode 100644 index 0000000..eaddc7e --- /dev/null +++ b/versions.ldscript
@@ -0,0 +1,175 @@ +# Symbols defined in the library which aren't specifically bound to a +# version node are effectively bound to an unspecified base version of +# the library. It is possible to bind all otherwise unspecified symbols +# to a given version node using `global: *' somewhere in the version script. +# +# The interfaces at the "v1" level. +# At this level we present these functions to the linker (and thus to an +# application). +# Any functions not defined in the global list (i.e. "local") will be internal +# to the library (i.e. not exported but used within the library). +# Thus the real function names, "numa_bind_v1" etc, are local and won't +# be known to the linker. + +# the first 16 have v1 aliases +# 3 of the 5 system calls that libnuma provides are common to all versions: +libnuma_1.1 { + global: + set_mempolicy; + get_mempolicy; + mbind; + numa_all_nodes; + numa_alloc; + numa_alloc_interleaved; + numa_alloc_interleaved_subset; + numa_alloc_local; + numa_alloc_onnode; + numa_available; + numa_bind; + numa_distance; + numa_error; + numa_exit_on_error; + numa_free; + numa_get_interleave_mask; + numa_get_interleave_node; + numa_get_membind; + numa_get_run_node_mask; + numa_interleave_memory; + numa_max_node; + numa_migrate_pages; + numa_no_nodes; + numa_node_size64; + numa_node_size; + numa_node_to_cpus; + numa_pagesize; + numa_parse_bitmap; + numa_police_memory; + numa_preferred; + numa_run_on_node; + numa_run_on_node_mask; + numa_sched_getaffinity; + numa_sched_setaffinity; + numa_set_bind_policy; + numa_set_interleave_mask; + numa_set_localalloc; + numa_set_membind; + numa_set_preferred; + numa_set_strict; + numa_setlocal_memory; + numa_tonode_memory; + numa_tonodemask_memory; + numa_warn; + numa_exit_on_warn; + local: + *; +}; + +# The interfaces at the "v2" level. +# The first 17 have v2 aliases +# We add the bitmask_ functions +# and the move_pages and migrate_pages system calls +# 1.2 depends on 1.1 + +libnuma_1.2 { + global: + copy_bitmask_to_nodemask; + copy_nodemask_to_bitmask; + copy_bitmask_to_bitmask; + set_mempolicy; + get_mempolicy; + mbind; + move_pages; + migrate_pages; + numa_all_cpus_ptr; + numa_all_nodes_ptr; + numa_alloc; + numa_alloc_interleaved; + numa_alloc_interleaved_subset; + numa_alloc_local; + numa_alloc_onnode; + numa_realloc; + numa_allocate_cpumask; + numa_allocate_nodemask; + numa_available; + numa_bind; + numa_bitmask_alloc; + numa_bitmask_clearall; + numa_bitmask_clearbit; + numa_bitmask_equal; + numa_bitmask_free; + numa_bitmask_isbitset; + numa_bitmask_nbytes; + numa_bitmask_setall; + numa_bitmask_setbit; + numa_bitmask_weight; + numa_distance; + numa_error; + numa_exit_on_error; + numa_free; + numa_get_interleave_mask; + numa_get_interleave_node; + numa_get_membind; + numa_get_mems_allowed; + numa_get_run_node_mask; + numa_interleave_memory; + numa_max_node; + numa_max_possible_node; + numa_migrate_pages; + numa_move_pages; + numa_no_nodes_ptr; + numa_node_size64; + numa_node_size; + numa_node_to_cpus; + numa_node_of_cpu; + numa_nodes_ptr; + numa_num_configured_cpus; + numa_num_configured_nodes; + numa_num_possible_nodes; + numa_num_task_cpus; + numa_num_task_nodes; + numa_num_thread_cpus; + numa_num_thread_nodes; + numa_pagesize; + numa_parse_bitmap; + numa_parse_cpustring; + numa_parse_nodestring; + numa_police_memory; + numa_preferred; + numa_run_on_node; + numa_run_on_node_mask; + numa_sched_getaffinity; + numa_sched_setaffinity; + numa_set_bind_policy; + numa_set_interleave_mask; + numa_set_localalloc; + numa_set_membind; + numa_set_preferred; + numa_set_strict; + numa_setlocal_memory; + numa_tonode_memory; + numa_tonodemask_memory; + numa_warn; + local: + *; +} libnuma_1.1; + +# New parsing interface for cpu/numastrings +# was added into version 1.3 + +libnuma_1.3 { + global: + numa_parse_cpustring_all; + numa_parse_nodestring_all; + numa_num_possible_cpus; + local: + *; +} libnuma_1.2; + +# New interface with customizable cpuset awareness +# was added into version 1.4 +libnuma_1.4 { + global: + numa_run_on_node_mask_all; + local: + *; +} libnuma_1.3;