Thursday, February 6, 2014

Running MPI tests in Fedora packages

Getting Started

I'm excited about getting starting building my packages for Fedora EPEL 7.  Most of them are scientific in nature, and near the bottom of the stack is the HDF5 library.  Unfortunately I quickly ran into a problem with running the mpich mpi tests in the package - the test would hang immediately:

make[4]: Entering directory `/builddir/build/BUILD/hdf5-1.8.12/mpich/testpar'
Testing  t_mpi 

Time to reproduce this on my local mock builder...

More Problems...

So I fired off a local mock build, and lo and behold a different issue.  It made it past the above hang making it impossible to debug that locally, but then hung later:

Testing   t_cache
Parallel metadata cache tests
        mpi_size     = 4
        express_test = 1
*** Hint ***
You can use environment variable HDF5_PARAPREFIX to run parallel test
files in a
different directory or to add file type prefix. E.g.,
*** End of Hint ***
0:setup_rand(): seed = 138071.
3:setup_rand(): seed = 149196.
2:setup_rand(): seed = 160135.
1:setup_rand(): seed = 180134.
Testing server smoke check
Testing smoke check #1 -- process 0 only md write strategy

I spent a lot (too much) time poking around with strace, gdb, etc. trying to pin this down.  Finally though I sent off a message to the helpful folks at  Word came back about a known issue when running with more MPI processes than physical cores.  And this was indeed the case here.  So I'm now working around this by limiting the number of MPI processes in the tests to 4, which I'm told is the minimum number of cores on the Fedora builders.

But I was back to where I started.  So I decided to see if this was just an HDF5 issue, or if it would show up elsewhere.  So I picked on the poor, hapless, BLACS package...


BLACS is fairly simple MPI code.  But the package didn't (yet) have a %check section.  However the code did have a couple of tests that could be built and run directly, which would prove helpful later.  So starting with the rawhide package I built and ran the tests, and lo - a hang!  But this time only on the 32-bit build, which didn't match my original issue.

gdb fairly quickly pinpointed it getting stuck in a BLACS test routine - specifically a subroutine copied from the LAPACK library.   BLACS is pretty old (1997!) and I noticed that the current LAPACK library had a pretty reworked version of that routine, so I built against the system LAPACK and that fixed it.  Yet another strike against bundled libraries!

Now the mpich tests were completing and we were on the the openmpi version.  But that failed with:

 It looks like orte_init failed for some reason; your parallel process is
 likely to abort.  There are many reasons that a parallel process can
 fail during orte_init; some of which are due to configuration or
 environment problems.  This failure appears to be an internal failure;
 here's some additional information (which may only be relevant to an
 Open MPI developer):

   orte_plm_base_select failed
   --> Returned value Not found (-13) instead of ORTE_SUCCESS

What a lovely error message!  Luckily Google quickly pointed to the lack of a ssh or rsh binary - which is the case in the minimal build roots on Fedora.  A BR on openssh-clients took care of that for now, though I've filed an issue to see if we can remove this requirement.


We were building okay on rawhide, now for the smoke test again on EPEL7.   And there it was again, an immediate hang in the mpich test!  Still couldn't reproduce locally, so I added BR strace and did a scratch build:

+ strace -f mpirun -np 4 ./xCbtest_MPI-LINUX-0
execve("/usr/lib64/mpich/bin/mpirun", ["mpirun", "-np", "4",
"./xCbtest_MPI-LINUX-0"], [/* 46 vars */]) = 0
[pid  7662] execve("/usr/bin/ssh", ["/usr/bin/ssh", "-x",
"buildvm-11.phx2.fedoraproject.or"..., "\"/usr/lib64/mpich/bin/hydra_pmi_"...,
"--control-port", "buildvm-11.phx2.fedoraproject.or"..., "--rmk", "user",
"--launcher", "ssh", "--demux", "poll", "--pgid", "0", "--retries", "10",
...], [/* 46 vars */]) = 0
[pid  7662] write(2, "ssh: Could not resolve hostname "..., 94) = 94
[pid  7661] <... poll resumed> )        = 1 ([{fd=10, revents=POLLIN}])
[pid  7662] exit_group(255)             = ?
[pid  7661] fcntl(10, F_GETFL)          = 0 (flags O_RDONLY)
[pid  7661] fcntl(10, F_SETFL, O_RDONLY|O_NONBLOCK) = 0
[pid  7661] fcntl(2, F_GETFL)           = 0x1 (flags O_WRONLY)
[pid  7661] fcntl(2, F_SETFL, O_WRONLY|O_NONBLOCK) = 0
[pid  7661] read(10, "ssh: Could not resolve hostname "..., 65536) = 94
[pid  7661] write(2, "ssh: Could not resolve hostname "..., 94ssh: Could not
resolve hostname Name or service not known
) = 94
[pid  7661] gettimeofday({1391730785, 32416}, NULL) = 0
[pid  7661] poll([{fd=3, events=POLLIN}, {fd=5, events=POLLIN}, {fd=8,
events=POLLIN}, {fd=10, events=POLLIN}], 4, 4294967295 <unfinished ...>
[pid  7662] +++ exited with 255 +++
<... poll resumed> )                    = 2 ([{fd=8, revents=POLLHUP}, {fd=10,
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=7662, si_status=255,
si_utime=0, si_stime=0} ---
brk(0)                                  = 0x1e0b000
brk(0x1e3a000)                          = 0x1e3a000
fcntl(8, F_GETFL)                       = 0 (flags O_RDONLY)
fcntl(1, F_GETFL)                       = 0x1 (flags O_WRONLY)
read(8, "", 65536)                      = 0
close(8)                                = 0
read(10, "", 65536)                     = 0
close(10)                               = 0
gettimeofday({1391730785, 33070}, NULL) = 0

And there we're stuck.  So it looks like mpich is making use of the ssh binary now provided for the openmpi tests, but because networking is completely disabled on the Fedora builders it seems to trigger a bug in mpich.  But it suggested a workaround - specifying "-host localhost", which worked!


So, to summarize necessary steps for running MPI tests in Fedora packages:

  • Add BuildRequires: openssh-clients for openmpi tests.
  • Add "-host localhost" to mpich test runs.
  • Limit MPI processes to 4 or less. 
HTH - Orion