Parallel HDF5 test failures (power9 / lustre / openmpi / ucx)


#1

Hi all,

I’m trying to get HDF5’s parallel tests to complete and am having difficulty. I’m on the following platform:

  • RHEL7
  • GCC 4.8.5 (OS supplied)
  • openmpi 4.0.5 (built against ucx 1.9.0)
  • Lustre 2.12.5
  • IBM power9
  • MOFED 4.7.3

If I run using openmpi’s ROMIO support, the tests timeout (even with the alarm set to an hour) during testphdf5 (last line printed is “Testing – multi-chunk collective chunk io (cchunk3)”).

If I run using openmpi’s (default) ompio support, testphdf5 completes (good), but testph5diff.sh fails because of about 800 extra lines in the test output of the form:

[1605284505.746115] [login2:139887:0] tag_match.c:61 UCX WARN unexpected tag-receive descriptor 0x20001e594000 was not matched

Has anyone seen these UCX errors before when testing HDF5, please?

Is ompio better than romio for HDF5 with openmpi on Lustre these days?

Thanks,

Mark


#2

There is at least one known and still open bug with OpenMPI + romio + lustre + HDF5:


The workaround is to switch from romio to ompio for this particular bug at least.


#3

I’m seeing the same building hdf5 1.10.7 on Fedora Rawhide with openmpi 4.1.1rc1 and ucx 1.9.0. According to https://github.com/openucx/ucx/issues/6331 these messages are the result of the program not receiving all of the messages sent to it and so points to an issue with the tests or HDF5 library itself.