I’m trying to get HDF5’s parallel tests to complete and am having difficulty. I’m on the following platform:
- GCC 4.8.5 (OS supplied)
- openmpi 4.0.5 (built against ucx 1.9.0)
- Lustre 2.12.5
- IBM power9
- MOFED 4.7.3
If I run using openmpi’s ROMIO support, the tests timeout (even with the alarm set to an hour) during testphdf5 (last line printed is “Testing – multi-chunk collective chunk io (cchunk3)”).
If I run using openmpi’s (default) ompio support, testphdf5 completes (good), but testph5diff.sh fails because of about 800 extra lines in the test output of the form:
[1605284505.746115] [login2:139887:0] tag_match.c:61 UCX WARN unexpected tag-receive descriptor 0x20001e594000 was not matched
Has anyone seen these UCX errors before when testing HDF5, please?
Is ompio better than romio for HDF5 with openmpi on Lustre these days?