Hi all,
I’m trying to get HDF5’s parallel tests to complete and am having difficulty. I’m on the following platform:
- RHEL7
- GCC 4.8.5 (OS supplied)
- openmpi 4.0.5 (built against ucx 1.9.0)
- Lustre 2.12.5
- IBM power9
- MOFED 4.7.3
If I run using openmpi’s ROMIO support, the tests timeout (even with the alarm set to an hour) during testphdf5 (last line printed is “Testing – multi-chunk collective chunk io (cchunk3)”).
If I run using openmpi’s (default) ompio support, testphdf5 completes (good), but testph5diff.sh fails because of about 800 extra lines in the test output of the form:
[1605284505.746115] [login2:139887:0] tag_match.c:61 UCX WARN unexpected tag-receive descriptor 0x20001e594000 was not matched
Has anyone seen these UCX errors before when testing HDF5, please?
Is ompio better than romio for HDF5 with openmpi on Lustre these days?
Thanks,
Mark