Struggling to build HDF5 with parallel libraries

I’m trying to build HDF5 with the --enable-parallel option. I’ve installed zlib (ZDIR=/home/chris/zlib). Using the MPICH libraries (built from source), I get as far as:

CC=/home/chris/mpich/install/bin/mpicc FC=/home/chris/mpich/install/bin/mpifort ./configure --with-zlib=${ZDIR} --prefix=${H5DIR} --enable-hl --enable-fortran --enable-parallel
make check

gets as far as

[…]

PHDF5 tests finished with no errors

Finished testing testphdf5

make[4]: Leaving directory ‘/home/nemo/hdf5/hdf5-1.10.5/testpar’
make[4]: Entering directory ‘/home/nemo/hdf5/hdf5-1.10.5/testpar’

Testing t_cache

…and then just hangs. If I kill the job and rerun make check it thinks it has already run the tests:

No need to test t_cache again.
make[4]: Leaving directory ‘/home/nemo/hdf5/hdf5-1.10.5/testpar’
make[4]: Entering directory ‘/home/nemo/hdf5/hdf5-1.10.5/testpar’

Testing t_cache_image

but, this needs to be in a script so isn’t an acceptable solution!

In case anyone is wondering, I’m using MPICH because Open MPI doesn’t even get that far:

Makefile:1444: recipe for target ‘t_mpi.chkexe_’ failed
make[4]: *** [t_mpi.chkexe_] Error 1
make[4]: Leaving directory ‘/nemo/hdf5-1.10.5/testpar’
Makefile:1553: recipe for target ‘build-check-p’ failed
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory ‘/nemo/hdf5-1.10.5/testpar’
Makefile:1424: recipe for target ‘test’ failed
make[2]: *** [test] Error 2
make[2]: Leaving directory ‘/nemo/hdf5-1.10.5/testpar’
Makefile:1225: recipe for target ‘check-am’ failed
make[1]: *** [check-am] Error 2
make[1]: Leaving directory ‘/nemo/hdf5-1.10.5/testpar’
Makefile:654: recipe for target ‘check-recursive’ failed
make: *** [check-recursive] Error 1

Hi, can’t speak for MPICH, but I have compiled several versions of Parallel HDF5 with OpenMPI 4.0.1. In fact I’ve compiled one commit for each day from 2003 - 2019 and linked against IOR performance measurement tool with good result.

What OS are you using?

It’s in a singularity container, but running Ubuntu; lsb_release -a:

LSB Version: core-9.20170808ubuntu1-noarch:security-9.20170808ubuntu1-noarch
Distributor ID: Ubuntu
Description: Ubuntu 18.04 LTS
Release: 18.04
Codename: bionic

Speaking from personal experience: Ubuntu 18.04 LTS is my preferred OS to build cluster on AWS EC2. And it should work fine. The only glitch I usually have is to compile systems from a shared / attached drive: for some reasons it messes up the time stamps. I usually resolve this by doing all system compiles on instances with ephemeral/local drives. However this problem is oblt OpenMPI specific, PHDF5 compiles fine on parallel FS.

I did manage to compile PHDF5 from 2003 - 2019 against OpenMPI 4.0.1 on and Ubuntu 18.04 LTS based custom cluster running on AWS EC2.

best: steve

So when I use OpenMPI, I get the error

Makefile:1444: recipe for target ‘t_mpi.chkexe_’ failed

do you have any idea how to resolve that?

No, did you try googleing it? I got the following

7280inputs+64outputs (54major+1677minor)pagefaults 0swaps
make[4]: *** [t_mpi.chkexe_] Error 1
make[4]: Leaving directory /home/hwu/Downloads/hdf5-1.10.5/testpar' make[3]: *** [build-check-p] Error 1 make[3]: Leaving directory /home/hwu/Downloads/hdf5-1.10.5/testpar'
make[2]: *** [test] Error 2
make[2]: Leaving directory /home/hwu/Downloads/hdf5-1.10.5/testpar' make[1]: *** [check-am] Error 2 make[1]: Leaving directory /home/hwu/Downloads/hdf5-1.10.5/testpar'
make: *** [check-recursive] Error 1

This message basically states that you do not have enough hardware resources to run the 6 processes you requested (OMPI assumes you are running for performance and refuses by default to oversubscribe your hardware resources). You can find more information in our FAQ.

best

Yeh, I saw that, but I couldn’t find a (simple) way to add the oversubscribe flag to anything relevant…

You should be able to do:

make check RUNPARALLEL='mpirun -oversubscribe ’

I have used the RUNPARALLEL macro in other contexts but not for the specific case that you are trying to.

Hope this helps.

Hmm, even that isn’t working:

$ CC=/opt/openmpi-4.0.1/bin/mpicc FC=/opt/openmpi-4.0.1/bin/mpifort ./configure --with-zlib=${ZDIR} --prefix=${H5DIR} --enable-hl --with-fortran --enable-parallel
$ /opt/openmpi-4.0.1/bin/mpicc --version
gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0
Copyright (C) 2017 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
$ make check
<...>
make[4]: Entering directory '/home/chris/hdf5-1.10.5/testpar'
============================
Testing  t_mpi 
============================
 t_mpi  Test Log
============================
[mpiexec@vagrant] match_arg (utils/args/args.c:163): unrecognized argument oversubscribe
[mpiexec@vagrant] HYDU_parse_array (utils/args/args.c:178): argument matching returned error
[mpiexec@vagrant] parse_args (ui/mpich/utils.c:1642): error parsing input array
[mpiexec@vagrant] HYD_uii_mpx_get_parameters (ui/mpich/utils.c:1694): unable to parse user arguments
[mpiexec@vagrant] main (ui/mpich/mpiexec.c:148): error parsing parameters
Makefile:1444: recipe for target 't_mpi.chkexe_' failed
make[4]: *** [t_mpi.chkexe_] Error 1
make[4]: Leaving directory '/home/chris/hdf5-1.10.5/testpar'
Makefile:1553: recipe for target 'build-check-p' failed
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory '/home/chris/hdf5-1.10.5/testpar'
Makefile:1424: recipe for target 'test' failed
make[2]: *** [test] Error 2
make[2]: Leaving directory '/home/chris/hdf5-1.10.5/testpar'
Makefile:1225: recipe for target 'check-am' failed
make[1]: *** [check-am] Error 2
make[1]: Leaving directory '/home/chris/hdf5-1.10.5/testpar'
Makefile:654: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1

I’ve tried --oversubscribe too

I missed running make clean before this (so it was still trying to build against a version of mpich I’d also installed). However, once I realised this, I still have issues:

h5repack tests failed with 1 errors.
Makefile:1451: recipe for target 'h5repack.sh.chkexe_' failed

H5repack error - make check and HDF5 make test in error suggest that these tests are unreliable and to use the -i flag. This does work, but it would be good to have some confirmation from an HDF5 developer if there are plans to fix the tests. I’m really not comfortable with hiding errors rather than fixing them, particularly for the community I’m supporting.

Did you try this on a plain computer / laptop running Ubuntu 18.04 instead of IAAS platform?

I factually know ompi works. You will run into problems with slurm perhaps, but ompi 4.0.1 with recent HDF5 is firm green.

Yes, on Ubuntu 16 I have, and it still fails. I still need to pass RUNPARALLEL='-oversubscribe' -i for make check to complete successfully

I get the following errors with first no flags…

Makefile:1444: recipe for target 't_mpi.chkexe_' failed
make[4]: *** [t_mpi.chkexe_] Error 1
make[4]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1553: recipe for target 'build-check-p' failed
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1424: recipe for target 'test' failed
make[2]: *** [test] Error 2
make[2]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1225: recipe for target 'check-am' failed
make[1]: *** [check-am] Error 2
make[1]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:654: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1

and with RUNPARALLEL='-oversubscribe'

ORTE_ERROR_LOG: Out of resource in file util/show_help.c at line 501
Makefile:1444: recipe for target 't_cache.chkexe_' failed
make[4]: *** [t_cache.chkexe_] Error 1
make[4]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1553: recipe for target 'build-check-p' failed
make[3]: *** [build-check-p] Error 1
make[3]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1424: recipe for target 'test' failed
make[2]: *** [test] Error 2
make[2]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:1225: recipe for target 'check-am' failed
make[1]: *** [check-am] Error 2
make[1]: Leaving directory '/home/chris/hdf5test/hdf5-1.10.5/testpar'
Makefile:654: recipe for target 'check-recursive' failed
make: *** [check-recursive] Error 1

Let’s break the problems into smaller size, factoring out make check of HDF5:

  1. OpenMPI 4.0.2 compiled ?
  2. Is there a suitable Parallel File system such as: Lustre, BGFS, OrangeFS running?
  3. did pHDF5 compile?
  4. what job scheduler is in place: SLUM, GridEngine, …
  5. are there enough resources available to run the job.

The make check may be controlled various way, and indeed can be finicky. To set the correct number of processes is suggested, by default I think it is set to 4 (not certain)

If your company is interested in quality IAAS on demand clusters matching with setup used on supercomputers my consulting company provides such services directly or through THDFGroup.

best:
steve

  1. Yes
  2. Docs say that a parallel file system isn’t needed, so no. (e.g. https://www.unidata.ucar.edu/software/netcdf/docs/getting_and_building_netcdf.html#build_hdf4 just specifies that MPI libraries must be made available)
  3. Yes

Yes, I’ve changed the resources needed using the RUNPARALLEL flag

That documentation link is to netcdf, is that what you are building? I am not familiar with NetCDF.

The question was tagged as parallel HDF5, here is the link to installation requirments somewhere on the top you see POSIX compliant parallel filesystem.

FYI. Some relaxation can be made from strict POSIX compliance.

Best: steve

Yes, HDF5 is a dependency for NetCDF, so I was following their instructions for building. Given my usecase, I’ll ignore make check for now.