About NaN and Inf in HDF5 files

Hello all,

I have a lot of HDF5 files with many layers and parameters.

I’m struggling to find a simple way to check whether any of
these files have NaN or Inf values for any of their parameters.

If you have any good ideas, please let me know.

I would be grateful for any advice.

Best,
Yuichiro


HAGIHARA Yuichiro // yu.hagihara@gmail.com

You can generate XML map of HDF5 contents using DMR++ - OPeNDAP.
Then, you can examine length (or size) of each dataset.
This approach is particularly useful if you want to create a DB (e.g., ElasticSearch) of XML maps from millions of HDF5 files.

You can quickly examine using h5dump for one HDF5 file.
Please pay attention to SIZE in the following output.

$ h5dump -H -p chunked_gzipped_fourD.h5
HDF5 "chunked_gzipped_fourD.h5" {
GROUP "/" {
  DATASET "d_16_gzipped_chunks" {
     DATATYPE  H5T_IEEE_F32LE
     DATASPACE  SIMPLE { ( 40, 40, 40, 40 ) / ( 40, 40, 40, 40 ) }
     STORAGE_LAYOUT {
        CHUNKED ( 20, 20, 20, 20 )
        SIZE 2863311 (3.576:1 COMPRESSION)
     }

If SIZE is very small, it is very likely that dataset has mostly fill value (or same value like -999).

Run h5dump in a shell loop over all your files. Grep for nan and inf. This is computationally very inefficient, but you asked for “simple”, not fast. If you want speed, then you will need something that will scan files in binary.

> set f = hdf5-1.14.5/tools/test/h5diff/testfiles/h5diff_basic1.h5
> h5dump $f | grep -e nan -e inf
               nan,
               nan
               nan,
               nan
               nan,
               nan
         (0): nan, 1, nan, 1, 1, 1
         (0): nan, nan, 1, 1, 1, 1
         (0): nan, 1, nan, 1, 1, 1
         (0): nan, nan, 1, 1, 1, 1
         (0): nan, nan, 1, 1, 1, 1
         (0): -inf, -inf, -inf, inf, inf, inf
         (0): -inf, -inf, -inf, inf, inf, inf
         (0): -inf, -inf, -inf, inf, inf, inf
         (0): -inf, -inf, -inf, inf, inf, inf

To search attributes only, use h5dump -A $f. This will be a lot faster than the full h5dump $f for large data files.