HDF Tools: Why h5dump is so slow ? is parallelization available?

rrr2005r · October 31, 2019, 7:49pm

Though here the HDFView have been discussed i will ask question about similar HDF tool - H5DUMP. When you extract data using this nice tool like h5dump (which to me looks infinitely superior than HDFView) you will notice one its unfortunate drawback: it is very slow extracting data. It is actually slowest program ever built. For example extracting data array called “particles” out of HDF file data.h5 into text file textFile like this

h5dump.exe -d /particles -o textFile data.h5

goes with the terrible speed around 2 MB/s. Extraction into binaryFile like this

h5dump.exe -d /particles -o binaryFile -b MEMORY data.h5

goes faster but the speed is still awful - 8 megabytes per second. In comparison any Fortran compiler extracts or saves files with the speed 20-50 times faster in ASCII text format and 100 times faster in binary (Silvefrost Fortran compiler saves binary data even 500x faster – 4 Gigabytes per second)

What compiler was used for I/O for creation of output? Was it C/Fortran or others?
Is there an option for h5dump to extract data in parallel? 7Zip or PIGZ compress and decompress much faster when multicore processors are allowed to do that in parallel.

gheber · November 1, 2019, 2:07pm

Maybe a simpler question to ask would be if h5dump is the right tool for high-performance data extraction from HDF5 files. The tool’s name might be deceiving here and our documentation / communication may have done the rest. It is not.

Is there room for an h5extract tool along the lines you are describing? Possibly.
Would you like to lead an effort to define and vet such a tool with the community?

In the meantime, there are powerful toolkits available that will get you there, with a modest effort on your part. For example, take a look at H5CPP (http://sandbox.h5cpp.org/) where practically all HDF5 plumbing is done automatically and the performance will be close to the underlying runtime / hardware.

G.

rrr2005r · November 4, 2019, 5:36pm

Essentially the only what needed is to tell the developer of HDF5DUMP to fix it trying different compiler or use Fortran for I/O. I suspect C was used which is notorious for its slowness.

Thanks for suggestion about sandbox, but i’d comment on it that this looks like an example of strategically wrong suggestion. Such new tools appear like mushrooms after the rain and disappear similarly quickly. It is hard to imagine what new this will bring to anyone. All the ultimate simplicity is already realized in mass software for example in MATLAB when to read the data the only what needed is to write just 20 letters

data=HDF5READ (h5file, name)

Plus MATLAB will not disappear belly up as quick

steven · November 4, 2019, 7:03pm

Thank you for the encouraging words, H5CPP is not planning to go away soon, and it incorporates bleeding edge features of both the HDF5 CAPI and C++ the language itself.

This Mushroom like development is coordinated by The HDFGroup, and you are welcome to join our C++ work group meetings. There you can openly criticise community effort or just write your own implementation: un-mushroom-like.

As for your statement “I suspect C was used which is notorious for its slowness.” please post this on C and C++ usergroups, where industry experts can examine your test cases across wide range of programming languages and hardware platforms.
On the side: If the C, C++ and FORTRAN code that makes systems slow, why is it the number one choice for libraries, and operating systems?

What do I know? I’ve moved to Julia 5 years ago… many others followed.

I think first requirement is to create funds and drive, then hire the expertise to do the job. Indeed The HDFGroup provides consulting services, however I don’t speak for them, this is just a community mailing list, where we follow ‘be nice to others principle’.

hope it helps:
steven varga
the author of H5CPP

rrr2005r · November 6, 2019, 4:38pm

Do i need to convince anyone more after exposing crazy 1980 floppy disk extraction speeds of HD5DUMP ? This clearly has to be fixed.

steven · November 7, 2019, 2:15pm

Please note that no-none argues that h5dump is a fast / high throughput solution. Instead we tried to point to free alternatives that can get you the result. BTW: you also can get good result with Julia, Python or Matlab if you are not familiar with C or C++.

IMHO you do have to convince others for the following reasons:

Talented software writers want to get paid – it is not a hobby, but a profession
To ‘fix’ an old code base maybe more expensive than writing a new one
The problem is with performance: writing some code will not do it, profiling is a tedious task leading to sleepless nights. Someone has to pick up the tab.
writing a general solution that works for all cases including unseen ones, preferably with flat profile is a dream of all clients. The reality: accomplish what can be done, then maintain the solution over years. This adds more cost.

Because of the above points, I think you do need to convince people with money to have a new HPC version of h5dump.

You also could argue that if h5dump doesn’t provide you with the IO throughput why is it even there – shouldn’t have been retired long time ago?
I love h5dump! If it didn’t exist I would have had to write something similar. I am thankful to TheHDFGroup, and the author to distribute it for free of charge.

You see I’ve been working on terminals, VI(M) is my best buddy and having able to print out the content of HDF5 files to the screen is vital for me. This is how I can verify visually if I did the right thing: a matrix is rotated the correct way, or I grabbed the right memory location and the IO transfer is correct.
Therefore I found this small, imperfect utility growing on me over time, and can’t imagine an initial debugging session without it.

best wishes:
steven

rrr2005r · November 8, 2019, 8:13pm

Guys, Just help to find the author of h5dump. We are not asking to make parallel version of it, just use Fortran as I/O or try different C compiler. The current one is a disaster

Yes, Python is 4x faster then h5dump but still nowhere as fast as it should be. And should be 50-100x faster.

As to H5CPP if it’s such a good tool like its authors think it is, let they create a h5dump clone and show how fast it is in comparison. Everyone would be happy to look at H5CPP after that.
What mostly needed to do are few major manipulations with HDF:

show h5 file header structure h5dump.exe -H h5file
extract specific data set into binary or ascii file

One day of work

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF Tools: Why h5dump is so slow ? is parallelization available?