Python -> C++ file loader


#1

Hello All,
I wrote a program for data visualization in Python, for which data are stored in *.h5 format, and right now it’s working with h5py python package. When I’m running my code however, loading data is the most time consuming part, so I was thinking about moving this part to c++ (with which I have basic experience).

My first question would be: is it actually worth the effort? Or h5py package is already well optimized and I won’t gain much?

Second issue: I already made an attempt, but hit the wall at very beginning. I was trying to write a code that just opens a file based on an example here, but I have a problem with includes and libraries.
I installed the HDF5 package using MacPorts (hdf5 version 1.12.0) and when I’m compiling I’m using:

clang++ -g -I<path_to_inlcude> -c main.cpp -o <project_path>/main.o (for building)

clang++ -L<path_to_libs> -o <project_path>/bin/dl <project_path>/main.o <path_to_libs>/libhdf5_cpp.200.dylib (for linking)

The above works when I’m not doing anything, but only I’ll try to open file using:
H5File file( FILE_NAME, H5F_ACC_RDONLY );
it returns while linking:
Undefined symbols for architecture x86_64:
"_H5check_version", referenced from:
_main in main.o
"_H5open", referenced from:
_main in main.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)

Does anyone know where is the mistake, or have a better example to follow?

Thank in advance for all help!


#2

Hi @wojciech.pudelko

On the speed front:

I do not know about the HDF groups python optimisations but here is my anecdotal experience:

  • I used HDF5 C API to gather data (long high speed imaging runs, +100GB), its nice and quick.
  • I use h5py in Python to visualise the data: I am able to seemlessly scroll through the images of a +100GB file, no lag and easy on memory.

For my application the C API writing was ~3 times faster than the h5py reading… but that is incredibly annecdotal - please ignore this!

It all depends on the chunking

The chunk size should be optimsed to require loading and unloading a little as possible and use all of the HDF5 cache (1MB by default, but this can be expanded).

If your chunks are rows and your processing in columns you need to load multiple chunks to get the data making up one column and lots of time is wasted reading in whole chunks for a single value…

There is also plenty of further tinkering one can do and it should be exposed in the h5py python api but chunk design is the main thing, as far I know.

Linker errors, symbols not found

Not sure if this helps but on Windows with HDF5 1.12.0 to use dynamic libraries:

  1. Use the import libraries, labelled like: hdf5.lib in the libs folder.
  2. Include the H5_BUILT_AS_DYNAMIC_LIB compiler flag

To help confuse you also contained in the libs folder are the static libraries, labelled with lib prefixes like: libhdf5.lib.

Hope this helps.