Multithreaded writing of multiple files in Java


#1

In https://forum.image.sc/t/hdf5-java-parallel/25560/1 a user was looking to write to multiple different HDF5 files in parallel. In looking into the issue, I came across https://stackoverflow.com/questions/33040652/how-to-create-multiple-instances-of-the-same-library-with-jna and was wondering if you would consider this the best workaround for the synchronized methods (without using multiprocessing).

~Josh


#2

Perhaps a better starting point is: is there any known way to write to multiple files simultaneously from a single Java HDF process?


#3

Hi Josh, I can describe what we did (Bitplane Imaris developer here) to achieve parallel thread-safe reading in C/C++ before SWMR was added to hdf5, hoping that it can help.

The principle is to dynamically load multiple instances of hdf5.dll. In C/C++, the same binary cannot be loaded multiple times (all loaded instances share the same memory space, I assume the same happens in java), but the dynamic linking allows simple copy-pasted copies of the original dll to be loaded independently.

A pseudo-code for parallel writing would look like:

// returns a structure containing hdf5 function pointers
function load_hdf5_lib( dll_name )
    dll = LoadLibrary( dll_name ) // system dependent call
    lib.H5Fopen = GetProcAddress( dll , "H5Fopen" ) // again system dependent
    lib.H5Dopen = GetProcAddress( dll , "H5Dopen1" ) // careful with macro definitions
    lib.H5Dwrite = GetProcAddress( dll , "H5Dwrite" )
    // assign here all other required functions. yes, this is tedious
    return lib

// writes data_array into file_name using the functions in dll_name
function write_hdf5( dll_name , file_name , data_array )
    lib = load_hdf5_lib( dll_name )
    file = lib.H5Fopen( file_name , RW )
    data_set = lib.H5Dopen( file , "data_set" )
    lib.H5Dwrite( data_set, data_array )

// writes data_array into 3 files, one thread per file
function write_hdf5_in_parallel( data_array , file_name_0 , file_name_1 , file_name_2 )
    run_on_thread( write_hdf5( "hdf5_copy_0.dll" , file_name_0 , data_array) )
    run_on_thread( write_hdf5( "hdf5_copy_1.dll" , file_name_1 , data_array) )
    run_on_thread( write_hdf5( "hdf5_copy_2.dll" , file_name_2 , data_array) )

#4

Hi Igor,

thanks for the details, @igor . Certainly reproducing what Bitplane did in C++ is what we’d like to see in Java. If there are no other suggestions, I’ll give your suggestion a try.

Cheers,
~Josh


#5

The slow part of reading and writing is usually the compression. You can use H5Dread_chunk and H5Dwrite_chunk (since hdf5 1.10) to access the compressed streams without processing and decompress/decompress them using the gzip library separately, which can be done in parallel. There is some little overhead in hdf5, but most part of H5Dread/write_chunk is direct disk access. You can implement a pipeline of reading-decompressing and/or compressing-writing. If the bottleneck is the multi-threaded compression or decompression (CPU-bound), or the single-threaded reading or writing is already saturating the disk capacity (I/O-bound), you would not gain anything more by parallelizing the file accesses. This solution may be simpler and more elegant than dynamically loading multiple copies of the library, and it would provide a boost for single (or few), large files as well.