Multithreaded writing of multiple files in Java

In https://forum.image.sc/t/hdf5-java-parallel/25560/1 a user was looking to write to multiple different HDF5 files in parallel. In looking into the issue, I came across https://stackoverflow.com/questions/33040652/how-to-create-multiple-instances-of-the-same-library-with-jna and was wondering if you would consider this the best workaround for the synchronized methods (without using multiprocessing).

~Josh

Perhaps a better starting point is: is there any known way to write to multiple files simultaneously from a single Java HDF process?

Hi Josh, I can describe what we did (Bitplane Imaris developer here) to achieve parallel thread-safe reading in C/C++ before SWMR was added to hdf5, hoping that it can help.

The principle is to dynamically load multiple instances of hdf5.dll. In C/C++, the same binary cannot be loaded multiple times (all loaded instances share the same memory space, I assume the same happens in java), but the dynamic linking allows simple copy-pasted copies of the original dll to be loaded independently.

A pseudo-code for parallel writing would look like:

// returns a structure containing hdf5 function pointers
function load_hdf5_lib( dll_name )
    dll = LoadLibrary( dll_name ) // system dependent call
    lib.H5Fopen = GetProcAddress( dll , "H5Fopen" ) // again system dependent
    lib.H5Dopen = GetProcAddress( dll , "H5Dopen1" ) // careful with macro definitions
    lib.H5Dwrite = GetProcAddress( dll , "H5Dwrite" )
    // assign here all other required functions. yes, this is tedious
    return lib

// writes data_array into file_name using the functions in dll_name
function write_hdf5( dll_name , file_name , data_array )
    lib = load_hdf5_lib( dll_name )
    file = lib.H5Fopen( file_name , RW )
    data_set = lib.H5Dopen( file , "data_set" )
    lib.H5Dwrite( data_set, data_array )

// writes data_array into 3 files, one thread per file
function write_hdf5_in_parallel( data_array , file_name_0 , file_name_1 , file_name_2 )
    run_on_thread( write_hdf5( "hdf5_copy_0.dll" , file_name_0 , data_array) )
    run_on_thread( write_hdf5( "hdf5_copy_1.dll" , file_name_1 , data_array) )
    run_on_thread( write_hdf5( "hdf5_copy_2.dll" , file_name_2 , data_array) )
1 Like

Hi Igor,

thanks for the details, @igor . Certainly reproducing what Bitplane did in C++ is what we’d like to see in Java. If there are no other suggestions, I’ll give your suggestion a try.

Cheers,
~Josh

1 Like

The slow part of reading and writing is usually the compression. You can use H5Dread_chunk and H5Dwrite_chunk (since hdf5 1.10) to access the compressed streams without processing and decompress/decompress them using the gzip library separately, which can be done in parallel. There is some little overhead in hdf5, but most part of H5Dread/write_chunk is direct disk access. You can implement a pipeline of reading-decompressing and/or compressing-writing. If the bottleneck is the multi-threaded compression or decompression (CPU-bound), or the single-threaded reading or writing is already saturating the disk capacity (I/O-bound), you would not gain anything more by parallelizing the file accesses. This solution may be simpler and more elegant than dynamically loading multiple copies of the library, and it would provide a boost for single (or few), large files as well.

2 Likes

Multithreaded Java file writing has various uses. Developers may speed up their apps by writing to numerous files simultaneously utilising multiple threads.

Java Multithreaded writing https://cloudfoundation.com/blog/what-is-java/ involves several threads writing to various files concurrently. Each thread writes data to a separate file. This lets a software write to many files simultaneously. This boosts application speed and efficiency.

Java multithreaded writing of several files is often used with databases. Multiple threads can update database tables efficiently. This improves application performance and reliability.

Web applications use multithreaded Java file writing. Each thread writing to a distinct page efficiently updates pages. This improves user experience and application reliability.

Thread synchronisation is crucial for multithreaded Java file writing. Each thread should be able to access the same data without interference. This synchronisation ensures data consistency.

Writing data requires correct buffering and synchronisation. In every application, data corruption can result from improper buffering. Buffering helps optimise data writing.

Finally, Java multithreaded file writing speed and scalability must be considered. Different methods can boost performance efficiently. Scalability can also be improved to make the software more accessible to additional users.

Hi, I have been through your query.

Here’s a high-level outline of how you can implement parallel HDF5 file writing using JNA:

Initialize the HDF5 Library Instances:

Create multiple instances of the HDF5 library using JNA. Each instance will be responsible for working with a different HDF5 file.

Parallelize the Writing Process:

Use Java’s concurrency mechanisms (e.g., ExecutorService and Callable) to parallelize the writing process. Divide the data or tasks among different threads, each associated with a specific HDF5 library instance.

Ensure Synchronization:

If the HDF5 library is not inherently thread-safe, implement synchronization mechanisms to prevent concurrent access to the same HDF5 file. You can use synchronized blocks or other thread synchronization constructs to control access.

Write to HDF5 Files:

Within each thread, use the corresponding HDF5 library instance to write data to the associated file. Make sure to handle any exceptions following this msbi training that may occur during the writing process.

Cleanup and Finalization:

Properly close and release resources associated with each HDF5 library instance once the parallel writing is complete.

Thanks
Raavikant