Parallel write on VAST filesystem (hdf5 with adios2)

Dear community,

we are currently optimizing our GPU-parallel code with output via the adios2 library to run on TACC Vista which uses a VAST filesystem. We are experiencing extreme performance regression during the HDF5 output or file accessibility errors during the write.

More details: The main issue is severe performance degradation during ADIOS2 calls, which varies with our MPI decomposition. Profiling with NVIDIA Nsys shows a slowdown factor of approximately 300x for decomposition along one specific direction. We have attempted various compiler combinations (GNU and NVCC) and tried both the provided parallel hdf5 module and a self-built HDF5, but the issue persists.

So far, we have narrowed our issues down to a parallel write of multiple processes into one hdf5 file on VAST (while all works fine on LUSTRE). As mentioned, the performance also heavily depends on the mesh decomposition. For a 2D test array, performance is acceptable for MPI decomposition in one direction, while slowing down substantially for MPI decomposition in the other direction. VAST does not support stripe count or stripe size.

We have tried substantial troubleshooting with the support team itself, including several environment variables that control file locking. All without success.

We would be highly interested in learning about experiences with the VAST filesystem and what could be done to mitigate these performance regressions on VAST as compared to writing the same data to LUSTRE.

Thank you very much!
Jens

Hi Jens,
thanks for mentioning this. I’m with VAST Data and can’t find any record of someone from TACC reaching out to us about this. However, I will be very happy to help you with this, including jumping on a call to look at this live. Thus, please feel free to reach out to me via email (sven at vastdata@com) or feel free to have the TACC support team reach out to me via our normal support channels.

For HDF5 writes through MPI-IO, the typical pitfall is that OpenMPI (in contrast to e.g. MPICH) under the hood locks the entire file for each write operation by default for historic reasons. This isn’t necessary with VAST, so the simple solution is to add the mpirun parameter “–mca fs_ufs_lock_algorithm 3” to change this (0=auto, 1=never, 2=lock entire files, 3=lock byte ranges of files).

For some MPI-IO based applications it also provides a nice extra boost to disable caching on the client side. The POSIX interface has the O_DIRECT flag for this, but MPI-IO doesn’t have a flag to request this. Thus, we’re providing an LD_PRELOAD library that can bring the O_DIRECT benefits also to MPI-IO based applications. It actually works with any filesystem, but best with VAST of course :wink: You can find it here and I’ll be happy to show you how it works if you want: GitHub - vast-data/vast-preload-lib: LD_PRELOAD library to inject O_DIRECT into file I/O

All the best
Sven Breuner

1 Like