Many small writes to large data

Hi all, I would like to learn about your experience on performing many small writes to a large dataset.

It is for converting data in many small data in a big file.

Source files:
52000 files, each has shape nblock * nz * ny * nx * ncomps
where nblock=15 is block number, ncomps is number of variables, nz=64, ny=100, nx=100 are block size
number of blocks along z, y, x are 78, 100, 100, thus there are 78100100=7.8e5 blocks in total.

Target hdf5 dataset for one variable has shape Nz * Ny * Nx
where Nz = nz * nblock_z = 64 * 78 = 4992, Ny = ny * nblock_y = 100 * 100 = 1e4, Nx=nxnblock_y=100100=1e4

My way is to iterate over all files, read a block of data nz * ny * nx * ncomps, and save it in the contiguous locations in the target hdf5 dataset.

If I do parallel writing to the target file, it is extremely slow. For example,
if I use only 1 processor, the writing of each block data takes, say 5 seconds
if I use 24 processors, the writing takes, say 200 seconds.

Of course, the writing speed depends on many factors, like disk setup, stripes (I’m using a Lustre file system).
Something I’m working on is to use chunks. But on my Lustre system, it fails due to ADIOI_Set_lock or ADIO_Offset errors that I need to talk to our tech experts about.

Do you have any ideas I could improve the concurrent writing such small data? Maybe I am missing some simple solutions? Thank you very much.