Slow Performance with increasing number of nodes on Lustre with Parallel HDF5

I’m trying to write out a chunked, compressed file (with O(100) datasets, dimensions mostly 100x1950x1800) to a Lustre filesystem with parallel HDF5 1.10.6 that I’ve compiled with HPE/SGI MPI/MPT 2.21 . The write out works, but I seem to be getting relatively slow write speeds to Lustre that scale down with number of processors (~300 MB/s at 840 cores/30 nodes; ~100 MB/s at 2800 cores/100 nodes). I’ve got Lustre stripe number set to 4 with stripe size of 1M and my application writes on all nodes. Unfortunately, the chunks don’t exactly line up with subdomain sizes, but they should be pretty close. The final output size is ~150GB uncompressed/~3-19 GB compressed.

We are opening our HDF5 files like:
access_id = H5Pcreate (H5P_FILE_ACCESS);
H5Pset_fapl_mpio (access_id,MPI_COMM_WORLD,MPI_INFO_NULL);
flags = H5F_ACC_RDWR;
fileid_h5 = H5Fopen (locfn,flags,access_id);
hdferr = H5Pclose (access_id);

and writing our datasets like:
fspcid = H5Screate_simple (*f_ndims,f_dims,f_dims);
herr = H5Pset_chunk (propid,*f_ndims,chunk_size);
dsetid = H5Dcreate (fileid_h5,dname,memtype,fspcid,H5P_DEFAULT,propid,H5P_DEFAULT);
H5Pclose (propid);
H5Sclose (fspcid);
fspcid = H5Dget_space (dsetid);
mspcid = H5Screate_simple (*m_ndims,m_dims,NULL); // memory data space
herr = H5Sselect_hyperslab (fspcid,H5S_SELECT_SET,f_offset,f_stride
,f_count,f_block);
herr = H5Sselect_hyperslab (mspcid,H5S_SELECT_SET,m_offset,m_stride
,m_count,m_block);
propid = H5Pcreate (H5P_DATASET_XFER);
H5Pset_dxpl_mpio (propid,H5FD_MPIO_COLLECTIVE);
herr = H5Dwrite (dsetid,memtype,mspcid,fspcid,propid,buf);
H5Pclose (propid);
herr = H5Sclose (mspcid);
herr = H5Sclose (fspcid);
herr = H5Dclose (dsetid);

Is there anything obvious that I’m missing? I’ve tried to play around with the MPIIO parameters as suggested here: https://support.hdfgroup.org/HDF5/faq/parallel.html , but they have mostly given me worse performance than not setting them at all. Is there documentation somewhere for suggested tuning or an open source project that gets good HDF5 write performance on Lustre that we can compare to?