Parallel HDF5 write with irregular size in one dimension


#1

Hi,

I am about to implement a parallel writer using the HDF5 C-API.

The data I need to write is distributed over the different partitions in contiguous memory (c arrays). To keep things simple let the rank be 1 (1D/vector). Each process’ array has a different size (small deviation, less 10%).
So the data is irregular in just one dimension.
The data can be sparse or dense, for the sparse case I wish to use compression (chunking implied).
From this example I started experimenting and managed to have compression/chunking with same sized arrays. Without chunking I even got different sized arrays running.
Because I am using collective write I assume that the chunk size must be equal to all processes. The whole data may fit into memory, so that the chunk size could cover the whole data set.

Could you please provide some links to examples with chunking and collective writes and give hints how that could be adapted to different sized arrays?


#2

Let’s maybe take a step back and revisit a few basics:

An HDF5 dataset with chunked layout has a chunk size (or shape in > 1D) that is fixed at dataset creation time. There is only one chunk size (shape) per dataset, but different datasets tend to have different chunk sizes. This has nothing to do with doing sequential or parallel I/O. H5D[read,write] work with dataspace selections, which are logical and have nothing to do with chunks (which is physical).

Does performance vary depending on how (mis-)matched array partitioning and chunking are? Yes, but H5D[read,write] will do the right thing regardless. Any rank can read or write any portion in the dataset in the HDF5 file. If you can arrange it that every process writes the same portion of your array and those portions align with chunks, great! But don’t make your life miserable/code complex for a few seconds of runtime.

I don’t understand your distinction between same-sized and different-sized arrays. Do you mean MPI-process local arrays, the portions that each rank reads or writes? Again, that’s fine as long as the numbers (lengths and selections) add up and it has nothing to do with chunking. If two processes happen to read/write from/to the same chunk, H5D[read,write] will take care of that for you.

What’s a typical number of MPI ranks? How much data is each rank reading or writing?

Do you want to give us an MWE or reformulate your problem?

G.


#3

Thank you for your answer.

I have modified this example to Hyperslab_by_custom_chunk.cpp (4.0 KB).

I tried to change as less as possible, all lines I changed are marked as "//HDFFORUM.

  1. The data set dimension is changed to 1D.
  2. ‘my_chunk_dims’ is added as data independed input for H5Pset_chunk.

Experiment 1:
my_chunk_dims = 2; (same as the data dimension ,CH_NX=2)
call “mpirun -np 4 hyperslab_custom” -> h5 is fine
call “LD_PRELOAD=libdarshan.so mpirun -np 4 ./hyperslab_custom” -> h5 is fine

Experiment 2:
my_chunk_dims = 8; (can be 1, 3-8 (NX=8)
call “mpirun -np 4 hyperslab_custom” -> h5 is fine
call “LD_PRELOAD=libdarshan.so mpirun -np 4 ./hyperslab_custom” -> no h5 file, crash
errorreport (6.7 KB)

I need to reformulate and split the original problem.
In first step I need to find out if either instrumentation with darshan has a problem OR instrumentation with darshan reveals a problem that otherwise would be just undisovered.

Is there something I have done wrong with experiment 2, is it supposed to work?

Darshan-runtime 3.2.1 (with --with-hdf5=1)
hdf5-openmpi-1.12.0-2
openmpi-4.0.5-2


#4

My apologies for us putting such crappy examples online. It’s embarrassing that something so simple takes almost 150 lines of code. Forget about the code. Can you describe in words what you are trying to achieve? In other words, I’m looking for a description like this:

I have a 1D array A of N 32-integers unevenly (N_1,…,N_P) spread across P MPI processes. I would like to create a chunked dataset D w/ chunk size C and (collectively) write A to D. How do I do that?

Is that accurate? G.

P.S.Are you confident that the Darshan runtime was built with the same HDF5 version?


#5

Darshan:
To make sure darshan was built with the same version a rebuild darshan. But I guess it should also haved worked if the versions used just met this requirement:

NOTE: HDF5 instrumentation only works on HDF5 library versions >=1.8, and further requires that the HDF5 library used to build Darshan and the HDF5 library being linked in either both be version >=1.10 or both be version <1.10

–> Nothing has changed, I get the same error message.

Question:

I have a 1D array A of N 32-integers unevenly (N_1,…,N_P) spread across P MPI processes. I would like to create a chunked dataset D w/ chunk size C and (collectively) write A to D. How do I do that?

That’s accurate!
Additionally it would be great if you could point to an “non-crappy” example.


#6

example.cc (2.4 KB)

OK, here’s an example that I believe is neither trivial nor misleading, and that has some nutritional value. I haven’t tried to LD_PRELOAD Darshan, but I don’t see why that would cause any issues. Let me know how that goes! G.