Parallel HDF5 write with irregular size in one dimension

tobias.meisel · March 16, 2021, 2:32pm

Hi,

I am about to implement a parallel writer using the HDF5 C-API.

The data I need to write is distributed over the different partitions in contiguous memory (c arrays). To keep things simple let the rank be 1 (1D/vector). Each process’ array has a different size (small deviation, less 10%).
So the data is irregular in just one dimension.
The data can be sparse or dense, for the sparse case I wish to use compression (chunking implied).
From this example I started experimenting and managed to have compression/chunking with same sized arrays. Without chunking I even got different sized arrays running.
Because I am using collective write I assume that the chunk size must be equal to all processes. The whole data may fit into memory, so that the chunk size could cover the whole data set.

Could you please provide some links to examples with chunking and collective writes and give hints how that could be adapted to different sized arrays?

gheber · March 17, 2021, 1:54am

Let’s maybe take a step back and revisit a few basics:

An HDF5 dataset with chunked layout has a chunk size (or shape in > 1D) that is fixed at dataset creation time. There is only one chunk size (shape) per dataset, but different datasets tend to have different chunk sizes. This has nothing to do with doing sequential or parallel I/O. H5D[read,write] work with dataspace selections, which are logical and have nothing to do with chunks (which is physical).

Does performance vary depending on how (mis-)matched array partitioning and chunking are? Yes, but H5D[read,write] will do the right thing regardless. Any rank can read or write any portion in the dataset in the HDF5 file. If you can arrange it that every process writes the same portion of your array and those portions align with chunks, great! But don’t make your life miserable/code complex for a few seconds of runtime.

I don’t understand your distinction between same-sized and different-sized arrays. Do you mean MPI-process local arrays, the portions that each rank reads or writes? Again, that’s fine as long as the numbers (lengths and selections) add up and it has nothing to do with chunking. If two processes happen to read/write from/to the same chunk, H5D[read,write] will take care of that for you.

What’s a typical number of MPI ranks? How much data is each rank reading or writing?

Do you want to give us an MWE or reformulate your problem?

G.

tobias.meisel · March 18, 2021, 1:52pm

Thank you for your answer.

I have modified this example to Hyperslab_by_custom_chunk.cpp (4.0 KB).

I tried to change as less as possible, all lines I changed are marked as "//HDFFORUM.

The data set dimension is changed to 1D.
‘my_chunk_dims’ is added as data independed input for H5Pset_chunk.

Experiment 1:
my_chunk_dims = 2; (same as the data dimension ,CH_NX=2)
call “mpirun -np 4 hyperslab_custom” → h5 is fine
call “LD_PRELOAD=libdarshan.so mpirun -np 4 ./hyperslab_custom” → h5 is fine

Experiment 2:
my_chunk_dims = 8; (can be 1, 3-8 (NX=8)
call “mpirun -np 4 hyperslab_custom” → h5 is fine
call “LD_PRELOAD=libdarshan.so mpirun -np 4 ./hyperslab_custom” → no h5 file, crash
errorreport (6.7 KB)

I need to reformulate and split the original problem.
In first step I need to find out if either instrumentation with darshan has a problem OR instrumentation with darshan reveals a problem that otherwise would be just undisovered.

Is there something I have done wrong with experiment 2, is it supposed to work?

Darshan-runtime 3.2.1 (with --with-hdf5=1)
hdf5-openmpi-1.12.0-2
openmpi-4.0.5-2

gheber · March 23, 2021, 12:59pm

My apologies for us putting such crappy examples online. It’s embarrassing that something so simple takes almost 150 lines of code. Forget about the code. Can you describe in words what you are trying to achieve? In other words, I’m looking for a description like this:

I have a 1D array A of N 32-integers unevenly (N_1,…,N_P) spread across P MPI processes. I would like to create a chunked dataset D w/ chunk size C and (collectively) write A to D. How do I do that?

Is that accurate? G.

P.S.Are you confident that the Darshan runtime was built with the same HDF5 version?

tobias.meisel · March 31, 2021, 6:26am

Darshan:
To make sure darshan was built with the same version a rebuild darshan. But I guess it should also haved worked if the versions used just met this requirement:

NOTE: HDF5 instrumentation only works on HDF5 library versions >=1.8, and further requires that the HDF5 library used to build Darshan and the HDF5 library being linked in either both be version >=1.10 or both be version <1.10

→ Nothing has changed, I get the same error message.

Question:

I have a 1D array A of N 32-integers unevenly (N_1,…,N_P) spread across P MPI processes. I would like to create a chunked dataset D w/ chunk size C and (collectively) write A to D. How do I do that?

That’s accurate!
Additionally it would be great if you could point to an “non-crappy” example.

gheber · April 1, 2021, 12:48pm

example.cc (2.4 KB)

OK, here’s an example that I believe is neither trivial nor misleading, and that has some nutritional value. I haven’t tried to LD_PRELOAD Darshan, but I don’t see why that would cause any issues. Let me know how that goes! G.

tobias.meisel · April 12, 2021, 9:55am

Thank you for this nice example. Unfortunately I run into exactly the same problem with Darshan. Without darshan everything works fine. Good to know that the problem is not within the example script.
The next step would be to find out if there is a problem with my setup (I was already able to reproduce on other machines). It would be great if you could try to reproduce with LD_PRELOAD Darshan.

gheber · April 14, 2021, 10:02pm

OK, I think I can see a possible (Darshan HDF5 module?) issue. My setup is:

OpenMPI 4.1.0
HDF5 1.10.7
Darshan 3.2.1

If I build Darshan without the HDF5 module, LD_PRELOAD=.../libdarshan.so mpiexec ... works just fine and I get a happy Darshan output file.

If I build the Darshan runtime with --enable-hdf5-mod=..., then I get this error:

...
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 3:
HDF5-DIAG: Error detected in HDF5 (1.10.7) MPI-process 1:
...
  #000: H5Shyper.c line 12066 in H5Sis_regular_hyperslab(): not a hyperslab selection
    major: Invalid arguments to routine
    minor: Inappropriate type
HDF5-DIAG: Error detected in HDF5  #000: H5Shyper.c line 12066 in H5Sis_regular_hyperslab(): not a hyperslab selection
    major: Invalid arguments to routine
    minor: Inappropriate type
 (1.10.7) MPI-process HDF5-DIAG: Error detected1:
  #000: H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab selection
    major: in HDF5 (1.10.7) MPI-process 3 Invalid arguments to routine
    minor: Inappropriate type
:
  #000: H5Shyper.c line 12116 in H5Sget_regular_hyperslab(): not a hyperslab selection
    major: Invalid arguments to routine
    minor: Inappropriate type

This makes sense to a degree, because, in the sample app, the odd ranks don’t write any data (H5Sselect_none), and that’s not a regular hyperslab selection in HDF5 parlance. Why Darshan would be unhappy about that I don’t know. Have you reached out to someone on the Darshan team?

What does your error message look like?

G.

tobias.meisel · April 19, 2021, 9:57am

My Setup is:

OpenMPI (OpenRTE) 4.0.5
HDF5 1.12.0
Darshan 3.2.1

I get this error: output.txt (3.2 KB)

I have not contacted the Darshan team yet. I will post this question to the Darshan users mailing list and give a link here as soon as possible.

Thank you very much for your help!

robl · April 20, 2021, 5:42pm

     	 tobias.meisel Tobias Meisel 
April 19
My Setup is:

OpenMPI (OpenRTE) 4.0.5
HDF5 1.12.0
Darshan 3.2.1
I get this error: output.txt (3.2 KB)

I have not contacted the Darshan team yet. I will post this question
to the Darshan users mailing list and give a link here as soon as
possible.

I don’t know why I cannot log into the HDF5 discourse right now.

I’ll repeat what I said on the Darshan list: we knew of one problem
with Darsan and hyperslab selections, but we did not know this would
cause a “divide by zero” problem with OpenMPI

Workarounds, keeping Darshan in the picture:

enable the ROMIO MCA for OpenMPI:
mpirun --mca io romio314 or mpirun -mca io romio321
Use MPICH instead of OpenMPI

We’re aware of the hyperslab selection error and working on it. Did
not know about the bad interaction with OpenMPI-IO in that case,
though, so thanks for the bug report

robl · April 22, 2021, 7:12pm

We’re aware of the hyperslab selection error and working on it. Did
not know about the bad interaction with OpenMPI-IO in that case,
though, so thanks for the bug report

Wow, this was a fun one.

in ‘example2.c’ some processes create a zero-length dataspace.

That zero length dataspace becomes a zero-byte MPI file view

In Darshan, we query the offset so we can record some statitics. We
call MPI_File_get_byte_offset.

In OpenMPI’s OMPIO, there is a spot where the code divides by the size
of the datatype – blammo we have a divide by zero error

github.com/open-mpi/ompi

segfault when using MPI_File_get_byte_offset on a 0-sized file view

opened 06:49PM - 22 Apr 21 UTC

closed 07:16PM - 29 Jun 21 UTC

shanedsnyder

bug

## Background information ### What version of Open MPI are you using? (e.g., …v3.0.5, v4.0.2, git branch name and hash, etc.) v4.0.5 ### Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.) via Spack ### If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`. ### Please describe the system on which you are running My laptop, but have also triggered the problem on Summit's Spectrum MPI using OMPIO backend * Operating system/version: Ubuntu 20.04.02 * Computer hardware: Intel® Core™ i7-8650U CPU @ 1.90GHz * Network type: ----------------------------- ## Details of the problem MPI codes that set a 0-sized file view and then call MPI_File_get_byte_offset can trigger a segfault in OpenMPI, ultimately caused by a divide by zero. Here's an example code snippet to trigger the problem: ``` #include <mpi.h> #include <stdio.h> #define CHECK(fn) {int errcode; errcode = (fn); if (errcode != MPI_SUCCESS) handle_error(errcode, NULL); } static void handle_error(int errcode, char *str) { char msg[MPI_MAX_ERROR_STRING]; int resultlen; MPI_Error_string(errcode, msg, &resultlen); fprintf(stderr, "%s: %s\n", str, msg); MPI_Abort(MPI_COMM_WORLD, 1); } int main(int argc, char** argv) { MPI_File fh; MPI_Datatype t; MPI_Aint lb, extent; MPI_Count size; MPI_Offset offset; MPI_Init(&argc, &argv); CHECK(MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_CREATE|MPI_MODE_RDWR, MPI_INFO_NULL, &fh)); CHECK(MPI_Type_contiguous(0, MPI_INT, &t)); CHECK(MPI_Type_commit(&t)); CHECK(MPI_Type_get_extent(t, &lb, &extent)); CHECK(MPI_Type_size_x(t, &size)); printf("type stats: lb: %ld extent %ld size %lld\n", lb, extent, size); CHECK(MPI_File_set_view(fh, 0, MPI_BYTE, t, "native", MPI_INFO_NULL)); CHECK(MPI_File_get_byte_offset(fh, 100, &offset)); MPI_File_close(&fh); MPI_Type_free(&t); MPI_Finalize(); } ``` Digging through the code, it looks like there aren't safety checks in the OMPIO `mca_io_ompio_file_get_byte_offset()` routine to protect against a 0-value for the `f_view_size` parameter (which is set to 0 as part of `MPI_File_set_view()`), allowing for an eventual divide by zero condition. The code in question runs fine using ROMIO backend.

The standard says this about file views:

be negative, and they must be monotonically nondecreasing.```

"zero" meets those restrictions.

We (speaking as an MPI-IO implementor) might prefer HDF set a file view
of MPI_BYTE and leave the "i did not read/write anyting" logic to the
H5DWrite path.

In darshan we are going to disable the MPI_File_get_byte_offset call if
we detect openmpi, and I'm sure the OpenMPI folks will stick the
apropriate check in their code soon.  In the meantime, you can still
use OpenMPI, but use the alternate I/O module: with "--mca io romio321"

==rob

koziol · April 22, 2021, 7:47pm

Sounds like a new regression test for you!

Passing count=0 instead of a 0-byte MPI datatype for the file view would be preferable?

Quincey

tobias.meisel · April 23, 2021, 1:01pm

The workarounds with romio314 error_romio314 (10.2 KB) and romio 321 (error_romio321 (9.1 KB) are not working for me with and even without using Darshan. (Most likely I do something wrong, but I did not investigate because my workaround is to not use Darshan temporarily.)

tobias.meisel · April 23, 2021, 1:43pm

To add an example that is very close to example.h5 but where also the odd ranked processes write data:

example_noemptyodds.c (2.2 KB)

With Darshan gives this error:
error_withdarshan_allprocesseswrite (6.1 KB)

robl · April 23, 2021, 9:30pm

     	 tobias.meisel Tobias Meisel 
April 23
The workarounds with romio314 error_romio314 (10.2 KB)

This log message says OpenMPI did not build a ‘romio314’
module. That’s fine: OpenMPI usually removes the old one when they
update, keeping just one instance of ROMIO.

and romio 321
(error_romio321 (9.1 KB)

Ugh, ok, yeah. Good(?) news: this error is not a Darshan error. This
error happens in ROMIO long before Darshan gets involved.

ROMIO-3.2.1 is five years old and is probably missing a fix in the way
we handle RESIZED datatypes.

Thanks for trying it out. Sorry it’s not working out for you. I’m
glad to have two new test cases, though.

==rob

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Parallel HDF5 write with irregular size in one dimension