Poor performance with PHDF5

Hi,
I am trying to write a ~24GB large array of floats to a file with PHDF5. I am running on a Lustre PFS, with IB networking. I am running the software on 128 processes, spread amongst 16 nodes of 8 cores each. The MPI implementation is OpenMPI 1.6.3, and HDF5 is 1.8.10.

Each process is writing one regular hyperslab with a various offset. Not every process has a hyperslab of the same size, but they are close in size. Each process should therefore be writing around 192MB of data.

For some reason, it seems that if I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

only the master node writes anything into the resulting file (and it takes ~10 minutes to write it).

If instead I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

all nodes write data, and it takes ~4-5 minutes to write the whole file.

I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.
2) With our lustre filesystem, I would expect way more than 100MB/s for such collective IOs (at least around 1GB/s).

Any tips on what might be going one ?

Thanks,

···

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Qu�bec, Universit� Laval
Ph. D. en physique

Hi,
I am trying to write a ~24GB large array of floats to a file with PHDF5. I am running on a Lustre PFS, with IB networking. I am running the software on 128 processes, spread amongst 16 nodes of 8 cores each. The MPI implementation is OpenMPI 1.6.3, and HDF5 is 1.8.10.

Each process is writing one regular hyperslab with a various offset. Not every process has a hyperslab of the same size, but they are close in size. Each process should therefore be writing around 192MB of data.

For some reason, it seems that if I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);

only the master node writes anything into the resulting file (and it takes ~10 minutes to write it).

Are you saying that the master node writes its data and all the data of the other ranks? Or are you saying that there is a bug that only the master node writes its data and the other ranks' data don't ever get written to the file? (I assume it's the former)

But, yes that shouldn't happen. Is the default number of aggregators that OpenMPI sets in ROMIO, 1?
BTW how did you determine that only the master node is writing data? Did you add printfs in MPI_File_write_at_all?

HDF5 just calls into MPI-I/O with the data to be written, so the MPI-I/O library selects the number of aggregators (writers).
Could you set cb_nodes to something like, I don't know, 128 and try that (you can vary that to better tune your I/O). You can set that through the info object you pass to H5Pset_fapl_mpio().
Also set cb_buffer_size to something like your Lustre stripe size.

If instead I set
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_INDEPENDENT);

all nodes write data, and it takes ~4-5 minutes to write the whole file.

Ok this behavior is normal, independent is well independent :slight_smile:

I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.

This is not correct. All ranks should write in the HDF5 library, but not all ranks should write in MPI-I/O. Depending on the collective algorithm (like two-phase), a subset of ranks will actually write the data (cb_nodes ranks).

2) With our lustre filesystem, I would expect way more than 100MB/s for such collective IOs (at least around 1GB/s).

I have to ask this, but are you sure your stripe size and count are set to something large? The default stripe count is usually 1 or 2 which kills performance when writing large amounts of data.

Thanks,
Mohamad

···

On 4/23/2013 11:19 AM, Maxime Boissonneault wrote:

Any tips on what might be going one ?

Thanks,

Hi Mohamad,
I did further testing yesterday, varying the stripe count of my output file, and switching between collective or independent, but first let me answer your questions.

Are you saying that the master node writes its data and all the data of the other ranks? Or are you saying that there is a bug that only the master node writes its data and the other ranks' data don't ever get written to the file? (I assume it's the former)

But, yes that shouldn't happen. Is the default number of aggregators that OpenMPI sets in ROMIO, 1?
BTW how did you determine that only the master node is writing data? Did you add printfs in MPI_File_write_at_all?

I am saying that the master nodes write the data for all the other ranks. I determine this by monitoring the nodes IOPS and read/write per second through our ganglia.

HDF5 just calls into MPI-I/O with the data to be written, so the MPI-I/O library selects the number of aggregators (writers).
Could you set cb_nodes to something like, I don't know, 128 and try that (you can vary that to better tune your I/O). You can set that through the info object you pass to H5Pset_fapl_mpio().
Also set cb_buffer_size to something like your Lustre stripe size.

I will look at this further and do some testing with those parameters.

I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.

This is not correct. All ranks should write in the HDF5 library, but not all ranks should write in MPI-I/O. Depending on the collective algorithm (like two-phase), a subset of ranks will actually write the data (cb_nodes ranks).

I rather meant that I would expect all nodes to write (maybe not all ranks).

2) With our lustre filesystem, I would expect way more than 100MB/s for such collective IOs (at least around 1GB/s).

The initial numbers were with a stripe count of 1. I did some more testing when varying the stripe count, on two different filesystems :
- One has 8 targets and is idle (our test filesystem)
- One has 64 targets and is more or less busy.

I was writing with 16 nodes, 128 MPI ranks. With collective IOs, I obtained the following rates :
FS with 64 targets :
sc = 1 : 171 � 13 MB/s
sc = 8 : 937 � 34 MB/s
sc = -1 : 1102 � 19 MB/s

FS with 8 targets :
sc = 1 : 249 � 4 MB/s
sc = 8 : 1218 � 47 MB/s

With independent IO, I obtained the rates :
FS with 64 targets :
sc = 1 : 240 � 12 MB/s
sc = 8 : 1362 � 79 MB/s
sc = -1 : 948 � 48 MB/s

FS with 8 targets :
sc = 1 : 581 � 7 MB/s
sc = 8 : 2700 � 200 MB/s

The error bar that I give is the standard deviation over 3 runs. The stripe size was left to 1 MB, which is aligned with our raid blocks. I also did testing with 8 nodes (64 MPI ranks) and obtained very similar rates.

What puzzles me is that independent IOs perform either as good or much better than collective ones. Maybe this has to do with the cb_nodes parameter.

Thanks again for your reply.

Best regards,

···

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Qu�bec, Universit� Laval
Ph. D. en physique

Hi Maxime,

Hi Mohamad,
I did further testing yesterday, varying the stripe count of my output file, and switching between collective or independent, but first let me answer your questions.

Are you saying that the master node writes its data and all the data of the other ranks? Or are you saying that there is a bug that only the master node writes its data and the other ranks' data don't ever get written to the file? (I assume it's the former)

But, yes that shouldn't happen. Is the default number of aggregators that OpenMPI sets in ROMIO, 1?
BTW how did you determine that only the master node is writing data? Did you add printfs in MPI_File_write_at_all?

I am saying that the master nodes write the data for all the other ranks. I determine this by monitoring the nodes IOPS and read/write per second through our ganglia.

HDF5 just calls into MPI-I/O with the data to be written, so the MPI-I/O library selects the number of aggregators (writers).
Could you set cb_nodes to something like, I don't know, 128 and try that (you can vary that to better tune your I/O). You can set that through the info object you pass to H5Pset_fapl_mpio().
Also set cb_buffer_size to something like your Lustre stripe size.

I will look at this further and do some testing with those parameters.

Ok. I would vary the cb_nodes between 16, 32, 64 and 128 just to see which is ideal for your application/file system combination.

I am expecting two things that I don't see happening :
1) With Collective IOs, I would expect all ranks to write.

This is not correct. All ranks should write in the HDF5 library, but not all ranks should write in MPI-I/O. Depending on the collective algorithm (like two-phase), a subset of ranks will actually write the data (cb_nodes ranks).

I rather meant that I would expect all nodes to write (maybe not all ranks).

It doesn't have to be that either. It totally depends on the access pattern of your ranks in the application. I don't think the current two_phase in ROMIO takes into account rank placement on nodes, but just how much each rank is writing and how many ranks you have.

2) With our lustre filesystem, I would expect way more than 100MB/s for such collective IOs (at least around 1GB/s).

The initial numbers were with a stripe count of 1.

Yes I would definitely increase that.

I did some more testing when varying the stripe count, on two different filesystems :
- One has 8 targets and is idle (our test filesystem)
- One has 64 targets and is more or less busy.

I was writing with 16 nodes, 128 MPI ranks. With collective IOs, I obtained the following rates :
FS with 64 targets :
sc = 1 : 171 � 13 MB/s
sc = 8 : 937 � 34 MB/s
sc = -1 : 1102 � 19 MB/s

what is -1 here?

FS with 8 targets :
sc = 1 : 249 � 4 MB/s
sc = 8 : 1218 � 47 MB/s

ok this sounds more reasonable now (with a larger sc).

With independent IO, I obtained the rates :
FS with 64 targets :
sc = 1 : 240 � 12 MB/s
sc = 8 : 1362 � 79 MB/s
sc = -1 : 948 � 48 MB/s

FS with 8 targets :
sc = 1 : 581 � 7 MB/s
sc = 8 : 2700 � 200 MB/s

The error bar that I give is the standard deviation over 3 runs. The stripe size was left to 1 MB, which is aligned with our raid blocks. I also did testing with 8 nodes (64 MPI ranks) and obtained very similar rates.

What puzzles me is that independent IOs perform either as good or much better than collective ones. Maybe this has to do with the cb_nodes parameter.

yes, If only 1 rank is chosen as an aggregator in ROMIO for collective I/O, this is definitely the issue you are seeing. Increasing that should get you better results.

Thanks,
Mohamad

···

On 4/24/2013 8:45 AM, Maxime Boissonneault wrote:

Thanks again for your reply.

Best regards,

I did some more testing when varying the stripe count, on two different filesystems :
- One has 8 targets and is idle (our test filesystem)
- One has 64 targets and is more or less busy.

I was writing with 16 nodes, 128 MPI ranks. With collective IOs, I obtained the following rates :
FS with 64 targets :
sc = 1 : 171 � 13 MB/s
sc = 8 : 937 � 34 MB/s
sc = -1 : 1102 � 19 MB/s

what is -1 here?

-1 is all targets (i.e. sc = 64 in this case).

···

FS with 8 targets :
sc = 1 : 249 � 4 MB/s
sc = 8 : 1218 � 47 MB/s

ok this sounds more reasonable now (with a larger sc).

With independent IO, I obtained the rates :
FS with 64 targets :
sc = 1 : 240 � 12 MB/s
sc = 8 : 1362 � 79 MB/s
sc = -1 : 948 � 48 MB/s

FS with 8 targets :
sc = 1 : 581 � 7 MB/s
sc = 8 : 2700 � 200 MB/s

The error bar that I give is the standard deviation over 3 runs. The stripe size was left to 1 MB, which is aligned with our raid blocks. I also did testing with 8 nodes (64 MPI ranks) and obtained very similar rates.

What puzzles me is that independent IOs perform either as good or much better than collective ones. Maybe this has to do with the cb_nodes parameter.

yes, If only 1 rank is chosen as an aggregator in ROMIO for collective I/O, this is definitely the issue you are seeing. Increasing that should get you better results.

Thanks,
Mohamad

Thanks again for your reply.

Best regards,

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Qu�bec, Universit� Laval
Ph. D. en physique