H5Dwrite is for writing raw data, and unlike HDF5 metadata operations the library does not require them to be collective unless you ask for it.
For a list of HDF5 function calls which are required to be collective, look here:
For raw data, we do not detect whether you are writing to the same position in the file or not, and so we just pass the data down onto MPI to write collectively or independently. The concept of collective I/O in MPI is to have all processes work together to write different portions of the file, not the same portion. So if you have 2 processes writing values X and Y to the same offset in the file, both writes will happen and the result is really undefined in MPI semantics. Now if you do collective I/O with 2 processes writing X and Y to two adjacent positions in the file, mpi io would internally have 1 rank combine the two writes and execute it itself rather than have 2 smaller writes from different processes to the parallel file system. This is a very simple example of what collective I/O is. Of course things get more complicated with more processes and more data
Not that it matters here, but note that in your code, you set your dxpl to use independent I/O and not collective:
As for the attribute, it depends on what you need. I don’t know what sort of metadata you want to store. If that metadata is related to every large dataset that you write, then you should create that attribute on every large dataset. If it is metadata for the entire file, then you can just create it on the root group "/" (note this is not a dataset, but a group object.. those are 2 different HDF5 objects. Look into the HDF5 user guide if you need more information). Note that attribute operations are regarded as HDF5 metadata operations, unlike H5Dread and H5Dwrite, and are always required to be collective, and should be called with the same parameters and values from all processes. HDF5 internally manages the metadata cache operations to the file system in that case so you don't end up writing multiple times to the file as was the case with what you were doing with raw data writes with H5Dwrite.
Note also that If you call H5LTset_attribute_string() twice with the same attribute name, the older one is overwritten. So it really depends what you want to store as metadata and how..
From: Hdf-forum [mailto:firstname.lastname@example.org] On Behalf Of Maxime Boissonneault
Sent: Wednesday, January 28, 2015 9:43 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] File keeps being updated long after the dataset is closed
Le 2015-01-28 10:32, Mohamad Chaarawi a écrit :
Ha.. I bet that writeMetaDataDataset is the culprit here...
so you are saying that you create a scalar dataset (with 1 element), and then write that same element n (n being the number of processes) the same time from all processes? Why would you need to do such a thing in the first place? If you need to write that element, you should just call writeMetaDataDataset from rank 0. If you don't need that float, then you should just not write it at all.
I was under the impression that HDF5 (or MPI IO) managed under the hood which process actually wrote data, and that such a small dataset would end up being written only by one rank. I actually thought that H5Dwrite, H5*close *needed* to be called by all processes, i.e. that they were collective.
I guess that at least H5Fclose is collective, since all processes need to close the file. Are the other ones not collective ?
You called the metadata dataset an empty dataset essentially, so I understand that you don't need it? If that is the case, then why not create the attributes on the root group, or a different sub group for the current run, or even the large dataset?
I did not know that there was a default, root, dataset. So you are saying that I can simply call H5LTset_attribute_string(file_id, "root", key, value) without creating a dataset first ?
I do not attach the metadata to the large dataset, because it is collective metadata and there may be more than one large dataset in the same file.
What you are causing is having every process grab a lock on that file system OST block, write that element and then release the lock. This is happening 960 times in your case, which I interpret what is causing this performance degradation..
This makes sense. I will test and make changes accordingly.
Hdf-forum is for HDF software users discussion.