File keeps being updated long after the dataset is closed


#1

Hi,
I am writing a very large (572GB) file with HDF5, on a Lustre filesystem, using 960 MPI processes spread over 120 nodes.

I am monitoring the IO that is going on the filesystem at this time. I see a very large peak, around ~2GB/s for roughly 3-4 minutes. My internal timers (from creating the dataset, selecting the memory hyperslab, writing the dataset, and closing the dataset), tells me writing takes 180s, which corresponds to the peak I see on our Lustre servers.

After writing the big dataset, I write a small "metadata" dataset, containing details of the run. This dataset is very small, and contains various data types, while the big dataset contains only doubles.

My problem : the H5 file keeps being updated (I watch the last modified date) long after the big dataset is written ~10-15 minutes after.

Is it possible that writing the small dataset at the end takes so much time, while the big dataset is so quick to write ? In the 10-15 minutes after writing the big dataset, I see next to nothing happening on our lustre filesystem.

Any idea what may be going on ?

···

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval
Ph. D. en physique


#2

Hi Maxime,

If what I understand is true from what you are saying, you are getting pretty good performance when writing the large dataset, but there is something else going on that is slowing your I/O at the end?

Let me first mention that HDF5 manages in the background a metadata cache (not to be confused with your metadata dataset). Any updates to the HDF5 file may trigger a flush of the metadata cache at some point in time (usually it's at file close if you are not doing a lot of metadata updates). The metadata is internal to HDF5 and contains information about the file like object header information, the file superblock, and many other things that are transparent to the application.. This explains why you are seeing I/Os after your dataset is closed. You will stop seeing updates on the file after you call H5Fclose().

The small "metadata" dataset is something I don't understand. What do you mean by "it contains various data types"? The dataset is created with only 1 HDF5 datatype and can't have multiple datatypes. Variable length datatypes are not permitted in parallel. So please explain more what you mean by that.
Also how large is the small dataset and how are you writing to it? Do all 960 MPI ranks write different hyperslabs to this small dataset (I wouldn't imagine it would be small then), or does only 1 rank write your metadata to that dataset? Are you using collective I/O if all processes are writing? Can you use an attribute instead of a small dataset?

It would be great if you can share the application so we can try it out, but I understand that it might not always be possible.

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Maxime Boissonneault
Sent: Tuesday, January 27, 2015 2:49 PM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] File keeps being updated long after the dataset is closed

Hi,
I am writing a very large (572GB) file with HDF5, on a Lustre filesystem, using 960 MPI processes spread over 120 nodes.

I am monitoring the IO that is going on the filesystem at this time. I see a very large peak, around ~2GB/s for roughly 3-4 minutes. My internal timers (from creating the dataset, selecting the memory hyperslab, writing the dataset, and closing the dataset), tells me writing takes 180s, which corresponds to the peak I see on our Lustre servers.

After writing the big dataset, I write a small "metadata" dataset, containing details of the run. This dataset is very small, and contains various data types, while the big dataset contains only doubles.

My problem : the H5 file keeps being updated (I watch the last modified
date) long after the big dataset is written ~10-15 minutes after.

Is it possible that writing the small dataset at the end takes so much time, while the big dataset is so quick to write ? In the 10-15 minutes after writing the big dataset, I see next to nothing happening on our lustre filesystem.

Any idea what may be going on ?

--
---------------------------------
Maxime Boissonneault
Analyste de calcul - Calcul Québec, Université Laval Ph. D. en physique

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


#3

Hi Maxime,

If what I understand is true from what you are saying, you are getting pretty good performance when writing the large dataset, but there is something else going on that is slowing your I/O at the end?

Yes. I am wondering if it is a matter of caching (i.e. HDF5 is waiting for a lock to come back, which would come back only after the data is actually written), or if it is what I do after writing the main dataset.

Let me first mention that HDF5 manages in the background a metadata cache (not to be confused with your metadata dataset). Any updates to the HDF5 file may trigger a flush of the metadata cache at some point in time (usually it's at file close if you are not doing a lot of metadata updates). The metadata is internal to HDF5 and contains information about the file like object header information, the file superblock, and many other things that are transparent to the application.. This explains why you are seeing I/Os after your dataset is closed. You will stop seeing updates on the file after you call H5Fclose().
The small "metadata" dataset is something I don't understand. What do you mean by "it contains various data types"? The dataset is created with only 1 HDF5 datatype and can't have multiple datatypes. Variable length datatypes are not permitted in parallel. So please explain more what you mean by that.
Also how large is the small dataset and how are you writing to it? Do all 960 MPI ranks write different hyperslabs to this small dataset (I wouldn't imagine it would be small then), or does only 1 rank write your metadata to that dataset? Are you using collective I/O if all processes are writing? Can you use an attribute instead of a small dataset?

The "metadata" dataset is created with
H5Dcreate(...,H5T_NATIVE_FLOAT,...)
It is being kept opened for the whole duration of the run, and has some attributes set on it, before being written. I guess that qualifies it as an empty dataset with multiple attributes. All ranks are calling the set attribute functions. Here is the code :
http://pastebin.com/sq5ygQyM

The functions createMetaDataDataset, setProperty, writeMetaDataDataset and closeMetaDataDataset are being called by all ranks.

It would be great if you can share the application so we can try it out, but I understand that it might not always be possible.

I think the above pastebin is all the code that you need, but I can paste the code for the whole class if you want. The code for the whole application will be problematic, not because of the code itself, but because of the data required as input.

Maxime

···

Le 2015-01-27 16:14, Mohamad Chaarawi a écrit :


#4

Ha.. I bet that writeMetaDataDataset is the culprit here...
so you are saying that you create a scalar dataset (with 1 element), and then write that same element n (n being the number of processes) the same time from all processes? Why would you need to do such a thing in the first place? If you need to write that element, you should just call writeMetaDataDataset from rank 0. If you don't need that float, then you should just not write it at all.

You called the metadata dataset an empty dataset essentially, so I understand that you don't need it? If that is the case, then why not create the attributes on the root group, or a different sub group for the current run, or even the large dataset?

What you are causing is having every process grab a lock on that file system OST block, write that element and then release the lock. This is happening 960 times in your case, which I interpret what is causing this performance degradation..

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Maxime Boissonneault
Sent: Wednesday, January 28, 2015 9:11 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] File keeps being updated long after the dataset is closed

Le 2015-01-27 16:14, Mohamad Chaarawi a écrit :

Hi Maxime,

If what I understand is true from what you are saying, you are getting pretty good performance when writing the large dataset, but there is something else going on that is slowing your I/O at the end?

Yes. I am wondering if it is a matter of caching (i.e. HDF5 is waiting for a lock to come back, which would come back only after the data is actually written), or if it is what I do after writing the main dataset.

Let me first mention that HDF5 manages in the background a metadata cache (not to be confused with your metadata dataset). Any updates to the HDF5 file may trigger a flush of the metadata cache at some point in time (usually it's at file close if you are not doing a lot of metadata updates). The metadata is internal to HDF5 and contains information about the file like object header information, the file superblock, and many other things that are transparent to the application.. This explains why you are seeing I/Os after your dataset is closed. You will stop seeing updates on the file after you call H5Fclose().
The small "metadata" dataset is something I don't understand. What do you mean by "it contains various data types"? The dataset is created with only 1 HDF5 datatype and can't have multiple datatypes. Variable length datatypes are not permitted in parallel. So please explain more what you mean by that.
Also how large is the small dataset and how are you writing to it? Do all 960 MPI ranks write different hyperslabs to this small dataset (I wouldn't imagine it would be small then), or does only 1 rank write your metadata to that dataset? Are you using collective I/O if all processes are writing? Can you use an attribute instead of a small dataset?

The "metadata" dataset is created with
H5Dcreate(...,H5T_NATIVE_FLOAT,...)
It is being kept opened for the whole duration of the run, and has some attributes set on it, before being written. I guess that qualifies it as an empty dataset with multiple attributes. All ranks are calling the set attribute functions. Here is the code :
http://pastebin.com/sq5ygQyM

The functions createMetaDataDataset, setProperty, writeMetaDataDataset and closeMetaDataDataset are being called by all ranks.

It would be great if you can share the application so we can try it out, but I understand that it might not always be possible.

I think the above pastebin is all the code that you need, but I can paste the code for the whole class if you want. The code for the whole application will be problematic, not because of the code itself, but because of the data required as input.

Maxime

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5


#5

Hi Mohamad,

Ha.. I bet that writeMetaDataDataset is the culprit here...
so you are saying that you create a scalar dataset (with 1 element), and then write that same element n (n being the number of processes) the same time from all processes? Why would you need to do such a thing in the first place? If you need to write that element, you should just call writeMetaDataDataset from rank 0. If you don't need that float, then you should just not write it at all.

I was under the impression that HDF5 (or MPI IO) managed under the hood which process actually wrote data, and that such a small dataset would end up being written only by one rank. I actually thought that H5Dwrite, H5*close *needed* to be called by all processes, i.e. that they were collective.

I guess that at least H5Fclose is collective, since all processes need to close the file. Are the other ones not collective ?

You called the metadata dataset an empty dataset essentially, so I understand that you don't need it? If that is the case, then why not create the attributes on the root group, or a different sub group for the current run, or even the large dataset?

I did not know that there was a default, root, dataset. So you are saying that I can simply call
H5LTset_attribute_string(file_id, "root", key, value)
without creating a dataset first ?

I do not attach the metadata to the large dataset, because it is collective metadata and there may be more than one large dataset in the same file.

What you are causing is having every process grab a lock on that file system OST block, write that element and then release the lock. This is happening 960 times in your case, which I interpret what is causing this performance degradation..

This makes sense. I will test and make changes accordingly.

Maxime

···

Le 2015-01-28 10:32, Mohamad Chaarawi a écrit :


#6

Hi Maxime,

H5Dwrite is for writing raw data, and unlike HDF5 metadata operations the library does not require them to be collective unless you ask for it.
For a list of HDF5 function calls which are required to be collective, look here:
http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

For raw data, we do not detect whether you are writing to the same position in the file or not, and so we just pass the data down onto MPI to write collectively or independently. The concept of collective I/O in MPI is to have all processes work together to write different portions of the file, not the same portion. So if you have 2 processes writing values X and Y to the same offset in the file, both writes will happen and the result is really undefined in MPI semantics. Now if you do collective I/O with 2 processes writing X and Y to two adjacent positions in the file, mpi io would internally have 1 rank combine the two writes and execute it itself rather than have 2 smaller writes from different processes to the parallel file system. This is a very simple example of what collective I/O is. Of course things get more complicated with more processes and more data :slight_smile:

Not that it matters here, but note that in your code, you set your dxpl to use independent I/O and not collective:
H5Pset_dxpl_mpio(md_plist_id, H5FD_MPIO_INDEPENDENT);

As for the attribute, it depends on what you need. I don’t know what sort of metadata you want to store. If that metadata is related to every large dataset that you write, then you should create that attribute on every large dataset. If it is metadata for the entire file, then you can just create it on the root group "/" (note this is not a dataset, but a group object.. those are 2 different HDF5 objects. Look into the HDF5 user guide if you need more information). Note that attribute operations are regarded as HDF5 metadata operations, unlike H5Dread and H5Dwrite, and are always required to be collective, and should be called with the same parameters and values from all processes. HDF5 internally manages the metadata cache operations to the file system in that case so you don't end up writing multiple times to the file as was the case with what you were doing with raw data writes with H5Dwrite.
Note also that If you call H5LTset_attribute_string() twice with the same attribute name, the older one is overwritten. So it really depends what you want to store as metadata and how..

Thanks,
Mohamad

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Maxime Boissonneault
Sent: Wednesday, January 28, 2015 9:43 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] File keeps being updated long after the dataset is closed

Hi Mohamad,
Le 2015-01-28 10:32, Mohamad Chaarawi a écrit :

Ha.. I bet that writeMetaDataDataset is the culprit here...
so you are saying that you create a scalar dataset (with 1 element), and then write that same element n (n being the number of processes) the same time from all processes? Why would you need to do such a thing in the first place? If you need to write that element, you should just call writeMetaDataDataset from rank 0. If you don't need that float, then you should just not write it at all.

I was under the impression that HDF5 (or MPI IO) managed under the hood which process actually wrote data, and that such a small dataset would end up being written only by one rank. I actually thought that H5Dwrite, H5*close *needed* to be called by all processes, i.e. that they were collective.

I guess that at least H5Fclose is collective, since all processes need to close the file. Are the other ones not collective ?

You called the metadata dataset an empty dataset essentially, so I understand that you don't need it? If that is the case, then why not create the attributes on the root group, or a different sub group for the current run, or even the large dataset?

I did not know that there was a default, root, dataset. So you are saying that I can simply call H5LTset_attribute_string(file_id, "root", key, value) without creating a dataset first ?

I do not attach the metadata to the large dataset, because it is collective metadata and there may be more than one large dataset in the same file.

What you are causing is having every process grab a lock on that file system OST block, write that element and then release the lock. This is happening 960 times in your case, which I interpret what is causing this performance degradation..

This makes sense. I will test and make changes accordingly.

Maxime

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org


Twitter: https://twitter.com/hdf5


#7

Hi Mohamad,
First, thank you for the info. Below are a few followup questions.

Hi Maxime,

H5Dwrite is for writing raw data, and unlike HDF5 metadata operations the library does not require them to be collective unless you ask for it.
For a list of HDF5 function calls which are required to be collective, look here:
http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

Thanks. Is there such a list for the "lite interface" ?

For raw data, we do not detect whether you are writing to the same position in the file or not, and so we just pass the data down onto MPI to write collectively or independently. The concept of collective I/O in MPI is to have all processes work together to write different portions of the file, not the same portion. So if you have 2 processes writing values X and Y to the same offset in the file, both writes will happen and the result is really undefined in MPI semantics. Now if you do collective I/O with 2 processes writing X and Y to two adjacent positions in the file, mpi io would internally have 1 rank combine the two writes and execute it itself rather than have 2 smaller writes from different processes to the parallel file system. This is a very simple example of what collective I/O is. Of course things get more complicated with more processes and more data :slight_smile:

Not that it matters here, but note that in your code, you set your dxpl to use independent I/O and not collective:
H5Pset_dxpl_mpio(md_plist_id, H5FD_MPIO_INDEPENDENT);

For the big datasets, I do use H5FD_MPIO_COLECTIVE. I aslo create the file with those MPI info parameters :

void HDF5DataStore::createMPIInfo()
{
     MPI_Info_create(&info);
     int comm_size;
     MPI_Comm_size(MPI_COMM_WORLD,&comm_size);

     MPI_Info_set(info,"striping_factor",const_cast<char *>(std::to_string(comm_size/2).c_str()));
MPI_Info_set(info,"romio_lustre_coll_threshold",const_cast<char *>(std::to_string(32*1024*1024).c_str()));
}

Maybe I should revisit those parameters, but they resulted in good enough performance for the main dataset during my tests.

As for the attribute, it depends on what you need. I don’t know what sort of metadata you want to store. If that metadata is related to every large dataset that you write, then you should create that attribute on every large dataset. If it is metadata for the entire file, then you can just create it on the root group "/" (note this is not a dataset, but a group object.. those are 2 different HDF5 objects. Look into the HDF5 user guide if you need more information). Note that attribute operations are regarded as HDF5 metadata operations, unlike H5Dread and H5Dwrite, and are always required to be collective, and should be called with the same parameters and values from all processes. HDF5 internally manages the metadata cache operations to the file system in that case so you don't end up writing multiple times to the file as was the case with what you were doing with raw data writes with H5Dwrite.
Note also that If you call H5LTset_attribute_string() twice with the same attribute name, the older one is overwritten. So it really depends what you want to store as metadata and how..

I will create the attributes on the root groups from now on, and elimite the fake dataset. The metadata basically contains the values of input parameters, that is mostly the name of the input file as well as a few integers.

Thanks a lot. Everything makes much sense now.

Maxime

···

Le 2015-01-28 11:24, Mohamad Chaarawi a écrit :


#8

You know, having struggled with etypes, ftypes, and datatype equivalence, it turns out that HDF5 (or pnetcdf or an application) has a much easier time determining if everyone is reading the same data.

Writing the same data is actually "undefined" (by the strict letter of the MPI standard), and while no one does this it would be fun to detect this condition and write out 0xDEADBEEF and see how many applications break....

==rob

···

On 01/28/2015 10:24 AM, Mohamad Chaarawi wrote:

Hi Maxime,

H5Dwrite is for writing raw data, and unlike HDF5 metadata operations the library does not require them to be collective unless you ask for it.
For a list of HDF5 function calls which are required to be collective, look here:
http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

For raw data, we do not detect whether you are writing to the same position in the file or not, and so we just pass the data down onto MPI to write collectively or independently.

--
Rob Latham
Mathematics and Computer Science Division
Argonne National Lab, IL USA


#9

You know, having struggled with etypes, ftypes, and datatype equivalence, it turns out that HDF5 (or pnetcdf or an application) has a much easier time determining if everyone is reading the same data.

[msc] I agree. I have it in my bucket of things to do to, improve this, at least for the metadata HDF5 reads. For raw data, it is very easy to detect that but would introduce unnecessary communication in many cases to check for overlap. Maybe an additional property that tells HDF5 to look for overlaps for reads might be useful here..

Writing the same data is actually "undefined" (by the strict letter of the MPI standard), and while no one does this it would be fun to detect this condition and write out 0xDEADBEEF and see how many applications break....

[msc] yes!

Mohamad

==rob

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Rob Latham
Sent: Wednesday, January 28, 2015 1:13 PM
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] File keeps being updated long after the dataset is closed

On 01/28/2015 10:24 AM, Mohamad Chaarawi wrote:

Hi Maxime,

H5Dwrite is for writing raw data, and unlike HDF5 metadata operations the library does not require them to be collective unless you ask for it.
For a list of HDF5 function calls which are required to be collective, look here:
http://www.hdfgroup.org/HDF5/doc/RM/CollectiveCalls.html

For raw data, we do not detect whether you are writing to the same position in the file or not, and so we just pass the data down onto MPI to write collectively or independently.

--
Rob Latham
Mathematics and Computer Science Division Argonne National Lab, IL USA

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org


Twitter: https://twitter.com/hdf5


#10

I’m currently writing a program in python on a Linux system. the target is to read a log file and execute a bash command upon finding a specific string. The log file is being constantly written to by another program.

My question: If I open the file using the open() method will my Python file object be updated because the actual file gets written to by the opposite program or will I even have to reopen the file at timed intervals?

UPDATE: Thanks for the answers thus far. I perhaps should have mentioned that the file is being written to by a Java EE app so I even have no control over when data gets written there. I’ve currently got a program that reopens the file every 10 seconds and tries to read from the byte position within the file that it last read up to. For the instant, it just prints out the string that’s returned.
#!/usr/bin/python
import time

fileBytePos = 0
while True:
inFile = open(’./server.log’,‘r’)
inFile.seek(fileBytePos)
data = inFile.read()
print data
fileBytePos = inFile.tell()
print fileBytePos
inFile.close()
time.sleep(10)

I used to be hoping that the file didn’t get to be reopened but the read command would somehow have access to the info written to the file by the Java app. Here I share whatever I tried and during the online data science course https://www.cetpainfotech.com/technology/data-science-training