Default H5D_FILL_TIME_IFSET is writing data even if no fill value set

noah · August 30, 2023, 2:49pm

Am I misinterpreting the documentation or is this fill behavior buggy?

After enabling chunking of my multi-GB rank-8 array:
I’m finding that when I set the fill time property to H5D_FILL_TIME_IFSET but do not set a fill value, the number of bytes written is double what is expected. Setting the property instead to H5D_FILL_TIME_NEVER results in the expected number of Bytes written.
It seems to me the if set condition is not being honored.

This is for parallel collective IO on a Lustre filesystem with:

cray-mpich/8.1.26
cray-hdf5-parallel/1.12.2.3

There was a 2014 thread related to this matter, but I don’t read any clarification for my observation.

Some instrumented observations:

Original (no chunking, no fill property specification):

+--------------------------------------------------------+
| MPIIO write access patterns for simulation_checkpoint_candidate.h5
|   ranks in communicator   = 64
|   independent writes      = 281
|   collective writes       = 5
|   independent writers     = 64
|   aggregators             = 8
|   stripe count            = 8
|   stripe size             = 140737488294032
|   system writes           = 26533
|   stripe sized writes     = 0
|   aggregators active      = 320,0,0,0 (1, <= 4, > 4, 8)
|   total bytes for writes  = 27521443760 = 26246 MiB = 25 GiB
|   ave system write size   = 1037253
|   read-modify-write count = 0
|   read-modify-write bytes = 0
|   number of write gaps    = 0
|   ave write gap size      = NA
+--------------------------------------------------------+
finished step: checkpoint_write Seconds elapsed: 19.207

Resulting actual file size 27521443824 Bytes.

Enabled chunking, no fill property specification (expecting documented default):

+--------------------------------------------------------+
| MPIIO write access patterns for simulation_checkpoint_candidate.h5
|   ranks in communicator   = 64
|   independent writes      = 292
|   collective writes       = 6
|   independent writers     = 64
|   aggregators             = 8
|   stripe count            = 8
|   stripe size             = 140737488294096
|   system writes           = 52807
|   stripe sized writes     = 0
|   aggregators active      = 384,0,0,0 (1, <= 4, > 4, 8)
|   total bytes for writes  = 55040335424 = 52490 MiB = 51 GiB
|   ave system write size   = 1042292
|   read-modify-write count = 0
|   read-modify-write bytes = 0
|   number of write gaps    = 0
|   ave write gap size      = NA
+--------------------------------------------------------+
finished step: checkpoint_write Seconds elapsed: 39.851

Resulting actual file size 27521506904 Bytes.

Added fill time property specification: H5D_FILL_TIME_NEVER

+--------------------------------------------------------+
| MPIIO write access patterns for simulation_checkpoint_candidate.h5
|   ranks in communicator   = 64
|   independent writes      = 292
|   collective writes       = 5
|   independent writers     = 64
|   aggregators             = 8
|   stripe count            = 8
|   stripe size             = 140737488294016
|   system writes           = 26553
|   stripe sized writes     = 0
|   aggregators active      = 320,0,0,0 (1, <= 4, > 4, 8)
|   total bytes for writes  = 27521506880 = 26246 MiB = 25 GiB
|   ave system write size   = 1036474       
|   read-modify-write count = 0             
|   read-modify-write bytes = 0             
|   number of write gaps    = 14
|   ave write gap size      = 40            
+--------------------------------------------------------+
EndOnly completed in 20.930 seconds

Resulting actual file size 27521506904 Bytes.

Enabled compression at level 6:

+--------------------------------------------------------+
| MPIIO write access patterns for simulation_checkpoint_candidate.h5
|   ranks in communicator   = 64
|   independent writes      = 297
|   collective writes       = 5
|   independent writers     = 64
|   aggregators             = 8
|   stripe count            = 8
|   stripe size             = 140737488294016
|   system writes           = 9995
|   stripe sized writes     = 0
|   aggregators active      = 320,0,0,0 (1, <= 4, > 4, 8)
|   total bytes for writes  = 10163583817 = 9692 MiB = 9 GiB
|   ave system write size   = 1016866
|   read-modify-write count = 0
|   read-modify-write bytes = 0
|   number of write gaps    = 13
|   ave write gap size      = 40
+--------------------------------------------------------+
finished step: checkpoint_write Seconds elapsed: 14.786

Resulting actual file size 10160651983 Bytes.

noah · August 30, 2023, 3:42pm

It is worth noting that h5dump confirmed the default fill property was H5D_FILL_TIME_IFSET when I did not specify anything. This refines my confusion. H5D_FILL_TIME_IFSET doesn’t behave as I think it should.

         DATASET "data" {
            DATATYPE  H5T_IEEE_F64LE
            DATASPACE  SIMPLE { ( 24, 24, 1, 16, 16, 16, 729, 2 ) / ( 24, 24, 1, 16, 16, 16, 729, 2 ) }
            STORAGE_LAYOUT {
               CHUNKED ( 1, 1, 1, 16, 16, 16, 729, 2 )
               SIZE 27518828544
            }
            FILTERS {
               NONE
            }
            FILLVALUE {
               FILL_TIME H5D_FILL_TIME_IFSET
               VALUE  H5D_FILL_VALUE_DEFAULT
            }
            ALLOCATION_TIME {
               H5D_ALLOC_TIME_EARLY
            }
            ATTRIBUTE "startIndices" {
               DATATYPE  H5T_STD_I32LE
               DATASPACE  SIMPLE { ( 8 ) / ( 8 ) }
            }
         }

nfortne2 · August 31, 2023, 7:43pm

By default, the fill value is “set” to 0. This is why using H5D_FILL_TIME_IFSET results in the fill value being written when using the default fill value. In order for the fill value to be unset, you should pass NULL to H5Pset_fill_value().

noah · August 31, 2023, 8:23pm

Ok. I look forward to testing fill value NULL.
If this is indeed the case, I will suggest documentation clarifying improvements.
But I must wonder if this is the case, why does the double-write behavior only occur when I have chunked my dataset? Without chunking, the number of bytes written match the size of the dataset.

nfortne2 · August 31, 2023, 9:56pm

Are you writing the entire dataset in a single call to H5Dwrite()?

noah · August 31, 2023, 10:14pm

Yes I am – in both the unchecked and chunked cases.

nfortne2 · September 1, 2023, 2:52pm

It appears the contiguous allocation code actually behaves the other way - if the fill value is default and the fill time is H5D_FILL_TIME_IFSET it does not write the fill value. I’ll have to think about if this should be changed for consistency, or if the chunked case should be changed.

Since you’re writing in parallel (and without filters) it actually doesn’t matter that you’re writing the whole thing because the library needs to allocate the entire dataset as soon as it’s created.

noah · September 1, 2023, 3:52pm

Thank you for thinking about this. I agree consistency is best. I can work with either behavior. I just need to know which one to code for.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Default H5D_FILL_TIME_IFSET is writing data even if no fill value set