Allocation time of chunked dataset (collective calls)


#1

Hi,

When I create a filtered chunked dataset, I noticed that the dataset gets filtered twice: when the empty dataset gets created, and when it gets written.

For efficiency, I would like to remove the first filter call (dataset creation) and if possible, avoid the initial disk allocation with empty data.
I’ve tried options such as H5D_ALLOC_TIME_INCR, but without success so far.

I’m using the C++ wrappers of HDF5 1.10.3 with collective calls.

Thanks,
Hans


#2

Hans, you should consider precreating the dataset (in serial) on rank 0, and then re-open in parallel.

You can avoid writing fill values via H5Pset_fill_time(dcpl, H5D_FILL_TIME_NEVER ).

G.


#3

Hi Gerd,

Thanks for your reply.

I’ll give it a try. I have another question though: the file is being created collectively, is it OK to create a dataset sequentially within a collective file?

Regards,
Hans


#4

Hi Gerd,

I’m not sure how to do this.

The file is being created collectively. I create the dataset like this:
H5::DSetCreatPropList ds_creatplist;
ds_creatplist.setFillTime( H5D_FILL_TIME_NEVER);
ds_creatplist.setFilter( (h5Z_filter_t)32768, H5Z_FLAG_OPTIONAL);
ds_creatplist.setAllocTime(H5D_ALLOC_TIME_LATE);
createDataSet(“ds”,H5::PredType::NATIVE_INT, dataspace, ds_creatplist);

As far as I can see, DSetCreatPropList has no options related to a sequential or collective call.

Do you mean create the dataset and the file sequentially on rank 0, and reopen them collectively?

Regards,
Hans


#5

No, if the file is opened via an MPI communicator, all processes in that communicator need to “witness” the dataset creation (or attribute or group creation, etc.). That’s part of the parallel HDF5 etiquette.


#6

Yes (… create the dataset and the file sequentially on rank 0, and reopen them collectively?)