Allocating large datasets - parallel HDF5 1.10

willdonahue · January 22, 2019, 8:34pm

I am working on a early stage research project. Currently, I need to create two datasets each about 2.5 TB. where I can accumulate results over multiple runs. The slowest part of my application is allocating the datasets on the disk. I had a few questions about this process.

Is the allocation process/filling performed in parallel for datasets?
Are there any tricks that I can use to speed up this process?

I am on a LUSTRE system and it is just the allocation part that is slow. Reading and writing this entire dataset takes less than 1 hour after the allocation. However, the allocation is currently taking 3 hours.

Any Ideas?

koziol · January 22, 2019, 8:45pm

Hi!
Try H5Pset_fill_time(dcpl, H5D_FILL_TIME_NEVER)

	Quincey

willdonahue · January 22, 2019, 9:03pm

Hi,

Sadly that doesn’t really help, at some point the data space needs to be allocated, and it takes 3 hours to complete anyway.

Will

To provide some more details, the lifetime of the HDF file is:

First application initalizes Datasets
Independent applications read data from the file and write their results to individual files
A final application merges (through accumulation) the results from step 2 in to the file

If I set the setting for Fill_time_never, the step 3 spends 3 hours allocating and initializing the dataset. If I use fill time early step 1 spends three hours doing it.

This is what led me to ask about parallel initialization. If the application is only using a node to write the fill values, then using multiple nodes would improve process bandwidth

koziol · January 22, 2019, 9:14pm

Hi Will,
If you turn off writing fill values, there’s no time spent doing it in either step 1 or 3.

	Quincey

willdonahue · February 1, 2019, 7:04pm

Sadly it is required as part of the accumulation algorithm, without requiring a lot of extra data checking, as I can loop 2 and 3 . I did find a solution (see my next post)

willdonahue · February 1, 2019, 7:07pm

The solution I found was the following: The first step was to set the following flags:

H5Pset_alloc_time(dcpl,H5D_ALLOC_TIME_EARLY );
H5Pset_fill_time(dcpl,H5D_FILL_TIME_ALLOC);

This pushed dataset initialization into step one where it belonged. Next, to deal with the file system issues with creating 5.5 TBs at once I created a Chunked Dataset. This resolved everything, now it only takes an hour which is what i expected.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Allocating large datasets - parallel HDF5 1.10