Allocating large datasets - parallel HDF5 1.10

I am working on a early stage research project. Currently, I need to create two datasets each about 2.5 TB. where I can accumulate results over multiple runs. The slowest part of my application is allocating the datasets on the disk. I had a few questions about this process.

  1. Is the allocation process/filling performed in parallel for datasets?
  2. Are there any tricks that I can use to speed up this process?

I am on a LUSTRE system and it is just the allocation part that is slow. Reading and writing this entire dataset takes less than 1 hour after the allocation. However, the allocation is currently taking 3 hours.

Any Ideas?

Hi!
Try H5Pset_fill_time(dcpl, H5D_FILL_TIME_NEVER)

	Quincey

Hi,

Sadly that doesn’t really help, at some point the data space needs to be allocated, and it takes 3 hours to complete anyway.

Will

To provide some more details, the lifetime of the HDF file is:

  1. First application initalizes Datasets
  2. Independent applications read data from the file and write their results to individual files
  3. A final application merges (through accumulation) the results from step 2 in to the file

If I set the setting for Fill_time_never, the step 3 spends 3 hours allocating and initializing the dataset. If I use fill time early step 1 spends three hours doing it.

This is what led me to ask about parallel initialization. If the application is only using a node to write the fill values, then using multiple nodes would improve process bandwidth

Hi Will,
If you turn off writing fill values, there’s no time spent doing it in either step 1 or 3.

	Quincey

Sadly it is required as part of the accumulation algorithm, without requiring a lot of extra data checking, as I can loop 2 and 3 . I did find a solution (see my next post)

The solution I found was the following: The first step was to set the following flags:

H5Pset_alloc_time(dcpl,H5D_ALLOC_TIME_EARLY );
H5Pset_fill_time(dcpl,H5D_FILL_TIME_ALLOC);

This pushed dataset initialization into step one where it belonged. Next, to deal with the file system issues with creating 5.5 TBs at once I created a Chunked Dataset. This resolved everything, now it only takes an hour which is what i expected.