I am working on a early stage research project. Currently, I need to create two datasets each about 2.5 TB. where I can accumulate results over multiple runs. The slowest part of my application is allocating the datasets on the disk. I had a few questions about this process.
Is the allocation process/filling performed in parallel for datasets?
Are there any tricks that I can use to speed up this process?
I am on a LUSTRE system and it is just the allocation part that is slow. Reading and writing this entire dataset takes less than 1 hour after the allocation. However, the allocation is currently taking 3 hours.
Sadly that doesn’t really help, at some point the data space needs to be allocated, and it takes 3 hours to complete anyway.
Will
To provide some more details, the lifetime of the HDF file is:
First application initalizes Datasets
Independent applications read data from the file and write their results to individual files
A final application merges (through accumulation) the results from step 2 in to the file
If I set the setting for Fill_time_never, the step 3 spends 3 hours allocating and initializing the dataset. If I use fill time early step 1 spends three hours doing it.
This is what led me to ask about parallel initialization. If the application is only using a node to write the fill values, then using multiple nodes would improve process bandwidth
Sadly it is required as part of the accumulation algorithm, without requiring a lot of extra data checking, as I can loop 2 and 3 . I did find a solution (see my next post)
This pushed dataset initialization into step one where it belonged. Next, to deal with the file system issues with creating 5.5 TBs at once I created a Chunked Dataset. This resolved everything, now it only takes an hour which is what i expected.