I/O optimization when writing many datasets

jazzcat81 · June 10, 2009, 1:12pm

Hello all,

In the image processing problem I am currently working on, I am obliged to write large hdf5-Files (100s of GB) containing several million smallish (~KB-MB) extendible datasets. Although I/O performance has been encouraging so far ( especially when compared to writing individual binary files...), as far as I can tell there are three main paramaters open to tweaking which could further increase write performance, i.e. chunk size, meta data cache and buffer size.

Since I am working on high performance servers with at least 128GB of RAM, write performance is paramount and I could easily cope with a reasonable increase in memory usage and final file size. Being relatively new to hdf5, I am unsure about how to best set the cache/buffer sizes as well as the chunk size, or whether the default settings are already adequate. I would be very grateful for any suggestions!

Thanks,

Patrick

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

nfortne2 · June 10, 2009, 7:03pm

Patrick,

Patrick wrote:

Hello all,

In the image processing problem I am currently working on, I am obliged to write large hdf5-Files (100s of GB) containing several million smallish (~KB-MB) extendible datasets. Although I/O performance has been encouraging so far ( especially when compared to writing individual binary files...), as far as I can tell there are three main paramaters open to tweaking which could further increase write performance, i.e. chunk size, meta data cache and buffer size.

Since I am working on high performance servers with at least 128GB of RAM, write performance is paramount and I could easily cope with a reasonable increase in memory usage and final file size. Being relatively new to hdf5, I am unsure about how to best set the cache/buffer sizes as well as the chunk size, or whether the default settings are already adequate. I would be very grateful for any suggestions!

If you are only writing to very small datasets, then the default chunk cache size (1 MB) is most likely large enough, since this limit is applied to each dataset individually. However, if you are regularly rewriting/reading the same portions of the dataset, and it can grow beyond 1 MB then you may see a benefit from increasing the cache size. Depending on your chunk size, you may also want to increase the number of elements in the chunk cache from the default 521 (make sure it stays a prime number). Be careful about having too many datasets open at once though, as the limit is 1 MB for each dataset. So if you have several million datasets open you potentially have several million megabytes of cache.

The chunk size should align as closely as possible to your typcial selection for writing (or reading). This minimizes the amount of costly scattering as well as wasted space in the cache. However you should not set it too small, in order to avoid excessive overhead.

Thanks,
-Neil Fortner

···

Thanks,

Patrick

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.