I’m writing a compound datatype into HDF5 datasets and wondering what would be a recommendation in terms of setting chunk dimensions. Please let me know.
Here’s some sample code to write chunked datasets:
It depends… Without more context, all we can suggest is that they should be neither too small nor too large,
Here are a few questions to ponder:
Do you need chunking? Why?
What’s the size (in bytes) of an individual dataset element? (A record, in your example.)
How many elements do you expect in a typical dataset?
How do you access your data? Write-Once-Read-Many? How does your data change? Append-only? Etc.
Are read and write performances equally important? Is one more important than the other?
Do you write and read the data on the same or different systems? How do they differ?
Do you always access entire records (compound datatype)?
If you clue us in, we can get more specific.
The easiest way to play w/ chunking (and compression) is h5repack. You don’t have to write a single line of code. Just experiment. Observe the file sizes and repack times!
Do you need chunking? Why?
Chunking can help improve read performance. Isn’t it?
What’s the size (in bytes) of an individual dataset element? (A record, in your example.)
I’m using several compound datatypes (one for each dataset in HDF5). Byte count can vary from 50 to 200.
How many elements do you expect in a typical dataset?
Some dataset can contain millions of elements.
How do you access your data? Write-Once-Read-Many? How does your data change? Append-only? Etc.
Yes, write-once-read-many. I don’t expect the data to change after it’s written.
Are read and write performances equally important? Is one more important than the other?
In general, read performance is more important than write as the write happens one time in a batch application.
Do you write and read the data on the same or different systems? How do they differ?
Write will likely be on a different system than the read but all systems are very similar in their configurations – Red Hat 6/7 OS, for example.
Do you always access entire records (compound datatype)?
Yes, need to access the entire compound datatype.
Are your datasets one-dimensional or higher-dimensional? If they are one-dimensional datasets, then read performance with contiguous layout will be about as fast as it can be. If they are higher-dimensional, then there are read patterns where chunking can be beneficial, but there is slow-down potential if you are reading regions covering multiple chunks. Do you always read entire datasets or just regions?
Are your applications sequential or do they use some form of parallelism?
Do you know the number of dataset elements in advance? If not, do you have a reasonable estimate of the maximum number of elements? Are you using/planning to use compression?
Given your numbers, the size of your datasets is in the range of 50 - 2,000 MB. Right?
What are the datatypes of the fields in your compounds? Just integers and floating-point numbers, or are there strings in the mix? If so, are they fixed-size, and what’s the typical length?
Can you send us the output of h5dump -pBH sample.h5 of a typical file?
Are your datasets one-dimensional or higher-dimensional?
Mostly one-dimensional.
Do you always read entire datasets or just regions?
In some cases entire datasets are read but in most cases it will be partial.
Are your applications sequential or do they use some form of parallelism?
We’re considering dataset read in multiple threads. But, the HDF5 generation itself would be single threaded.
Do you know the number of dataset elements in advance? If not, do you have a reasonable estimate of the maximum number of elements? Are you using/planning to use compression?
During the write, number of dataset elements is known and metadata info is being written to capture this info for each dataset. And there’s plan to use compression.
Given your numbers, the size of your datasets is in the range of 50 - 2,000 MB. Right?
Yes, that’s correct. Although, the upper limit can be higher.
What are the datatypes of the fields in your compounds? Just integers and floating-point numbers, or are there strings in the mix? If so, are they fixed-size, and what’s the typical length?
Most of the fields are int/double/float. There are 1 or 2 variable length strings. Length of strings can vary from 10s of bytes to a few thousand.
I cannot provide h5dump due to the proprietary nature of our application.
I’d say, start with contiguous layout as your baseline. Otherwise, how will you know whether there is any improvement or you’ve actually made matters worse. Unless you are dying to write code, you could then use h5repack to play with chunked or chunked/compressed variants. The code for your readers does not need to be changed for that.
Aside from the mixture of types (and perhaps value variability), you should be careful w/ variable-length strings, regardless of dataset storage layout. We had an instructional presentation by @steven the other day (see https://www.youtube.com/watch?v=jLUhmprV5kQ about 17 min into the clinic) on the “devastating” impact on performance that variable-length strings may have on performance. Unlike fixed-size string, these values also won’t be compressed.
For a first experiment, aim for (un-)compressed chunks of about 1MB and see how that compares to your baseline. Definitely try uncompressed and compressed chunks!