My experience with the interconnection between editability, chunking, and szip

Happily using HDF5 now, after quite some problems because of the (IMO overly) complex interfaces. I’d like to share my experiences and maybe get some feedback. I use mainly C++ (read and write), but also some Python (reading).

In our software we have data in ArrayND objects which hold the data in one flat array of a few selectable types. The dim sizes are in an attached ArrayNDInfo object. HDF5 nicely supports putting this data into DataSet’s. We have real large data sets (think TBs) but also really tiny ones, and I want to store them, if needed and possible, chunked and zipped. On top of that, some data sets need to be resizable (usually just the last dimension), when the data file is opened for edit.

I have struggled quite a bit with the entanglement of these requirements. For example, only chunked datasets are resizable. This means that if you want the ability to resize a dataset, then you need to chunk it no matter whether this is in any way necessary otherwise.

How big should you chunk? Hard to tell. Let’s say you want a settable size, let’s say the same in all dimensions, so your users can experiment. But hold on, be careful that the chunksize you use for any dimension does not exceed any of the actual data dimension sizes. Failure to check against this will cost you an exception.

But then, if resizability is required, you need to make sure there is an actual necessity of chunking in at least one dimension. So find the largest dimension and check that the chunk dim will result in at least 2 chunks. If you just make all the chunk sizes equal the dim sizes things again HDF will cause trouble.

SZip can only be used if at least one of the dimensions is larger than the szip_pixels_per_block. Oh and you need to check whether SZip is available using H5Zfilter_avail( H5Z_FILTER_SZIP ).

The rule is: HDF5 will relentlessly punish you if one of the requirements is not met. Then, the data is not saved. All in all, I now have loads of intelligence surrounding the writing of our data sets, all because HDF5 is very unforgiving in these things. The squeeze is between the users (who at least want their data saved, efficient and fast is important but useless if the data is not saved at all), and HDF5, which forgives, basically, nothing.

With all this in mind, I come to the following basic logic which seems to work in the situations I have tested:

I figure out:
mustchunk (when for some reason the dataset needs to be editable)
havelargerdimthanchunksz
largestdim

then for each dim:
hsize_t chunkdim = dimsz < chunksz_ ? dimsz : chunksz_;
if ( mustchunk && !havelargerdimthanchunksz && idim == largestdim )
chunkdim = getChunkSz4TwoChunks( chunkdim );

then:
wantchunk = maxdim > maxchunkdim
canzip = szip_encoding_possible && maxdim >= szip_pixels_per_block

then:

try
{
    H5::DataSpace dataspace( nrdims_, dims, mDSResizing );
    H5::DSetCreatPropList proplist;
    if ( mustchunk || wantchunk )
    {
        proplist.setChunk( nrdims_, chunkdims );
        if ( canzip )
            proplist.setSzip( szip_options_mask, szip_pixels_per_block );
    }
    dataset_ = group_.createDataSet( dsky.dataSetName(),
                                     h5dt, dataspace, proplist );
}

with:

static hsize_t getChunkSz4TwoChunks( hsize_t dimsz )
{
    hsize_t ret = dimsz / 2;
    if ( dimsz % 2 )
        ret++;
    if ( ret < 1 )
        ret = 1; // what can I do?
    return ret;
}

and:

#define mUnLim4 H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED, H5S_UNLIMITED
static hsize_t resizablemaxdims_[24] =
{ mUnLim4, mUnLim4, mUnLim4, mUnLim4, mUnLim4, mUnLim4 };
#define mDSResizing (createeditable_ ? resizablemaxdims_ : 0)

I use:

static unsigned szip_options_mask = H5_SZIP_EC_OPTION_MASK; // entropy coding
                     // nearest neighbour coding: H5_SZIP_NN_OPTION_MASK
static unsigned szip_pixels_per_block = 16;
                    // can be an even number [2,32]