Optimizing writing for unlimited time acquisition and szip compression

Hi forum,

I apologize for the verbosity of this message ahead of time but the devil
is in the details. I've scoured the archives and have had trouble finding
something similar to my problem in terms of scale.

I am using an analog to digital acquisition device which delivers 16 bit
integers in 2d row major contiguous blocks. The dimensions of the row
(time) can be 128, 256, 512, and 1024 and the column dimension (space)
could be any integer between 1000 and 50000. This means are largest
delivered block coming from the device is 1024*50000*16/8/1024/1024 =
97.6565 megabytes to be written to disk per block. I essentially receive 3
of these blocks a second which translates to approximately 300mb/s. For the
smaller blocks, I'll simply receive more of them per second ultimately
equating to 300mb/s.

Since we are acquiring data for an unknown amount of time, the row
dimension is unlimited and say space dimension is 50000:

    const std::string name = data_name;
    const hsize_t dims[2] = {0, 50000};
    const hsize_t maxdims[2] = {H5S_UNLIMITED, 50000};

    const hsize_t time_count = 1024;
    const hsize_t chunk_dims[2] = {time_count, 50000};
    const size_t chunk_size = chunk_dims[0] * chunk_dims[1];

    auto create_plist = H5Pcreate(H5P_DATASET_CREATE);
    H5Pset_chunk(create_plist, 2, chunk_dims);

    auto type = get_type(elementDesc, 0); // helper function to get the
data type we are dealing with

    auto access_plist = H5Pcreate(H5P_DATASET_ACCESS);
    const size_t rdcc_nbytes = 0;
    const size_t rdcc_nslots = 0;
    const double rdcc_w0 = 1;
    H5Pset_chunk_cache(access_plist, rdcc_nslots, rdcc_nbytes, rdcc_w0);

    if (use_szip_filter()) {
        const size_t options_mask = H5_SZIP_NN_OPTION_MASK;
        const size_t pixels_per_block = 16u;
        H5Pset_szip(create_plist, options_mask, pixels_per_block);
    }

    auto datatype = H5Tcopy(std::get<1>(type));
    H5Pset_fill_value(create_plist, datatype, NULL);

    auto dataspace = H5Screate_simple(2, dims, maxdims);
    auto dataset = H5Dcreate(file, name.c_str(), datatype, dataspace,
H5P_DEFAULT, create_plist, access_plist);
    if (dataset < 0)
        throw FileWriterError("Unable to create data var");

    H5Sclose(dataspace);
    H5Pclose(access_plist);
    H5Pclose(create_plist);

Every time a block comes in from the acquisition device, the data set is
extended by the row dimension i.e, by 1024, and a write is performed:

    const hsize_t size[2] = {some_previous_multiple_of_1024 + 1024, 50000};
    H5Dextend(data_var.id, size);

    const hsize_t dims[2] = {1024, 50000};
    const hsize_t offset[2] = {some_previous_multiple_of_1024, 0};
    auto filespace = H5Dget_space(data_var.id);
    H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, NULL, dims,
NULL);

    auto memspace = H5Screate_simple(2, dims, NULL);
    if (H5Dwrite(data_var.id, data_var.type, memspace, filespace,
H5P_DEFAULT, data) < 0)
        return 0;

    H5Sclose(memspace);
    H5Sclose(filespace);

I have experimented with a variety of chunking and caching strategies both
in time and space. For example, it would be normal to typically process
rows sequentially in time of the acquired data, from the literature it
seemed appropriate to then have a single row as a chunk with dimensions 1 x
50000 which is approximately 0.1mb per chunk. However every strategy I
tried only hindered the write performance. The only way I was able to
achieve a write speed of 300mb/s was by disabling caching completely. I
also disabled strip mining though this may have had a negligibly impact. In
this case the chunk dimensions match the acquisition blocks exactly, a
single chunk per write. This seemed counter intuitive to me, at least the
portion about caching. Furthermore, I have a considerable I/O performance
hit when SZIP compression is enabled, and that's using experimental values
for chunking in both the time and space dimension. Does anyone have any
thoughts on this or suggestions for appropriate cache and chunking
parameters? Or any other parameters available to be configured that I've
missed?

Also, I have been trying to understand the B-Tree usage. Since in my
application the data is only ever written or read sequentially in time, it
would seem to me that the fastest implementation would be a root node with
a child whose children only ever have one child themselves. In other words
the tree would be something more akin to a linked list. Is this making
logical sense and is there any way to take advantage of this assumption?

I appreciate any input.

Best regards,
Brock Hargreaves

Hi Brock,

Have you investigated the HDF5 Packet Table API? It was created precisely for data acquisition problems. I've never used it and thus can't provide any personal experience, but that would be my starting point.

Cheers,
--dan

···

On 07/23/15 10:16, Brock Hargreaves wrote:

Hi forum,

I apologize for the verbosity of this message ahead of time but the devil is in the details. I've scoured the archives and have had trouble finding something similar to my problem in terms of scale.

--
Daniel Kahn
Science Systems and Applications Inc.
301-867-2162

Hi Daniel,

Thanks for the response. Perhaps I misunderstood HDF5 Packet Tables when I
was reading about them about a week ago. For example, examining their
signature for creation:

* hid_t* H5PTcreate_fl( *hid_t* loc_id, *const char ** dset_name, *hid_t*
dtype_id, *hsize_t* chunk_size, *int* compression )

Versus a traditional hdf5 dataset which can take various properties list,
in particular dcpl_id:

* hid_t* H5Dcreate( *hid_t* loc_id, *const char **name, *hid_t* dtype_id
, *hid_t* space_id, *hid_t* lcpl_id, *hid_t* dcpl_id, *hid_t* dapl_id )

This gives me the impression that Packet Tables do not support filters,
such as ones used for lossless compression. One of the main reasons I'm
looking into HDF5 is because of it's ability to incorporate such filters.

Cheers,
Brock

···

On Thu, Jul 23, 2015 at 8:51 AM, Daniel Kahn <daniel.kahn@ssaihq.com> wrote:

Hi Brock,

Have you investigated the HDF5 Packet Table API? It was created precisely
for data acquisition problems. I've never used it and thus can't provide
any personal experience, but that would be my starting point.

Cheers,
--dan

On 07/23/15 10:16, Brock Hargreaves wrote:

Hi forum,

I apologize for the verbosity of this message ahead of time but the
devil is in the details. I've scoured the archives and have had trouble
finding something similar to my problem in terms of scale.

--
Daniel Kahn
Science Systems and Applications Inc.301-867-2162

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Hi forum,

Could anyone confirm my claim about packet tables not supporting filters?
It would seem a natural thing to support since the filters require data
chunking which is a requirement of packet tables.

Cheers,
Brock

···

On Thu, Jul 23, 2015 at 10:32 AM, Brock Hargreaves < brock.hargreaves@gmail.com> wrote:

Hi Daniel,

Thanks for the response. Perhaps I misunderstood HDF5 Packet Tables when I
was reading about them about a week ago. For example, examining their
signature for creation:

* hid_t* H5PTcreate_fl( *hid_t* loc_id, *const char ** dset_name,
*hid_t* dtype_id, *hsize_t* chunk_size, *int* compression )

Versus a traditional hdf5 dataset which can take various properties list,
in particular dcpl_id:

* hid_t* H5Dcreate( *hid_t* loc_id, *const char **name, *hid_t*
dtype_id, *hid_t* space_id, *hid_t* lcpl_id, *hid_t* dcpl_id, *hid_t*
dapl_id )

This gives me the impression that Packet Tables do not support filters,
such as ones used for lossless compression. One of the main reasons I'm
looking into HDF5 is because of it's ability to incorporate such filters.

Cheers,
Brock

On Thu, Jul 23, 2015 at 8:51 AM, Daniel Kahn <daniel.kahn@ssaihq.com> > wrote:

Hi Brock,

Have you investigated the HDF5 Packet Table API? It was created precisely
for data acquisition problems. I've never used it and thus can't provide
any personal experience, but that would be my starting point.

Cheers,
--dan

On 07/23/15 10:16, Brock Hargreaves wrote:

Hi forum,

I apologize for the verbosity of this message ahead of time but the devil
is in the details. I've scoured the archives and have had trouble finding
something similar to my problem in terms of scale.

--
Daniel Kahn
Science Systems and Applications Inc.301-867-2162

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

I use compression and they do indeed support filters and type conversion -
be careful though because filters or conversions may add significant
overhead in a data acquisition loop. H5PTcreate_fl has a compression
parameter (for the zip filter) which will clue you in on it's support:
https://www.hdfgroup.org/HDF5/doc/HL/RM_H5PT.html#H5PTcreate_fl - you can
also take a look at H5PT.c to answer some of your questions, maybe.

However there is no version of this function upstream that allows you to
specify your own compression. I have patched one in and intend to submit
these patches upstream (again? I believe i submitted this one in the past
but could be wrong) where this variant supports the dataset property list
so you can specify whatever you want for compression/filters:

H5PTcreate_fl2(hid_t loc_id, const char *dset_name, hid_t dtype_id,

hsize_t chunk_size, hid_t plist_id)

I'll try and remember to submit my latest patches in the next few days.
But your answer is: technically yes but the exposure of filters is very
restricted - not clear why this was ever the case.

-Jason

···

On Tue, Aug 11, 2015 at 1:27 PM, Brock Hargreaves < brock.hargreaves@gmail.com> wrote:

Hi forum,

Could anyone confirm my claim about packet tables not supporting filters?
It would seem a natural thing to support since the filters require data
chunking which is a requirement of packet tables.

Cheers,
Brock

On Thu, Jul 23, 2015 at 10:32 AM, Brock Hargreaves < > brock.hargreaves@gmail.com> wrote:

Hi Daniel,

Thanks for the response. Perhaps I misunderstood HDF5 Packet Tables when
I was reading about them about a week ago. For example, examining their
signature for creation:

* hid_t* H5PTcreate_fl( *hid_t* loc_id, *const char ** dset_name,
*hid_t* dtype_id, *hsize_t* chunk_size, *int* compression )

Versus a traditional hdf5 dataset which can take various properties list,
in particular dcpl_id:

* hid_t* H5Dcreate( *hid_t* loc_id, *const char **name, *hid_t*
dtype_id, *hid_t* space_id, *hid_t* lcpl_id, *hid_t* dcpl_id, *hid_t*
dapl_id )

This gives me the impression that Packet Tables do not support filters,
such as ones used for lossless compression. One of the main reasons I'm
looking into HDF5 is because of it's ability to incorporate such filters.

Cheers,
Brock

On Thu, Jul 23, 2015 at 8:51 AM, Daniel Kahn <daniel.kahn@ssaihq.com> >> wrote:

Hi Brock,

Have you investigated the HDF5 Packet Table API? It was created
precisely for data acquisition problems. I've never used it and thus can't
provide any personal experience, but that would be my starting point.

Cheers,
--dan

On 07/23/15 10:16, Brock Hargreaves wrote:

Hi forum,

I apologize for the verbosity of this message ahead of time but the
devil is in the details. I've scoured the archives and have had trouble
finding something similar to my problem in terms of scale.

--
Daniel Kahn
Science Systems and Applications Inc.301-867-2162

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

I was asked privately about patches more than a few times - most recently
from Brock here. I pushed my git repo upstream to
https://github.com/nevion/hdf5 so you can have patches as I use them and
incase I get lazy and forget to remind the HDF group to review/accept them
again - I can lose more than a few weeks before starting this process
again. Then I get to lose a few hours next release trying to [re]integrate
them back in on the next release which is work I'd like to stop doing.

I encourage someone from the HDF group to try and incorporate the patches
from git rather than me submitting them / half ignored with the help desk
without any commit messages... almost all of these are nearing 2 years old
now although we have whittled them down a little in the last 2-3 releases.
There's shouldn't be anything controversial in these patches... Can't wait
and hope we have better luck if we ever see hdf5 under
https://github.com/HDFGroup and pull requests. In the mean time if there's
anything I can do to help get them in, please let me know.

-Jason

···

On Tue, Aug 11, 2015 at 2:01 PM, Jason Newton <nevion@gmail.com> wrote:

I use compression and they do indeed support filters and type conversion -
be careful though because filters or conversions may add significant
overhead in a data acquisition loop. H5PTcreate_fl has a compression
parameter (for the zip filter) which will clue you in on it's support:
https://www.hdfgroup.org/HDF5/doc/HL/RM_H5PT.html#H5PTcreate_fl - you can
also take a look at H5PT.c to answer some of your questions, maybe.

However there is no version of this function upstream that allows you to
specify your own compression. I have patched one in and intend to submit
these patches upstream (again? I believe i submitted this one in the past
but could be wrong) where this variant supports the dataset property list
so you can specify whatever you want for compression/filters:

>> H5PTcreate_fl2(hid_t loc_id, const char *dset_name, hid_t dtype_id,
hsize_t chunk_size, hid_t plist_id)

I'll try and remember to submit my latest patches in the next few days.
But your answer is: technically yes but the exposure of filters is very
restricted - not clear why this was ever the case.

-Jason

On Tue, Aug 11, 2015 at 1:27 PM, Brock Hargreaves < > brock.hargreaves@gmail.com> wrote:

Hi forum,

Could anyone confirm my claim about packet tables not supporting filters?
It would seem a natural thing to support since the filters require data
chunking which is a requirement of packet tables.

Cheers,
Brock

On Thu, Jul 23, 2015 at 10:32 AM, Brock Hargreaves < >> brock.hargreaves@gmail.com> wrote:

Hi Daniel,

Thanks for the response. Perhaps I misunderstood HDF5 Packet Tables when
I was reading about them about a week ago. For example, examining their
signature for creation:

* hid_t* H5PTcreate_fl( *hid_t* loc_id, *const char ** dset_name,
*hid_t* dtype_id, *hsize_t* chunk_size, *int* compression )

Versus a traditional hdf5 dataset which can take various properties
list, in particular dcpl_id:

* hid_t* H5Dcreate( *hid_t* loc_id, *const char **name, *hid_t*
dtype_id, *hid_t* space_id, *hid_t* lcpl_id, *hid_t* dcpl_id, *hid_t*
dapl_id )

This gives me the impression that Packet Tables do not support filters,
such as ones used for lossless compression. One of the main reasons I'm
looking into HDF5 is because of it's ability to incorporate such filters.

Cheers,
Brock

On Thu, Jul 23, 2015 at 8:51 AM, Daniel Kahn <daniel.kahn@ssaihq.com> >>> wrote:

Hi Brock,

Have you investigated the HDF5 Packet Table API? It was created
precisely for data acquisition problems. I've never used it and thus can't
provide any personal experience, but that would be my starting point.

Cheers,
--dan

On 07/23/15 10:16, Brock Hargreaves wrote:

Hi forum,

I apologize for the verbosity of this message ahead of time but the
devil is in the details. I've scoured the archives and have had trouble
finding something similar to my problem in terms of scale.

--
Daniel Kahn
Science Systems and Applications Inc.301-867-2162

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5