I need to implement a storage for data with the following properties:
1) multi-dimensional unlimited size data set of variable-length records
2) may be highly sparsed
3) usually randomly accessed one record at a time
4) each record may vary in size from tens of kilobytes to tens of megabytes
I am thinking of unlimited chunked data space. However to make it efficient in terms of disk space and access time I need to have my chunks as small as one element. Could you please save me performance test and tell if such configuration is practical with HDF5?
Thanks,
Efim
------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.
Unfortunately chunking+compression doesn’t really help much with variable length datatypes. Variable length datasets consist of an array of heap pointers, so the bulk of the dataset doesn’t participate in any compression.
On the other hand your record size is large enough that you could setup your storage to be a collection of scalar datasets. Since there is just element per dataset, you can make the datatype be whatever the size of the row is and use a compression filter. So rather than accessing a row via and index into a dataset, you’d access a dataset via a link name (which could just be a stringified version of a numeric index).
If you go this route, use the “libver=latest” option when opening the file. Recent changes in the file format have made accessing objects from a large group collection much more efficient.
I need to implement a storage for data with the following properties:
1) multi-dimensional unlimited size data set of variable-length records
2) may be highly sparsed
3) usually randomly accessed one record at a time
4) each record may vary in size from tens of kilobytes to tens of megabytes
I am thinking of unlimited chunked data space. However to make it efficient in terms of disk space and access time I need to have my chunks as small as one element. Could you please save me performance test and tell if such configuration is practical with HDF5?
Thanks,
Efim
------------------- This e-mail, including any attached files, may contain confidential and privileged information for the sole use of the intended recipient. Any review, use, distribution, or disclosure by others is strictly prohibited. If you are not the intended recipient (or authorized to receive information for the intended recipient), please contact the sender by reply e-mail and delete all copies of this message.