Reading a subset from a dataset with contiguous storage

arham.amouei · February 13, 2020, 3:49pm

Hi. Someone told me that if I read a subset from a dataset whose storage layout is contiguous, HDF5 API loads the entire dataset into memory, which is not good in terms of performance. Is this the case?

steven · February 13, 2020, 4:20pm

Yes indeed it loads it in one shot. The performance is more complex: with cases of small datasets it may have higher performance to partial IO. You have to make this call based on measurements respect to some architecture – as the library is extensive not only feature wise but supported platforms.

What OS and what programming language are you using?

arham.amouei · February 13, 2020, 9:06pm

Thanks for the reply. I’m going to write a program in C which gets large one-dimensional arrays of floating point numbers as datasets in HDF5 files and calculates quantities like averages, standard deviations, etc. The storage layout is always contiguous.

The size of the datasets are not previously known. I was thinking about the case when the size of a dataset is so large that malloc() fails to make the buffer for loading the dataset into RAM completely. So, I thought that maybe the program should read from the file gradualy as the calculations of averages, etc. proceed. But as you seem to suggest, HDF5 API doesn’t support that kind of I/O for datasets with contiguous storage.

By the way, the program is planned to run under Linux.

gnwiii · February 13, 2020, 9:41pm

arham.amouei
      [Arham Amouei](https://forum.hdfgroup.org/u/arham.amouei)




    February 13
Thanks for the reply. I’m going to write a program in C which gets large one-dimensional arrays of floating point numbers as datasets in HDF5 files and calculates quantities like averages, standard deviations, etc. The storage layout is always contiguous.

The size of the datasets are not previously known. I was thinking about the case when the size of a dataset is so large that malloc() fails to make the buffer for loading the dataset into RAM completely. So, I thought that maybe the program should read from the file gradualy as the calculations of averages, etc. proceed. But as you seem to suggest, HDF5 API doesn’t support that kind of I/O for datasets with contiguous storage.

For very large data sets it is common practice to compute statistics for a random sample. This usually reduces memory requirements to a “trivial” amount, and makes it easy to avoid numerical problems with simple singe-pass variance calculations. There are quite sophisticated sampling methods to deal with “difficult” data sets, such as stratified sampling. See:

http://faculty.franklin.uga.edu/amandal/sites/faculty.franklin.uga.edu.amandal/files/Effective_Statistical_Methods_for_Big_Data_Analytics.pdf

steven.varga · February 13, 2020, 10:00pm

If C++ is an option for you you maybe interested in H5CPP where the small datasets are saved by default as compact layout, and larger ones you can control the size with h5::chunk property.-- when chunk property not set getting contiguous layout. Zero copy, no overhead over C calls.

In the example folder you find code for most common use cases, including typed memory IO.

A good design does consider the available features of the platform running and should match it with the library features. In my opinion HDF5 is the most featureful and often misunderstood storage system. Sticking with one layout regardless of dataset size is suboptimal.

Best:

Steve

arham.amouei · February 14, 2020, 3:22pm

You are right. The HDF5 files contains raw data. First of all I have to read from them. Then I can do sampling and some calculations.

The thing is that the HDF5 files are produced by a complex scientific simulation code and at the moment I have no control on them (storage layout, etc.). I am supposed to just read them and do some post-processing.

contact · February 14, 2020, 5:33pm

Hi Arham,

Hope you can solve the issue that brought you here.

Would you mind to enumerate the post-processing functions you are thinking to execute? Asking this because we have the intention to implement post-processing functions in HDFql. Last year, we contacted some people to better understand the functions they are using the most to post-process HDF5 data. It would be great to get your feedback too. Thanks!

The idea that we have in mind is to enable users to do things like these in HDFql (as examples):

 SELECT FROM MIN(dataset0) ---> get minimum value stored in 'dataset0'

 SELECT FROM AVG(dataset1) ---> get average value of all the values stored in 'dataset1'

 SELECT FROM COUNT(dataset2, 12) ---> get number of times 12 is stored in 'dataset2'

To greatly boost performance, we are planning to have HDFql executing post-process functions using all cores available on the machine. Moreover, if running in a parallel environment (i.e. MPI), HDFql will spread the computation cost to all intervening nodes (i.e. machines). We can think of it as map-reduce operation provided by HDFql (and completely transparent from the user point-of-view).

Cheers!

PS: if other people also use functions to post-process HDF data, feel free to drop us a line or two about these here (or to post in this thread if you find it more appropriate)!

Ger_van_Diepen · February 14, 2020, 5:54pm

Isn’t it more logical to have something like:

SELECT MIN(dataset) FROM HDF5-file-or-group

I doubt if using multiple cores helps in functions like MIN as I/O will be the dominant factor. Only compute-intensive functions will benefit.

Cheers,

Ger

contact · February 14, 2020, 7:03pm

Thanks for the suggestion @Ger_van_Diepen!

The post-processing MIN function was just a simple example (and sure that it is probably not the best candidate for parallelism). It would be interesting to learn what functions you (or your organization) use to post-process HDF5 data if that is applicable. Thanks.

Ger_van_Diepen · February 15, 2020, 8:51am

I work at a radio-astronomical observatory.
Somebody wanted to subtract the background from a 2-dim image by subtracting the median in a box around each pixel which I implemented in casacore’s TaQL. Thus

dataset - runningmedian(dataset, half_boxwidth)

This is quite a compute-expensive operation.

BTW. I did not find a way to take advantage of the previous running median. In the 1-dim case you can take advantage, but in higher dimensions I found recalculating the median for each box was the fastest.

Why is the operation in the FROM clause instead of the SELECT clause as SQL does?

contact · February 17, 2020, 9:11am

Thanks for describing a post-processing function that you had to implement at your organization. This feedback will help better implement post-processing functions in HDFql!

Concerning the syntax of HDFql SELECT operation, while we could have it as you suggest (SELECT dset FROM HDF5-file-or-group), this could be an issue in HDFql INSERT operation since it accepts input redirecting options that uses keyword FROM. Example:

INSERT INTO dset VALUES FROM FILE input.txt

As you can see, if we had to implement a FROM HDF5-file-or-group syntax in the INSERT operation (so that it is consistent with SELECT) that could be confusing.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Reading a subset from a dataset with contiguous storage