Large attribute or dataset: pros and cons

kerim.khemraev · August 11, 2020, 5:39pm

Hi,

I’m trying to make right decision:

use column attribute of size from 100 to 1000000 elemnts (float type);
use separated dataset

Using attributes is more convenient way but my gut tells me that the perfomance may suffer in this way. And the perfomance is much more important that convenience

Maybe somebody could give me advice?

Regards,
Kerim

steven · August 11, 2020, 7:03pm

Hi Kerim,
at first it seemed as bad idea, then again according to this note there is no difference between the two in terms of functionality. Yet I am not quite certain if the previous statement is true across all versions cross product parallel and serial HDF5. On this page attributes must be in a collective call for phdf5. So I wonder what others with more experience have say on this matter.

To break down your question, I see the following:

most recent HDF5
- any performance difference between attributes and datasets
- any functional difference considering the cross product of: serial, parallel, attribute, dataset
historical version of HDF5
- any difference when considering functionality from 1.6 - 1.10; You might be interested in this information when your software may be linked against hdf5 libs provided by somewhat outdated OS distribution

A possible approach is to write a quick test case for both, ballpark/measure the difference if any, and possibly re-post your results here for a review?

best: steve

kerim.khemraev · August 11, 2020, 9:32pm

He Steven,

I use HighFive wrapper, Windows 10 x64, MSVC x64, Qt, Release built, HDF5 1.12.0, C-language library

First of all I noticed that I can’t create attribute with more than 16366 float numbers (65 464 bytes).
I also can’tcreate two attributes to one dataset with 16366 float numbers.

Then in my experiment I create dataset and write data (16366 float numbers) there in loop and measure average time to do 100 such loops. It takes 0.039 milliseconds

Then to each of these dataset I create and write attribute (16366 float numbers) in loop (also 100 loops). The average time is 0.255 milliseconds

But here is one important thing that I can’t understand.
My code takes less time to create 100 datasets (attributes) than creating only one of them.
Here is my code:

#include <QString>
#include <QList>

#include <H5File.hpp>
#include <H5Group.hpp>
#include <H5DataSet.hpp>
#include <H5DataSpace.hpp>
#include <H5Attribute.hpp>

#include <armadillo>
using namespace arma;

using namespace HighFive;

int main(void)
{
    File file("names.h5", File::ReadWrite | File::Create | File::Truncate);

    Group group = file.createGroup("group");

    wall_clock timer;

    size_t N = 16366;
    vec data(N, fill::randu);

    int I = 1;
    vec datasetTime(I);
    QList<DataSet> datasetList;
    for (int i = 0; i < I; i++){
        timer.tic();
        DataSet dataset = group.createDataSet<float>(std::to_string(i), DataSpace({N}));
        datasetTime(i) = timer.toc();
        datasetList.push_back(dataset);
    }

    vec attrTime(I);
    for (int i = 0; i < I; i++){
        timer.tic();
        datasetList[i].createAttribute<float>("Attr", DataSpace({N})).write((float*)data.memptr());
        attrTime(i) = timer.toc();
    }

    std::cout << mean(datasetTime) << std::endl;
    std::cout << mean(attrTime) << std::endl;

    file.flush();
}

To create 100 such datasets and write data it takes 0.039 milliseconds
To create 1 dataset and write data it takes 0.173 milliseconds

To create 100 such attributes and write data it takes 0.255 milliseconds
To create 1 attribute and write data it takes 0.158 milliseconds

I don’t understand why this happens
Here is the screenshot of HdfViewer:

steven · August 12, 2020, 4:12am

Interesting observation… I made some modifications to your example – replaced QT container with std::vector --, then recompiled it on Linux using highfive then H5CPP.
I got similar results…high5.cpp (1.1 KB)

best: steve

ps.: corrected the arma::vec to arma::fvec so the attribute is 32bit float.

steven · August 12, 2020, 1:00pm

The reworked experiment is uploaded to this github page, where I keep interesting forum topics together.

The first section contains the dataset and attribute creation experiment run only 1 time on a Lenovo X250 Linux 18.04;posted time values are in micro seconds.

num	type	h5cpp	highfive
1	data	227.9	119.39
1	attr	87.7	98.14

A single run, with 100 iterations:

num	type	h5cpp	highfive
100	data	54.91	22.71
100	attr	92.07	79.61

Please take it with grain of salt at this microseconds granularity a single run of the experiment is not convincing enough. If you are interested to take this further, you might want to add a batch file – not quite sure what it is called in the windows world – to execute the the tests many times to produce a histogram (or even normalise it to probability distribution)

best: steve
Note: I also reduced the attribute size to 8000, for some reasons it failed with higher numbers on my system.

gheber · August 12, 2020, 8:21pm

There is definitely a difference in functionality. There is no partial I/O for attribute values, i.e., in the 1M float case, you’ll have to read/write four or eight MB, even if you care only about element 15. With a dataset, you’d just read/write what you need. G.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Large attribute or dataset: pros and cons