Large attribute or dataset: pros and cons


#1

Hi,

I’m trying to make right decision:

  1. use column attribute of size from 100 to 1000000 elemnts (float type);
  2. use separated dataset

Using attributes is more convenient way but my gut tells me that the perfomance may suffer in this way. And the perfomance is much more important that convenience :slight_smile:

Maybe somebody could give me advice?

Regards,
Kerim


#2

Hi Kerim,
at first it seemed as bad idea, then again according to this note there is no difference between the two in terms of functionality. Yet I am not quite certain if the previous statement is true across all versions cross product parallel and serial HDF5. On this page attributes must be in a collective call for phdf5. So I wonder what others with more experience have say on this matter.

To break down your question, I see the following:

  • most recent HDF5
    • any performance difference between attributes and datasets
    • any functional difference considering the cross product of: serial, parallel, attribute, dataset
  • historical version of HDF5
    • any difference when considering functionality from 1.6 - 1.10; You might be interested in this information when your software may be linked against hdf5 libs provided by somewhat outdated OS distribution

A possible approach is to write a quick test case for both, ballpark/measure the difference if any, and possibly re-post your results here for a review?

best: steve


#3

He Steven,

I use HighFive wrapper, Windows 10 x64, MSVC x64, Qt, Release built, HDF5 1.12.0, C-language library

First of all I noticed that I can’t create attribute with more than 16366 float numbers (65 464 bytes).
I also can’tcreate two attributes to one dataset with 16366 float numbers.

Then in my experiment I create dataset and write data (16366 float numbers) there in loop and measure average time to do 100 such loops. It takes 0.039 milliseconds

Then to each of these dataset I create and write attribute (16366 float numbers) in loop (also 100 loops). The average time is 0.255 milliseconds

But here is one important thing that I can’t understand.
My code takes less time to create 100 datasets (attributes) than creating only one of them.
Here is my code:

#include <QString>
#include <QList>

#include <H5File.hpp>
#include <H5Group.hpp>
#include <H5DataSet.hpp>
#include <H5DataSpace.hpp>
#include <H5Attribute.hpp>

#include <armadillo>
using namespace arma;

using namespace HighFive;

int main(void)
{
    File file("names.h5", File::ReadWrite | File::Create | File::Truncate);

    Group group = file.createGroup("group");

    wall_clock timer;

    size_t N = 16366;
    vec data(N, fill::randu);

    int I = 1;
    vec datasetTime(I);
    QList<DataSet> datasetList;
    for (int i = 0; i < I; i++){
        timer.tic();
        DataSet dataset = group.createDataSet<float>(std::to_string(i), DataSpace({N}));
        datasetTime(i) = timer.toc();
        datasetList.push_back(dataset);
    }

    vec attrTime(I);
    for (int i = 0; i < I; i++){
        timer.tic();
        datasetList[i].createAttribute<float>("Attr", DataSpace({N})).write((float*)data.memptr());
        attrTime(i) = timer.toc();
    }

    std::cout << mean(datasetTime) << std::endl;
    std::cout << mean(attrTime) << std::endl;

    file.flush();
}

To create 100 such datasets and write data it takes 0.039 milliseconds
To create 1 dataset and write data it takes 0.173 milliseconds

To create 100 such attributes and write data it takes 0.255 milliseconds
To create 1 attribute and write data it takes 0.158 milliseconds

I don’t understand why this happens :smile:
Here is the screenshot of HdfViewer:


#4

Interesting observation… I made some modifications to your example – replaced QT container with std::vector --, then recompiled it on Linux using highfive then H5CPP.
I got similar results…high5.cpp (1.1 KB)

best: steve

ps.: corrected the arma::vec to arma::fvec so the attribute is 32bit float.


#5

The reworked experiment is uploaded to this github page, where I keep interesting forum topics together.

The first section contains the dataset and attribute creation experiment run only 1 time on a Lenovo X250 Linux 18.04;posted time values are in micro seconds.

num type h5cpp highfive
1 data 227.9 119.39
1 attr 87.7 98.14

A single run, with 100 iterations:

num type h5cpp highfive
100 data 54.91 22.71
100 attr 92.07 79.61

Please take it with grain of salt at this microseconds granularity a single run of the experiment is not convincing enough. If you are interested to take this further, you might want to add a batch file – not quite sure what it is called in the windows world – to execute the the tests many times to produce a histogram (or even normalise it to probability distribution)

best: steve
Note: I also reduced the attribute size to 8000, for some reasons it failed with higher numbers on my system.


#6

There is definitely a difference in functionality. There is no partial I/O for attribute values, i.e., in the 1M float case, you’ll have to read/write four or eight MB, even if you care only about element 15. With a dataset, you’d just read/write what you need. G.