Performance issues, any suggestions on what I can do?

I've got to store anything from ten to possibly millions of time series. At the moment, I create a simple dataspace and store the time series which all works. But once I get to a few thousand time series, performance drops off so HDF5 is no longer an option for me.

Can anyone suggest anything I can try? The function to write the data to the HDF5 is below, it's pretty simple.

Any suggestions at all are more than welcome.

All the best,

Tony.

int AddTimeSeriesToHDF5File(int file_id, charutf8* pStrVarName, double *pVars, int nVals, BOOL bCompress)
{
     hsize_t dims[1];
     hid_t dataspace_id;
     hid_t dcpl, aid2 ;
     hid_t dataset_id;
     hid_t attr2;
     herr_t status;
     hid_t plist_id; //compress
     hsize_t cdims[1];

     //bCompress = FALSE;
     bCompress = TRUE;

     dims[0] = nVals;
     dataspace_id = H5Screate_simple(1, dims, NULL);

     if ( bCompress )
     {
         plist_id = H5Pcreate (H5P_DATASET_CREATE);
         cdims[0] = nVals;
         status = H5Pset_chunk (plist_id , 1, cdims);

         status = H5Pset_deflate (plist_id , 9);
     }
     else
     {
         plist_id = H5P_DEFAULT;
     }

     //Compact dataset test start
     //dcpl = H5Pcreate (H5P_DATASET_CREATE);
     //status = H5Pset_layout (dcpl, H5D_COMPACT);
     //Compact dataset test end

     // Open an existing dataset.
     dataset_id = H5Dcreate2(file_id, pStrVarName, H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, plist_id , H5P_DEFAULT);

     //Attribute set start
     aid2 = H5Screate(H5S_SCALAR);
     attr2 = H5Acreate2(dataset_id, "Number of points", H5T_NATIVE_INT, aid2, H5P_DEFAULT, H5P_DEFAULT);
     status = H5Awrite(attr2, H5T_NATIVE_INT, &nVals);
     //Attribute set end

     //Now try writing data
     status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL, H5P_DEFAULT, pVars);
     status = H5Sclose(dataspace_id);
     status = H5Dclose(dataset_id);
     status = H5Pclose(plist_id);

     return status;
}

I've got to store anything from ten to possibly millions of time
series. At the moment, I create a simple dataspace and store the
time series which all works. But once I get to a few thousand time
series, performance drops off so HDF5 is no longer an option for me.

Can anyone suggest anything I can try? The function to write the
data to the HDF5 is below, it's pretty simple.

Any suggestions at all are more than welcome.

If I read correctly, you do not close the attribute attr2 or its dataspace aid2.
This may take up memory and decrease performance.

Regards,

Pierre

···

On Tue, Jun 10, 2014 at 11:40:07PM +0100, Tony Kennedy - Ventana Systems UK wrote:

All the best,

Tony.

int AddTimeSeriesToHDF5File(int file_id, charutf8* pStrVarName,
double *pVars, int nVals, BOOL bCompress)
{
    hsize_t dims[1];
    hid_t dataspace_id;
    hid_t dcpl, aid2 ;
    hid_t dataset_id;
    hid_t attr2;
    herr_t status;
    hid_t plist_id; //compress
    hsize_t cdims[1];

    //bCompress = FALSE;
    bCompress = TRUE;

    dims[0] = nVals;
    dataspace_id = H5Screate_simple(1, dims, NULL);

    if ( bCompress )
    {
        plist_id = H5Pcreate (H5P_DATASET_CREATE);
        cdims[0] = nVals;
        status = H5Pset_chunk (plist_id , 1, cdims);

        status = H5Pset_deflate (plist_id , 9);
    }
    else
    {
        plist_id = H5P_DEFAULT;
    }

    //Compact dataset test start
    //dcpl = H5Pcreate (H5P_DATASET_CREATE);
    //status = H5Pset_layout (dcpl, H5D_COMPACT);
    //Compact dataset test end

    // Open an existing dataset.
    dataset_id = H5Dcreate2(file_id, pStrVarName, H5T_NATIVE_DOUBLE,
dataspace_id, H5P_DEFAULT, plist_id , H5P_DEFAULT);

    //Attribute set start
    aid2 = H5Screate(H5S_SCALAR);
    attr2 = H5Acreate2(dataset_id, "Number of points",
H5T_NATIVE_INT, aid2, H5P_DEFAULT, H5P_DEFAULT);
    status = H5Awrite(attr2, H5T_NATIVE_INT, &nVals);
    //Attribute set end

    //Now try writing data
    status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, H5S_ALL,
H5S_ALL, H5P_DEFAULT, pVars);
    status = H5Sclose(dataspace_id);
    status = H5Dclose(dataset_id);
    status = H5Pclose(plist_id);

    return status;
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

--
-----------------------------------------------------------
Pierre de Buyl
KU Leuven - Polymer Chemistry and Materials
T +32 16 327355
W http://pdebuyl.be/
-----------------------------------------------------------

So each time series is a dataset in the file's root directory, thus you end up with a million datasets in the same root group?

Does performance also slow down when you use the HDF5 tools, such as h5ls / h5dump on such a file?

It might make sense to rearrange your datasets hierarchically such that you have only e.g. 1000 datasets per group, and you create 1000 groups, each covering a range of time series, thus getting a million datasets but only 1000 per group.

If all time series are of the same length, it might also be an option to put them all into the same dataset but leave one dimension open and extend that one as new datasets come in, so each time series is just one chunk of such a 2-dimensional dataset.

Cheers,
            Werner

···

On 11.06.2014 00:40, Tony Kennedy - Ventana Systems UK wrote:

I've got to store anything from ten to possibly millions of time series. At the moment, I create a simple dataspace and store the time series which all works. But once I get to a few thousand time series, performance drops off so HDF5 is no longer an option for me.

Can anyone suggest anything I can try? The function to write the data to the HDF5 is below, it's pretty simple.

Any suggestions at all are more than welcome.

All the best,

Tony.

int AddTimeSeriesToHDF5File(int file_id, charutf8* pStrVarName, double *pVars, int nVals, BOOL bCompress)
{
    hsize_t dims[1];
    hid_t dataspace_id;
    hid_t dcpl, aid2 ;
    hid_t dataset_id;
    hid_t attr2;
    herr_t status;
    hid_t plist_id; //compress
    hsize_t cdims[1];

    //bCompress = FALSE;
    bCompress = TRUE;

    dims[0] = nVals;
    dataspace_id = H5Screate_simple(1, dims, NULL);

    if ( bCompress )
    {
        plist_id = H5Pcreate (H5P_DATASET_CREATE);
        cdims[0] = nVals;
        status = H5Pset_chunk (plist_id , 1, cdims);

        status = H5Pset_deflate (plist_id , 9);
    }
    else
    {
        plist_id = H5P_DEFAULT;
    }

    //Compact dataset test start
    //dcpl = H5Pcreate (H5P_DATASET_CREATE);
    //status = H5Pset_layout (dcpl, H5D_COMPACT);
    //Compact dataset test end

    // Open an existing dataset.
    dataset_id = H5Dcreate2(file_id, pStrVarName, H5T_NATIVE_DOUBLE, dataspace_id, H5P_DEFAULT, plist_id , H5P_DEFAULT);

    //Attribute set start
    aid2 = H5Screate(H5S_SCALAR);
    attr2 = H5Acreate2(dataset_id, "Number of points", H5T_NATIVE_INT, aid2, H5P_DEFAULT, H5P_DEFAULT);
    status = H5Awrite(attr2, H5T_NATIVE_INT, &nVals);
    //Attribute set end

    //Now try writing data
    status = H5Dwrite(dataset_id, H5T_NATIVE_DOUBLE, H5S_ALL, H5S_ALL, H5P_DEFAULT, pVars);
    status = H5Sclose(dataspace_id);
    status = H5Dclose(dataset_id);
    status = H5Pclose(plist_id);

    return status;
}

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Twitter: https://twitter.com/hdf5

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Pierre, (If I read correctly, you do not close the attribute attr2

or its dataspace aid2. This may take up memory and decrease performance.),
Thank you, I had missed these. Unfortunately, it made no noticeable difference to the performance.

>> Werner Benger (If all time series are of the same length, it might also be an option to put
>> them all into the same dataset but leave one dimension open and extend that one as
>> new datasets come in, so each time series is just one chunk of such a 2-dimensional dataset.
I like the sound of this (the majority of the data is the same length). I haven't noticed a way yet of adding columns to a matrix, is this straightforward? If it is, I can store the column names and numbers in a different dataset and the actual values as a large 2d matrix.

Pretty much, check for |H5S_UNLIMITED| in the dataspace description:

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5S.html

The dataset needs to be chunked, but is extensible beyond its initial size if its dimension is set to be unlimited rather than a constant value at creation.

     Werner

···

On 12.06.2014 12:04, Tony Kennedy - Ventana Systems UK wrote:

>> Werner Benger (If all time series are of the same length, it might also be an option to put
>> them all into the same dataset but leave one dimension open and extend that one as
>> new datasets come in, so each time series is just one chunk of such a 2-dimensional dataset.
I like the sound of this (the majority of the data is the same length). I haven't noticed a way yet of adding columns to a matrix, is this straightforward? If it is, I can store the column names and numbers in a different dataset and the actual values as a large 2d matrix.

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Center for Computation & Technology at Louisiana State University (CCT/LSU)
2019 Digital Media Center, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362