Efficient Way to Write Compound Data

SK1 · August 25, 2008, 6:09am

*Version:* 1.8.1 but using -D H5_USE_16_API as I couldn't find examples that
are compatible with 1.8
*Hardware:* 3.2GHz Xeon with 1GB RAM
*OS* Linux 2.6.24-19 (64 bit)
*Compiler:* gcc (compiled using h5cc script)

I am just starting out with HDF5 and I would like to know the most efficient
way to write large number of rows of compound data. I combined the examples
for compound dataset and extendible dataset to create a program that will
write one row of data at a time. The program's (given below) performance is
very poor (takes about 3 real seconds to process 100K records). When I try
to run the same program on 1MM rows, the program brings down my machine. I
also tried using the example provided using Table High Level API and it
takes slighty more than a minute(90sec) to write 1MM rows but at least
succeeds. I also created a similar program using PyTables and that one
finishes in 1.8 seconds. I looked through the Table API code for PyTables
and it appears to be somewhat similar to H5TB code but PyTable's comments
say it is a stripped down version so not sure why their performance is much
better. I have provided my code below, could someone suggest the best way
to constuct the program so that it will perform at least slightly better
than PyTables? I read about Chunking and its impact on I/O performance but
could not figure out what I need to change - I did play with the chunk_dims
but with no impact.

My dataset has the following properties:
1. It has a fixed structure (number of fields is fixed)
2. Number of rows is unknown and will read be read one row at a time.
3. There is no requirement to read/write more than one row at any given
time.

Thanks very much in advance for your help!

-SK

*Code:*
#include "hdf5.h"

#define FILE "SDScompound.h5"
#define DATASETNAME "ArrayOfStructures"
#define LENGTH 1
#define RANK 1
#define ITER 100000

int main(void) {

    /* First structure and dataset*/
    typedef struct s1_t {
      int a;
      float b;
      double c;
    } s1_t;

s1_t s1[LENGTH];
hid_t s1_tid; /* File datatype identifier */

    int i;
    hid_t file, dataset, space, filespace, cparms; /* Handles */
    herr_t status;
    hsize_t dim[] = { LENGTH }; /* Dataspace dimensions */
    hsize_t offset[LENGTH], size[LENGTH]; /* Dataspace dimensions */
    hsize_t maxdims[] = { H5S_UNLIMITED };
    hsize_t chunk_dims[] = { 5 }; // what is the best number if I am
reading/writing one row at a time

    /*
     * Initialize the data
     */
    s1[i].a = 10;
    s1[i].b = 15;
    s1[i].c = 99.99;

    /*
     * Create the data space.
     */
    space = H5Screate_simple (RANK, dim, maxdims);

    /*
     * Create the file.
     */
    file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk ( cparms, RANK, chunk_dims);

    /*
     * Create the memory data type.
     */
    s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
    H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT);
    H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE);
    H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT);

    /*
     * Create the dataset.
     */
    dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, cparms);

    /* Extend the dataset to the orig dimension */
    size[0] = dim[0];
    status = H5Dextend (dataset, size);

    /* Select a hyperslab */
    filespace = H5Dget_space (dataset);
    offset[0] = 0;
    status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL,
dim, NULL);

/* Write the data to the hyperslab */
status = H5Dwrite (dataset, s1_tid,space, filespace, H5P_DEFAULT, s1);

for (i = 0; i < ITER; ++i) {

        /* Extend the dataset. Add one more row */
        ++size[0]; // increase the row size by 1
        status = H5Dextend (dataset, size);

        /* Select a hyperslab */
        filespace = H5Dget_space (dataset);
        offset[0] = size[0] - 1; // offset starts at 0
        status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset,
NULL, dim, NULL);

        space = H5Screate_simple (RANK, dim, NULL);
        status = H5Dwrite (dataset, s1_tid, space, filespace, H5P_DEFAULT,
s1);
        // status = H5Fflush(file, H5F_SCOPE_GLOBAL); // program still
brings down the system w/ or w/o flush
    }

    /*
     * Release resources
     */
    H5Tclose(s1_tid);
    H5Sclose(space);
    H5Dclose(dataset);
    H5Fclose(file);

return 0;
}

epourmal · August 25, 2008, 1:36pm

Hello,

There are several factor that could degrade performance. The chunk size was too small. We also recommend to close dataspace handle inside the loop to reduce memory usage.

I modified your program to use bigger chunks, to initialize data correctly (index i was not define), and to use 1.8 H5Dcreate call. I also moved H5Sclose inside the loop.

Please see if performance is better now (You do not need -DH5_USE_16_APIs flag to build the program).

Elena

compound-example.c (2.9 KB)

···

On Aug 25, 2008, at 1:09 AM, SK wrote:

#include "hdf5.h"

#define FILE "SDScompound.h5"
#define DATASETNAME "ArrayOfStructures"
#define LENGTH 1
#define RANK 1
#define ITER 100000

int main(void) {

    /* First structure and dataset*/
    typedef struct s1_t {
      int a;
      float b;
      double c;
    } s1_t;

    s1_t s1[LENGTH];
    hid_t s1_tid; /* File datatype identifier */

    int i;
    hid_t file, dataset, space, filespace, cparms; /* Handles */
    herr_t status;
    hsize_t dim[] = { LENGTH }; /* Dataspace dimensions */
    hsize_t offset[LENGTH], size[LENGTH]; /* Dataspace dimensions */
    hsize_t maxdims[] = { H5S_UNLIMITED };
    hsize_t chunk_dims[] = { 5 }; // what is the best number if I am reading/writing one row at a time

    /*
     * Initialize the data
     */
    s1[i].a = 10;
    s1[i].b = 15;
    s1[i].c = 99.99;

    /*
     * Create the data space.
     */
    space = H5Screate_simple (RANK, dim, maxdims);

    /*
     * Create the file.
     */
    file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk ( cparms, RANK, chunk_dims);

    /*
     * Create the memory data type.
     */
    s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
    H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT);
    H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE);
    H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT);

    /*
     * Create the dataset.
     */
    dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, cparms);

    /* Extend the dataset to the orig dimension */
    size[0] = dim[0];
    status = H5Dextend (dataset, size);

    /* Select a hyperslab */
    filespace = H5Dget_space (dataset);
    offset[0] = 0;
    status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL, dim, NULL);

    /* Write the data to the hyperslab */
    status = H5Dwrite (dataset, s1_tid,space, filespace, H5P_DEFAULT, s1);

    for (i = 0; i < ITER; ++i) {

        /* Extend the dataset. Add one more row */
        ++size[0]; // increase the row size by 1
        status = H5Dextend (dataset, size);

        /* Select a hyperslab */
        filespace = H5Dget_space (dataset);
        offset[0] = size[0] - 1; // offset starts at 0
        status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL, dim, NULL);

        space = H5Screate_simple (RANK, dim, NULL);
        status = H5Dwrite (dataset, s1_tid, space, filespace, H5P_DEFAULT, s1);
        // status = H5Fflush(file, H5F_SCOPE_GLOBAL); // program still brings down the system w/ or w/o flush
    }

    /*
     * Release resources
     */
    H5Tclose(s1_tid);
    H5Sclose(space);
    H5Dclose(dataset);
    H5Fclose(file);

    return 0;
}

Francesc_Alted1 · August 25, 2008, 9:37am

A Monday 25 August 2008, SK escrigué:

*Version:* 1.8.1 but using -D H5_USE_16_API as I couldn't find
examples that are compatible with 1.8
*Hardware:* 3.2GHz Xeon with 1GB RAM
*OS* Linux 2.6.24-19 (64 bit)
*Compiler:* gcc (compiled using h5cc script)

I am just starting out with HDF5 and I would like to know the most
efficient way to write large number of rows of compound data. I
combined the examples for compound dataset and extendible dataset to
create a program that will write one row of data at a time. The
program's (given below) performance is very poor (takes about 3 real
seconds to process 100K records). When I try to run the same program
on 1MM rows, the program brings down my machine. I also tried using
the example provided using Table High Level API and it takes slighty
more than a minute(90sec) to write 1MM rows but at least succeeds.
I also created a similar program using PyTables and that one finishes
in 1.8 seconds. I looked through the Table API code for PyTables and
it appears to be somewhat similar to H5TB code but PyTable's comments
say it is a stripped down version so not sure why their performance
is much better. I have provided my code below, could someone suggest
the best way to constuct the program so that it will perform at least
slightly better than PyTables? I read about Chunking and its impact
on I/O performance but could not figure out what I need to change - I
did play with the chunk_dims but with no impact.

My dataset has the following properties:
1. It has a fixed structure (number of fields is fixed)
2. Number of rows is unknown and will read be read one row at a time.
3. There is no requirement to read/write more than one row at any
given time.

You won't never reach a decent performance by writing a row at a time.
If PyTables reach much higher performance than your test program is
because it implements a buffered I/O on top of HDF5, i.e. the writes
(and reads) are made against a memory buffer and when this buffer is
full, it is sent to the disk via HDF5.

You may want to inspect more closely the PyTables code in order to see
how this is done, and how the size of the I/O buffers and HDF5 chunks
are chosen in order to reach maximum performance (provided that you
know the approximate number of rows that will end going into your
table).

Hope that helps,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

SK1 · August 26, 2008, 12:16am

Thanks very much Elena and Francesc!

Elena, your code did improve the performance by 3 times for the 100K test!
However, it fails for 1MM (still locks up the system). After Francesc
mentioned about memory buffer, I read a bit more on how the file is
structured and I think I understand why it fails for large number of rows
when I write one row at a time -- guessing mainly because of the B-tree
memory requirements.

Francesc, I have started looking at PyTables code a little more. I think it
should give me more insights into how to read and write H5 files more
efficiently.

Thanks again,
SK

···

On Mon, Aug 25, 2008 at 6:36 AM, Elena Pourmal <epourmal@hdfgroup.org>wrote:

Hello,
There are several factor that could degrade performance. The chunk size was
too small. We also recommend to close dataspace handle inside the loop to
reduce memory usage.

I modified your program to use bigger chunks, to initialize data correctly
(index i was not define), and to use 1.8 H5Dcreate call. I also moved
H5Sclose inside the loop.

Please see if performance is better now (You do not need -DH5_USE_16_APIs
flag to build the program).

Elena

On Aug 25, 2008, at 1:09 AM, SK wrote:

#include "hdf5.h"

#define FILE "SDScompound.h5"
#define DATASETNAME "ArrayOfStructures"
#define LENGTH 1
#define RANK 1
#define ITER 100000

int main(void) {

    /* First structure and dataset*/
    typedef struct s1_t {
      int a;
      float b;
      double c;
    } s1_t;

    s1_t s1[LENGTH];
    hid_t s1_tid; /* File datatype identifier */

    int i;
    hid_t file, dataset, space, filespace, cparms; /* Handles */
    herr_t status;
    hsize_t dim[] = { LENGTH }; /* Dataspace dimensions */
    hsize_t offset[LENGTH], size[LENGTH]; /* Dataspace dimensions */
    hsize_t maxdims[] = { H5S_UNLIMITED };
    hsize_t chunk_dims[] = { 5 }; // what is the best number if I am
reading/writing one row at a time

    /*
     * Initialize the data
     */
    s1[i].a = 10;
    s1[i].b = 15;
    s1[i].c = 99.99;

    /*
     * Create the data space.
     */
    space = H5Screate_simple (RANK, dim, maxdims);

    /*
     * Create the file.
     */
    file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);

    /* Modify dataset creation properties, i.e. enable chunking */
    cparms = H5Pcreate (H5P_DATASET_CREATE);
    status = H5Pset_chunk ( cparms, RANK, chunk_dims);

    /*
     * Create the memory data type.
     */
    s1_tid = H5Tcreate (H5T_COMPOUND, sizeof(s1_t));
    H5Tinsert(s1_tid, "a_name", HOFFSET(s1_t, a), H5T_NATIVE_INT);
    H5Tinsert(s1_tid, "c_name", HOFFSET(s1_t, c), H5T_NATIVE_DOUBLE);
    H5Tinsert(s1_tid, "b_name", HOFFSET(s1_t, b), H5T_NATIVE_FLOAT);

    /*
     * Create the dataset.
     */
    dataset = H5Dcreate(file, DATASETNAME, s1_tid, space, cparms);

    /* Extend the dataset to the orig dimension */
    size[0] = dim[0];
    status = H5Dextend (dataset, size);

    /* Select a hyperslab */
    filespace = H5Dget_space (dataset);
    offset[0] = 0;
    status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset, NULL,
dim, NULL);

    /* Write the data to the hyperslab */
    status = H5Dwrite (dataset, s1_tid,space, filespace, H5P_DEFAULT, s1);

    for (i = 0; i < ITER; ++i) {

        /* Extend the dataset. Add one more row */
        ++size[0]; // increase the row size by 1
        status = H5Dextend (dataset, size);

        /* Select a hyperslab */
        filespace = H5Dget_space (dataset);
        offset[0] = size[0] - 1; // offset starts at 0
        status = H5Sselect_hyperslab (filespace, H5S_SELECT_SET, offset,
NULL, dim, NULL);

        space = H5Screate_simple (RANK, dim, NULL);
        status = H5Dwrite (dataset, s1_tid, space, filespace, H5P_DEFAULT,
s1);
        // status = H5Fflush(file, H5F_SCOPE_GLOBAL); // program still
brings down the system w/ or w/o flush
    }

    /*
     * Release resources
     */
    H5Tclose(s1_tid);
    H5Sclose(space);
    H5Dclose(dataset);
    H5Fclose(file);

    return 0;
}

Francesc_Alted1 · August 26, 2008, 7:50am

A Tuesday 26 August 2008, SK escrigué:

Thanks very much Elena and Francesc!

Elena, your code did improve the performance by 3 times for the 100K
test! However, it fails for 1MM (still locks up the system). After
Francesc mentioned about memory buffer, I read a bit more on how the
file is structured and I think I understand why it fails for large
number of rows when I write one row at a time -- guessing mainly
because of the B-tree memory requirements.

I don't think so. For relatively small tables like 1 Mrow and
reasonable chunk sizes the B-tree shouldn't take too much memory. By
looking at the example posted by Elena, I think she missed to close the
``filespace`` handler, so the program is basically developing a leak.
Try adding:

H5Sclose(filespace);

at the end of your loop and the leak should go away.

Francesc, I have started looking at PyTables code a little more. I
think it should give me more insights into how to read and write H5
files more efficiently.

The basic idea is that, if you want to achieve best performance for
sequential access, you should make your algorithms to read and write
data by blocks, not single elements. Of course, if you want also a
decent performance in random access of rows, the blocks should not be
too large. By experimenting a bit with buffer and chunk sizes, you
should find which is the best sizes that suits your needs.

Good luck!

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Andrew_Cunningham · August 27, 2008, 6:22pm

Repeat after me...
"HDF5 is not a database..."

In all seriousness, I would personally not like the HDF developers getting distracted trying to recreate SQL-like database capabilities.. or make HDF a general persistent store. If you need to store millions of rows in tables perhaps a real SQL database would be better - then if you need to maintain complex relationships between large numbers of objects with multithreaded access and object level locking then an OODB would be the tool of choice.

As a newbie to HDF I find it complicated as it is with a large and hard to grasp API - in my opinion there are too many combinations of obscure API calls to do the same thing and lots of "knobs to twiddle".

Andrew

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · August 27, 2008, 6:39pm

Andrew,

for the sake of it, I agree with you. HDF5 is a good performer when it comes
to few large transactions. For many small transactions, a real database
system is preferable. The only thing I see as useful for HDF5 people to
elaborate on (wrt db systems) is a little bit better support for ACID
properties.

I would disagree though that the API is large and hard to grasp. What is
important is to understand the basic concepts behind HDF5 format and then
the API really makes sense. You will soon find yourself asking for more. If
you're looking for some quick action though, you can always use the high
level API. Parts of it are really useful.

HTH

-- dimitris

···

2008/8/27 Andrew Cunningham <andrewc@mac.com>

Repeat after me...
"HDF5 is not a database..."

In all seriousness, I would personally not like the HDF developers getting
distracted trying to recreate SQL-like database capabilities.. or make HDF a
general persistent store. If you need to store millions of rows in tables
perhaps a real SQL database would be better - then if you need to maintain
complex relationships between large numbers of objects with multithreaded
access and object level locking then an OODB would be the tool of choice.

As a newbie to HDF I find it complicated as it is with a large and hard to
grasp API - in my opinion there are too many combinations of obscure API
calls to do the same thing and lots of "knobs to twiddle".

Andrew

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Andrew_Cunningham · August 27, 2008, 8:15pm

I would consider myself probably representative of a typical math/sci/eng programmer starting with HDF - I have written 4-5 non-trivial HDF programs to test its use with our data and I did find the "API large and hard to grasp and somewhat confusing". Nor did I find the HL API that helpful, as it only goes so far and then you are trawling the full API to do what you need.
This is just my experience. Either the API is hard to grasp or I am an atypical case and should try something other than programming for a living (:

···

On Aug 27, 2008, at 11:39 AM, Dimitris Servis wrote:

I would disagree though that the API is large and hard to grasp. What is important is to understand the basic concepts behind HDF5 format and then the API really makes sense. You will soon find yourself asking for more. If you're looking for some quick action though, you can always use the high level API. Parts of it are really useful.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

su-loric · March 12, 2018, 4:00pm

Darryl_Okahata · August 26, 2008, 12:57am

However, it fails for 1MM (still locks up the system).

Here's a test program that I've used for playing with HDF5 compound
data. It basically writes out data in blocks, where a block consists of
a fixed number of rows. Each row consists of a struct:

  typedef struct {
      int var1;
      char *str;
      double d1;
      int var2;
      int var4;
      long long var3;
      double d2;
  } ROWDATA;

Note, however, that the presence of the string will distort the file
size (strings are slightly inefficient, and are not compressible). You
also want to arrange the struct data for optimal data packing; if you
want to see what I mean, try moving one of the doubles up or down, and
then look at size of the resulting HDF5 file. (Yeah, I think I left a
HDF5 hole in the above struct.)

See the #define's for the parameters that you can twiddle. You can
see how performance suffers, as you shrink the value of ROWS_PER_BLOCK.

As given, the test program writes out a total of around 10 million
rows. On my 64-bit linux box (older 3GHz Xeons), it takes around 13
seconds to write the file (~92MB in size), onto a local disk. The
process size is generally sane, and typically does not consume all RAM
and swap.

h5_test.c (4.32 KB)

···

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

Ray · August 26, 2008, 1:31am

If you do a pack on the structure once you've defined the compound datatype,
you might be able to get rid of the structure size variability (which is due
to how the compiler byte aligns the fields).

CompType *pdt = DD::DefineDataType(); // my data types self-define
themselves
pdt->pack();

    DataSpace *pds = new H5::DataSpace( H5S_SIMPLE );
    hsize_t curSize = 0;
    hsize_t maxSize = H5S_UNLIMITED;
    pds->setExtentSimple( 1, &curSize, &maxSize );

    DSetCreatPropList pl;
    hsize_t sizeChunk = CHDF5DataManager::H5ChunkSize(); // chunk size is
constant defined elsewhere
    pl.setChunk( 1, &sizeChunk );
    pl.setShuffle();
    pl.setDeflate(5); // has the documentation been updated with what the
various levels mean?

    dataset = new H5::DataSet( dm.GetH5File()->createDataSet( sPathName,
*pdt, *pds, pl ) );
    dataset->close();
    pds->close();
    pdt->close();
    delete pds;
    delete pdt;
    delete dataset;

···

-----Original Message-----
From: Darryl Okahata [mailto:darrylo@soco.agilent.com]
Sent: Monday, August 25, 2008 21:58
To: hdf-forum@hdfgroup.org
Cc: SK
Subject: Re: [hdf-forum] Efficient Way to Write Compound Data

> However, it fails for 1MM (still locks up the system).

     Here's a test program that I've used for playing with
HDF5 compound data. It basically writes out data in blocks,
where a block consists of a fixed number of rows. Each row
consists of a struct:

  typedef struct {
      int var1;
      char *str;
      double d1;
      int var2;
      int var4;
      long long var3;
      double d2;
  } ROWDATA;

Note, however, that the presence of the string will distort
the file size (strings are slightly inefficient, and are not
compressible). You also want to arrange the struct data for
optimal data packing; if you want to see what I mean, try
moving one of the doubles up or down, and then look at size
of the resulting HDF5 file. (Yeah, I think I left a
HDF5 hole in the above struct.)

     See the #define's for the parameters that you can
twiddle. You can see how performance suffers, as you shrink
the value of ROWS_PER_BLOCK.

     As given, the test program writes out a total of around
10 million rows. On my 64-bit linux box (older 3GHz Xeons),
it takes around 13 seconds to write the file (~92MB in size),
onto a local disk. The process size is generally sane, and
typically does not consume all RAM and swap.

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and
does not constitute the support, opinion, or policy of
Agilent Technologies, or of the little green men that have
been following him all day.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

SK1 · August 26, 2008, 7:30am

Thanks Darryl/Ray. The code which writes using the blocked approach is very
efficient - i need to play with it a bit more to see what configuration
produces the best result. I compared it against just writing a struct out as
a plain binary file and that code is twice as fast but that's
understandable. Given H5 allows me to store lot of sophisticated metadata, i
think the performance I am seeing with the new approach is well worth it! I
am sure as i learn more about H5, i probably can improve this further.

Darryl, if you don't mind, I would recommend the HDF Group publish your
example as a great way to write compound data. I think this can complement
the examples they already on Table API. This will be helpful for people who
don't mind using the lower level APIs to get very good performance.

Thanks very much!
-SK

···

On Mon, Aug 25, 2008 at 6:31 PM, Ray Burkholder <ray@oneunified.net> wrote:

If you do a pack on the structure once you've defined the compound
datatype,
you might be able to get rid of the structure size variability (which is
due
to how the compiler byte aligns the fields).

   CompType *pdt = DD::DefineDataType(); // my data types self-define
themselves
   pdt->pack();

   DataSpace *pds = new H5::DataSpace( H5S_SIMPLE );
   hsize_t curSize = 0;
   hsize_t maxSize = H5S_UNLIMITED;
   pds->setExtentSimple( 1, &curSize, &maxSize );

   DSetCreatPropList pl;
   hsize_t sizeChunk = CHDF5DataManager::H5ChunkSize(); // chunk size is
constant defined elsewhere
   pl.setChunk( 1, &sizeChunk );
   pl.setShuffle();
   pl.setDeflate(5); // has the documentation been updated with what the
various levels mean?

   dataset = new H5::DataSet( dm.GetH5File()->createDataSet( sPathName,
*pdt, *pds, pl ) );
   dataset->close();
   pds->close();
   pdt->close();
   delete pds;
   delete pdt;
   delete dataset;

> -----Original Message-----
> From: Darryl Okahata [mailto:darrylo@soco.agilent.com]
> Sent: Monday, August 25, 2008 21:58
> To: hdf-forum@hdfgroup.org
> Cc: SK
> Subject: Re: [hdf-forum] Efficient Way to Write Compound Data
>
> > However, it fails for 1MM (still locks up the system).
>
> Here's a test program that I've used for playing with
> HDF5 compound data. It basically writes out data in blocks,
> where a block consists of a fixed number of rows. Each row
> consists of a struct:
>
> typedef struct {
> int var1;
> char *str;
> double d1;
> int var2;
> int var4;
> long long var3;
> double d2;
> } ROWDATA;
>
> Note, however, that the presence of the string will distort
> the file size (strings are slightly inefficient, and are not
> compressible). You also want to arrange the struct data for
> optimal data packing; if you want to see what I mean, try
> moving one of the doubles up or down, and then look at size
> of the resulting HDF5 file. (Yeah, I think I left a
> HDF5 hole in the above struct.)
>
> See the #define's for the parameters that you can
> twiddle. You can see how performance suffers, as you shrink
> the value of ROWS_PER_BLOCK.
>
> As given, the test program writes out a total of around
> 10 million rows. On my 64-bit linux box (older 3GHz Xeons),
> it takes around 13 seconds to write the file (~92MB in size),
> onto a local disk. The process size is generally sane, and
> typically does not consume all RAM and swap.
>
> --
> Darryl Okahata
> darrylo@soco.agilent.com
>
> DISCLAIMER: this message is the author's personal opinion and
> does not constitute the support, opinion, or policy of
> Agilent Technologies, or of the little green men that have
> been following him all day.
>
>
> --
> Scanned for viruses and dangerous content at
> http://www.oneunified.net and is believed to be clean.
>
>

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Darryl_Okahata · August 29, 2008, 4:25pm

Unless I'm missing something (which is certainly possible), string
data isn't compressed. You can see this by turning on compression and
then running strings(1) upon the HDF5 file. It's even more obvious if
you use the split file driver and examine the metadata file; the data
file is compressed, but the metadata file is not compressed (strings are
stored in the metadata file).

···

Francesc Alted <faltet@pytables.com> wrote:

You may be surprised on the amount of people that is using HDF5 to store
string data too. Its capability for transparently compressing these
type of data (all data types in general) is very appreciated out there.

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

george.lewandowski · August 29, 2008, 4:35pm

Variable length string data is not compressed. There is a small
structure in the dataset itself which points to the variable length
data, stored elsewhere in the file. Those structures themselves will be
compressed if you apply the filter, but the strings are stored as is.

As far as I know, the only way to compress string data is to store
fixed-length strings. You can pad strings to a given maximum size, or
concatenate many strings into a single large string.

George Lewandowski
(314)777-7890
Mail Code S270-2204
Building 270-E Level 2E Room 20E
P-8A

···

-----Original Message-----
From: Darryl Okahata [mailto:darrylo@soco.agilent.com]
Sent: Friday, August 29, 2008 11:25 AM
To: Francesc Alted
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Efficient Way to Write Compound Data

Francesc Alted <faltet@pytables.com> wrote:

You may be surprised on the amount of people that is using HDF5 to
store string data too. Its capability for transparently compressing
these type of data (all data types in general) is very appreciated out

there.

Unless I'm missing something (which is certainly possible), string
data isn't compressed. You can see this by turning on compression and
then running strings(1) upon the HDF5 file. It's even more obvious if
you use the split file driver and examine the metadata file; the data
file is compressed, but the metadata file is not compressed (strings are
stored in the metadata file).

--
Darryl Okahata
darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · August 29, 2008, 6:11pm

Variable length string data is not compressed. There is a small
structure in the dataset itself which points to the variable length
data, stored elsewhere in the file. Those structures themselves will be
compressed if you apply the filter, but the strings are stored as is.

As far as I know, the only way to compress string data is to store
fixed-length strings. You can pad strings to a given maximum size, or
concatenate many strings into a single large string.

Yes, this is all true (variable vs. fixed-length string compression, etc.) and good advice - thanks! I also wanted to mention that it's a fixable problem that we may be able to move forward on for the 1.10.0 release, with some funding. Generally speaking, I'd like to change the way that variable-sized information for dataset elements is stored and put that information in a container that is private to the dataset (or possibly even to each chunk of a dataset) and also allow it to be compressed when the chunks for the dataset are compressed. Hopefully, this will be a high enough priority for some funding provider to support, or else someone in the community could provide a patch that tackles it.

Quincey

···

On Aug 29, 2008, at 11:35 AM, Lewandowski, George wrote:

George Lewandowski
(314)777-7890
Mail Code S270-2204
Building 270-E Level 2E Room 20E
P-8A

-----Original Message-----
From: Darryl Okahata [mailto:darrylo@soco.agilent.com]
Sent: Friday, August 29, 2008 11:25 AM
To: Francesc Alted
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Efficient Way to Write Compound Data

Francesc Alted <faltet@pytables.com> wrote:

You may be surprised on the amount of people that is using HDF5 to
store string data too. Its capability for transparently compressing
these type of data (all data types in general) is very appreciated out

there.

    Unless I'm missing something (which is certainly possible), string
data isn't compressed. You can see this by turning on compression and
then running strings(1) upon the HDF5 file. It's even more obvious if
you use the split file driver and examine the metadata file; the data
file is compressed, but the metadata file is not compressed (strings are
stored in the metadata file).

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dougherty_Matthew_T1 · August 29, 2008, 6:33pm

I was thinking about adding a new compression filter.

Could you give me some direction on how to plan and execute this?

thanks,

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

Quincey_Koziol · August 29, 2008, 6:44pm

Hi Matt,

I was thinking about adding a new compression filter.

Could you give me some direction on how to plan and execute this?

Take a look at this bzip compression study that Kent Yang did:

http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/

It should have enough information to get you started, along with example code for writing a pipeline filter that's "outside the library".

Quincey

···

On Aug 29, 2008, at 1:33 PM, Dougherty, Matthew T. wrote:

thanks,

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dougherty_Matthew_T1 · August 29, 2008, 7:07pm

Hi Quincey,

I see three things in the code that catches my eye

1) H5Zregister(H5Z_BZIP2,"bzip2",bzip2_filter); is the only additional element compared to regular code that links in the filter.

2) /* set up the property list to make data buffer to be equal to the data size,
     this step is critial since HDF5 library will be much slower to handle data
     type conversion (for example, from big-endian to little-endian) without
     setting this buffer. The future HDF5 library should handle this case
     better. */

Has the last sentence come to pass?

3) size_t XXXXX_filter(unsigned flags,
          size_t cd_nelmts,
          const unsigned cd_values[],
          size_t nbytes,
          size_t *buf_size,
          void **buf) {

is the required organization of all filter subroutine calls; which provides the interface to the compression library.
detailed in http://hdfgroup.com/HDF5/doc/RM/RM_H5Z.html

Can I run multiple filters in sequence? or must I only use one filter? e.g. compression filter followed by encryption filter

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

-----Original Message-----
From: Quincey Koziol [mailto:koziol@hdfgroup.org]
Sent: Fri 8/29/2008 1:44 PM
To: Dougherty, Matthew T.
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] filter

Hi Matt,

On Aug 29, 2008, at 1:33 PM, Dougherty, Matthew T. wrote:

I was thinking about adding a new compression filter.

Could you give me some direction on how to plan and execute this?

Take a look at this bzip compression study that Kent Yang did:

http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/

It should have enough information to get you started, along with
example code for writing a pipeline filter that's "outside the library".

Quincey

thanks,

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=

======================================================================

=

======================================================================

Dougherty_Matthew_T1 · August 29, 2008, 7:16pm

I am little confused on the example:

H5Zregister(H5Z_BZIP2,"bzip2",bzip2_filter); has three parameters.

the docs say one parameter passed, which is the structure
typedef struct H5Z_class_t {
           H5Z_filter_t filter_id;
           const char *comment;
           H5Z_can_apply_func_t can_apply_func;
           H5Z_set_local_func_t set_local_func;
           H5Z_func_t filter_func;
       } H5Z_class_t;

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

···

=========================================================================

Quincey_Koziol · August 29, 2008, 8:02pm

Hi Matt,

Hi Quincey,

I see three things in the code that catches my eye

1) H5Zregister(H5Z_BZIP2,"bzip2",bzip2_filter); is the only additional element compared to regular code that links in the filter.

Yes.

2) /* set up the property list to make data buffer to be equal to the data size,
     this step is critial since HDF5 library will be much slower to handle data
     type conversion (for example, from big-endian to little-endian) without
     setting this buffer. The future HDF5 library should handle this case
     better. */

  Has the last sentence come to pass?

Nope, sorry.

3) size_t XXXXX_filter(unsigned flags,
                                        size_t cd_nelmts,
                                        const unsigned cd_values[],
                                        size_t nbytes,
                                        size_t *buf_size,
                                        void **buf) {

  is the required organization of all filter subroutine calls; which provides the interface to the compression library.
  detailed in http://hdfgroup.com/HDF5/doc/RM/RM_H5Z.html

Can I run multiple filters in sequence? or must I only use one filter? e.g. compression filter followed by encryption filter

Yes, you can push data through multiple filters in sequence. The library pushes the data through the filters in the order you make the property "set" calls (e.g. if you call H5Pset_shuffle then H5Pset_deflate, the shuffle filter is first, then the deflate filter) when writing (and in reverse when reading :-). There's a limit of 32 filters for a sequence right now.

Quincey

···

On Aug 29, 2008, at 2:07 PM, Dougherty, Matthew T. wrote:

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

=========================================================================

-----Original Message-----
From: Quincey Koziol [mailto:koziol@hdfgroup.org]
Sent: Fri 8/29/2008 1:44 PM
To: Dougherty, Matthew T.
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] filter

Hi Matt,

On Aug 29, 2008, at 1:33 PM, Dougherty, Matthew T. wrote:

> I was thinking about adding a new compression filter.
>
> Could you give me some direction on how to plan and execute this?
>

        Take a look at this bzip compression study that Kent Yang did:

http://www.hdfgroup.uiuc.edu/papers/papers/bzip2/

        It should have enough information to get you started, along with
example code for writing a pipeline filter that's "outside the library".

        Quincey

>
> thanks,
>
> Matthew Dougherty
> 713-433-3849
> National Center for Macromolecular Imaging
> Baylor College of Medicine/Houston Texas USA
> =
> ======================================================================
> =
> ======================================================================

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Efficient Way to Write Compound Data

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

=

======================================================================

=

=========================================================================

Matthew Dougherty 713-433-3849 National Center for Macromolecular Imaging Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA

Matthew Dougherty
713-433-3849
National Center for Macromolecular Imaging
Baylor College of Medicine/Houston Texas USA