H5py compound data set with arrays


#1

I’m experiencing the perfect storm as I’m a relative newby to python and hdf5.

I have a situation where I’d like to store image data. I’ve had success producing results when storing as a simple 2D array. However, I’d like to store additional ‘meta’ data for each data set that includes a timestamp. So the data would look something like:
Image1, timestamp1,
Image2, timestamp2,

I’ve seen examples of using compound data for scalars and strings but have not found anything for compound types storing arrays along with scalar values. I DID find a solution that uses pytables - is that the best solution?

In short, I’m trying to figure out how to create the correct dtype. In this case, how does ‘???’ translate to a 2D byte array?
dt = np.dtype( [???, (‘timestamp’, np.float)] )


#2

You should use Attributes to store that type of metadata.

Mark


#3

Hello Mark.

Thanks for the quick response. I’m slowly figuring it out.

I wouldn’t think attributes could work since the value of the attribute changes with each dataset. I have an image group and within that group there is a single data set reference with many instances. Each instance has a unique timestamp.

Seth


#4

Hi Seth,

You are correct that if you have multiple images in a dataset then attributes won’t work.

Here is part of an h5dump on a file where I am streaming images to the HDF5 file. I create a 3-D dataset for the images, and another dataset for the timestamp and another for the UniqueId of each image. In this case I am streaming 1024x1024 images to the file. I told it the maximum size would be 100 images, but I stopped it when only 34 images has been collected. The image dataset (/entry/data/data) is thus [34,1024,1024] while the /entry/instrument/NDAttributes/NDArrayTimeStamp and /entry/instrument/NDAttributes/NDArrayUniqueId datasets are both [34].

HDF5 “hdf5_test_002.h5” {

GROUP “/” {

GROUP “entry” {

  ATTRIBUTE "NX_class" {

     DATATYPE  H5T_STRING {

        STRSIZE 7;

        STRPAD H5T_STR_NULLTERM;

        CSET H5T_CSET_ASCII;

        CTYPE H5T_C_S1;

     }

     DATASPACE  SCALAR

  }

  GROUP "data" {

     ATTRIBUTE "NX_class" {

        DATATYPE  H5T_STRING {

           STRSIZE 6;

           STRPAD H5T_STR_NULLTERM;

           CSET H5T_CSET_ASCII;

           CTYPE H5T_C_S1;

        }

        DATASPACE  SCALAR

     }

     DATASET "data" {

        DATATYPE  H5T_STD_U8LE

        DATASPACE  SIMPLE { ( 34, 1024, 1024 ) / ( 100, 1024, 1024 ) }

        ATTRIBUTE "NDArrayDimBinning" {

           DATATYPE  H5T_STD_I32LE

           DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }

        }

     }

  }

  GROUP "instrument" {

     GROUP "NDAttributes" {

        DATASET "NDArrayTimeStamp" {

           DATATYPE  H5T_IEEE_F64LE

           DATASPACE  SIMPLE { ( 34 ) / ( H5S_UNLIMITED ) }

           ATTRIBUTE "NDAttrDescription" {

              DATATYPE  H5T_STRING {

                 STRSIZE 39;

                 STRPAD H5T_STR_NULLTERM;

                 CSET H5T_CSET_ASCII;

                 CTYPE H5T_C_S1;

              }

              DATASPACE  SCALAR

           }

        }

        DATASET "NDArrayUniqueId" {

           DATATYPE  H5T_STD_I32LE

           DATASPACE  SIMPLE { ( 34 ) / ( H5S_UNLIMITED ) }

        }

     }

  }

}

}

}

Mark


#5

Thanks again.

That’s a good idea and I may try that approach. But while I have your attention :wink: Here’s what I was able to do with a compound data type. It works but I have to figure out how to properly handle the chunks because performance is awful (I should also probably use a byte value for the image data rather than an integer as well).

  DATASET "ebic.avt24:imageM2" {

DATATYPEH5T_COMPOUND {

****H5T_ARRAY { [1292][964] H5T_STD_I64LE } “values”;

****H5T_IEEE_F64LE “timestamp”;

****H5T_STD_I64LE “width”;

****H5T_STD_I64LE “height”;

****}

     DATASPACE  SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }

     DATA {

     (0): {

            [ 0, 7, 8, 9, 66, 12, 43, 10, 8, 11, 22, 12, 16, 8, 10, 11, 8, 13, 10, 16, 31, 9, 8, 9, 12, 11, 17, 7, 8, 8, 8, 16, 7, 9, 8, 14, 9, 12, 14, 7, 6, 12, 43, 10, 7, 9, 12, 8, 9, 8, 6, 8, 8, 10, 7, 9, 11, 24, 9…


#6

Just one followup – it’s not the chunk that causes the performance problem - I’m setting chunk=True instead of a specific size using h5py. But rather it’s the compression ratio that’s slowing things down. I found that a gzip compression ratio above 6 or 7 has quite a big impact.

Seth


#7

Hi Mark.

My apologies if I’m spamming you with this email. I could try and post to the forum but I have a feeling my question/concern is something that has already been addressed. I’ll try to be concise.

I’ve been exploring the use of HDF5 as a replacement for our SDDS file structure that we’ve been using for years to store logged data from our controls system. In the past week I’ve come across several road blocks that are making me draw the conclusion that HDF5 is just not the right solution. My original enthusiasm for HDF5 came from the notion that the data being stored was much more compact than the compressed binary SDDS data. For example, I found that 5 minutes worth of 1khz data for 20 different data sets took up 3.5MB on our file system which meant that we could store hours of high frequency data in a single HDF5 file. The fact that the files are random accessible would be a huge win.

Unfortunately, since we would like to use HDF5 for our logging system, access to the data needs to be current so storing hours of data in a single file means that SWMR mode must be turned on. Two problems.


#8

Hi Seth,
being aware this is a python related post, I am wondering if C++ is an option for you? If so, I would like to hear more of your problem.
best:
steve


#9

Thanks for the quick reply. I’m not sure why the rest of my message is not showing up on the forum but here’s an attempt at sharing the full content of my message again (if it didn’t make it the first time). The following is a continuation from my previous post.

Unfortunately, since we would like to use HDF5 for our logging system, access to the data needs to be current so storing hours of data in a single file means that SWMR mode must be turned on. Two problems.
SWMR does not work over NFS – sigh
SWMR produces so much ‘unaccounted’ space that even if we write to a local disk, that 3.5 MB file is now 100MB!

So SWMR is not the answer but how can I write data to a file while allowing other processes to access that file. I thought about closing the file between writes and opening in append mode when ready to add new data sets but that causes other problems.
The process of adding data in append mode causes that ‘unaccounted’ space to blow up again. OK – one work around is to h5repack the file after closing it. While a very non elegant solution, it works.
If another process attempts to read that data and doesn’t close it in time the writer process will throw an exception when it tries to write. So this is not a good solution either.

I would find it surprising if I’m the only one who’s experiencing these obstacles. Is there a simple solution to deal with the use case I described above?

I’ve been using the python 2.10 (h5py) module for testing but have also confirmed that the C++ implementation has possibly different but similar shortcomings. I’m going to do more testing in the C++ world next week but it would be a shame if the python implementation was not as robust as C/C++.


#10

Thanks for getting back to me,
here is the link to H5CPP a new approach to persistence for modern C++, you may be interested in h5::append operator, which is an efficient – near filesystem throughput – custom implementation of packet tables. These ISC’19 presentation slides help you with a quick overview.

I would recommend to combine h5::append with a ZeroMQ perhaps with some wire protocol such as protocol buffers. This kind of setup is indeed in different ball park than SWMR: robust and distributed. A single, or many event recorder consumes then persists the events, independently, no SWMR needed. For multithreaded used you have to follow some rules – to able to read back data blocks, while recording them. This is like rolling your own SWMR – which is much easier, as you are to solve a specific case.

The previous sketch with some refinements has been used in High Frequency Trading Industry, with the difference that UDP or SCP packets are used. (In fact I based H5CPP on my previous works: reliable low latency logging of high count events approx: 10^6 - 10^8 events/sec)
If you have UDP packets, or perhaps TCP stream, with C struct, I highly recommend to check out my other project: h5cpp compiler – which solves compile time reflection with LLVM based tool.

let me know if you find this interesting.
best wishes:
steve


#11

There’s a lot here to digest. I will be sure to explore your recommendations and let you know how it works out.

Thanks again for you follow up.

Seth


#12

Hi Seth,

You say you are currently using SDDS. I know that is mainly used at the APS at Argonne. Is that where you are? I am there at sector 13. If you are perhaps we can talk in person?

I have just made a test streaming 100 1024x1024 UInt8 images to an HDF5 file, including metadata as I described previously. I am using the EPICS areaDetector HDF5 file writer (C interface to HDF5). I tested both with SWMR and without. The files are very nearly the same size (in fact the SWMR file is slightly smaller), and are just a little bigger than the expected 100 MB because of the metadata.

-rw-rw-r-- 1 epics domain users 105154704 Mar 9 09:15 hdf5_test_noswmr_023.h5

-rw-rw-r-- 1 epics domain users 105078764 Mar 9 09:16 hdf5_test_swmr_025.h5

Mark


#13

Hi Mark.

I’m at BNL.

Just a very brief history. We inherited the SDDS libraries in the late 90’s and have (probably) forked a version that we have been using here at RHIC to store logged data. It has worked mostly well for us but we are planning for the EIC (electron-ion collider) and I would like to modernize our logging system.

About two years ago I had started a proof of principle exploration into HDF5 and there was a lot of promise (random access, better compression, community support). That project got put on the back burner until a couple of weeks ago when I finally had to time to revisit.

I decided to look into python instead of c++ as that’s where our momentum in development was heading. As I’ve alluded to in my previous messages, I’ve been a bit disappointed in the python implementation (h5py v1.10). I could go back to C++ development strategies but I’m not sure that it’s the right thing to do at this time.

A couple of things I’ve noticed


#14

Hi Seth,

Your message seems to have been truncated again.

You can e-mail me directly at rivers at cars.uchicago.edu.