Performance problem writing small datasets hdf5.8.0.2


My apologies if this comes out wrong - it is a) my first time posting and b) I only get the digest and so had to copy and paste.,

We too are developing software for the control of a scientific instrument and frankly I am nervous that my selection of HDF5 might backfire and so I look for problems like this one as things to test for. We too are first logging to a binary file and then loading the data into HDF5 and we capture the time stamps at source and so this problem would be less of an issue for us.

My suspicion is that the increasing delays are due to chunking and perhaps compression.

You have 18 bits @ 250kHz sampled for 2.8 ms = 700 samples
You are binning this data down to 18 or 20 samples.
You mention an overall rate of 6 Hz so I am guessing you write at around 120 samples/sec/dataset
Your 9 sec frequency corresponds to about every 1080 values.

Yours are presumably expandable datasets and so they must be using chunking (you don't mention chunk size or compression) but depending on whether caching is enabled and whether you are using compression or not, this might be the source of your delays every 9 secs. If you are using a chunk size of 1000 (for example) then about every 9 seconds your data is compressed, and written to disk. This effect will be magnified by the number of channels you are recording (24 or so) since I am guessing all those datasets will fill up a chunk at the same time due to uniform sample rates. I am unclear why the time to write would get longer as time goes on but you might want to try varying chunk size and turning compression off to see if it changes the behaviour.

Can I ask why you are not capturing the time stamps at the point of measurement?



On Mon, May 9, 2016 at 1:58 PM, Karl Hoover <<>> wrote:

We're developing software for the control of a scientific instrument. At
an overall rate of bout 6Hz, 2.8 millisecond's worth of 18 bit samples at
250 kHz on up to 24 channels. These data are shipped back over gigabit
Ethernet to a Linux PC running a simple Java program. These data can
reliably be written as a byte stream to disk at full speed with extremely
regular timing. Thus we are certain that our data acquisition, Ethernet
transport, Linux PC software and file system are working fine.

However, the users want data in a more portable, summarized format and we
selected hdf5. The 700 or so 18 bit samples of each channel are integrated
into 18 to 20 time bins. The resulting data sets are thus not very large at
all. I've attached a screen shot of a region of a typical file of typical
size and example data (much smaller than a typical file.)

The instrument operates in two distinct modes. In one mode the instrument
is stationary over the region of interest. This is working flawlessly. In
the other mode, the instrument is moved around and about in arbitrary
paths. In this mode the precise time of the data acquisition obviously is
critical. What we observe is that the performance of the system is fine
very stable at 6Hz except that every 9 seconds a delay occurs starting with
about a 10 ms delays growing without bound to 100's of milliseconds. There
is nothing in my software that knows anything about a 9 second interval.
And I've found that this delay only occurs when I *write* the HDF5 file.
All other processing including creating the HDF5 file can be performed
without any performance problem. It makes no difference whether I keep the
hdf5 file open or close it each time. I'm using HDF5.8.02 and the jni /
Java library. Any suggestions about how to fix this problem would be

Best regards,
Karl Hoover

Senior Sofware Engineer