HDF5 SWMR Document design question

Hello HDF Forum members -

I'm working on a project here at SSEC which provides community software to
process GOES-R Rebroadcast data, and am wondering if you could answer a
technical concern we have.

There are users who want us to serve this data in real-time, as they are
being written. We believe this should be possible if we pre-allocate an SDS
in an HDF5 file and initialize it with fill data, which will simply get
replaced as the real data comes in.

So, there will be a single process writing, and multiple other processes
opening, reading, and exiting. We found documentation essentially saying
"don't do this" :

"The SWMR Problem" on page 5.

However, it seems to us this would only be a problem if we are taking
advantage of HDF5 functionality like making new datasets or expanding
the size of existing datasets, which we won't. Offsets (it seems) should
never change.

Also, we are not concerned about whether data has been flushed yet or
not. If readers come along and find fill data when the writer is still
flushing that's fine.

The question is whether this concurrency issue truly does still apply in
our case.

Any help with this is greatly appreciated. We are in pretty good shape but
if we could optimize this before launch that would be nice. Thank you!

···

--
Tommy Jasmin
Space Science and Engineering Center
University of Wisconsin, Madison
1225 West Dayton Street, Madison, WI 53706

Hi,

It is still a problem and you'll need to use the SWMR feature in HDF5 1.10.0.

The problem is that file metadata (chunk indexes, etc.) is cached by the library so the full state of the file is composed of both on-disk and in-memory structures. If an on-disk structure contains a reference to a structure that has not been flushed from the cache yet, a reader will encounter problems when it tries to read the not-on-disk-only-in-memory structure from the disk. One of the key changes that SWMR makes under the hood is to order metadata flushes to avoid this situation.

Dana Robinson
Software Engineer
The HDF Group

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tommy Jasmin
Sent: Monday, March 14, 2016 4:29 PM
To: hdf-forum@lists.hdfgroup.org
Cc: Tommy Jasmin <tommy.jasmin@ssec.wisc.edu>
Subject: [Hdf-forum] HDF5 SWMR Document design question

Hello HDF Forum members -

I'm working on a project here at SSEC which provides community software to process GOES-R Rebroadcast data, and am wondering if you could answer a technical concern we have.

There are users who want us to serve this data in real-time, as they are being written. We believe this should be possible if we pre-allocate an SDS in an HDF5 file and initialize it with fill data, which will simply get replaced as the real data comes in.

So, there will be a single process writing, and multiple other processes opening, reading, and exiting. We found documentation essentially saying "don't do this" :

https://www.hdfgroup.org/HDF5/docNewFeatures/SWMR/Design-HDF5-SWMR-20130629.v5.2.pdf

"The SWMR Problem" on page 5.

However, it seems to us this would only be a problem if we are taking advantage of HDF5 functionality like making new datasets or expanding the size of existing datasets, which we won't. Offsets (it seems) should never change.

Also, we are not concerned about whether data has been flushed yet or not. If readers come along and find fill data when the writer is still flushing that's fine.

The question is whether this concurrency issue truly does still apply in our case.

Any help with this is greatly appreciated. We are in pretty good shape but if we could optimize this before launch that would be nice. Thank you!

--
Tommy Jasmin
Space Science and Engineering Center
University of Wisconsin, Madison
1225 West Dayton Street, Madison, WI 53706

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5