interesting issues with split vfd file naming

Hi All,

I've been adding support for HDF5's split vfd to Silo, a library that
runs on top of HDF5 to read and write scientific mesh data.

First, split vfd is great! Works like a charm. Reduced I/O requests by 3
orders of magnitude by using core for meta and sec2 for raw. Thats
awsome!

Nonetheless, I've run into a number of peculiar issues with the 'file
splitting' aspect of this vfd and wanted to mention them, get feedback.

The overriding issue is that the HDF5 library itself presently does not
know of a file's 'splitness'. An application using HDF5 has to tell it
so prior to opening the file as well as the extensions used for meta and
raw parts. I think almost all of my problems would disappear if HDF5
'knew' a file was split via some kind of magic information contained in
either (or any one of the files if you are using the multi vfd) file or,
at the very least, in the meta file. That way, it would be possible to
pass to H5open a string that is the actual name of a file on disk and
HDF5 could just 'figure it out'

A consequence of this is that few if any of the hdf5 tools will be able
to operate on files that have been generated via split vfd. I am told
h5dump might work but haven't tried whats involved yet. I have a
solution for cases where only one file is opened at a time.

My Silo library sitting on top of HDF5 interacts with files NOT ONLY
through HDF5 library but also system calls (stat,access,...) as well.
But, the string one must pass to HDF5 to correctly open a split file is
NOT NECESSARILY the name of any actual file. So, all of Silo's system
calls can fail even though there is really a split file there for HDF5
to open. Worse, it could be the name of an actual file just not the
ACTUAL file the split vfd will open. Its properties may be entirely
different than the file(s) you reall want to open.

Finally, I have software on top of Silo that may open multiple files
generated by different user communities. And, each may use a different
convention for extension names for the meta and raw files of the split
vfd. That means the Silo library has to manage multiple extension
conventions. That isn't too bad by itself. However, if you have
"foobar.meta" and "foobar.raw" generated by application A using one set
of split vfd extensions and then another "foobar.aaa" and "foobar.bbb"
generated by application B using a different set of split vfd extension
conventions, you can no longer ask Silo to open "foobar". But, you also
can't open any of the other foobars as they are only one piece of a
split pair of files.

Puzzling, puzzling...

Mark

···

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-851

Hi Mark,

···

On Feb 12, 2010, at 12:44 AM, Mark Miller wrote:

Hi All,

I've been adding support for HDF5's split vfd to Silo, a library that
runs on top of HDF5 to read and write scientific mesh data.

First, split vfd is great! Works like a charm. Reduced I/O requests by 3
orders of magnitude by using core for meta and sec2 for raw. Thats
awsome!

Nonetheless, I've run into a number of peculiar issues with the 'file
splitting' aspect of this vfd and wanted to mention them, get feedback.

The overriding issue is that the HDF5 library itself presently does not
know of a file's 'splitness'. An application using HDF5 has to tell it
so prior to opening the file as well as the extensions used for meta and
raw parts. I think almost all of my problems would disappear if HDF5
'knew' a file was split via some kind of magic information contained in
either (or any one of the files if you are using the multi vfd) file or,
at the very least, in the meta file. That way, it would be possible to
pass to H5open a string that is the actual name of a file on disk and
HDF5 could just 'figure it out'

A consequence of this is that few if any of the hdf5 tools will be able
to operate on files that have been generated via split vfd. I am told
h5dump might work but haven't tried whats involved yet. I have a
solution for cases where only one file is opened at a time.

My Silo library sitting on top of HDF5 interacts with files NOT ONLY
through HDF5 library but also system calls (stat,access,...) as well.
But, the string one must pass to HDF5 to correctly open a split file is
NOT NECESSARILY the name of any actual file. So, all of Silo's system
calls can fail even though there is really a split file there for HDF5
to open. Worse, it could be the name of an actual file just not the
ACTUAL file the split vfd will open. Its properties may be entirely
different than the file(s) you reall want to open.

Finally, I have software on top of Silo that may open multiple files
generated by different user communities. And, each may use a different
convention for extension names for the meta and raw files of the split
vfd. That means the Silo library has to manage multiple extension
conventions. That isn't too bad by itself. However, if you have
"foobar.meta" and "foobar.raw" generated by application A using one set
of split vfd extensions and then another "foobar.aaa" and "foobar.bbb"
generated by application B using a different set of split vfd extension
conventions, you can no longer ask Silo to open "foobar". But, you also
can't open any of the other foobars as they are only one piece of a
split pair of files.

  Yes, the split, multi & family file drivers create an interesting batch of problems. I'm not certain exactly how to solve the problems that arise when dealing with an HDF5 "file" that maps onto multiple system files. If you've got any suggestions for solving some of the problems exhibited by the use cases you outline above, that would be great. :slight_smile:

  Quincey

  Yes, the split, multi & family file drivers create an interesting
batch of problems. I'm not certain exactly how to solve the problems
that arise when dealing with an HDF5 "file" that maps onto multiple
system files. If you've got any suggestions for solving some of the
problems exhibited by the use cases you outline above, that would be
great. :slight_smile:

Well, I haven't read the VFL docs in detail but I did see something in
there regarding a USER_BLOCK. I wonder if the split vfd could somehow
use that to help solve the fundamental problem here; that is ensuring
that HDF5 library proper 'knows' that a given set of files represent a
'split' file. I was thinking that if the filenaming convention is stored
in this USER_BLOCK then an attempt to open the 'meta' part could be made
to 'automagically' work correctly; that is identify that its a 'split'
file AND setup the correct extensions for the set_fapl call to succeed
in opening it. Such an approach may NOT work if HDF5 library is asked to
open some other part of the split stream in which the USER_BLOCK does
not exist. But, its a step in the right direction.

···

On Tue, 2010-02-16 at 06:40 -0600, Quincey Koziol wrote:

  Quincey

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://*mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Mark C. Miller, Lawrence Livermore National Laboratory
================!!LLNL BUSINESS ONLY!!================
miller86@llnl.gov urgent: miller86@pager.llnl.gov
T:8-6 (925)-423-5901 M/W/Th:7-12,2-7 (530)-753-8511

Hi Mark,

···

On Feb 16, 2010, at 11:21 AM, Mark Miller wrote:

On Tue, 2010-02-16 at 06:40 -0600, Quincey Koziol wrote:

  Yes, the split, multi & family file drivers create an interesting
batch of problems. I'm not certain exactly how to solve the problems
that arise when dealing with an HDF5 "file" that maps onto multiple
system files. If you've got any suggestions for solving some of the
problems exhibited by the use cases you outline above, that would be
great. :slight_smile:

Well, I haven't read the VFL docs in detail but I did see something in
there regarding a USER_BLOCK. I wonder if the split vfd could somehow
use that to help solve the fundamental problem here; that is ensuring
that HDF5 library proper 'knows' that a given set of files represent a
'split' file. I was thinking that if the filenaming convention is stored
in this USER_BLOCK then an attempt to open the 'meta' part could be made
to 'automagically' work correctly; that is identify that its a 'split'
file AND setup the correct extensions for the set_fapl call to succeed
in opening it. Such an approach may NOT work if HDF5 library is asked to
open some other part of the split stream in which the USER_BLOCK does
not exist. But, its a step in the right direction.

  I think you mean the super block here, not the user block. With that in mind, yes, it is possible to "probe" for the correct VFD to use for the file and then re-open the file with the information from the super block. I think we even have an existing issue in our tracker for this, so it's mostly a matter of priorities and funding at this point (which could be routed around with a well-written patch, also).

  Quincey