multi-/split- file examples or advice for controlling file layout

j.rowe · February 22, 2016, 7:45pm

Hello- we are using some block-level deduping infrastructure that allows us to synchronize files around our enterprise. To make this most effective, we need the beginning of files to be as stable as possible.

We have HDF5 files that range from .5 to 20GBs, and generally alter only 5% of the data in specific datasets after then initial creation. We would like to structure these such that we can take the most advantage of the aforementioned deduping. Questions:

1) It appears that H5Pset_fapl_split() is the direction to look to separate data from meta data. Is this fully supported? Any performance issues with these drivers over the single-file type?

2) Is there a way to specify where a particular dataset is stored? E.g., in my ideal scenario, I would have 3 files: 1) for my metadata which is potentially most volatile as blocks change; 2) for data sets where I am altering data which would be somewhat volatile; 3) the last file for my most static data.

Any other advice or practical experience in this regard?

Best regards,
--Jim

miller86 · February 22, 2016, 9:31pm

Hello- we are using some block-level deduping infrastructure that allows us to synchronize files around our enterprise. To make this most effective, we need the beginning of files to be as stable as possible.

We have HDF5 files that range from .5 to 20GBs, and generally alter only 5% of the data in specific datasets after then initial creation. We would like to structure these such that we can take the most advantage of the aforementioned deduping. Questions:

1) It appears that H5Pset_fapl_split() is the direction to look to separate data from meta data. Is this fully supported?

Yes. Note that there is a similar driver called 'multi' that will be discontinued. That is NOT relevant to your use of the split driver, however. The HDF Group will continue to support the split driver.

Any performance issues with these drivers over the single-file type?

Not in cases I have tested. In fact, it *can* lead to improved performance in many cases. That said, there is a logistical issue to keep in mind. Every 'file' is really two files on disk, the meta file and the raw file. So, all the software (and users) in your workflows need to be 'hip' to this. Which file do user's click on? Which file do they pass in an open call? You have to make sure that your workflows call H5Fopen on the correct filesystem object. If a user wants to give some file(s) to another user (say via tar'ing them up) does the user know to get *both* the raw and meta files? Worse, presently the HDF5 library is not smart enough to know that H5Pset_fapl_split is needed to open such a file. So your software needs to have the smarts to make it happen.

2) Is there a way to specify where a particular dataset is stored? E.g., in my ideal scenario, I would have 3 files: 1) for my metadata which is potentially most volatile as blocks change; 2) for data sets where I am altering data which would be somewhat volatile; 3) the last file for my most static data.

Hmmm. I am confused. You say 'dataset' here and we're talking about the split file setting. All datasets go into the 'raw' file. Well, that isn't entirely true. Probably datasets with storage type 'compact' will go into the meta file. However, compact datasets are limited in size to 64Kb and probably not relevant to your case. It sounds like you might really be looking for a swizzle on the family file case.

But, I think there are two ways you could go here whilst still using split driver for raw/meta files. First, you could define a 3rd HDF5 file that you 'mount' into the raw/meta file after you open it (see H5Fmount()). Maybe you use the mounted file for your class-3 stuff, and the raw/meta split files for you class 2/1 stuff respectively.

Another option is to store your non-volitile stuff as 'external' datasets in external (non-hdf5) files. See H5Pset_external() for that.

I'd go with the mount option because the 3rd file would still be a valid HDF5 file and you can put any number of datasets in any organization you desire into that 3rd file. The external dataset option is very limited in functionality.

Hope that helps.

Any other advice or practical experience in this regard?

Best regards,
--Jim

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of "Rowe, Jim" <J.Rowe@questintegrity.com<mailto:J.Rowe@questintegrity.com>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Monday, February 22, 2016 11:45 AM
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Subject: [Hdf-forum] multi-/split- file examples or advice for controlling file layout

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

multi-/split- file examples or advice for controlling file layout