Many datasets in an HDF5 file

Jon · February 24, 2016, 3:55pm

Hello,

I am planning to store potentially tens of thousands of 2-D datasets along with their relevant attributes to an HDF5 file. Each dataset will have the same number of rows but different number of columns. I am rather new to HDF so I thought I'd ask about any potential pitfalls before diving into coding. Are there any memory or performance issues I should be concerned about due to the large number of datasets being dealt with? And what about the file size? Currently, the data is stored in native Fortran sequential binary file format. The file sizes range from tens of GBytes to over 100 GBytes depending on the application they are generated from. Should I expect a file size that is much larger than its Fortran binary counterpart or about the same size?

Any information would be greatly appreciated.

Thanks,
Jon

···

**************************************************
Emin C. Dogrul, Ph.D., P.E.
Water Resources Engineer
Hydrologic Models Development Unit

California Department of Water Resources Bay-Delta Office
1416 9th Street, Rm 252A
Sacramento, CA 95814

Phone: (916) 654 7018
Fax: (916) 653 6077
e-mail: Can.Dogrul@water.ca.gov<mailto:Can.Dogrul@water.ca.gov>
**************************************************

miller86 · February 24, 2016, 8:57pm

See comments embedded below. . .

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of "Dogrul, Can@DWR" <Can.Dogrul@water.ca.gov<mailto:Can.Dogrul@water.ca.gov>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Wednesday, February 24, 2016 7:55 AM
To: "hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>" <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Subject: [Hdf-forum] Many datasets in an HDF5 file

Unsubscribe

It appears that you have subscribed to commercial messages from this sender. To stop receiving such messages from this sender, please unsubscribe<http://secure-web.cisco.com/1A1STJFlrYW3pTYGjGkMV3NVABn_SrVUt5GFlSSYKh0vKy2xTGJoDuUy9XGiioXFcjwRNGD9_HamR1Ppz6cfaxQdbq9cFewBg5CY3laSbxSX80zDDBTeENu7Y-nmdENs5EuE3QjYK1LT1PvSrZCDTtBy_6HotIJ3uPcB_4RmfqhvAOm91vuZ6vFsxYvohgxn9GD8i-i3KcIsdGU9m4CHeDmLoRYFZaBN96BG3v9612CNzdHREayXFC2-cQIvRhfVYVU7heOSIBUI4Befjnnoi8t096qlCk9dCwE-m9w6fsDLTd-uPUIMs_jmDtKGDxepyrEqk0u7K2Bwyyv8Nok5_zDGyl14SR1FPh4xaPShhfy__Ikr1v6hf4ZBqM6Rz3DNDjz1U-bAmK7O0Yqv-vG7Pl3Dy_6k_u6rIRWva7QBIqSn3cgVRDJTPVzssHgLHyXdgiRzIJ7FHpYrIVqsNPNohU-V-gV2_8zjF77GYygwAP152r09iqc83Su1hJ0B0vBTda1bBOhrWg3H28FiYdM2Q7Q/l70%3Ahttp%3A%2F%2Flists.hdfgroup.org%2Fmailman%2Foptions%2Fhdf-forum_lists.hdfgroup.org63%3Amailto%3Ahdf-forum-request%40lists.hdfgroup.org%3Fsubject%3Dunsubscribee>
Hello,

I am planning to store potentially tens of thousands of 2-D datasets along with their relevant attributes to an HDF5 file. Each dataset will have the same number of rows but different number of columns. I am rather new to HDF so I thought I’d ask about any potential pitfalls before diving into coding. Are there any memory or performance issues I should be concerned about due to the large number of datasets being dealt with?

If its practical, I think you would want to try to distribute the datasets among several groups in a group hierarchy of some modest depth, maybe 2-6 depending on dataset count. Putting all datasets in a single group is probably not the best approach as it leads to a rather large single structure necessary to manage all the members of that group.

And what about the file size? Currently, the data is stored in native Fortran sequential binary file format. The file sizes range from tens of GBytes to over 100 GBytes depending on the application they are generated from. Should I expect a file size that is much larger than its Fortran binary counterpart or about the same size?

That depends. Is this a *lot* of tiny datasets or a lot of large-ish datasets? I think dataset header overheads are on the order of 1/2 kilobyte. Its worse if you chunk the datasets (e.g. use H5P_CHUNK storage mode when you create the datasets). If your average dataset size is say 20x that (e.g. >= 10Kb), then I think the file size difference will NOT be significant.

Hope that helps.