I am planning to store potentially tens of thousands of 2-D datasets along with their relevant attributes to an HDF5 file. Each dataset will have the same number of rows but different number of columns. I am rather new to HDF so I thought I'd ask about any potential pitfalls before diving into coding. Are there any memory or performance issues I should be concerned about due to the large number of datasets being dealt with? And what about the file size? Currently, the data is stored in native Fortran sequential binary file format. The file sizes range from tens of GBytes to over 100 GBytes depending on the application they are generated from. Should I expect a file size that is much larger than its Fortran binary counterpart or about the same size?
Any information would be greatly appreciated.
Thanks,
Jon
···
**************************************************
Emin C. Dogrul, Ph.D., P.E.
Water Resources Engineer
Hydrologic Models Development Unit
California Department of Water Resources Bay-Delta Office
1416 9th Street, Rm 252A
Sacramento, CA 95814
I am planning to store potentially tens of thousands of 2-D datasets along with their relevant attributes to an HDF5 file. Each dataset will have the same number of rows but different number of columns. I am rather new to HDF so I thought I’d ask about any potential pitfalls before diving into coding. Are there any memory or performance issues I should be concerned about due to the large number of datasets being dealt with?
If its practical, I think you would want to try to distribute the datasets among several groups in a group hierarchy of some modest depth, maybe 2-6 depending on dataset count. Putting all datasets in a single group is probably not the best approach as it leads to a rather large single structure necessary to manage all the members of that group.
And what about the file size? Currently, the data is stored in native Fortran sequential binary file format. The file sizes range from tens of GBytes to over 100 GBytes depending on the application they are generated from. Should I expect a file size that is much larger than its Fortran binary counterpart or about the same size?
That depends. Is this a *lot* of tiny datasets or a lot of large-ish datasets? I think dataset header overheads are on the order of 1/2 kilobyte. Its worse if you chunk the datasets (e.g. use H5P_CHUNK storage mode when you create the datasets). If your average dataset size is say 20x that (e.g. >= 10Kb), then I think the file size difference will NOT be significant.
Hope that helps.
Any information would be greatly appreciated.
Thanks,
Jon
**************************************************
Emin C. Dogrul, Ph.D., P.E.
Water Resources Engineer
Hydrologic Models Development Unit
California Department of Water Resources Bay-Delta Office
1416 9th Street, Rm 252A
Sacramento, CA 95814