Reding (part of) a dataset from a large HDF5 file on a remote server


#1

Dear all,
I need to read part of a large HDF5 file which is stored on a
remote server. The file is large and I am looking for solutions which
do not require downloading it entirely, and I am not allowed to run
code on the server so I can not, for instance, pipe the data I need
from the server side.

More details about the use case:

  • the remote file is accessible either through a Windows network drive
    or as fuse file system with underlying ssh (but I am open for
    suggestions if a better option exists)
  • I need either an entire dataset or part of it, anyway less then 10%
    of the entire file
  • ideally, I would work in python

Thank you for any suggestion!

Marco


#2

Marco,

HDF5 can access files on Windows network drive and one can use sub-setting to read partial data. Sub-setting is available in h5py. Have your used HDF5 before? If not, “Python and HDF5” book is the way to start.

Could you please try and report if you encounter any problems?

Word of caution: SWMR feature will not work the files created on the network drives.

Thank you!

Elena


#3

Dear Elena,

thank you for the reply. I did work with HDF5 in the past but memory and
bandwidth were never a concern. I think “sub-setting” here is the keyword,
indeed my first tests show that this works out of the box.

One more point on this: I have the choice bewteen having various data in
different datasets or as members of a single user defined type. My understanding
is that this choice will not affect the performance of reading small parts of
the data, since anyway I am only fetching the chunks of the data I need:
is this correct? (Let us assume that I will need different slices of these data
at different times, so there is no optimal layout in terms of grouping together
the data themselves).

Thank you,
Marco