Random access large dataset

Hi all,

I have problem sampling random elements in a large Hdf5 array of size 6401100*100.

Right now I write C codes to do this hoping to be fast. I first compute random indices and store them in an array, and then read in one H5Dread call. I am also running this in parallel and use H5FD_MPIO_COLLECTIVE, but I suspect this will help little since my problem is essentially bounded by single-reading-time.

The run is very slow. I am thinking about the following way to speed up:

  • Sort indices before accessing.
    • However, each index entry is 4d now and all entries are stored in a flattened 1d array, making the sorting not straightforward to do.
    • Can I create a reshaped copy of the dataset as a 1D array without actually touching the source file? Then I only need to access data in this flattened 1D data using a sorted 1D index array. In Python’s h5py, a simple reshape call would work but a simple googling did not show me how to do it in C.
  • Use smaller cache, which I naive imaging might reduce the time per reading since I really don’t need caching.

Any suggestion is appreciated.

Best,
Liang

Hmm. You don’t mean one H5Dread call for each element you want to read, do you? Why not construct a single data selection. If that is what you meant, then my apologies for mis-understanding. In theory, HDF5 library does a lot to optimize this kind of I/O operation via data sieving. There are other properties that control the behavior of HDF5 in this regard also. These include H5Pset_sieve_buf_size and maybe H5Pset_cache.

Hi Miller,

Thank you for reply.

I meant to say that I stored all indices in one array so that I only call H5Dread once to read all elements.

It is still slow.

I am not familiar with the concept of sieve. But I do think the buff and cache settings you suggested would help. But it is not clear what sizes I should set (My data is 8 byte double floats). I will need to try and see. I might need to set both cache and sieve buff to 8 byte, for example?

At the same time, right now, the strategy I am trying is that:

  • Since my data is sequentially stored (no chunking), I can compute npts * 4 indices (my data is 4d and I want to sample npts points) then sort them according to the global indices as in a flattened 1d version. I have done this.

  • Then I call H5Dread to read in data using the sorted npts * 4 indices. I hope this will minimize the cache misses.

  • After reading, I will need to restore the retrieved data to the order before sorting.

Your suggestions are very helpful because I was overwhelmed by the number of functions provided to do so many things in Hdf5. I will try the buff and cache setting soon.

Best,

Liang

Well, unfortunately, I that may be your problem. HDF5’s partial I/O operations are kinda sorta predicated on doing chunked storage. I think partial I/O will still “work” on CONTIGUOUS storage but I don’t think it will or can be fast. If you are using CONTIGUOUS storage for dataset layout, then I think doing partial I/O on that (e.g. reading only some of it) probably winds up being very slow. I would try using chunked storage (HP5set_chunk) and then playing with or designing chunk size appropriate for your application. I think it is only supported in parallel in version >=1.10 and may hit scaling issues above maybe 10,000 MPI ranks.

Thank you for the explanation. I will try to use chunked storage for this and keep you and other users updated if I learn some experience.

(Initially, I was too lazy to convert data to chunked storage because the source data files I received are generated by external codes that I have no control of. I have been trying to convince my colleagues to use chunked storage for large data files.)

Have a look at h5repack’s -l (layout) option. You don’t need to write any code to do conversion to chunked storage. h5repack can do it for you :wink:

Hi Miller,

My test speed is acceptable now using the chunked storage. I did not use h5repack, though since it was very slow, and I ended up writing python code to fill data chunk by chunk, which seems to be fast enough for me (~10min for a 30G data on an old Cray cluster machine with Lustre storage). Thank you for the suggestions.

On the other hand, at least in the cases I tested, sorting the indices according to storage order somehow did not give me obvious speedup in sampling, which is a little surprising. This is perhaps case-dependent and complicated by things like the buffer that I am not very familiar with, though. Anyway, right now I am happy with the results.

Liang

Cool! Glad you got it improved.