Hi all,
I have problem sampling random elements in a large Hdf5 array of size 6401100*100.
Right now I write C codes to do this hoping to be fast. I first compute random indices and store them in an array, and then read in one H5Dread call. I am also running this in parallel and use H5FD_MPIO_COLLECTIVE, but I suspect this will help little since my problem is essentially bounded by single-reading-time.
The run is very slow. I am thinking about the following way to speed up:
- Sort indices before accessing.
- However, each index entry is 4d now and all entries are stored in a flattened 1d array, making the sorting not straightforward to do.
- Can I create a reshaped copy of the dataset as a 1D array without actually touching the source file? Then I only need to access data in this flattened 1D data using a sorted 1D index array. In Python’s h5py, a simple reshape call would work but a simple googling did not show me how to do it in C.
- Use smaller cache, which I naive imaging might reduce the time per reading since I really don’t need caching.
Any suggestion is appreciated.
Best,
Liang
Hmm. You don’t mean one H5Dread call for each element you want to read, do you? Why not construct a single data selection. If that is what you meant, then my apologies for mis-understanding. In theory, HDF5 library does a lot to optimize this kind of I/O operation via data sieving. There are other properties that control the behavior of HDF5 in this regard also. These include H5Pset_sieve_buf_size and maybe H5Pset_cache.
Hi Miller,
Thank you for reply.
I meant to say that I stored all indices in one array so that I only call H5Dread once to read all elements.
It is still slow.
I am not familiar with the concept of sieve. But I do think the buff and cache settings you suggested would help. But it is not clear what sizes I should set (My data is 8 byte double floats). I will need to try and see. I might need to set both cache and sieve buff to 8 byte, for example?
At the same time, right now, the strategy I am trying is that:
-
Since my data is sequentially stored (no chunking), I can compute npts * 4 indices (my data is 4d and I want to sample npts points) then sort them according to the global indices as in a flattened 1d version. I have done this.
-
Then I call H5Dread to read in data using the sorted npts * 4 indices. I hope this will minimize the cache misses.
-
After reading, I will need to restore the retrieved data to the order before sorting.
Your suggestions are very helpful because I was overwhelmed by the number of functions provided to do so many things in Hdf5. I will try the buff and cache setting soon.
Best,
Liang
Well, unfortunately, I that may be your problem. HDF5’s partial I/O operations are kinda sorta predicated on doing chunked storage. I think partial I/O will still “work” on CONTIGUOUS storage but I don’t think it will or can be fast. If you are using CONTIGUOUS storage for dataset layout, then I think doing partial I/O on that (e.g. reading only some of it) probably winds up being very slow. I would try using chunked storage (HP5set_chunk) and then playing with or designing chunk size appropriate for your application. I think it is only supported in parallel in version >=1.10 and may hit scaling issues above maybe 10,000 MPI ranks.
Thank you for the explanation. I will try to use chunked storage for this and keep you and other users updated if I learn some experience.
(Initially, I was too lazy to convert data to chunked storage because the source data files I received are generated by external codes that I have no control of. I have been trying to convince my colleagues to use chunked storage for large data files.)
Have a look at h5repack’s -l (layout) option. You don’t need to write any code to do conversion to chunked storage. h5repack can do it for you 
Hi Miller,
My test speed is acceptable now using the chunked storage. I did not use h5repack, though since it was very slow, and I ended up writing python code to fill data chunk by chunk, which seems to be fast enough for me (~10min for a 30G data on an old Cray cluster machine with Lustre storage). Thank you for the suggestions.
On the other hand, at least in the cases I tested, sorting the indices according to storage order somehow did not give me obvious speedup in sampling, which is a little surprising. This is perhaps case-dependent and complicated by things like the buffer that I am not very familiar with, though. Anyway, right now I am happy with the results.
Liang
Cool! Glad you got it improved.