Efficient reading of HDF5 files

Simon_R_Proud · January 27, 2011, 3:47pm

Hi all,

I'm working on a project to read data from multiple HDF5 files for analysis. Each file consists of 5 floating point datasets (each being 2000x2200 in size) and there's between 100 and 120 files to read.
At the moment my code reads all the data from all the files into memory at once, which is nice and simple but because of memory constraints I end up using a lot of virtual memory....which is rather slow.

I tried reading a hyperslab of each dataset (corresponding of 2000 elements) from each file, but that turned out to be even slower than reading all the data at once.

So, do you have any suggestions as to the best way to read this data? Aside from getting more memory for the computer!

All the best,
Simon.

gnwiii · January 27, 2011, 4:54pm

Hi all,

I'm working on a project to read data from multiple HDF5 files for analysis. Each file consists of 5 floating point datasets (each being 2000x2200 in size) and there's between 100 and 120 files to read.
At the moment my code reads all the data from all the files into memory at once, which is nice and simple but because of memory constraints I end up using a lot of virtual memory....which is rather slow.

So about 32MBytes each for over 500 arrays, or over 16GBytes. At one
time I was running a calculation on O(50) such arrays stored in hdf4
on a system with 0.5G RAM (e.g., much smaller than the data. We had a
version of the hdf4 library that used memory mapping, and each
calculation only need a small part of each array, so by raising the
limit on the maximum virtual memory that could be allocated the
calculation ran with very modest physical I/O and modest run times.

Are you seeing a lot of disk activity after the data have been loaded
into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

I tried reading a hyperslab of each dataset (corresponding of 2000 elements) from each file, but that turned out to be even slower than reading all the data at once.

So, do you have any suggestions as to the best way to read this data? Aside from getting more memory for the computer!

If your data are larger than real memory you need to arrange things so
memory accesses don't jump around too much.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord. all
100+ files, so there is no need to store complete arrays, but you want
parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

···

On Thu, Jan 27, 2011 at 11:47 AM, Simon R. Proud <Srp@geo.ku.dk> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Efficient reading of HDF5 files