Efficient reading of HDF5 files

Thanks for the reply!

Are you seeing a lot of disk activity after the data have been loaded
into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

The entire thing is being run on a cluster, so I can't check disk activity - but the data is local to the program.
However, I can see that the program is fast at loading the first 60ish files, and then slows down. As soon as that slowdown occurs I also see virtual memory useage increase, so I assume it's loading data into VM rather than physical RAM.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord. all
100+ files, so there is no need to store complete arrays, but you want
parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element individually. It is remote sensing data, with each file being a separate observation, so what I'm doing is processing a timeseries on a per-pixel basis.
As you say, there's no need to store the complete arrays, but my attempts at loading only a small hyperslab (corresponding to one row of the input images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

Your data arrays have shape [2000,2200] and I understand you read
[2000,1] hyperslabs.
In HDF5 the indices are in C-order, this the first axis varies slowest.
I think It would be much better to read [1,2200] hyperslabs.

Another approach might be that the data are stored in a single tiled
cube where time is an extendible axis.

As an aside.
We have (radio-astronomical) datasets of tens of GBytes ordered in
time,baseline,freq, while a particular application needs to access the
data in baseline,time,freq order. Even though the amount of data to read
per baseline,time is about 8 KBytes, the seek times on the disk were
killing (even when reading multiple baselines at a time). It proved that
first resorting the data was cheaper than leapfrogging through them.
Eventually the algorithm was changed such that as many time slots as
fitting in physical memory are read, so the data access could be made
sequential again.
Although in this case HDF5 was not used, it will be the same for HDF5 I
think.

Cheers,
Ger

"Simon R. Proud" <srp@geo.ku.dk> 1/27/2011 9:43 PM >>>

Thanks for the reply!

Are you seeing a lot of disk activity after the data have been loaded
into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

The entire thing is being run on a cluster, so I can't check disk
activity - but the data is local to the program.
However, I can see that the program is fast at loading the first 60ish
files, and then slows down. As soon as that slowdown occurs I also see
virtual memory useage increase, so I assume it's loading data into VM
rather than physical RAM.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord.

all

100+ files, so there is no need to store complete arrays, but you

want

parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element
individually. It is remote sensing data, with each file being a separate
observation, so what I'm doing is processing a timeseries on a per-pixel
basis.
As you say, there's no need to store the complete arrays, but my
attempts at loading only a small hyperslab (corresponding to one row of
the input images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

Ger van Diepen's suggestions make sense to me. I know that some other
sites that offer time-series views of RS data create a separate copy
of the data organized as he suggests. What I don't know is whether
it is still possible on a modern cluster and using hdf5 to take
advantage of memory-mapped I/O for this use-case. Real life is more
complicated as we want to do this with "binned" (integerized
sinusoidal grid) data so don't have arrays.

···

On Thu, Jan 27, 2011 at 4:43 PM, Simon R. Proud <srp@geo.ku.dk> wrote:

Thanks for the reply!

Are you seeing a lot of disk activity after the data have been loaded
into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

The entire thing is being run on a cluster, so I can't check disk activity -
but the data is local to the program.
However, I can see that the program is fast at loading the first 60ish
files, and then slows down. As soon as that slowdown occurs I also see
virtual memory useage increase, so I assume it's loading data into VM rather
than physical RAM.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord. all
100+ files, so there is no need to store complete arrays, but you want
parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element individually.
It is remote sensing data, with each file being a separate observation, so
what I'm doing is processing a timeseries on a per-pixel basis.
As you say, there's no need to store the complete arrays, but my attempts at
loading only a small hyperslab (corresponding to one row of the input
images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

We used mmap-ed IO in that application, but that didn't help. The
physical disk seeks still have to be done.
I must say that we did this on a single RAID array. On a large disk
subsystem data might be spread over many more disks and is leapfrogging
through the data less painful.

Cheers,
Ger

"George N. White III" <gnwiii@gmail.com> 1/28/2011 2:47 PM >>>

Thanks for the reply!

Are you seeing a lot of disk activity after the data have been

loaded

into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

The entire thing is being run on a cluster, so I can't check disk

activity -

but the data is local to the program.
However, I can see that the program is fast at loading the first

60ish

files, and then slows down. As soon as that slowdown occurs I also

see

virtual memory useage increase, so I assume it's loading data into VM

rather

than physical RAM.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord.

all

100+ files, so there is no need to store complete arrays, but you

want

parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element

individually.

It is remote sensing data, with each file being a separate

observation, so

what I'm doing is processing a timeseries on a per-pixel basis.
As you say, there's no need to store the complete arrays, but my

attempts at

loading only a small hyperslab (corresponding to one row of the

input

images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

Ger van Diepen's suggestions make sense to me. I know that some other
sites that offer time-series views of RS data create a separate copy
of the data organized as he suggests. What I don't know is whether
it is still possible on a modern cluster and using hdf5 to take
advantage of memory-mapped I/O for this use-case. Real life is more
complicated as we want to do this with "binned" (integerized
sinusoidal grid) data so don't have arrays.

···

On Thu, Jan 27, 2011 at 4:43 PM, Simon R. Proud <srp@geo.ku.dk> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi again,
Thanks to Ger and George for their replies. It turns out that my memory problems were caused by me not closing a dataspace, so I was creating a new dataspace for each hyperslab but not closing the old one...hence massive memory use!
I fixed that, but still had a slow program - so I took the advice and switched to reading [1,2200] hyperslabs, and that helped significantly. Playing about with the chunk sizes also helped, so now I have a nice, fast, program for loading all the data.

Thanks again, you've been a great help - and saved me from a lot of problems in getting this working nicely.
Simon.

"Ger van Diepen" <diepen@astron.nl> 1/28/2011 2:55 pm >>>

We used mmap-ed IO in that application, but that didn't help. The
physical disk seeks still have to be done.
I must say that we did this on a single RAID array. On a large disk
subsystem data might be spread over many more disks and is leapfrogging
through the data less painful.

Cheers,
Ger

"George N. White III" <gnwiii@gmail.com> 1/28/2011 2:47 PM >>>

Thanks for the reply!

Are you seeing a lot of disk activity after the data have been

loaded

into memory? That would indicate
excessive swapping. Low CPU usage (CPU is waiting on I/O) is another
indicator. There are usually some OS-specific tools to gather
statistics on vm usage and swapping. Are the data on a local disk or
a network server?

The entire thing is being run on a cluster, so I can't check disk

activity -

but the data is local to the program.
However, I can see that the program is fast at loading the first

60ish

files, and then slows down. As soon as that slowdown occurs I also

see

virtual memory useage increase, so I assume it's loading data into VM

rather

than physical RAM.

You need to tell us more about how the data are used. One common
example is where the calculation is repeated for each (i,j) coord.

all

100+ files, so there is no need to store complete arrays, but you

want

parts of all arrays to be stored at the same time. Another is a
calculation that uses data from one array at a time, so there is no
need to store more than one array at a time.

Yes, I'm performing the former - processing each i,j element

individually.

It is remote sensing data, with each file being a separate

observation, so

what I'm doing is processing a timeseries on a per-pixel basis.
As you say, there's no need to store the complete arrays, but my

attempts at

loading only a small hyperslab (corresponding to one row of the

input

images) have not been successful.

Hope that makes sense, and thanks again.
Simon.

Ger van Diepen's suggestions make sense to me. I know that some other
sites that offer time-series views of RS data create a separate copy
of the data organized as he suggests. What I don't know is whether
it is still possible on a modern cluster and using hdf5 to take
advantage of memory-mapped I/O for this use-case. Real life is more
complicated as we want to do this with "binned" (integerized
sinusoidal grid) data so don't have arrays.

···

On Thu, Jan 27, 2011 at 4:43 PM, Simon R. Proud <srp@geo.ku.dk> wrote:

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org