H5 read function timing discrepancy

Gopal_Anupam_GE_Infr · June 2, 2008, 9:08pm

Hi,

I have observed some anomalies in the HDF5 read functionality for which
I do not have any explanation, I was hoping if anyone of have an answer
to my question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

Ray · June 2, 2008, 9:17pm

Have you tried taking compression off and comparing that way? One set
appears to be 10x bigger than the other, which is approx. your time
difference. Perhaps decompression, depending upon how you've chunked and
distributed values, may be the culprit.

···

_____

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for which I do
not have any explanation, I was hoping if anyone of have an answer to my
question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression, allocation
time, dimension, data type). The only difference is in the dimension size
one of them is (130*5*12404) and the other one is (3151*5*5162). I read the
same hyperslab (same offset and number of elements). But one takes on an
average 450 mili seconds and the other one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

<http://www.gepsec.com/> http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Gopal_Anupam_GE_Infr · June 2, 2008, 9:29pm

Thanks for your reply. To answer ur question, No I havent, I did not see
any point in doing that, as both are generated using the same code. and
hence have identical properties. If compression or chunking makes one of
them slower to read, then it should have the same effect on the other
one. The reason one is 10 times bigger than the other is because it has
10 times more data than the other. It has nothing to do with
compression. any other ideas ??

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

···

________________________________

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set
appears to be 10x bigger than the other, which is approx. your time
difference. Perhaps decompression, depending upon how you've chunked
and distributed values, may be the culprit.
________________________________

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

  Hi,

  I have observed some anomalies in the HDF5 read functionality
for which I do not have any explanation, I was hoping if anyone of have
an answer to my question or have experienced it before.

  Here it goes:

  I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

  Please let me know if it makes sense to anyone !!

  Regards,

  Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

Ray · June 2, 2008, 9:45pm

I don't know if it has anything to do with it, but it depends upon how
things are stored and which points your hyperslab retrieves. If you
hyperslab retrieves points which are scattered in a number of different
compressed chunks, each one of those chunks needs to be decompressed, then
the data accessed.

Hence, one way to confirm if it is a compression thing or something else is
to remove it, if you can.

If you are running this on windows with VS, you could use the profiling
utility to see in which routine most of the processing time is taking place.
That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well. And
which version of HDF5 are you using?

···

_____

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:30
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Thanks for your reply. To answer ur question, No I havent, I did not see any
point in doing that, as both are generated using the same code. and hence
have identical properties. If compression or chunking makes one of them
slower to read, then it should have the same effect on the other one. The
reason one is 10 times bigger than the other is because it has 10 times more
data than the other. It has nothing to do with compression. any other ideas
??

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

<http://www.gepsec.com/> http://www.gepsec.com

General Electric International, Inc.

_____

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set
appears to be 10x bigger than the other, which is approx. your time
difference. Perhaps decompression, depending upon how you've chunked and
distributed values, may be the culprit.
_____

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for which I do
not have any explanation, I was hoping if anyone of have an answer to my
question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression, allocation
time, dimension, data type). The only difference is in the dimension size
one of them is (130*5*12404) and the other one is (3151*5*5162). I read the
same hyperslab (same offset and number of elements). But one takes on an
average 450 mili seconds and the other one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

<http://www.gepsec.com/> http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.
--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Ger_van_Diepen · June 3, 2008, 6:47am

Do both data sets have the same chunk size?
If the same, I find the timing difference strange.
If different, that might explain it because the second one might need to read more chunks to get the hyperslab (but a difference of factor 13 is a lot).
I find it hard to say more because you do not say how the data are chunked and which hyperslab you are reading.

Ger

"Gopal, Anupam (GE Infra, Energy)" <anupam.gopal@ge.com> 06/02/08 11:29 PM >>>

Thanks for your reply. To answer ur question, No I havent, I did not see
any point in doing that, as both are generated using the same code. and
hence have identical properties. If compression or chunking makes one of
them slower to read, then it should have the same effect on the other
one. The reason one is 10 times bigger than the other is because it has
10 times more data than the other. It has nothing to do with
compression. any other ideas ??

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

···

________________________________

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set
appears to be 10x bigger than the other, which is approx. your time
difference. Perhaps decompression, depending upon how you've chunked
and distributed values, may be the culprit.
________________________________

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

  Hi,

  I have observed some anomalies in the HDF5 read functionality
for which I do not have any explanation, I was hoping if anyone of have
an answer to my question or have experienced it before.

  Here it goes:

  I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

  Please let me know if it makes sense to anyone !!

  Regards,

  Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Gopal_Anupam_GE_Infr · June 3, 2008, 1:08pm

Yes that's a good point let me try that, but just to let you know, I
write data to the XY plane, Each plane is chunked. so for this dataset
of (3151*5*5162), I write 5162 times each time adding a chunk of 3151*5.
This was about writing to the dataset. While retrieving the data, I
retrieve a single line in the Z direction, that is an array of size
(1*5162). I assume this means I am am accessing data which is scattered
across 5162 chunks, right ??. But the real question is I am doing the
same for each of these datasets, so if there is a delay in reading then
it should show up in both the cases.
However yesterday I tried to plot the initial size of the dataset with
the time it takes to retrieve the data, and it seems like as the initial
size of the datasets increase, the time taken to retrieve the data (same
amount) also increases proportionately.

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

···

________________________________

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:45 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

I don't know if it has anything to do with it, but it depends upon how
things are stored and which points your hyperslab retrieves. If you
hyperslab retrieves points which are scattered in a number of different
compressed chunks, each one of those chunks needs to be decompressed,
then the data accessed.

Hence, one way to confirm if it is a compression thing or something else
is to remove it, if you can.

If you are running this on windows with VS, you could use the profiling
utility to see in which routine most of the processing time is taking
place. That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well.
And which version of HDF5 are you using?

________________________________

  From: Gopal, Anupam (GE Infra, Energy)
[mailto:anupam.gopal@ge.com]
  Sent: Monday, June 02, 2008 18:30
  To: Ray Burkholder; hdf-forum@hdfgroup.org
  Subject: RE: [hdf-forum] H5 read function timing discrepancy

  Thanks for your reply. To answer ur question, No I havent, I did
not see any point in doing that, as both are generated using the same
code. and hence have identical properties. If compression or chunking
makes one of them slower to read, then it should have the same effect on
the other one. The reason one is 10 times bigger than the other is
because it has 10 times more data than the other. It has nothing to do
with compression. any other ideas ??

  Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com <http://www.gepsec.com/>

General Electric International, Inc.

________________________________

  From: Ray Burkholder [mailto:ray@oneunified.net]
  Sent: Monday, June 02, 2008 5:17 PM
  To: hdf-forum@hdfgroup.org
  Subject: RE: [hdf-forum] H5 read function timing discrepancy

  Have you tried taking compression off and comparing that way?
One set appears to be 10x bigger than the other, which is approx. your
time difference. Perhaps decompression, depending upon how you've
chunked and distributed values, may be the culprit.
________________________________

  From: Gopal, Anupam (GE Infra, Energy)
[mailto:anupam.gopal@ge.com]
  Sent: Monday, June 02, 2008 18:09
  To: hdf-forum@hdfgroup.org
  Subject: [hdf-forum] H5 read function timing discrepancy

    Hi,

    I have observed some anomalies in the HDF5 read
functionality for which I do not have any explanation, I was hoping if
anyone of have an answer to my question or have experienced it before.

    Here it goes:

    I have two datasets. They are identical ( chunking,
compression, allocation time, dimension, data type). The only difference
is in the dimension size one of them is (130*5*12404) and the other one
is (3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

    Please let me know if it makes sense to anyone !!

    Regards,

    Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

<http://www.gepsec.com/> http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

  --
  Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.
  --
  Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

--
Scanned for viruses & dangerous content at One Unified
<http://www.oneunified.net> and is believed to be clean.

Quincey_Koziol · June 3, 2008, 1:21pm

Hi Anupam,

Yes that's a good point let me try that, but just to let you know, I write data to the XY plane, Each plane is chunked. so for this dataset of (3151*5*5162), I write 5162 times each time adding a chunk of 3151*5. This was about writing to the dataset. While retrieving the data, I retrieve a single line in the Z direction, that is an array of size (1*5162). I assume this means I am am accessing data which is scattered across 5162 chunks, right ??. But the real question is I am doing the same for each of these datasets, so if there is a delay in reading then it should show up in both the cases.
However yesterday I tried to plot the initial size of the dataset with the time it takes to retrieve the data, and it seems like as the initial size of the datasets increase, the time taken to retrieve the data (same amount) also increases proportionately.

I agree with an earlier comment in this thread: it's probably related to your chunk sizes. What are the dimensions for chunks in each of these datasets? Since HDF5 [generally] accesses entire chunks at a time (i.e. bringing each chunk with elements to access into memory and extracting necessary elements from it), if the chunks are different sizes, then the I/O times will be different.

Quincey

···

On Jun 3, 2008, at 8:08 AM, Gopal, Anupam (GE Infra, Energy) wrote:

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:45 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

I don't know if it has anything to do with it, but it depends upon how things are stored and which points your hyperslab retrieves. If you hyperslab retrieves points which are scattered in a number of different compressed chunks, each one of those chunks needs to be decompressed, then the data accessed.

Hence, one way to confirm if it is a compression thing or something else is to remove it, if you can.

If you are running this on windows with VS, you could use the profiling utility to see in which routine most of the processing time is taking place. That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well. And which version of HDF5 are you using?

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:30
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Thanks for your reply. To answer ur question, No I havent, I did not see any point in doing that, as both are generated using the same code. and hence have identical properties. If compression or chunking makes one of them slower to read, then it should have the same effect on the other one. The reason one is 10 times bigger than the other is because it has 10 times more data than the other. It has nothing to do with compression. any other ideas ??
<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set appears to be 10x bigger than the other, which is approx. your time difference. Perhaps decompression, depending upon how you've chunked and distributed values, may be the culprit. From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for which I do not have any explanation, I was hoping if anyone of have an answer to my question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression, allocation time, dimension, data type). The only difference is in the dimension size one of them is (130*5*12404) and the other one is (3151*5*5162). I read the same hyperslab (same offset and number of elements). But one takes on an average 450 mili seconds and the other one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ray · June 3, 2008, 1:16pm

Does it make sense that if you are chunking at 3151*5, then accessing a
single line across 5162 chunks, means you have to access and decompress Each
And Every one of the 5162 chunks (which means decompressing the whole file
anyway) ... which gives you your proportion you have found.

One solution is to reduce the size of the chunks so that there are multiple
chunks at the 3151*5 level. Ensure the line you are accessing is not
scattered in those multiple chunks, but is in one smaller chunk. Your
access times should then go down.

···

_____

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Tuesday, June 03, 2008 10:08
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Yes that's a good point let me try that, but just to let you know, I write
data to the XY plane, Each plane is chunked. so for this dataset of
(3151*5*5162), I write 5162 times each time adding a chunk of 3151*5. This
was about writing to the dataset. While retrieving the data, I retrieve a
single line in the Z direction, that is an array of size (1*5162). I assume
this means I am am accessing data which is scattered across 5162 chunks,
right ??. But the real question is I am doing the same for each of these
datasets, so if there is a delay in reading then it should show up in both
the cases.
However yesterday I tried to plot the initial size of the dataset with the
time it takes to retrieve the data, and it seems like as the initial size of
the datasets increase, the time taken to retrieve the data (same amount)
also increases proportionately.

Have you tried taking compression off and comparing that way? One set
appears to be 10x bigger than the other, which is approx. your time
difference. Perhaps decompression, depending upon how you've chunked and
distributed values, may be the culprit.

_____

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for which I do
not have any explanation, I was hoping if anyone of have an answer to my
question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression, allocation
time, dimension, data type). The only difference is in the dimension size
one of them is (130*5*12404) and the other one is (3151*5*5162). I read the
same hyperslab (same offset and number of elements). But one takes on an
average 450 mili seconds and the other one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

--
Scanned for viruses and dangerous content at
http://www.oneunified.net and is believed to be clean.

Gopal_Anupam_GE_Infr · June 3, 2008, 3:19pm

Ok, Got the point. It makes a lot of sense to me now. We have certain
limitations in our proprietary software which generates this data,
because of which we add one chunck at a time, which is a plane(as the
data gets generated). And for post processing, we need to retrieve data
in the Z direction, which turns out not to be aligned, with the chunking
planes, and rips right through it.
Given these limitations, is it possible to reorganize the dataset once
it is created. I mean once the dataset is created (chunked as planes),
can we decompress it and then chunk it differently, so that we can query
it quicker ??. Or is there any other better way to do it?? Thanks again
for ur help

Anupam Gopal
Energy Application & Systems Engineering
GE Energy
T518-385-4586
F 518-385-5703
E anupam.gopal@ge.com
http://www.gepsec.com
General Electric International, Inc.

···

-----Original Message-----
From: Quincey Koziol [mailto:koziol@hdfgroup.org]
Sent: Tuesday, June 03, 2008 9:21 AM
To: Gopal, Anupam (GE Infra, Energy)
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] H5 read function timing discrepancy

Hi Anupam,

On Jun 3, 2008, at 8:08 AM, Gopal, Anupam (GE Infra, Energy) wrote:

Yes that's a good point let me try that, but just to let you know, I
write data to the XY plane, Each plane is chunked. so for this dataset

of (3151*5*5162), I write 5162 times each time adding a chunk of
3151*5. This was about writing to the dataset. While retrieving the
data, I retrieve a single line in the Z direction, that is an array of

size (1*5162). I assume this means I am am accessing data which is
scattered across 5162 chunks, right ??. But the real question is I am
doing the same for each of these datasets, so if there is a delay in
reading then it should show up in both the cases.
However yesterday I tried to plot the initial size of the dataset with

the time it takes to retrieve the data, and it seems like as the
initial size of the datasets increase, the time taken to retrieve the
data (same amount) also increases proportionately.

I agree with an earlier comment in this thread: it's probably
related to your chunk sizes. What are the dimensions for chunks in each
of these datasets? Since HDF5 [generally] accesses entire chunks at a
time (i.e. bringing each chunk with elements to access into memory and
extracting necessary elements from it), if the chunks are different
sizes, then the I/O times will be different.

Quincey

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:45 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

I don't know if it has anything to do with it, but it depends upon how

things are stored and which points your hyperslab retrieves. If you
hyperslab retrieves points which are scattered in a number of
different compressed chunks, each one of those chunks needs to be
decompressed, then the data accessed.

Hence, one way to confirm if it is a compression thing or something
else is to remove it, if you can.

If you are running this on windows with VS, you could use the
profiling utility to see in which routine most of the processing time
is taking place. That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well.
And which version of HDF5 are you using?

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:30
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Thanks for your reply. To answer ur question, No I havent, I did not
see any point in doing that, as both are generated using the same
code. and hence have identical properties. If compression or
chunking makes one of them slower to read, then it should have the
same effect on the other one. The reason one is 10 times bigger than
the other is because it has 10 times more data than the other. It has
nothing to do with compression. any other ideas ??
<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set

appears to be 10x bigger than the other, which is approx. your
time difference. Perhaps decompression, depending upon how you've
chunked and distributed values, may be the culprit. From: Gopal,
Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for
which I do not have any explanation, I was hoping if anyone of have an

answer to my question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 3, 2008, 7:46pm

Hi Anupam,

Ok, Got the point. It makes a lot of sense to me now. We have certain
limitations in our proprietary software which generates this data,
because of which we add one chunck at a time, which is a plane(as the
data gets generated). And for post processing, we need to retrieve data
in the Z direction, which turns out not to be aligned, with the chunking
planes, and rips right through it.
Given these limitations, is it possible to reorganize the dataset once
it is created. I mean once the dataset is created (chunked as planes),
can we decompress it and then chunk it differently, so that we can query
it quicker ??. Or is there any other better way to do it?? Thanks again
for ur help

The h5repack tool that ships with the HDF5 distribution will re-chunk datasets in a file.

Quincey

···

On Jun 3, 2008, at 10:19 AM, Gopal, Anupam (GE Infra, Energy) wrote:

Anupam Gopal
Energy Application & Systems Engineering
GE Energy
T518-385-4586
F 518-385-5703
E anupam.gopal@ge.com
http://www.gepsec.com
General Electric International, Inc.

-----Original Message-----
From: Quincey Koziol [mailto:koziol@hdfgroup.org]
Sent: Tuesday, June 03, 2008 9:21 AM
To: Gopal, Anupam (GE Infra, Energy)
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] H5 read function timing discrepancy

Hi Anupam,

On Jun 3, 2008, at 8:08 AM, Gopal, Anupam (GE Infra, Energy) wrote:

Yes that's a good point let me try that, but just to let you know, I
write data to the XY plane, Each plane is chunked. so for this dataset

of (3151*5*5162), I write 5162 times each time adding a chunk of
3151*5. This was about writing to the dataset. While retrieving the
data, I retrieve a single line in the Z direction, that is an array of

size (1*5162). I assume this means I am am accessing data which is
scattered across 5162 chunks, right ??. But the real question is I am
doing the same for each of these datasets, so if there is a delay in
reading then it should show up in both the cases.
However yesterday I tried to plot the initial size of the dataset with

the time it takes to retrieve the data, and it seems like as the
initial size of the datasets increase, the time taken to retrieve the
data (same amount) also increases proportionately.

I agree with an earlier comment in this thread: it's probably
related to your chunk sizes. What are the dimensions for chunks in each
of these datasets? Since HDF5 [generally] accesses entire chunks at a
time (i.e. bringing each chunk with elements to access into memory and
extracting necessary elements from it), if the chunks are different
sizes, then the I/O times will be different.

Quincey

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:45 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

I don't know if it has anything to do with it, but it depends upon how

things are stored and which points your hyperslab retrieves. If you
hyperslab retrieves points which are scattered in a number of
different compressed chunks, each one of those chunks needs to be
decompressed, then the data accessed.

Hence, one way to confirm if it is a compression thing or something
else is to remove it, if you can.

If you are running this on windows with VS, you could use the
profiling utility to see in which routine most of the processing time
is taking place. That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well.
And which version of HDF5 are you using?

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:30
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Thanks for your reply. To answer ur question, No I havent, I did not
see any point in doing that, as both are generated using the same
code. and hence have identical properties. If compression or
chunking makes one of them slower to read, then it should have the
same effect on the other one. The reason one is 10 times bigger than
the other is because it has 10 times more data than the other. It has
nothing to do with compression. any other ideas ??
<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One set

appears to be 10x bigger than the other, which is approx. your
time difference. Perhaps decompression, depending upon how you've
chunked and distributed values, may be the culprit. From: Gopal,
Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for
which I do not have any explanation, I was hoping if anyone of have an

answer to my question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is believed

to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Ger_van_Diepen · June 4, 2008, 7:38am

Hi Anupam,

h5repack will make a copy of your file. Suppose that you rechunk from [3151,5,1] to [1,1,5162], I'm afraid it will take quite a long time to make the copy. In fact it has to resort the data. You have 'only' about 300 MByte of data, so it can probably do it in memory (assuming you use 4-byte floats). For larger arrays, it might be undoable.

I think a better approach is to choose a chunk size that suits both writing in x and reading in z. That is basically the idea of chunking; have semi-optimal access in all directions. So a chunk size of, say, [23,5,70] gives you [137,1,74] chunks, each about 32KBytes in size.
So reading a line in z means reading 74 chunks.

Note that when writing, it is wise the define the chunk cache size appropriately, which has to be done before opening the file. The cache needs to able to hold 137 chunks, otherwise HDF5 has to 'page in and out' chunks from its cache. I think the default cache size is too small to hold 137 chunks.
When reading a single line in z, there is no point in defining the cache size. However, when reading multiple lines, it might be wise to do that as well (to hold at least 74 chunks).

Quincey, please correct me if I'm saying anything incorrect. I don't know all the HDF5 details.

Cheers,
Ger

Quincey Koziol <koziol@hdfgroup.org> 06/03/08 9:46 PM >>>

Hi Anupam,

Ok, Got the point. It makes a lot of sense to me now. We have certain
limitations in our proprietary software which generates this data,
because of which we add one chunck at a time, which is a plane(as the
data gets generated). And for post processing, we need to retrieve
data
in the Z direction, which turns out not to be aligned, with the
chunking
planes, and rips right through it.
Given these limitations, is it possible to reorganize the dataset once
it is created. I mean once the dataset is created (chunked as planes),
can we decompress it and then chunk it differently, so that we can
query
it quicker ??. Or is there any other better way to do it?? Thanks
again
for ur help

The h5repack tool that ships with the HDF5 distribution will re-chunk
datasets in a file.

Quincey

···

On Jun 3, 2008, at 10:19 AM, Gopal, Anupam (GE Infra, Energy) wrote:

Anupam Gopal
Energy Application & Systems Engineering
GE Energy
T518-385-4586
F 518-385-5703
E anupam.gopal@ge.com
http://www.gepsec.com
General Electric International, Inc.

-----Original Message-----
From: Quincey Koziol [mailto:koziol@hdfgroup.org]
Sent: Tuesday, June 03, 2008 9:21 AM
To: Gopal, Anupam (GE Infra, Energy)
Cc: hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] H5 read function timing discrepancy

Hi Anupam,

On Jun 3, 2008, at 8:08 AM, Gopal, Anupam (GE Infra, Energy) wrote:

Yes that's a good point let me try that, but just to let you know, I
write data to the XY plane, Each plane is chunked. so for this
dataset

of (3151*5*5162), I write 5162 times each time adding a chunk of
3151*5. This was about writing to the dataset. While retrieving the
data, I retrieve a single line in the Z direction, that is an array
of

size (1*5162). I assume this means I am am accessing data which is
scattered across 5162 chunks, right ??. But the real question is I am
doing the same for each of these datasets, so if there is a delay in
reading then it should show up in both the cases.
However yesterday I tried to plot the initial size of the dataset
with

the time it takes to retrieve the data, and it seems like as the
initial size of the datasets increase, the time taken to retrieve the
data (same amount) also increases proportionately.

I agree with an earlier comment in this thread: it's probably
related to your chunk sizes. What are the dimensions for chunks in
each
of these datasets? Since HDF5 [generally] accesses entire chunks at a
time (i.e. bringing each chunk with elements to access into memory and
extracting necessary elements from it), if the chunks are different
sizes, then the I/O times will be different.

Quincey

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:45 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

I don't know if it has anything to do with it, but it depends upon
how

things are stored and which points your hyperslab retrieves. If you
hyperslab retrieves points which are scattered in a number of
different compressed chunks, each one of those chunks needs to be
decompressed, then the data accessed.

Hence, one way to confirm if it is a compression thing or something
else is to remove it, if you can.

If you are running this on windows with VS, you could use the
profiling utility to see in which routine most of the processing time
is taking place. That may help to track down the culprit.

I think Unix/Linux have profilers of one fashion or another as well.
And which version of HDF5 are you using?

From: Gopal, Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:30
To: Ray Burkholder; hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Thanks for your reply. To answer ur question, No I havent, I did not
see any point in doing that, as both are generated using the same
code. and hence have identical properties. If compression or
chunking makes one of them slower to read, then it should have the
same effect on the other one. The reason one is 10 times bigger than
the other is because it has 10 times more data than the other. It has
nothing to do with compression. any other ideas ??
<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

From: Ray Burkholder [mailto:ray@oneunified.net]
Sent: Monday, June 02, 2008 5:17 PM
To: hdf-forum@hdfgroup.org
Subject: RE: [hdf-forum] H5 read function timing discrepancy

Have you tried taking compression off and comparing that way? One
set

appears to be 10x bigger than the other, which is approx. your
time difference. Perhaps decompression, depending upon how you've
chunked and distributed values, may be the culprit. From: Gopal,
Anupam (GE Infra, Energy) [mailto:anupam.gopal@ge.com]
Sent: Monday, June 02, 2008 18:09
To: hdf-forum@hdfgroup.org
Subject: [hdf-forum] H5 read function timing discrepancy

Hi,

I have observed some anomalies in the HDF5 read functionality for
which I do not have any explanation, I was hoping if anyone of have
an

answer to my question or have experienced it before.

Here it goes:

I have two datasets. They are identical ( chunking, compression,
allocation time, dimension, data type). The only difference is in the
dimension size one of them is (130*5*12404) and the other one is
(3151*5*5162). I read the same hyperslab (same offset and number of
elements). But one takes on an average 450 mili seconds and the other
one 6000 mili sec. Any idea why ??

Please let me know if it makes sense to anyone !!

Regards,

<image001.gif>

Anupam Gopal

Energy Application & Systems Engineering

GE Energy

T518-385-4586

F 518-385-5703

E anupam.gopal@ge.com

http://www.gepsec.com

General Electric International, Inc.

--
Scanned for viruses & dangerous content at One Unified and is
believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is
believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is
believed

to be clean.
--
Scanned for viruses & dangerous content at One Unified and is
believed

to be clean.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org
.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

H5 read function timing discrepancy