H5Dread and array organization

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i cannot understand how i can obtain this kind of order without reordering completelly the data. A unique chunk cannot contain two diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )? Is there anyway to control this order? (not row-major, but let's say Z-major..)

Thanks for helping.

Mathieu

Mathieu, you should bear in mind that reading a dataset is logically a
mapping between dataspaces. The underlying physical layout in the file is
irrelevant for this mapping. Users may not appreciate getting different
answers when reading the nominally same datset with different physical
layouts.
Of course, not all layouts may give you the same performance.

If i understand well, a chunked dataset is read chunk by chunk

That's a misunderstanding. Sometimes that's the case, but not always. Have a
look at

http://www.hdfgroup.org/HDF5/doc/Advanced/DataFlow_H5Dread/DataFlow_H5Dread.
pdf

Best, G.

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of mathieu.westphal@obs.ujf-grenoble.fr
Sent: Tuesday, May 15, 2012 5:23 AM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] H5Dread and array organization

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i cannot
understand how i can obtain this kind of order without reordering
completelly the data. A unique chunk cannot contain two diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hello Mathieu,

AFAIK HDF5 will always reorder the data. I think this is what most
users want as they don't care/know how the data are chunked. Note that
the main purpose of chunking is to have reasonably fast access
independent of the order the array's data are traversed. The chunk shape
should be such that it serves the expected access patterns reasonably
well.

Using chunking for a sparse data array is very valid.
However, if you don't want the data to be reordered, you have to access
them chunk-wise. Thus define your hyperslab such that it matches the
chunk boundaries.
Getting data in Z-major order can also be achieved by defining your
hyperslabs correctly.
Note that HDF5 has a lot of overhead when using many calls with small
hyperslabs.

Cheers,
Ger

<mathieu.westphal@obs.ujf-grenoble.fr> 5/15/2012 12:22 PM >>>

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i
cannot understand how i can obtain this kind of order without
reordering completelly the data. A unique chunk cannot contain two
diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

···

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Gerd,

Thans for the link to the document describing how chunked data are
read. It gives some good insights, but leaves me with a few questions.

1. I cannot imagine that the first step is reading a chunk from disk.
Doesn't it look in the chunk cache first? If not, what is the purpose of
the chunk cache?

2. I would like to know in more detail what reading the chunk means. I
assume it is doing a B-tree lookup to find out where the chunk is
located. What is involved in that step?

3. The diagram does not tell me why reading many small hyperslabs is so
much slower than reading a large hyperslab. Can it be that the B-tree
lookup is done over and over again, even if the chunk is in the cache?

Cheers,
Ger

"Gerd Heber" <gheber@hdfgroup.org> 5/15/2012 2:25 PM >>>

Mathieu, you should bear in mind that reading a dataset is logically a
mapping between dataspaces. The underlying physical layout in the file
is
irrelevant for this mapping. Users may not appreciate getting
different
answers when reading the nominally same datset with different physical
layouts.
Of course, not all layouts may give you the same performance.

If i understand well, a chunked dataset is read chunk by chunk

That's a misunderstanding. Sometimes that's the case, but not always.
Have a
look at

http://www.hdfgroup.org/HDF5/doc/Advanced/DataFlow_H5Dread/DataFlow_H5Dread.
pdf

Best, G.

···

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org
[mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of mathieu.westphal@obs.ujf-grenoble.fr
Sent: Tuesday, May 15, 2012 5:23 AM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] H5Dread and array organization

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i
cannot
understand how i can obtain this kind of order without reordering
completelly the data. A unique chunk cannot contain two diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Ger!

Hi Gerd,

Thans for the link to the document describing how chunked data are read. It gives some good insights, but leaves me with a few questions.

1. I cannot imagine that the first step is reading a chunk from disk. Doesn't it look in the chunk cache first? If not, what is the purpose of the chunk cache?

  Yes, it does look in the chunk cache first.

2. I would like to know in more detail what reading the chunk means. I assume it is doing a B-tree lookup to find out where the chunk is located. What is involved in that step?

  Yes, if the chunk isn't in the cache, the library does an index lookup on the on the coordinates of the chunk, finding the address of the chunk in the file. (I say "index lookup" because, although we use a B-tree currently, we are moving to using more types of indices in the next major release (1.10.0), which will give a constant time lookup in many cases) Once the address of the chunk is found, the chunk is brought into the cache (usually) and I/O is performed on it.

3. The diagram does not tell me why reading many small hyperslabs is so much slower than reading a large hyperslab. Can it be that the B-tree lookup is done over and over again, even if the chunk is in the cache?

  No, it's just slower due to some inefficiencies in the hyperslabbing code. I made some progress speeding this up after we talked 2 years ago, but didn't have the time to finish the job. It's probably only a few weeks of work to knock out the remaining slowness...

    Quincey

···

On May 15, 2012, at 8:04 AM, Ger van Diepen wrote:

Cheers,
Ger

>>> "Gerd Heber" <gheber@hdfgroup.org> 5/15/2012 2:25 PM >>>
Mathieu, you should bear in mind that reading a dataset is logically a
mapping between dataspaces. The underlying physical layout in the file is
irrelevant for this mapping. Users may not appreciate getting different
answers when reading the nominally same datset with different physical
layouts.
Of course, not all layouts may give you the same performance.

> If i understand well, a chunked dataset is read chunk by chunk

That's a misunderstanding. Sometimes that's the case, but not always. Have a
look at

http://www.hdfgroup.org/HDF5/doc/Advanced/DataFlow_H5Dread/DataFlow_H5Dread.
pdf

Best, G.

-----Original Message-----
From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-bounces@hdfgroup.org]
On Behalf Of mathieu.westphal@obs.ujf-grenoble.fr
Sent: Tuesday, May 15, 2012 5:23 AM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] H5Dread and array organization

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i cannot
understand how i can obtain this kind of order without reordering
completelly the data. A unique chunk cannot contain two diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

I can easily read data chunk by chunk if necessary, but i'm afraid it will seriously decrease perf.

I was looking for a way to describe a Dataspace wich permit to control how the data is ordered, in only one read.

Anyway thank you for these information.

Mathieu
Quoting Gerd Heber <gheber@hdfgroup.org>:

Mathieu, you should bear in mind that reading a dataset is logically a
mapping between dataspaces. The underlying physical layout in the file is
irrelevant for this mapping. Users may not appreciate getting different
answers when reading the nominally same datset with different physical
layouts.
Of course, not all layouts may give you the same performance.

If i understand well, a chunked dataset is read chunk by chunk

That's a misunderstanding. Sometimes that's the case, but not always. Have a
look at

http://www.hdfgroup.org/HDF5/doc/Advanced/DataFlow_H5Dread/DataFlow_H5Dread.
pdf

Best, G.

Quoting Ger van Diepen <diepen@astron.nl>:

···

Hello Mathieu,

AFAIK HDF5 will always reorder the data. I think this is what most
users want as they don't care/know how the data are chunked. Note that
the main purpose of chunking is to have reasonably fast access
independent of the order the array's data are traversed. The chunk shape
should be such that it serves the expected access patterns reasonably
well.

Using chunking for a sparse data array is very valid.
However, if you don't want the data to be reordered, you have to access
them chunk-wise. Thus define your hyperslab such that it matches the
chunk boundaries.
Getting data in Z-major order can also be achieved by defining your
hyperslabs correctly.
Note that HDF5 has a lot of overhead when using many calls with small
hyperslabs.

Cheers,
Ger

<mathieu.westphal@obs.ujf-grenoble.fr> 5/15/2012 12:22 PM >>>

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i
cannot understand how i can obtain this kind of order without
reordering completelly the data. A unique chunk cannot contain two
diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Mathieu,

I can easily read data chunk by chunk if necessary, but i'm afraid it will seriously decrease perf.

I was looking for a way to describe a Dataspace wich permit to control how the data is ordered, in only one read.

Anyway thank you for these information.

  We considered having support for alternate "coordinate permutations" in the file and I believe I still have some notes about how this should be done. But there didn't seem to be much demand for it (yet :-). If you'd like to work on this, we could talk about funding the effort, or working with you to get a code contribution into shape.

  Quincey

···

On May 15, 2012, at 8:11 AM, mathieu.westphal@obs.ujf-grenoble.fr wrote:

Mathieu
Quoting Gerd Heber <gheber@hdfgroup.org>:

Mathieu, you should bear in mind that reading a dataset is logically a
mapping between dataspaces. The underlying physical layout in the file is
irrelevant for this mapping. Users may not appreciate getting different
answers when reading the nominally same datset with different physical
layouts.
Of course, not all layouts may give you the same performance.

If i understand well, a chunked dataset is read chunk by chunk

That's a misunderstanding. Sometimes that's the case, but not always. Have a
look at

http://www.hdfgroup.org/HDF5/doc/Advanced/DataFlow_H5Dread/DataFlow_H5Dread.
pdf

Best, G.

Quoting Ger van Diepen <diepen@astron.nl>:

Hello Mathieu,

AFAIK HDF5 will always reorder the data. I think this is what most
users want as they don't care/know how the data are chunked. Note that
the main purpose of chunking is to have reasonably fast access
independent of the order the array's data are traversed. The chunk shape
should be such that it serves the expected access patterns reasonably
well.

Using chunking for a sparse data array is very valid.
However, if you don't want the data to be reordered, you have to access
them chunk-wise. Thus define your hyperslab such that it matches the
chunk boundaries.
Getting data in Z-major order can also be achieved by defining your
hyperslabs correctly.
Note that HDF5 has a lot of overhead when using many calls with small
hyperslabs.

Cheers,
Ger

<mathieu.westphal@obs.ujf-grenoble.fr> 5/15/2012 12:22 PM >>>

Hello

I have a chunked dataset of size 20 20 10.

Chunk size are : 10 10 1.

le'ts say i read an hyperslab of data defined by:
start={0 0 0}
count={4 4 4}

I read it into a 1-D array.

and i get
X Y Z
0 0 0
0 0 1
0 0 2
0 0 3
0 0 4
0 1 0
0 1 1
0 1 2
0 1 3
0 1 4
0 2 0
0 2 1
...
0 4 3
0 4 4
1 0 0
1 0 1
...

If i understand well, a chunked dataset is read chunk by chunk, so i
cannot understand how i can obtain this kind of order without
reordering completelly the data. A unique chunk cannot contain two
diferent Z..

So,
Is this normal? do HDF5 reorder data (wasting time and ressources )?
Is there anyway to control this order? (not row-major, but let's say
Z-major..)

Thanks for helping.

Mathieu

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org