How does chunked reading/writing interact with setting hyperslabs in datasets?

Williams_Norman_K · May 2, 2011, 2:19pm

I'm implementing an HDF image reader for the ITK library. ITK has a set
of classes for Image I/O to hide the implementation details of reading and
writing images. One of the itk::ImageIO feature is streamable reading and
writing.

Streaming is meant to allow for reading subsets of large images to reduce
the memory footprint. With a properly implement ImageIO class, you can set
up a processing pipeline that reads parts of the images, processes them,
and writes the parts out, only requiring the memory for the current part
of the image.

This is a big win when processing, for example large time-series.

It looks as though to read or write subsets of an image, you specify the
desired hyperslab, and the same on writing. But according to the
documentation, the hyperslab-based partitioning of datasets is a
'scatter/gather' process that occurs in memory. This leads me to believe
when you create a new DataSet or open a new Dataset, the memory of the
entire dataset is allocated.

So my question is this: What is the 'HDF5 Way' to implement streaming of
smaller chunks of a dataset?

···

--
Kent Williams norman-k-williams@uiowa.edu

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

Quincey_Koziol · May 3, 2011, 12:30pm

Hi Kent,

I'm implementing an HDF image reader for the ITK library. ITK has a set
of classes for Image I/O to hide the implementation details of reading and
writing images. One of the itk::ImageIO feature is streamable reading and
writing.

Streaming is meant to allow for reading subsets of large images to reduce
the memory footprint. With a properly implement ImageIO class, you can set
up a processing pipeline that reads parts of the images, processes them,
and writes the parts out, only requiring the memory for the current part
of the image.

This is a big win when processing, for example large time-series.

It looks as though to read or write subsets of an image, you specify the
desired hyperslab, and the same on writing. But according to the
documentation, the hyperslab-based partitioning of datasets is a
'scatter/gather' process that occurs in memory. This leads me to believe
when you create a new DataSet or open a new Dataset, the memory of the
entire dataset is allocated.

So my question is this: What is the 'HDF5 Way' to implement streaming of
smaller chunks of a dataset?

Hyperslabs are the correct way to select a smaller region of a dataset in the file. You can read that smaller region into an appropriately sized memory buffer, without allocating a memory buffer that is the size of the dataset in the file. Search for "H5Sselect_hyperslab" in the examples subdirectory of the HDF5 distribution and you will find many use cases to draw from.

Quincey

···

On May 2, 2011, at 9:19 AM, Williams, Norman K wrote:

--
Kent Williams norman-k-williams@uiowa.edu

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Williams_Norman_K · May 3, 2011, 3:16pm

I am using the H5Sselect_hyperslab method for both reading and writing
already. Additionally when I write out an image, I 'chunk' the output and
compress it.

The question I have is what I can do to minimize the memory footprint.
Based on my reading in the documentation, the Hyperslab interface actually
scatters/gathers to/from an in-memory dataset, leading me to believe the
entire dataset will be allocated in system memory.

So the question is how would I use HDF5 in such a way as to minimize the
memory footprint in this context?

···

On 5/3/11 7:30 AM, "Quincey Koziol" <koziol@hdfgroup.org> wrote:

Hi Kent,

On May 2, 2011, at 9:19 AM, Williams, Norman K wrote:

I'm implementing an HDF image reader for the ITK library. ITK has a set
of classes for Image I/O to hide the implementation details of reading
and
writing images. One of the itk::ImageIO feature is streamable reading
and
writing.

Streaming is meant to allow for reading subsets of large images to
reduce
the memory footprint. With a properly implement ImageIO class, you can
set
up a processing pipeline that reads parts of the images, processes them,
and writes the parts out, only requiring the memory for the current part
of the image.

This is a big win when processing, for example large time-series.

It looks as though to read or write subsets of an image, you specify the
desired hyperslab, and the same on writing. But according to the
documentation, the hyperslab-based partitioning of datasets is a
'scatter/gather' process that occurs in memory. This leads me to
believe
when you create a new DataSet or open a new Dataset, the memory of the
entire dataset is allocated.

So my question is this: What is the 'HDF5 Way' to implement streaming of
smaller chunks of a dataset?

Hyperslabs are the correct way to select a smaller region of a
dataset in the file. You can read that smaller region into an
appropriately sized memory buffer, without allocating a memory buffer
that is the size of the dataset in the file. Search for
"H5Sselect_hyperslab" in the examples subdirectory of the HDF5
distribution and you will find many use cases to draw from.

Quincey

--
Kent Williams norman-k-williams@uiowa.edu

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered
by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is
confidential and may be legally privileged. If you are not the intended
recipient, you are hereby notified that any retention, dissemination,
distribution, or copying of this communication is strictly prohibited.
Please reply to the sender that you have received the message in error,
then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

Quincey_Koziol · May 3, 2011, 3:56pm

Hi Kent,

I am using the H5Sselect_hyperslab method for both reading and writing
already. Additionally when I write out an image, I 'chunk' the output and
compress it.

The question I have is what I can do to minimize the memory footprint.
Based on my reading in the documentation, the Hyperslab interface actually
scatters/gathers to/from an in-memory dataset, leading me to believe the
entire dataset will be allocated in system memory.

So the question is how would I use HDF5 in such a way as to minimize the
memory footprint in this context?

You are allowed to create a memory dataspace that is different from the dataset's dataspace in the file. That will allow you to tune the memory footprint.

Quincey

···

On May 3, 2011, at 10:16 AM, Williams, Norman K wrote:

On 5/3/11 7:30 AM, "Quincey Koziol" <koziol@hdfgroup.org> wrote:

Hi Kent,

On May 2, 2011, at 9:19 AM, Williams, Norman K wrote:

I'm implementing an HDF image reader for the ITK library. ITK has a set
of classes for Image I/O to hide the implementation details of reading
and
writing images. One of the itk::ImageIO feature is streamable reading
and
writing.

Streaming is meant to allow for reading subsets of large images to
reduce
the memory footprint. With a properly implement ImageIO class, you can
set
up a processing pipeline that reads parts of the images, processes them,
and writes the parts out, only requiring the memory for the current part
of the image.

This is a big win when processing, for example large time-series.

It looks as though to read or write subsets of an image, you specify the
desired hyperslab, and the same on writing. But according to the
documentation, the hyperslab-based partitioning of datasets is a
'scatter/gather' process that occurs in memory. This leads me to
believe
when you create a new DataSet or open a new Dataset, the memory of the
entire dataset is allocated.

So my question is this: What is the 'HDF5 Way' to implement streaming of
smaller chunks of a dataset?

Hyperslabs are the correct way to select a smaller region of a
dataset in the file. You can read that smaller region into an
appropriately sized memory buffer, without allocating a memory buffer
that is the size of the dataset in the file. Search for
"H5Sselect_hyperslab" in the examples subdirectory of the HDF5
distribution and you will find many use cases to draw from.

Quincey

--
Kent Williams norman-k-williams@uiowa.edu

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered
by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is
confidential and may be legally privileged. If you are not the intended
recipient, you are hereby notified that any retention, dissemination,
distribution, or copying of this communication is strictly prohibited.
Please reply to the sender that you have received the message in error,
then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Williams_Norman_K · May 3, 2011, 4:20pm

OK.

I'm not entirely clear how to coordinate different memory and disk
footprints, particularly when using the C++ interface.

The steps I know about:

1. Open the file
2. Open the DataSet.
3. Obtain the DataSpace from the Dataset.
4. Select a hyperslab to read data into a voxel buffer.

I select a hyperslab from the dataset to read based on the image region
specified by the user. It's a little more complicated than that in ITK,
but that summarizes the process.

At what stage in the process do I specify a different in-memory dataspace?
And how do I specify it's position in the larger on-disk dataspace?

If there's example code in the manual, just point me there.

···

On 5/3/11 10:56 AM, "Quincey Koziol" <koziol@hdfgroup.org> wrote:

Hi Kent,

On May 3, 2011, at 10:16 AM, Williams, Norman K wrote:

I am using the H5Sselect_hyperslab method for both reading and writing
already. Additionally when I write out an image, I 'chunk' the output
and
compress it.

The question I have is what I can do to minimize the memory footprint.
Based on my reading in the documentation, the Hyperslab interface
actually
scatters/gathers to/from an in-memory dataset, leading me to believe the
entire dataset will be allocated in system memory.

So the question is how would I use HDF5 in such a way as to minimize the
memory footprint in this context?

You are allowed to create a memory dataspace that is different from
the dataset's dataspace in the file. That will allow you to tune the
memory footprint.

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

bmribler · May 3, 2011, 5:16pm

Hi Norman,

From: hdf-forum-bounces@hdfgroup.org [mailto:hdf-forum-
bounces@hdfgroup.org] On Behalf Of Williams, Norman K
Sent: Tuesday, May 03, 2011 12:21 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] How does chunked reading/writing interact with
setting hyperslabs in datasets?

OK.

I'm not entirely clear how to coordinate different memory and disk
footprints, particularly when using the C++ interface.

The steps I know about:

1. Open the file
2. Open the DataSet.
3. Obtain the DataSpace from the Dataset.
4. Select a hyperslab to read data into a voxel buffer.

I select a hyperslab from the dataset to read based on the image region
specified by the user. It's a little more complicated than that in ITK,
but that summarizes the process.

At what stage in the process do I specify a different in-memory dataspace?
And how do I specify it's position in the larger on-disk dataspace?

If there's example code in the manual, just point me there.

The C++ examples can be found here
http://www.hdfgroup.org/HDF5/doc/cpplus_RM/examples.html, but you're
probably already aware of that, so my next suggestion is to look at the C
examples for appropriate pattern.

···

-----Original Message-----

On 5/3/11 10:56 AM, "Quincey Koziol" <koziol@hdfgroup.org> wrote:

Hi Kent,

On May 3, 2011, at 10:16 AM, Williams, Norman K wrote:

I am using the H5Sselect_hyperslab method for both reading and writing
already. Additionally when I write out an image, I 'chunk' the output
and
compress it.

The question I have is what I can do to minimize the memory footprint.
Based on my reading in the documentation, the Hyperslab interface
actually
scatters/gathers to/from an in-memory dataset, leading me to believe the
entire dataset will be allocated in system memory.

So the question is how would I use HDF5 in such a way as to minimize the
memory footprint in this context?

You are allowed to create a memory dataspace that is different from
the dataset's dataspace in the file. That will allow you to tune the
memory footprint.

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by
the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is
confidential and may be legally privileged. If you are not the intended
recipient, you are hereby notified that any retention, dissemination,
distribution, or copying of this communication is strictly prohibited.
Please reply to the sender that you have received the message in error,
then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Internal Virus Database is out-of-date.
Checked by AVG.
Version: 7.5.549 / Virus Database: 270.9.0/1778 - Release Date: 11/9/2008
2:14 PM

Quincey_Koziol · May 4, 2011, 4:06pm

Hi Kent,

OK.

I'm not entirely clear how to coordinate different memory and disk
footprints, particularly when using the C++ interface.

The steps I know about:

1. Open the file
2. Open the DataSet.
3. Obtain the DataSpace from the Dataset.
4. Select a hyperslab to read data into a voxel buffer.

I select a hyperslab from the dataset to read based on the image region
specified by the user. It's a little more complicated than that in ITK,
but that summarizes the process.

At what stage in the process do I specify a different in-memory dataspace?

The voxel buffer is the buffer that you want to describe with the in-memory dataspace. So, you'd define a new dataspace (with H5Screate) that specified the correct dimensions for your buffer.

And how do I specify it's position in the larger on-disk dataspace?

You specify a selection in the on-disk dataspace (probably with H5Sselect_all, H5Sselect_hyperslab, H5Sselect_elements, etc) and a selection in the in-memory dataspace (in the same way) and when you call H5Dread, the elements are transferred from the on-disk dataset into the memory buffer.

If there's example code in the manual, just point me there.

I think Binh-Minh already pointed you to the C++ examples, and you can search for H5Sselect_hyperslab in the 'examples' subdirectory of the HDF5 distribution.

Quincey

···

On May 3, 2011, at 11:20 AM, Williams, Norman K wrote:

On 5/3/11 10:56 AM, "Quincey Koziol" <koziol@hdfgroup.org> wrote:

Hi Kent,

On May 3, 2011, at 10:16 AM, Williams, Norman K wrote:

I am using the H5Sselect_hyperslab method for both reading and writing
already. Additionally when I write out an image, I 'chunk' the output
and
compress it.

The question I have is what I can do to minimize the memory footprint.
Based on my reading in the documentation, the Hyperslab interface
actually
scatters/gathers to/from an in-memory dataset, leading me to believe the
entire dataset will be allocated in system memory.

So the question is how would I use HDF5 in such a way as to minimize the
memory footprint in this context?

You are allowed to create a memory dataspace that is different from
the dataset's dataspace in the file. That will allow you to tune the
memory footprint.

________________________________
Notice: This UI Health Care e-mail (including attachments) is covered by the Electronic Communications Privacy Act, 18 U.S.C. 2510-2521, is confidential and may be legally privileged. If you are not the intended recipient, you are hereby notified that any retention, dissemination, distribution, or copying of this communication is strictly prohibited. Please reply to the sender that you have received the message in error, then delete it. Thank you.
________________________________

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org