Optimizing the Chunk Size of a Data set

mike.jackson · September 18, 2013, 3:42pm

I am developing some storage for a scientific instrument and am wanting to get an idea of how to best "optimize" a chunk size.

The basics of the setup are the following. The instrument collects an image point by point. There are 10 different quantities at each point. most are scalar quantities that are easy to deal with and pack into a single data set. Their is one quantity that is actually another 2D image in of itself. The size of this image can be as little as 80 x 60 all the way up to 1024 x 1024. The instrument can "scan" an image of 2048 x 2048 in size. So to be clear I am going to end up with a data set that is:

2048 x 2048 x 1024 x 1024 bytes in size (Worst case).

My initial thought was to just chunk it by the 1024 x 1024 size which makes striding through the data easy and natural for this application. Will having that many chunks in a file impact the IO performance at some point? Are there any general guidelines for setting the chunk size?

Thanks

···

___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

matthieu.brucher · September 18, 2013, 3:45pm

Hi,

IMHO, you should set your chunk size to a multiple of your filesystem
stripe size. This way, each write should end up with the lowest
possible writes.

Cheers,

Matthieu

···

2013/9/18 Michael Jackson <mike.jackson@bluequartz.net>:

I am developing some storage for a scientific instrument and am wanting to get an idea of how to best "optimize" a chunk size.

The basics of the setup are the following. The instrument collects an image point by point. There are 10 different quantities at each point. most are scalar quantities that are easy to deal with and pack into a single data set. Their is one quantity that is actually another 2D image in of itself. The size of this image can be as little as 80 x 60 all the way up to 1024 x 1024. The instrument can "scan" an image of 2048 x 2048 in size. So to be clear I am going to end up with a data set that is:

2048 x 2048 x 1024 x 1024 bytes in size (Worst case).

My initial thought was to just chunk it by the 1024 x 1024 size which makes striding through the data easy and natural for this application. Will having that many chunks in a file impact the IO performance at some point? Are there any general guidelines for setting the chunk size?

Thanks
___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

mike.jackson · September 18, 2013, 7:20pm

Also, currently I have convenience functions to write each sub-image to the dataset which involves opening, expanding, selecting a hyper slab, writing then closing everything. Would it be faster to design the codes in such a way as to leave all of those HDF items in an open state and then only close them when the instrument has completed the acquisition?

Thanks

···

___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

On Sep 18, 2013, at 11:45 AM, Matthieu Brucher <matthieu.brucher@gmail.com> wrote:

Hi,

IMHO, you should set your chunk size to a multiple of your filesystem
stripe size. This way, each write should end up with the lowest
possible writes.

Cheers,

Matthieu

2013/9/18 Michael Jackson <mike.jackson@bluequartz.net>:

I am developing some storage for a scientific instrument and am wanting to get an idea of how to best "optimize" a chunk size.

The basics of the setup are the following. The instrument collects an image point by point. There are 10 different quantities at each point. most are scalar quantities that are easy to deal with and pack into a single data set. Their is one quantity that is actually another 2D image in of itself. The size of this image can be as little as 80 x 60 all the way up to 1024 x 1024. The instrument can "scan" an image of 2048 x 2048 in size. So to be clear I am going to end up with a data set that is:

2048 x 2048 x 1024 x 1024 bytes in size (Worst case).

My initial thought was to just chunk it by the 1024 x 1024 size which makes striding through the data easy and natural for this application. Will having that many chunks in a file impact the IO performance at some point? Are there any general guidelines for setting the chunk size?

Thanks
___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Information System Engineer, Ph.D.
Blog: http://matt.eifelle.com
LinkedIn: http://www.linkedin.com/in/matthieubrucher
Music band: http://liliejay.com/

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Optimizing the Chunk Size of a Data set