Advice on Storing data with efficient access

I have a project coming up where we are going to store multiple modalities of data acquired from Scanning Electron Microscopes (EBSD, EDS and ISE**).

For each modality the data is acquired in a grid fashion. For the ISE data the data is actually a gray scale image so those are easy to store and think about. If the acquired is 2048 x 2048 then you have a gray scale image of that same size (unsigned char). This is where things start getting interesting. For the EBSD data there are several signals collected at *each* pixel. Some of the signals are simple scalar values (floats) and we have been storing those also in a 2D array just like the ISE image. But one of the signals is actually itself a 2D image (60x80 pixels). So for example the EBSD sampling grid dimensions is 100 x 75 and at each grid point there is a 60x80 array of data.

The EDS data is much the same except we have a 2048 1D Array at each pixel and the dimensions of the EDS sampling grid is 512 x 384

I am trying to figure out a balance between efficient storage and easy access. One thought was to store each grid point as its own "group" but that would be hundreds of thousands of groups and I don't think HDF5 is going to react well to that. So the other end of that would be to continue to think of each modality of data as an "Image" and store all the data under a group such as "EDS" as a large multi-dimensional array. So for example in the EBSD data acquisition from above I would have a 4D array (100x75x80x60). What type of attributes should I store the data set so that later when we are reading through the data we can efficiently grab hyper slabs of the data without having to read the entire data set into memory?

I hope all of that was clear enough to elicit some advice on storage. Thanks for any help. Just for clarification the sizes of the data sets are for our "experimental" data sets where we are just trying to figure this out. The real data sets will likely be multi-gigabytes in size for each "slice" of data where we may have 250 slices.

** EBSD - Electron Backscatter Diffraction
   EDS - Energy dispersive Spectra
   ISE - Ion Induced Secondary Electron Image

Thanks for any help or advice.

···

___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

I have a project coming up where we are going to store multiple modalities of data acquired from Scanning Electron Microscopes (EBSD, EDS and ISE**).

For each modality the data is acquired in a grid fashion. For the ISE data the data is actually a gray scale image so those are easy to store and think about. If the acquired is 2048 x 2048 then you have a gray scale image of that same size (unsigned char). This is where things start getting interesting. For the EBSD data there are several signals collected at *each* pixel. Some of the signals are simple scalar values (floats) and we have been storing those also in a 2D array just like the ISE image. But one of the signals is actually itself a 2D image (60x80 pixels). So for example the EBSD sampling grid dimensions is 100 x 75 and at each grid point there is a 60x80 array of data.

The EDS data is much the same except we have a 2048 1D Array at each pixel and the dimensions of the EDS sampling grid is 512 x 384

I am trying to figure out a balance between efficient storage and easy access. One thought was to store each grid point as its own "group" but that would be hundreds of thousands of groups and I don't think HDF5 is going to react well to that. So the other end of that would be to continue to think of each modality of data as an "Image" and store all the data under a group such as "EDS" as a large multi-dimensional array. So for example in the EBSD data acquisition from above I would have a 4D array (100x75x80x60). What type of attributes should I store the data set so that later when we are reading through the data we can efficiently grab hyper slabs of the data without having to read the entire data set into memory?

I hope all of that was clear enough to elicit some advice on storage. Thanks for any help. Just for clarification the sizes of the data sets are for our "experimental" data sets where we are just trying to figure this out. The real data sets will likely be multi-gigabytes in size for each "slice" of data where we may have 250 slices.

I am sure many people with more experience will have better advice
(that I also would like to learn), but for our data I decided to store
as matrix-like as possible, then use some sort of indexing to access
data, PyTables has some indexing functionalists I plan to rely on. It
also depends on the application you want to process data (Python in my
case is the primary), I think there are also proprietary bitmap
indexing schemes too to make data access faster, or you may end up
indexing on your own.

** EBSD - Electron Backscatter Diffraction
   EDS - Energy dispersive Spectra
   ISE - Ion Induced Secondary Electron Image

Thanks for any help or advice.
___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

_______________________________________________

dashesy

···

On Tue, Nov 27, 2012 at 5:33 PM, Michael Jackson <mike.jackson@bluequartz.net> wrote:

Hi Michael,

I have a project coming up where we are going to store multiple modalities of data acquired from Scanning Electron Microscopes (EBSD, EDS and ISE**).

For each modality the data is acquired in a grid fashion. For the ISE data the data is actually a gray scale image so those are easy to store and think about. If the acquired is 2048 x 2048 then you have a gray scale image of that same size (unsigned char). This is where things start getting interesting. For the EBSD data there are several signals collected at *each* pixel. Some of the signals are simple scalar values (floats) and we have been storing those also in a 2D array just like the ISE image. But one of the signals is actually itself a 2D image (60x80 pixels). So for example the EBSD sampling grid dimensions is 100 x 75 and at each grid point there is a 60x80 array of data.

The EDS data is much the same except we have a 2048 1D Array at each pixel and the dimensions of the EDS sampling grid is 512 x 384

I am trying to figure out a balance between efficient storage and easy access. One thought was to store each grid point as its own "group" but that would be hundreds of thousands of groups and I don't think HDF5 is going to react well to that. So the other end of that would be to continue to think of each modality of data as an "Image" and store all the data under a group such as "EDS" as a large multi-dimensional array. So for example in the EBSD data acquisition from above I would have a 4D array (100x75x80x60). What type of attributes should I store the data set so that later when we are reading through the data we can efficiently grab hyper slabs of the data without having to read the entire data set into memory?

  Actually, groups with hundreds of thousands of links should be fine.

  However, I would lean toward keeping the image structure and either using an array datatype (80x60, in the case you gave) or a compound datatype for the "pixels". Another useful option is to create a group for each "image" and then store a separate dataset for each field in the array.

  Quincey

···

On Nov 27, 2012, at 6:33 PM, Michael Jackson <mike.jackson@bluequartz.net> wrote:

I hope all of that was clear enough to elicit some advice on storage. Thanks for any help. Just for clarification the sizes of the data sets are for our "experimental" data sets where we are just trying to figure this out. The real data sets will likely be multi-gigabytes in size for each "slice" of data where we may have 250 slices.

** EBSD - Electron Backscatter Diffraction
  EDS - Energy dispersive Spectra
  ISE - Ion Induced Secondary Electron Image

Thanks for any help or advice.
___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Thanks for all the feedback. I'm going to start experimenting with some variations over the next few weeks to see which ones are efficient from an IO standpoint and reasonably "easy" to get at from a coding perspective.

Thanks

···

___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

On Nov 28, 2012, at 2:12 PM, Quincey Koziol wrote:

Hi Michael,

On Nov 27, 2012, at 6:33 PM, Michael Jackson <mike.jackson@bluequartz.net> wrote:

I have a project coming up where we are going to store multiple modalities of data acquired from Scanning Electron Microscopes (EBSD, EDS and ISE**).

For each modality the data is acquired in a grid fashion. For the ISE data the data is actually a gray scale image so those are easy to store and think about. If the acquired is 2048 x 2048 then you have a gray scale image of that same size (unsigned char). This is where things start getting interesting. For the EBSD data there are several signals collected at *each* pixel. Some of the signals are simple scalar values (floats) and we have been storing those also in a 2D array just like the ISE image. But one of the signals is actually itself a 2D image (60x80 pixels). So for example the EBSD sampling grid dimensions is 100 x 75 and at each grid point there is a 60x80 array of data.

The EDS data is much the same except we have a 2048 1D Array at each pixel and the dimensions of the EDS sampling grid is 512 x 384

I am trying to figure out a balance between efficient storage and easy access. One thought was to store each grid point as its own "group" but that would be hundreds of thousands of groups and I don't think HDF5 is going to react well to that. So the other end of that would be to continue to think of each modality of data as an "Image" and store all the data under a group such as "EDS" as a large multi-dimensional array. So for example in the EBSD data acquisition from above I would have a 4D array (100x75x80x60). What type of attributes should I store the data set so that later when we are reading through the data we can efficiently grab hyper slabs of the data without having to read the entire data set into memory?

  Actually, groups with hundreds of thousands of links should be fine.

  However, I would lean toward keeping the image structure and either using an array datatype (80x60, in the case you gave) or a compound datatype for the "pixels". Another useful option is to create a group for each "image" and then store a separate dataset for each field in the array.

  Quincey

I hope all of that was clear enough to elicit some advice on storage. Thanks for any help. Just for clarification the sizes of the data sets are for our "experimental" data sets where we are just trying to figure this out. The real data sets will likely be multi-gigabytes in size for each "slice" of data where we may have 250 slices.

** EBSD - Electron Backscatter Diffraction
EDS - Energy dispersive Spectra
ISE - Ion Induced Secondary Electron Image

Thanks for any help or advice.
___________________________________________________________
Mike Jackson Principal Software Engineer
BlueQuartz Software Dayton, Ohio
mike.jackson@bluequartz.net www.bluequartz.net

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org