Create large file using Java

Hello,

I am trying to create a large file (~155 MB) using the HDF Java API.
The idea is to create a 3D integer dataset from a set of 2D images. The total dataset size will be 512x512x304 (16 bits).

The original data is a set of 2D DICOM imagens. Each DICOM image is a 512x512 (16 bits) slice.

My code is:

//Create 512x512x304 image
H5File testFile = ...
//Obtain list of DICOM files
File[] list = ...
//Copy data to HDF image
for(int i=0;i<list.length;i++){
                    short[] array = readDicomFile(list[i]);
                    long[] start = dataset.getStartDims();
                    long[] stride = dataset.getStride();
                    long[] sizes = dataset.getSelectedDims();
                    // select the subset: starting at (0, 0, z)
                    start[0] = 0;
                    start[1] = 0;
                    start[2] = i;
                    // select the subset: subset size (512, 512, 1)
                    sizes[0] = 512;
                    sizes[1] = 512;
                    sizes[2] = 1;
                    // select the subset: set stride to (1, 1, 1)
                    stride[0] = 1;
                    stride[1] = 1;
                    stride[2] = 1;
                    Object data = dataset.read();
                    short[] buffer = (short[])data;
                    for (int j = 0; j < buffer.length; j++) {
                            buffer[j] = array[j];
                    }
                    dataset.write(buffer); } //Close the file
testFile.close();

The problem is that this code is VERY slow. Is there a way to create
the hdf file in a faster way?

Thank you all,

Ramon Moreno

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Ramon,

From what I know, when a dataset is created, it is filled by a default value of

all zeroes. This requires one extra pass through the dataset.

In HDF4, there was a command "sdsetfillmode", using which the fill could be set
to no fill, thus saving one pass through the file, and thus some time. However,
I don't know what is the equivalent command in HDF5 and I am also looking for
it. If you come to find out anything about it, please share with me. I will do
the same.

Thanks and Regards,
Nikhil

···

Hello,

I am trying to create a large file (~155 MB) using the HDF Java API.
The idea is to create a 3D integer dataset from a set of 2D images. The
total dataset size will be 512x512x304 (16 bits).

The original data is a set of 2D DICOM imagens. Each DICOM image is a
512x512 (16 bits) slice.

My code is:

//Create 512x512x304 image
H5File testFile = ...
//Obtain list of DICOM files
File[] list = ...
//Copy data to HDF image
for(int i=0;i<list.length;i++){
                    short[] array = readDicomFile(list[i]);
                    long[] start = dataset.getStartDims();
                    long[] stride = dataset.getStride();
                    long[] sizes = dataset.getSelectedDims();
                    // select the subset: starting at (0, 0, z)
                    start[0] = 0;
                    start[1] = 0;
                    start[2] = i;
                    // select the subset: subset size (512, 512, 1)
                    sizes[0] = 512;
                    sizes[1] = 512;
                    sizes[2] = 1;
                    // select the subset: set stride to (1, 1, 1)
                    stride[0] = 1;
                    stride[1] = 1;
                    stride[2] = 1;
                    Object data = dataset.read();
                    short[] buffer = (short[])data;
                    for (int j = 0; j < buffer.length; j++) {
                            buffer[j] = array[j];
                    }
                    dataset.write(buffer);
                }
//Close the file
testFile.close();

The problem is that this code is VERY slow. Is there a way to create
the hdf file in a faster way?

Thank you all,

Ramon Moreno

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi Nikhil,

Hi Ramon,

From what I know, when a dataset is created, it is filled by a default value of
all zeroes. This requires one extra pass through the dataset.

In HDF4, there was a command "sdsetfillmode", using which the fill could be set
to no fill, thus saving one pass through the file, and thus some time. However,
I don't know what is the equivalent command in HDF5 and I am also looking for
it. If you come to find out anything about it, please share with me. I will do
the same.

  The equivalent call for avoiding writing fill values in HDF5 is H5Pset_fill_time(, dcpl_id, H5D_FILL_TIME_NEVER).

  Quincey

P.S. - It's not obvious to me that this is what is slowing down Ramon's code though...

···

On Jun 19, 2008, at 4:27 PM, Nikhil Laghave wrote:

Thanks and Regards,
Nikhil

Hello,

I am trying to create a large file (~155 MB) using the HDF Java API.
The idea is to create a 3D integer dataset from a set of 2D images. The
total dataset size will be 512x512x304 (16 bits).

The original data is a set of 2D DICOM imagens. Each DICOM image is a
512x512 (16 bits) slice.

My code is:

//Create 512x512x304 image
H5File testFile = ...
//Obtain list of DICOM files
File[] list = ...
//Copy data to HDF image
for(int i=0;i<list.length;i++){
                   short[] array = readDicomFile(list[i]);
                   long[] start = dataset.getStartDims();
                   long[] stride = dataset.getStride();
                   long[] sizes = dataset.getSelectedDims();
                   // select the subset: starting at (0, 0, z)
                   start[0] = 0;
                   start[1] = 0;
                   start[2] = i;
                   // select the subset: subset size (512, 512, 1)
                   sizes[0] = 512;
                   sizes[1] = 512;
                   sizes[2] = 1;
                   // select the subset: set stride to (1, 1, 1)
                   stride[0] = 1;
                   stride[1] = 1;
                   stride[2] = 1;
                   Object data = dataset.read();
                   short[] buffer = (short[])data;
                   for (int j = 0; j < buffer.length; j++) {
                           buffer[j] = array[j];
                   }
                   dataset.write(buffer);
               }
//Close the file
testFile.close();

The problem is that this code is VERY slow. Is there a way to create
the hdf file in a faster way?

Thank you all,

Ramon Moreno

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Nikhil,

I measured the time spent in the conversion:

...
long tWrite = 0;
long t1 = System.currentTimeMillis();
for(int i=0;i<list.length;i++){
                   ...
                   long aux1 = System.currentTimeMillis();
                   dataset.write(buffer);
                   long aux2 = System.currentTimeMillis();
                   tWrite += (aux2-aux1);
}
long t2 = System.currentTimeMillis();
long tTotal = (t2-t1);
...

And the results obtained were:

Total time (tTotal) = 2481 s = 41 min.
Time writing (tWrite) = 2380 s = 39 min (96%!!)

So the problem is with the call "write(buffer)" that takes too long. I agree with
Koziol that it is not the fill time that makes the code slow.

Thanks,
Ramon

Quincey Koziol escreveu:

···

Hi Nikhil,

On Jun 19, 2008, at 4:27 PM, Nikhil Laghave wrote:

Hi Ramon,

From what I know, when a dataset is created, it is filled by a default value of
all zeroes. This requires one extra pass through the dataset.

In HDF4, there was a command "sdsetfillmode", using which the fill could be set
to no fill, thus saving one pass through the file, and thus some time. However,
I don't know what is the equivalent command in HDF5 and I am also looking for
it. If you come to find out anything about it, please share with me. I will do
the same.

    The equivalent call for avoiding writing fill values in HDF5 is H5Pset_fill_time(, dcpl_id, H5D_FILL_TIME_NEVER).

    Quincey

P.S. - It's not obvious to me that this is what is slowing down Ramon's code though...

Thanks and Regards,
Nikhil

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

One thing going on here is that you're repeatedly calling .write on the
dataset, once per item in the list.

It looks like your algorithm is:

For each image:
  1. read the image to "array"
  2. read the various meta data of the dataset
  3. set the values of the start,sizes,and stride arrays
  4. Read the dataset in to "data"
  5. copy the image data read from the file in memory to another
spot in memory, (converting?) from short[] to short[]
  6. copy the data from "array" to "buffer" (which now contains
the data from the file) one element at a time
  7. write the buffer to disk

Although HDF5 will do some caching up to maybe 1MB, this is very
inefficient.
Steps 2, 3 can be done once before the loop.
Between each iteration of this loop, the data in "buffer"... the
reference to it is lost, so there will be a lot of garbage collection
for this 155 MB piece of data. Same thing for "data". Normal JVM only
makes a heap of 60 MB, so this 300 MB of space will be GC'd after each
iteration of the loop. Also, you are writing to the dataset the data in
buffer, and then just reading it right back in the next iteration!

Instead, create an in-memory buffer, read all the data from all of the
images in to the buffer, and then write the whole buffer at-once. If
further optimization is necessary, you could read some subset of the
images at a time to a smaller buffer (e.g., 512x512x50), and then use
hyperslabs to write this portion to the specific part of the dataset the
data belongs.
  
Dave McCloskey

···

-----Original Message-----
From: Ramon Moreno [mailto:ramon.moreno@incor.usp.br]
Sent: Friday, June 20, 2008 9:31 AM
To: Quincey Koziol
Cc: Nikhil Laghave; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Create large file using Java

Nikhil,

I measured the time spent in the conversion:

...
long tWrite = 0;
long t1 = System.currentTimeMillis();
for(int i=0;i<list.length;i++){
                   ...
                   long aux1 = System.currentTimeMillis();
                   dataset.write(buffer);
                   long aux2 = System.currentTimeMillis();
                   tWrite += (aux2-aux1);
}
long t2 = System.currentTimeMillis();
long tTotal = (t2-t1);
...

And the results obtained were:

Total time (tTotal) = 2481 s = 41 min.
Time writing (tWrite) = 2380 s = 39 min (96%!!)

So the problem is with the call "write(buffer)" that takes too long. I
agree with
Koziol that it is not the fill time that makes the code slow.

Thanks,
Ramon

Quincey Koziol escreveu:

Hi Nikhil,

On Jun 19, 2008, at 4:27 PM, Nikhil Laghave wrote:

Hi Ramon,

From what I know, when a dataset is created, it is filled by a
default value of
all zeroes. This requires one extra pass through the dataset.

In HDF4, there was a command "sdsetfillmode", using which the fill
could be set
to no fill, thus saving one pass through the file, and thus some
time. However,
I don't know what is the equivalent command in HDF5 and I am also
looking for
it. If you come to find out anything about it, please share with me.
I will do
the same.

    The equivalent call for avoiding writing fill values in HDF5 is
H5Pset_fill_time(, dcpl_id, H5D_FILL_TIME_NEVER).

    Quincey

P.S. - It's not obvious to me that this is what is slowing down
Ramon's code though...

Thanks and Regards,
Nikhil

----------------------------------------------------------------------

This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dave,

McCloskey, David L. escreveu:

One thing going on here is that you're repeatedly calling .write on the
dataset, once per item in the list.

It looks like your algorithm is:

For each image:
  1. read the image to "array"
  2. read the various meta data of the dataset
  3. set the values of the start,sizes,and stride arrays
  4. Read the dataset in to "data"
  5. copy the image data read from the file in memory to another
spot in memory, (converting?) from short[] to short[]
  6. copy the data from "array" to "buffer" (which now contains
the data from the file) one element at a time
  7. write the buffer to disk

Although HDF5 will do some caching up to maybe 1MB, this is very
inefficient.
Steps 2, 3 can be done once before the loop.
Between each iteration of this loop, the data in "buffer"... the
reference to it is lost, so there will be a lot of garbage collection
for this 155 MB piece of data. Same thing for "data". Normal JVM only
makes a heap of 60 MB, so this 300 MB of space will be GC'd after each
iteration of the loop. Also, you are writing to the dataset the data in
buffer, and then just reading it right back in the next iteration!

The code is, in fact, inefficient. That's why I measured the times spent in each piece of the code
and emphasized that only the call to "write(buffer)" took 96% of the time.

The problem is that I don't want to put all data in memory before writting it to the file. Some DICOM
exames can have 1GB of data or more.

Instead, create an in-memory buffer, read all the data from all of the
images in to the buffer, and then write the whole buffer at-once. If
further optimization is necessary, you could read some subset of the
images at a time to a smaller buffer (e.g., 512x512x50), and then use
hyperslabs to write this portion to the specific part of the dataset the
data belongs.
  

I suppose this is a solution. I will try to see how it works.

  
Dave McCloskey
  

Thanks,
Ramon

···

-----Original Message-----
From: Ramon Moreno [mailto:ramon.moreno@incor.usp.br] Sent: Friday, June 20, 2008 9:31 AM
To: Quincey Koziol
Cc: Nikhil Laghave; hdf-forum@hdfgroup.org
Subject: Re: [hdf-forum] Create large file using Java

Nikhil,

I measured the time spent in the conversion:

...
long tWrite = 0;
long t1 = System.currentTimeMillis();
for(int i=0;i<list.length;i++){
                   ...
                   long aux1 = System.currentTimeMillis();
                   dataset.write(buffer);
                   long aux2 = System.currentTimeMillis();
                   tWrite += (aux2-aux1);
}
long t2 = System.currentTimeMillis();
long tTotal = (t2-t1);
...

And the results obtained were:

Total time (tTotal) = 2481 s = 41 min.
Time writing (tWrite) = 2380 s = 39 min (96%!!)

So the problem is with the call "write(buffer)" that takes too long. I agree with
Koziol that it is not the fill time that makes the code slow.

Thanks,
Ramon

Quincey Koziol escreveu:
  

Hi Nikhil,

On Jun 19, 2008, at 4:27 PM, Nikhil Laghave wrote:

Hi Ramon,

From what I know, when a dataset is created, it is filled by a default value of
all zeroes. This requires one extra pass through the dataset.

In HDF4, there was a command "sdsetfillmode", using which the fill could be set
to no fill, thus saving one pass through the file, and thus some time. However,
I don't know what is the equivalent command in HDF5 and I am also looking for
it. If you come to find out anything about it, please share with me. I will do
the same.
      

    The equivalent call for avoiding writing fill values in HDF5 is H5Pset_fill_time(, dcpl_id, H5D_FILL_TIME_NEVER).

    Quincey

P.S. - It's not obvious to me that this is what is slowing down Ramon's code though...

Thanks and Regards,
Nikhil

----------------------------------------------------------------------
  

This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.