Writing a large Compound dataset - slow [Java API]

I tried writing a 2D compound dataset with dimensions [10000 x 150].

*Vector data = new Vector();*
* String[] cellContent = new String[10000*150]*

* data.add(0, Column1);*
* data.add(1, Column2);*
* data.add(2, Column3);*

* long b = System.currentTimeMillis();*
* System.out.println("<<< Creating HDF5 File >>>");*
* System.out.println("");*
* Dataset d = file.createCompoundDS(FILENAME, null, new_dims,
maxdims, null, gzip, memberNames, memberDatatypes, memberSizes, data);*
* System.out.println("Time taken for Writing all cells to H5 file >>
"+(System.currentTimeMillis()-b));*

···

--
Karpaga R

"Did you always know that?"
"No, I didn't. But I believed"
  ---Matrix III

I tried writing a 2D compound dataset with dimension [10000 x 150].

* int DIM_X = 10000;*
* int DIM_Y = 150;*

* int MAX_DIMX = -1;*
* int MAX_DIMY = -1;*

* int **CHUNK_X** =1000;*
* int **CHUNK_Y** = 10;*

* long[] dims = {DIM_X, DIM_Y}; long[] maxdims = {MAX_DIMX,
MAX_DIMY}; // UNLIMITED FILE SIZE long[] chunks = {CHUNK_X,
CHUNK_Y}; // NOT USED AS OF NOW int gzip = 9;*

* String[] **Column1** = new String[DIM_X*DIM_Y];*
        *String[] **Column2** = new String[**DIM_X*DIM_Y**];*
        *String[] **Column3** = new String[**DIM_X*DIM_Y**]*

* /* Column1, Column2, Column3 are String arrays of size DIM_X *
DIM_Y ***/*

* Vector data = new Vector();*

* data.add(0, Column1);*
* data.add(1, Column2);*
* data.add(2, Column3);*

* long b = System.currentTimeMillis();*
* System.out.println("<<< Creating HDF5 File >>>");*

* Dataset d = file.createCompoundDS(FILENAME, null, dims, maxdims,
null, **gzip**, memberNames, memberDatatypes, memberSizes, data);*

* System.out.println("Time taken for Writing all cells to H5 file

"+(System.currentTimeMillis()-b));*

It is found that the time taken for writing the above H5 file (10000x150)
takes around 38393ms(SLOW). The file size is around 8MB(too big).
If, my *DIM_X is 10000* and *DIM_Y is 10 *then, the time taken for writing
the file is just 2543ms(quite quick). The file size is around 600KB.

Is there any better way to reduce the time taken for writing huge compound
dataset file? Also, how to reduce the file size? Will chunking in any way
do the need? Please, throw some light on this.

Thanks in advance,

kalpa

Kalpa,

If I read your example correctly, your ‘SLOW’ example is 15 times bigger than your ‘quite quick’ example. So, assuming everything scales linearly:

2.5seconds * 15 = 37.5 seconds… which is pretty close to the 38.4 seconds you measured and called ‘SLOW’.

Also,
600K * 15 = 8.8 MB… so 8 MB is doing pretty well.

What were you expecting?

Scott

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Karpaga Rajadurai
Sent: Wednesday, April 23, 2014 7:42 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Writing a large Compound dataset - slow [Java API]

I tried writing a 2D compound dataset with dimension [10000 x 150].

        int DIM_X = 10000;
        int DIM_Y = 150;

        int MAX_DIMX = -1;
        int MAX_DIMY = -1;

        int CHUNK_X =1000;
        int CHUNK_Y = 10;

        long[] dims = {DIM_X, DIM_Y};
        long[] maxdims = {MAX_DIMX, MAX_DIMY}; // UNLIMITED FILE SIZE
        long[] chunks = {CHUNK_X, CHUNK_Y}; // NOT USED AS OF NOW
        int gzip = 9;

        String[] Column1 = new String[DIM_X*DIM_Y];
        String[] Column2 = new String[DIM_X*DIM_Y];
        String[] Column3 = new String[DIM_X*DIM_Y]

        /* Column1, Column2, Column3 are String arrays of size DIM_X * DIM_Y */

        Vector data = new Vector();

        data.add(0, Column1);
        data.add(1, Column2);
        data.add(2, Column3);

        long b = System.currentTimeMillis();
        System.out.println("<<< Creating HDF5 File >>>");

        Dataset d = file.createCompoundDS(FILENAME, null, dims, maxdims, null, gzip, memberNames, memberDatatypes, memberSizes, data);

        System.out.println("Time taken for Writing all cells to H5 file >> "+(System.currentTimeMillis()-b));

It is found that the time taken for writing the above H5 file (10000x150) takes around 38393ms(SLOW). The file size is around 8MB(too big).
If, my DIM_X is 10000 and DIM_Y is 10 then, the time taken for writing the file is just 2543ms(quite quick). The file size is around 600KB.

Is there any better way to reduce the time taken for writing huge compound dataset file? Also, how to reduce the file size? Will chunking in any way do the need? Please, throw some light on this.

Thanks in advance,

kalpa

________________________________

This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.

Kalpa,
Furthermore, you can dial down your gzip compression level from 9 to 4 or 5 to gain some speed. After 4 or 5, you don't really gain a lot of compression but do lose a lot of time.
-Corey

···

On Apr 23, 2014, at 8:36 AM, Mitchell, Scott - Exelis wrote:

Kalpa,

If I read your example correctly, your ‘SLOW’ example is 15 times bigger than your ‘quite quick’ example. So, assuming everything scales linearly:

2.5seconds * 15 = 37.5 seconds… which is pretty close to the 38.4 seconds you measured and called ‘SLOW’.

Also,
600K * 15 = 8.8 MB… so 8 MB is doing pretty well.

What were you expecting?

Scott

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Karpaga Rajadurai
Sent: Wednesday, April 23, 2014 7:42 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Writing a large Compound dataset - slow [Java API]

I tried writing a 2D compound dataset with dimension [10000 x 150].

        int DIM_X = 10000;
        int DIM_Y = 150;

        int MAX_DIMX = -1;
        int MAX_DIMY = -1;

        int CHUNK_X =1000;
        int CHUNK_Y = 10;

        long[] dims = {DIM_X, DIM_Y};
        long[] maxdims = {MAX_DIMX, MAX_DIMY}; // UNLIMITED FILE SIZE
        long[] chunks = {CHUNK_X, CHUNK_Y}; // NOT USED AS OF NOW
        int gzip = 9;

        String[] Column1 = new String[DIM_X*DIM_Y];
        String[] Column2 = new String[DIM_X*DIM_Y];
        String[] Column3 = new String[DIM_X*DIM_Y]

        /* Column1, Column2, Column3 are String arrays of size DIM_X * DIM_Y */

        Vector data = new Vector();

        data.add(0, Column1);
        data.add(1, Column2);
        data.add(2, Column3);

        long b = System.currentTimeMillis();
        System.out.println("<<< Creating HDF5 File >>>");
        
        Dataset d = file.createCompoundDS(FILENAME, null, dims, maxdims, null, gzip, memberNames, memberDatatypes, memberSizes, data);

        System.out.println("Time taken for Writing all cells to H5 file >> "+(System.currentTimeMillis()-b));

It is found that the time taken for writing the above H5 file (10000x150) takes around 38393ms(SLOW). The file size is around 8MB(too big).
If, my DIM_X is 10000 and DIM_Y is 10 then, the time taken for writing the file is just 2543ms(quite quick). The file size is around 600KB.

Is there any better way to reduce the time taken for writing huge compound dataset file? Also, how to reduce the file size? Will chunking in any way do the need? Please, throw some light on this.

Thanks in advance,

kalpa
        
This e-mail and any files transmitted with it may be proprietary and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error please notify the sender. Please note that any views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of Exelis Inc. The recipient should check this e-mail and any attachments for the presence of viruses. Exelis Inc. accepts no liability for any damage caused by any virus transmitted by this e-mail.
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

--
Corey Bettenhausen
Science Systems and Applications, Inc
NASA Goddard Space Flight Center
301 614 5383
corey.bettenhausen@ssaihq.com