Problems with H5PT from java

Stamminger_Johannes · April 7, 2010, 6:33pm

Hi there,

I try to get up and running with the hdf and especially I plan to write
large amount of varying length data to hdf5 files from a java
application. I want to make use of the H5PT api and unfortunately
noticed that this currently seems not to be exposed in the hdf-java
2.6.1.

So to avoid the mess with JNI I try to access the H5PT using jna
(https://jna.dev.java.net/) - and in principe this seems to work. At
least I got a very simple example to work:

    public void testHdfJna() throws Exception {
        final int h5fileId = Hdf5Hl.H5Fcreate(
            new File(getCurrentTestDir(), "hdf.h5").toString(),
            HDF5Constants.H5F_ACC_TRUNC,
            HDF5Constants.H5P_DEFAULT,
            HDF5Constants.H5P_DEFAULT);

final int strTypeId = Hdf5Hl.H5Tcopy (HDF5Constants.H5T_C_S1);
Hdf5Hl.H5Tset_size (strTypeId, HDF5Constants.H5T_VARIABLE);

final int vlTableId = Hdf5Hl.H5PTcreate_fl(
h5fileId, "Test VL Dataset", strTypeId, 10L, -1);

        final String p1 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
        final String p2 = "bbbbbbbbbbbbbbbbbbbbbbbbb";
// hdf5Hl.H5PTappend(vlTableId, 2, ???);
        Hdf5Hl.H5PTclose(vlTableId);
        Hdf5Hl.H5Fclose(h5fileId);
    }

The resulting hdf file looks as expected - at least to me:

hdf.h5" {
GROUP "/" {
   DATASET "Test VL Dataset" {
      DATATYPE H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
      DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
      DATA {
      }
   }
}
}

I'm now stuck with writing the concrete varying length strings, p1 and
p2 in the example code. Does anyone can provide me the lines of code as
one would write them in plain C to achieve this? I do not find a
suitable example ...

Best regards,
Johannes Stamminger

Peter_Cao · April 7, 2010, 7:55pm

You should be able to find C examples at
http://www.hdfgroup.org/ftp/HDF5/examples/examples-by-api/

Thanks
--pc

Stamminger, Johannes wrote:

···

Hi there,

I try to get up and running with the hdf and especially I plan to write
large amount of varying length data to hdf5 files from a java
application. I want to make use of the H5PT api and unfortunately
noticed that this currently seems not to be exposed in the hdf-java
2.6.1.

So to avoid the mess with JNI I try to access the H5PT using jna
(https://jna.dev.java.net/\) - and in principe this seems to work. At
least I got a very simple example to work:

    public void testHdfJna() throws Exception {
        final int h5fileId = Hdf5Hl.H5Fcreate(
            new File(getCurrentTestDir(), "hdf.h5").toString(),
            HDF5Constants.H5F_ACC_TRUNC,
            HDF5Constants.H5P_DEFAULT,
            HDF5Constants.H5P_DEFAULT);

        final int strTypeId = Hdf5Hl.H5Tcopy (HDF5Constants.H5T_C_S1);
        Hdf5Hl.H5Tset_size (strTypeId, HDF5Constants.H5T_VARIABLE);

        final int vlTableId = Hdf5Hl.H5PTcreate_fl(
            h5fileId, "Test VL Dataset", strTypeId, 10L, -1);

        final String p1 = "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa";
        final String p2 = "bbbbbbbbbbbbbbbbbbbbbbbbb";
// hdf5Hl.H5PTappend(vlTableId, 2, ???);
        Hdf5Hl.H5PTclose(vlTableId);
        Hdf5Hl.H5Fclose(h5fileId);
    }

The resulting hdf file looks as expected - at least to me:

hdf.h5" {
GROUP "/" {
   DATASET "Test VL Dataset" {
      DATATYPE H5T_STRING {
            STRSIZE H5T_VARIABLE;
            STRPAD H5T_STR_NULLTERM;
            CSET H5T_CSET_ASCII;
            CTYPE H5T_C_S1;
         }
      DATASPACE SIMPLE { ( 0 ) / ( H5S_UNLIMITED ) }
      DATA {
      }
   }
}

I'm now stuck with writing the concrete varying length strings, p1 and
p2 in the example code. Does anyone can provide me the lines of code as
one would write them in plain C to achieve this? I do not find a
suitable example ...

Best regards,
Johannes Stamminger
  ------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stamminger_Johannes · April 8, 2010, 7:01am

Hi!

You should be able to find C examples at
http://www.hdfgroup.org/ftp/HDF5/examples/examples-by-api/

I thought so ... but failed: there are examples named to indicate
variable length packet writing - but those base on the H5PTcreate_vl
call being no longer available/being deprecated:

http://www.hdfgroup.org/HDF5/Tutor/examples/5-18/hl/ptExampleVL.c

Or there are fixed length examples basing on H5PTcreate_fl. But only
using fixed length data then.

Can someone provide me an example using H5PTcreate_fl using variable
length data (String or byte array) as indicated in
http://mail.hdfgroup.org/pipermail/hdf-forum_hdfgroup.org/2008-December/000636.html?

Thanks,
Johannes Stamminger

···

On Mi, 2010-04-07 at 14:55 -0500, Peter Cao wrote:

Stamminger_Johannes · April 8, 2010, 5:17pm

Hi,

the call to H5PTappend currently fails with messages in console:

HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 0:
#000: /home/hdftest/snapshots-hdf5-branchtest1/current/src/H5Dio.c line
255 in H5Dwrite(): no output buffer
major: Invalid arguments to routine
minor: Bad value

Can anybody explain me what lead to this error? I fail to understand ad
hoc the referenced source code ... ;-(

Thanks for every hint,
Johannes Stamminger

Stamminger_Johannes · April 9, 2010, 9:18am

Hi again,

same as for varying length strings I now try with binary blobs. I want
to write incrementally varying length byte arrays (blobs) to a hdf5 file
from java. Same as for strings easiest seems to use again the H5PT api.

But I fail to create a varying length binary datatype. I tried with

H5Tvlen_create(H5T_NATIVE_OPAQUE)

But this fails then in H5PTappend.

Is it possible at all? And if so: is there an example of how to do this?

Thanks for all hints,
Johannes Stamminger

kuki · March 21, 2012, 6:27am

Hi..
I too wanted to write my opaque datasets to the packet table. But none of
its APIs are supported in java...so could some one please guide me as to how
to go about doing it...

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Problems-with-H5PT-from-java-tp703862p3844849.html
Sent from the hdf-forum mailing list archive at Nabble.com.

Stamminger_Johannes · April 8, 2010, 5:53pm

Hi,

I just found my problem's cause: a wrong method parameter type mapping
of the nrecords parameter in H5PTappend: I specified it as a java long
(8 bytes) - but with java int (4 bytes) my sample is now working

And this wrong mapping was caused by a wrong (?) H5PTappend api
description in
http://www.hdfgroup.org/HDF5/doc/HL/RM_H5PT.html#H5PTappend: there the
nrecords parameter is stated to be of type hsize_t - thus in the
sourcecode it is defined as size_t. This seems to differ in it's size.
I'm not sure as I'm no C expert but maybe this occurrs only with my a
little bit special environment with running a 64bit linux but the 32bit
version of the hdf5[_hl] libraries ... (?)

So this is a very first proof of that it is very simply possible to use
the H5PT API from java with just a view lines of code with using the JNA
approach (this avoids the - IMHO - JNI hell).
Though I should do some performance measurings to identify the JNA
overhead ...

Best Regards,
Johannes Stamminger

···

--
Johannes.Stamminger@Astrium.EADS.net [2FE783D0 http://wwwkeys.PGP.net]
------ ----<--{(@ ------------------ EADS ASTRIUM
Koenigsberger Str. 17, 28857 Barrien Ground System Eng. (TE55)
+49 4242 169582 (Tel + FAX) Airbus Allee 1, 28199 Bremen
+49 174 7731593 (Mobile) +49 421 539 4152 (Tel) / 4378 (FAX)

Peter_Cao · March 21, 2012, 6:29pm

Hi,

HDF-Java does not support high level API functions such as H5PT and
does not read/write opaque data. If you want to do it from the Java
layer, you have to implement related JNI functions. For now the best
way to deal with opaque is directly using C API functions.

···

On 3/21/2012 1:27 AM, kuki wrote:

Hi..
I too wanted to write my opaque datasets to the packet table. But none of
its APIs are supported in java...so could some one please guide me as to how
to go about doing it...

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Problems-with-H5PT-from-java-tp703862p3844849.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
Thank you!
--pc

Peter_Cao · April 8, 2010, 6:33pm

Hi,

hsize_t should be the same (8 bytes) on both 32-bit and 64-bit machines. See H5public.h
#if H5_SIZEOF_LONG_LONG >= 8
typedef unsigned long long hsize_t;
typedef signed long long hssize_t;
# define H5_SIZEOF_HSIZE_T H5_SIZEOF_LONG_LONG
# define H5_SIZEOF_HSSIZE_T H5_SIZEOF_LONG_LONG
#else
# error "nothing appropriate for hsize_t"
#endif

I will be very interested in your performance results. Also, check the memory usage
as you test the performance.

Thanks
--pc

Stamminger, Johannes wrote:

···

Hi,

I just found my problem's cause: a wrong method parameter type mapping
of the nrecords parameter in H5PTappend: I specified it as a java long
(8 bytes) - but with java int (4 bytes) my sample is now working

And this wrong mapping was caused by a wrong (?) H5PTappend api
description in
http://www.hdfgroup.org/HDF5/doc/HL/RM_H5PT.html#H5PTappend: there the
nrecords parameter is stated to be of type hsize_t - thus in the
sourcecode it is defined as size_t. This seems to differ in it's size.
I'm not sure as I'm no C expert but maybe this occurrs only with my a
little bit special environment with running a 64bit linux but the 32bit
version of the hdf5[_hl] libraries ... (?)

So this is a very first proof of that it is very simply possible to use
the H5PT API from java with just a view lines of code with using the JNA
approach (this avoids the - IMHO - JNI hell).
Though I should do some performance measurings to identify the JNA
overhead ...

Best Regards,
Johannes Stamminger

------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stamminger_Johannes · April 9, 2010, 6:57am

Hi,

hsize_t should be the same (8 bytes) on both 32-bit and 64-bit machines.
See H5public.h
#if H5_SIZEOF_LONG_LONG >= 8
typedef unsigned long long hsize_t;
typedef signed long long hssize_t;
# define H5_SIZEOF_HSIZE_T H5_SIZEOF_LONG_LONG
# define H5_SIZEOF_HSSIZE_T H5_SIZEOF_LONG_LONG
#else
# error "nothing appropriate for hsize_t"
#endif

As I tried to explain: only the documentation tells of the parameter
(and some more on H5PT, too) to be of type hsize_t. In fact in the
sourcecode (H5PT.c) there are some parameters, e.g. the one I ran into,
to be of size_t type in fact.

After having thought about this a night I'm quite sure this is an error
in the sources. Do the hdf developers read this list or do I have to
contact the helpdesk to raise an issue or is a bugtracker online
availeble that I did not find up to now?

I will be very interested in your performance results. Also, check the
memory usage
as you test the performance.

I keep you informed ;-).

Best regards,
Johannes Stamminger

···

On Do, 2010-04-08 at 13:33 -0500, Peter Cao wrote:

Peter_Cao · April 9, 2010, 12:50pm

Hi Johannes,

This could be a documentation error. Please send it to the helpdesk.

Thanks
--pc

Stamminger, Johannes wrote:

···

On Do, 2010-04-08 at 13:33 -0500, Peter Cao wrote:


Hi,

hsize_t should be the same (8 bytes) on both 32-bit and 64-bit machines. See H5public.h
#if H5_SIZEOF_LONG_LONG >= 8
typedef unsigned long long hsize_t;
typedef signed long long hssize_t;
# define H5_SIZEOF_HSIZE_T H5_SIZEOF_LONG_LONG
# define H5_SIZEOF_HSSIZE_T H5_SIZEOF_LONG_LONG
#else
# error "nothing appropriate for hsize_t"
#endif

As I tried to explain: only the documentation tells of the parameter
(and some more on H5PT, too) to be of type hsize_t. In fact in the
sourcecode (H5PT.c) there are some parameters, e.g. the one I ran into,
to be of size_t type in fact.

After having thought about this a night I'm quite sure this is an error
in the sources. Do the hdf developers read this list or do I have to
contact the helpdesk to raise an issue or is a bugtracker online
availeble that I did not find up to now?

I will be very interested in your performance results. Also, check the memory usage
as you test the performance.

I keep you informed ;-).

Best regards,
Johannes Stamminger

  ------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stamminger_Johannes · April 9, 2010, 1:21pm

> I will be very interested in your performance results. Also, check the
> memory usage
> as you test the performance.

I keep you informed ;-).

My test setup was to split up 488MB binary data to varying long portions
and writing each as hex string (as I fail to write binary up to now)
directly
a) a zip file (on the fly inflation)
b) to a hdf5 file (variable strings to packet table, chunk size 100,
compression 9)

Writing to the zip takes 66s, to hdf 432s.
Resulting zip file is 28MB, hdf file is 961MB.

As it is a java app the only thing I can tell concerning memory
consumption is that I did not see increasing mem, does not look like
eating the mem (garbage collecting returned needed mem to a size as used
at startup).

Both, output file size and performance, look like an issue to me:

Concerning performance I found
http://www.mail-archive.com/hdf-forum@hdfgroup.org/msg00024.html
Therefore I expect hopefully some improvements with the mentioned 1.8.5
release.

The compression of the packet table seems to take *no* effect: sum of
sizes of all strings as written to the zip is 938M ...

Any hints of what I might have done the wrong here?

My initialisation code:

        fH5fileId = Hdf5Hl.H5Fcreate(
            fHdf5File.toString(),
            HDF5Constants.H5F_ACC_TRUNC,
            HDF5Constants.H5P_DEFAULT,
            HDF5Constants.H5P_DEFAULT);

fStrTypeId = Hdf5Hl.H5Tcopy (HDF5Constants.H5T_C_S1);
status = Hdf5Hl.H5Tset_size (fStrTypeId,
HDF5Constants.H5T_VARIABLE);

fVlTableId = Hdf5Hl.H5PTcreate_fl(fH5fileId, "/Packets As
Received", fStrTypeId, 100, 9);

For writing the string H5PTappend is used with only 1 string being
written each time (~1.100.000 calls).

I'm preparing a test with writing blocks of 100 strings ...

Any hints appreciated,
Johannes Stamminger

Peter_Cao · April 9, 2010, 2:00pm

Some thing is not right in your program or in H5PTappend(). I posted an Java example
writing 25M variable length strings on March 23. It took about 60 seconds to write
1.2GB data.

Thanks
--pc

Stamminger, Johannes wrote:

···

I will be very interested in your performance results. Also, check the memory usage
as you test the performance.


I keep you informed ;-).

My test setup was to split up 488MB binary data to varying long portions
and writing each as hex string (as I fail to write binary up to now)
directly
a) a zip file (on the fly inflation)
b) to a hdf5 file (variable strings to packet table, chunk size 100,
compression 9)

Writing to the zip takes 66s, to hdf 432s.
Resulting zip file is 28MB, hdf file is 961MB.

As it is a java app the only thing I can tell concerning memory
consumption is that I did not see increasing mem, does not look like
eating the mem (garbage collecting returned needed mem to a size as used
at startup).

Both, output file size and performance, look like an issue to me:

Concerning performance I found
[Hdf-forum] Variable length types slowness
Therefore I expect hopefully some improvements with the mentioned 1.8.5
release.

The compression of the packet table seems to take *no* effect: sum of
sizes of all strings as written to the zip is 938M ...

Any hints of what I might have done the wrong here?

My initialisation code:

        fH5fileId = Hdf5Hl.H5Fcreate(
            fHdf5File.toString(),
            HDF5Constants.H5F_ACC_TRUNC,
            HDF5Constants.H5P_DEFAULT,
            HDF5Constants.H5P_DEFAULT);

        fStrTypeId = Hdf5Hl.H5Tcopy (HDF5Constants.H5T_C_S1);
        status = Hdf5Hl.H5Tset_size (fStrTypeId,
HDF5Constants.H5T_VARIABLE);

        fVlTableId = Hdf5Hl.H5PTcreate_fl(fH5fileId, "/Packets As
Received", fStrTypeId, 100, 9);

For writing the string H5PTappend is used with only 1 string being
written each time (~1.100.000 calls).

I'm preparing a test with writing blocks of 100 strings ...

Any hints appreciated,
Johannes Stamminger
  ------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stamminger_Johannes · April 9, 2010, 1:40pm

This improved things *a lot*, it took only 22s now! And yes, all strings
are still written to the hdf file

So it is remaining the file size issue. A played around with the chunk
size and compression level but did not succeed at all ...

Have a nice weekend,
Johannes Stamminger

···

On Fr, 2010-04-09 at 15:21 +0200, Johannes Stamminger wrote:

I'm preparing a test with writing blocks of 100 strings at one time ...

Peter_Cao · April 9, 2010, 2:14pm

Hi Johannes,

22s is about right. Making your chunk size to be about 64kB or 1MB
will improve the compression ratio and I/O performance. You can try
different chunk size to get the best result. Compression level 6 should be
a good choice for i/o performance and compression ratio.

FYI, variable length strings (or any other variable length data) do not
compress well. In the example that I posted on March 23, I used
compression level 5, the compressed data is about 0.8GB and the uncompressed
data is about 1.2GB. The actual compression ratio also depends on the
contend of your strings.

Thanks
--pc

Stamminger, Johannes wrote:

···

On Fr, 2010-04-09 at 15:21 +0200, Johannes Stamminger wrote:


I'm preparing a test with writing blocks of 100 strings at one time ...

This improved things *a lot*, it took only 22s now! And yes, all strings
are still written to the hdf file

So it is remaining the file size issue. A played around with the chunk
size and compression level but did not succeed at all ...

Have a nice weekend,
Johannes Stamminger

  ------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Stamminger_Johannes · April 12, 2010, 11:58am

Hi!

Hi Johannes,

22s is about right. Making your chunk size to be about 64kB or 1MB
will improve the compression ratio and I/O performance. You can try
different chunk size to get the best result. Compression level 6 should be
a good choice for i/o performance and compression ratio.

I did some detailed performance tests this morning (I try to attach the
spreadsheet - I don't know if this forum allows this).

For my testdata (~1.100.000 varying length strings, summerized of size
488MB) I found

i) it *very* supprising to see the performance results' variance: on my
(else idling) development machine best to worst measured runs always
differed by 25-40%! No explanation of this is coming to my mind up to
now ...
This means that differences below 5% maybe only caused by random (I ran
each configuration 6-11 times - but this seems not enough to me
regarding the variance).

ii) one cannot talk of any compression: difference from level -1/0 to
level 9 is just 1,39% percent in the resulting hdf file's size
(~970MB)

iii) "compression" level 5 seems best choice taking into account
additionally performance (4% overhead compared to 10% using level 9).
But regard i) reading this - maybe only a random ...

iv) my strings are much shorter than yours. With mine I observe that it
is best to write ~350 of mine in a block with a chunksize of 16K. The
number of strings writing in a block makes the biggest difference: 1 =>
425s, 10 => 55s, 100 => 21s, 350 => 20s, 600 => 22s, 1000 => 23s.

v) with always writing 100 strings in a block, the chunksize makes a
difference of max. 10% (tested with 128 bytes up to 64K). But with
chunksize of 128K perfromance degraded by factor 10 to 682s for a single
run.

Next I will try to use an array type of fixed length to see some working
compression.

Best regards,
Johannes Stamminger

HDF Packet String Writing Performance Comparison.xls (23.5 KB)

···

On Fr, 2010-04-09 at 09:14 -0500, Peter Cao wrote:

Francesc_Alted2 · April 12, 2010, 2:39pm

A Monday 12 April 2010 13:58:24 Stamminger, Johannes escrigué:

I did some detailed performance tests this morning (I try to attach the
spreadsheet - I don't know if this forum allows this).

For my testdata (~1.100.000 varying length strings, summerized of size
488MB) I found

i) it *very* supprising to see the performance results' variance: on my
(else idling) development machine best to worst measured runs always
differed by 25-40%! No explanation of this is coming to my mind up to
now ...
This means that differences below 5% maybe only caused by random (I ran
each configuration 6-11 times - but this seems not enough to me
regarding the variance).

That could be consequence of the cache disk subsystem of the OS that is
working on. If you want to get better reproducibility on your results, try to
flush the OS cache (sync with UNIX-like OS) before taking time measurements.
Of course, you may not be interested in measuring your disk I/O, but only the
disk cache subsystem throughput, but this is always tricky to do.

ii) one cannot talk of any compression: difference from level -1/0 to
level 9 is just 1,39% percent in the resulting hdf file's size
(~970MB)

For what I know, compression of variable length types is not supported by HDF5
yet. By forcing the use of a compression filter there, you are only
compressing the *pointers* to your variable length values, not the values
themselves.

iii) "compression" level 5 seems best choice taking into account
additionally performance (4% overhead compared to 10% using level 9).
But regard i) reading this - maybe only a random ...

iv) my strings are much shorter than yours. With mine I observe that it
is best to write ~350 of mine in a block with a chunksize of 16K. The
number of strings writing in a block makes the biggest difference: 1 =>
425s, 10 => 55s, 100 => 21s, 350 => 20s, 600 => 22s, 1000 => 23s.

v) with always writing 100 strings in a block, the chunksize makes a
difference of max. 10% (tested with 128 bytes up to 64K). But with
chunksize of 128K perfromance degraded by factor 10 to 682s for a single
run.

Don't know about this one, but it is certainly strange this dramatic loss in
performance when passing from 64 KB to 128 KB chunksize. It would be nice if
you can build a small benchmark showing this problem in performance and send
it to the HDF group for further analysis.

Next I will try to use an array type of fixed length to see some working
compression.

IMO, this is your best bet if you are after compressing your data. BTW, when
sending strings to HDF5 containers be sure to zero the memory buffer area
after the end of the string: this could improve compression ratio quite a lot.

Hope this helps,

···

--
Francesc Alted

Stamminger_Johannes · April 12, 2010, 4:40pm

Fine to have some more feedback!

Did the attached spreadsheet make it's way to the forum? Or was it
filtered?

A Monday 12 April 2010 13:58:24 Stamminger, Johannes escrigué:
> I did some detailed performance tests this morning (I try to attach the
> spreadsheet - I don't know if this forum allows this).
>
> For my testdata (~1.100.000 varying length strings, summerized of size
> 488MB) I found
>
> i) it *very* supprising to see the performance results' variance: on my
> (else idling) development machine best to worst measured runs always
> differed by 25-40%! No explanation of this is coming to my mind up to
> now ...
> This means that differences below 5% maybe only caused by random (I ran
> each configuration 6-11 times - but this seems not enough to me
> regarding the variance).

That could be consequence of the cache disk subsystem of the OS that is
working on. If you want to get better reproducibility on your results, try to
flush the OS cache (sync with UNIX-like OS) before taking time measurements.
Of course, you may not be interested in measuring your disk I/O, but only the
disk cache subsystem throughput, but this is always tricky to do.

Maybe. Though I never before noticed such a variance (and never thought
of any explicite sync'ing) ...

Btw: I'm running an 64bit linux (latest ubuntu) with raid0 filesystem.
But I use the 32bit version of the hdf library.

And additionally please note that I run the tests from a JAVA unit test!

>
> ii) one cannot talk of any compression: difference from level -1/0 to
> level 9 is just 1,39% percent in the resulting hdf file's size
> (~970MB)

For what I know, compression of variable length types is not supported by HDF5
yet. By forcing the use of a compression filter there, you are only
compressing the *pointers* to your variable length values, not the values
themselves.

I read already something like this

> iii) "compression" level 5 seems best choice taking into account
> additionally performance (4% overhead compared to 10% using level 9).
> But regard i) reading this - maybe only a random ...
>
> iv) my strings are much shorter than yours. With mine I observe that it
> is best to write ~350 of mine in a block with a chunksize of 16K. The
> number of strings writing in a block makes the biggest difference: 1 =>
> 425s, 10 => 55s, 100 => 21s, 350 => 20s, 600 => 22s, 1000 => 23s.
>
> v) with always writing 100 strings in a block, the chunksize makes a
> difference of max. 10% (tested with 128 bytes up to 64K). But with
> chunksize of 128K perfromance degraded by factor 10 to 682s for a single
> run.

Don't know about this one, but it is certainly strange this dramatic loss in
performance when passing from 64 KB to 128 KB chunksize. It would be nice if
you can build a small benchmark showing this problem in performance and send
it to the HDF group for further analysis.

I may extract this test with some small much effort. But it is java then
wrapping the native shared libraries. And it is *not* hdf-java as this
does not support the H5PT but using JNA for that purpose.

Still interested?

> Next I will try to use an array type of fixed length to see some working
> compression.

IMO, this is your best bet if you are after compressing your data. BTW, when

I'm still measuring - but I was supprised again from the findings. E.g.
with arrays of size 16384 it seems best to use chunksize 32, compression
level 4 and to write as much as possible (maybe there is an upper limit
that I did not reach, yet) arrays with a single call to H5PTappend. With
that I get the data written in 217s to a file of size 160MB.

The data is the same as I used for writing the strings. But now without
conversion to hex string, 468M bytes in sum. With the overhead of the
fixed length arrays total data written to the file is of size 16,2G (the
overhead bytes are zero'ed). With the latter in mind the resulting
filesize of 160M is quite imaginable. But compared with writing same
data to a zip with on-the-fly inflation it is not as this leads to 50M
in 65s (with no performance tuning like writing data in blocks etc) ...

With big chunksize both, performance and file size, degrade by a large
factor. Worst example was to have 819K leading to a file of 513M (50
arrays with 16384 bytes each, compression 0, chunk size 32K).

If my attachment made it to the list I will provide a table again then.

Maybe I try something like using multiple packet tables in parallel with
each using different array sizes ... ?

sending strings to HDF5 containers be sure to zero the memory buffer area
after the end of the string: this could improve compression ratio quite a lot.

I'm using H5PT - I do not see any method for doing such thing there.
What method did you think of?

Thanks for every hint!
Johannes Stamminger

···

On Mo, 2010-04-12 at 16:39 +0200, Francesc Alted wrote:

Stamminger_Johannes · April 12, 2010, 6:43pm

I may extract this test with some small much effort. But it is java then

Should read: "I may extract this test with some small effort", of
course

wrapping the native shared libraries. And it is *not* hdf-java as this
does not support the H5PT but using JNA for that purpose.

Still interested?

...

Maybe I try something like using multiple packet tables in parallel with
each using different array sizes ... ?

That does the trick!!! With three packet tables with optimized sizes I
now achieved to receive a file of size 30M within 14s. Unfortunately may
previous mentioned ZIP-comparison numbers were wrong: the simple zip'ing
takes 42s resulting in a 23M file.

But with this approach and considering to have with the hdf the data
accessible in a improved manner it looks promising. Tomorrow I will have
to maintain additionally a packet table keeping references to the arrays
in the original order ...

Thanks for every hint!
Johannes Stamminger

···

On Mo, 2010-04-12 at 18:40 +0200, Johannes Stamminger wrote:

Francesc_Alted2 · April 12, 2010, 6:46pm

A Monday 12 April 2010 18:40:33 Stamminger, Johannes escrigué:

Fine to have some more feedback!

Did the attached spreadsheet make it's way to the forum? Or was it
filtered?

Yes, it made into the list. It is a small file and OpenOffice can open it
easily, so I suppose that it is fine if you send more of these (although if
you can come up with a PDF file would be better).

> That could be consequence of the cache disk subsystem of the OS that is
> working on. If you want to get better reproducibility on your results,
> try to flush the OS cache (sync with UNIX-like OS) before taking time
> measurements. Of course, you may not be interested in measuring your disk
> I/O, but only the disk cache subsystem throughput, but this is always
> tricky to do.

Maybe. Though I never before noticed such a variance (and never thought
of any explicite sync'ing) ...

Btw: I'm running an 64bit linux (latest ubuntu) with raid0 filesystem.
But I use the 32bit version of the hdf library.

And additionally please note that I run the tests from a JAVA unit test!

Please don't take my words as if they are absolute truth. I'm just talikng
about my own experience, and when doing benchmarks, you should be aware that
there is a huge difference when your dataset fits in cache and when it
doesn't. That *maybe* affecting you.

> Don't know about this one, but it is certainly strange this dramatic loss
> in performance when passing from 64 KB to 128 KB chunksize. It would be
> nice if you can build a small benchmark showing this problem in
> performance and send it to the HDF group for further analysis.

I may extract this test with some small much effort. But it is java then
wrapping the native shared libraries. And it is *not* hdf-java as this
does not support the H5PT but using JNA for that purpose.

Still interested?

I suppose so, but you should ask the THG helpdesk just to be sure

I'm still measuring - but I was supprised again from the findings. E.g.
with arrays of size 16384 it seems best to use chunksize 32, compression
level 4 and to write as much as possible (maybe there is an upper limit
that I did not reach, yet) arrays with a single call to H5PTappend. With
that I get the data written in 217s to a file of size 160MB.

The data is the same as I used for writing the strings. But now without
conversion to hex string, 468M bytes in sum. With the overhead of the
fixed length arrays total data written to the file is of size 16,2G (the
overhead bytes are zero'ed). With the latter in mind the resulting
filesize of 160M is quite imaginable. But compared with writing same
data to a zip with on-the-fly inflation it is not as this leads to 50M
in 65s (with no performance tuning like writing data in blocks etc) ...

Well, 160 MB wrt 468 MB is quite fine. Indeed zip is compressing better
because of a series of reasons. First one is that zip is probably using
larger block sizes here. In addition, HDF5 is designed to be able to look at
each chunk directly, not sequentially, so it has to add some overhead (in the
form of a B-tree) to quickly locate chunks; you cannot (as far as I know) do
the same with zip. Finally, have in mind that you are actually compressing
16.2 GB instead of 468 MB. And although most of the 16.2 GB are zeros, the
compressor still have to walk, chew and code them. So, you can never expect
to have the same speed/compression ratio than zip in this scenario.

But I'm curious when you say that you were converting data to hex string. Why
were you doing so? If your data are typically ints or floats, you may want to
use the shuffle filter in combination with zlib. In many circumstances,
shuffle may buy you significant additional compression ratio. This is
something that zip cannot do (it can only compress streams of bytes, as it has
not the notion of ints/floats).

With big chunksize both, performance and file size, degrade by a large
factor. Worst example was to have 819K leading to a file of 513M (50
arrays with 16384 bytes each, compression 0, chunk size 32K).

Uh, you lost me. What is 819 KB, the chunksize?

> sending strings to HDF5 containers be sure to zero the memory buffer area
> after the end of the string: this could improve compression ratio quite a
> lot.

I'm using H5PT - I do not see any method for doing such thing there.
What method did you think of?

I was talking about zero'ing the overhead bytes on each string, but I see that
you are doing this already.

···

--
Francesc Alted

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Problems with H5PT from java