Problems with H5PT from java

Peter_Cao · April 12, 2010, 8:39pm

wrapping the native shared libraries. And it is *not* hdf-java as this
does not support the H5PT but using JNA for that purpose.

Still interested?

...

It will be very helpful to see the difference of JNA and JNI. The current hdf-java JNI was carefully
designed so that the memory allocated in Java is directly used in the JNI C. The overhead between
Java and C is very small, i.e. the I/O from JNI calls is close to C.

A good way to test is to read and write a simple 1-D array, e.g. float[10*1024*1024] in JNA and
in C (directly use the HDF5 library) and see the results.

Thanks
--pc

Stamminger_Johannes · April 13, 2010, 6:14am

Hi!

>> wrapping the native shared libraries. And it is *not* hdf-java as this
>> does not support the H5PT but using JNA for that purpose.
>>
>> Still interested?
>>
>
>
> ...
>
>

It will be very helpful to see the difference of JNA and JNI. The
current hdf-java JNI was carefully
designed so that the memory allocated in Java is directly used in the
JNI C. The overhead between
Java and C is very small, i.e. the I/O from JNI calls is close to C.

A good way to test is to read and write a simple 1-D array, e.g.
float[10*1024*1024] in JNA and
in C (directly use the HDF5 library) and see the results.

JNA with using there "direct mapping" (as I do) intends to be nearly as
fast as JNI (in the background it seems to do just the same. But it has
to do some initial additional shared library analysis).

Indeed, a JNI comparison would be interesting. Maybe I go for this
later, currently this is not my main target ...

FYI find attached my JNA java sources. Put the jna.jar to the classpath,
set the system property jna.library.path to a directory containing the
hdf libs in the same architecture as your java vm (32/64 bit) and write
some java code to access hdf files - that's it.
You have then exactly the same api as when writing plain C stuff with -
as far as i understand the jna - sharing the mem between the native and
the java part, too, as you told for hdf-java.

One of my major problems with this JNA approach remain the hdf constants
as they are not available in the hdf native libs as symbols, only as
#define's. My current workaround is to use - only for this purpose - the
hdf-java's HDF5Constants (implicating to additionally put the hdf native
libs to the java.library.path and the hdf-java jar's to the classpath).
But I will contact the helpdesk to simply add the constants additionally
as symbols to the native libs. And in the meanwhile I will do this
myself with patching and compiling the sources.
Btw the hdf-java suffers the same and has written some JNI code for this
purpose.

Best regards,
Johannes Stamminger

Hdf5.java (4.47 KB)

Hdf5Hl.java (1.38 KB)

···

On Mo, 2010-04-12 at 15:39 -0500, Peter Cao wrote:

Stamminger_Johannes · April 13, 2010, 7:15am

A Monday 12 April 2010 18:40:33 Stamminger, Johannes escrigué:
> Fine to have some more feedback!
>
>
> Did the attached spreadsheet make it's way to the forum? Or was it
> filtered?

Yes, it made into the list.

Great and good to know - this makes things much easier than with having
to provide attachments some different channels!

It is a small file and OpenOffice can open it

... as it is created using OO - remember I'm running linux

easily, so I suppose that it is fine if you send more of these (although if
you can come up with a PDF file would be better).

In principle absolutely correct. Thus it contains calculations and with
pdf no one could verify them any longer (I found already several times -
until now only *before* having posted it - wrong cell references inside
them). This way someone wondering might check them ... so I personally
would prefer to stay with the OO-xls for best compatibility ... ?

> > Don't know about this one, but it is certainly strange this dramatic loss
> > in performance when passing from 64 KB to 128 KB chunksize. It would be
> > nice if you can build a small benchmark showing this problem in
> > performance and send it to the HDF group for further analysis.
>
> I may extract this test with some small much effort. But it is java then
> wrapping the native shared libraries. And it is *not* hdf-java as this
> does not support the H5PT but using JNA for that purpose.
>
> Still interested?

I suppose so, but you should ask the THG helpdesk just to be sure

I will contact them then for this then, too.

> I'm still measuring - but I was supprised again from the findings. E.g.
> with arrays of size 16384 it seems best to use chunksize 32, compression
> level 4 and to write as much as possible (maybe there is an upper limit
> that I did not reach, yet) arrays with a single call to H5PTappend. With
> that I get the data written in 217s to a file of size 160MB.
>
> The data is the same as I used for writing the strings. But now without
> conversion to hex string, 468M bytes in sum. With the overhead of the
> fixed length arrays total data written to the file is of size 16,2G (the
> overhead bytes are zero'ed). With the latter in mind the resulting
> filesize of 160M is quite imaginable. But compared with writing same
> data to a zip with on-the-fly inflation it is not as this leads to 50M
> in 65s (with no performance tuning like writing data in blocks etc) ...

Well, 160 MB wrt 468 MB is quite fine. Indeed zip is compressing better
because of a series of reasons. First one is that zip is probably using
larger block sizes here. In addition, HDF5 is designed to be able to look at
each chunk directly, not sequentially, so it has to add some overhead (in the
form of a B-tree) to quickly locate chunks; you cannot (as far as I know) do
the same with zip. Finally, have in mind that you are actually compressing
16.2 GB instead of 468 MB. And although most of the 16.2 GB are zeros, the
compressor still have to walk, chew and code them. So, you can never expect
to have the same speed/compression ratio than zip in this scenario.

You are absolutely correct, and I'm aware of the benefits with having
the improved data access.

Unfortunately the app user does not see this and will complain if the
application runs 10 times slower and creates 10 time bigger files :-(.
Therefore I *must* reach comparable times and sizes. The files may grow
"a bit" but not by factors.

And as I wrote already in this thread afterwards, it seems realistic
indeed: with using multiple packet tables of different fixed array sizes
(to reduce the 0's overhead and save time spent for compressing and
writing the 0's) in parallel. But today I will have to maintain
additionally a dataset with references to those arrays in the correct
order. Hopefully this does not increase file size that much ...

But I'm curious when you say that you were converting data to hex string. Why
were you doing so? If your data are typically ints or floats, you may want to

This was just a workaround for my problems with writing varying length
binary data. In fact I have a series of varying length binary blobs (max
possible size 2^16 bytes; in my test data mean size 465, max size 1435).
And as I failed to write them as varying length binary data (see thread
"Varying length binary data") I used as workaround the hex strings -
just for performance measuring.

use the shuffle filter in combination with zlib. In many circumstances,
shuffle may buy you significant additional compression ratio. This is

Good to know. You know of whether this is shown in an example?

something that zip cannot do (it can only compress streams of bytes, as it has
not the notion of ints/floats).

> With big chunksize both, performance and file size, degrade by a large
> factor. Worst example was to have 819K leading to a file of 513M (50
> arrays with 16384 bytes each, compression 0, chunk size 32K).

Uh, you lost me. What is 819 KB, the chunksize?

Oh no, this is the size of the data having written to the hdf: 50 arrays
of size 16384 bytes each. Written to hdf with compression level 0, chunk
size 32K.

Thanks for all hints,
Johannes Stamminger

···

On Mo, 2010-04-12 at 20:46 +0200, Francesc Alted wrote:

Francesc_Alted2 · April 13, 2010, 7:58am

A Monday 12 April 2010 20:43:03 Stamminger, Johannes escrigué:

> Maybe I try something like using multiple packet tables in parallel with
> each using different array sizes ... ?

That does the trick!!! With three packet tables with optimized sizes I
now achieved to receive a file of size 30M within 14s. Unfortunately may
previous mentioned ZIP-comparison numbers were wrong: the simple zip'ing
takes 42s resulting in a 23M file.

But with this approach and considering to have with the hdf the data
accessible in a improved manner it looks promising. Tomorrow I will have
to maintain additionally a packet table keeping references to the arrays
in the original order ...

Mmh, if you don't want to have the nuisance of maintaining several tables for
keeping your data, another possibility would be to compress your data before
injecting it into variable length types in HDF5. You will have to deal with
the zlib API to do so, but probably that would be easier than what you are
planning. And you would get better results in terms of efficiency too. The
drawback is that you won't be able to read your data by using standard HDF5
tools (like HDFView, for example).

···

--
Francesc Alted

Peter_Cao · April 13, 2010, 1:16pm

Stamminger, Johannes wrote:

Hi!

wrapping the native shared libraries. And it is *not* hdf-java as this
does not support the H5PT but using JNA for that purpose.

Still interested?


...

It will be very helpful to see the difference of JNA and JNI. The current hdf-java JNI was carefully
designed so that the memory allocated in Java is directly used in the JNI C. The overhead between
Java and C is very small, i.e. the I/O from JNI calls is close to C.

A good way to test is to read and write a simple 1-D array, e.g. float[10*1024*1024] in JNA and
in C (directly use the HDF5 library) and see the results.

JNA with using there "direct mapping" (as I do) intends to be nearly as
fast as JNI (in the background it seems to do just the same. But it has
to do some initial additional shared library analysis).

Indeed, a JNI comparison would be interesting. Maybe I go for this
later, currently this is not my main target ...

You don't have to compare with JNI. Just compare to a simple C program.
If JNA is close to C in terms of memory usage and io speed, JNA is a way
to go.

FYI find attached my JNA java sources. Put the jna.jar to the classpath,
set the system property jna.library.path to a directory containing the
hdf libs in the same architecture as your java vm (32/64 bit) and write
some java code to access hdf files - that's it.
You have then exactly the same api as when writing plain C stuff with -
as far as i understand the jna - sharing the mem between the native and
the java part, too, as you told for hdf-java.

One of my major problems with this JNA approach remain the hdf constants
as they are not available in the hdf native libs as symbols, only as
#define's. My current workaround is to use - only for this purpose - the
hdf-java's HDF5Constants (implicating to additionally put the hdf native
libs to the java.library.path and the hdf-java jar's to the classpath).
But I will contact the helpdesk to simply add the constants additionally
as symbols to the native libs. And in the meanwhile I will do this
myself with patching and compiling the sources.
Btw the hdf-java suffers the same and has written some JNI code for this
purpose.

Some ( or most) of the HDF5 constant values are defined at runtime. There is no way to get such
C constants in Java directly. This is why we use private function calls in JNI to get the values.

···

On Mo, 2010-04-12 at 15:39 -0500, Peter Cao wrote:

Best regards,
Johannes Stamminger
------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Francesc_Alted2 · April 13, 2010, 7:53am

A Tuesday 13 April 2010 09:15:46 Stamminger, Johannes escrigué:

> easily, so I suppose that it is fine if you send more of these (although
> if you can come up with a PDF file would be better).

In principle absolutely correct. Thus it contains calculations and with
pdf no one could verify them any longer (I found already several times -
until now only *before* having posted it - wrong cell references inside
them). This way someone wondering might check them ... so I personally
would prefer to stay with the OO-xls for best compatibility ... ?

Honest, do not expect people to check the correctness of your computations.
They will rather tend to trust your report, so always double check that your
data is correct before sending it to the list.

> use the shuffle filter in combination with zlib. In many circumstances,
> shuffle may buy you significant additional compression ratio. This is

Good to know. You know of whether this is shown in an example?

I suppose there are different examples spread in the web, but just to give you
an idea, this would work:

http://www.pytables.org/docs/manual/ch05.html#ShufflingOptim

> something that zip cannot do (it can only compress streams of bytes, as
> it has not the notion of ints/floats).
>
> > With big chunksize both, performance and file size, degrade by a large
> > factor. Worst example was to have 819K leading to a file of 513M (50
> > arrays with 16384 bytes each, compression 0, chunk size 32K).
>
> Uh, you lost me. What is 819 KB, the chunksize?

Oh no, this is the size of the data having written to the hdf: 50 arrays
of size 16384 bytes each. Written to hdf with compression level 0, chunk
size 32K.

Using a 32 KB chunksize is not that big, but I'd say it is a normal to small
size :-/ And requiring 513 MB to save just 819 KB is completely unusual. I'd
report this to THG too.

···

--
Francesc Alted

Stamminger_Johannes · April 13, 2010, 8:42am

I need the references in any way as the order I write the blobs is not
the correct one. I have to do some sorting, the refs dataset is
therefore needed anyway. And this is the only "maintenance" I have to
deal with AFAIS. Though I did not test to read the blobs through this
indirection, yet ...

Additionally I'm still not able to write variable length binary data -
as I try to get help for with the mentioned thread "Varying length
binary data".

And finally I doubt it being very efficient to compress each blob
separately (mean size just 465 bytes - and this is realistic data) - and
I definitely need the ability to access each blob separately.

So the refs list seems more charming to me at the moment ... ?

···

On Di, 2010-04-13 at 09:58 +0200, Francesc Alted wrote:

A Monday 12 April 2010 20:43:03 Stamminger, Johannes escrigué:
> > Maybe I try something like using multiple packet tables in parallel with
> > each using different array sizes ... ?
>
> That does the trick!!! With three packet tables with optimized sizes I
> now achieved to receive a file of size 30M within 14s. Unfortunately may
> previous mentioned ZIP-comparison numbers were wrong: the simple zip'ing
> takes 42s resulting in a 23M file.
>
> But with this approach and considering to have with the hdf the data
> accessible in a improved manner it looks promising. Tomorrow I will have
> to maintain additionally a packet table keeping references to the arrays
> in the original order ...

Mmh, if you don't want to have the nuisance of maintaining several tables for
keeping your data, another possibility would be to compress your data before
injecting it into variable length types in HDF5. You will have to deal with
the zlib API to do so, but probably that would be easier than what you are
planning. And you would get better results in terms of efficiency too. The
drawback is that you won't be able to read your data by using standard HDF5
tools (like HDFView, for example).

Francesc_Alted2 · April 13, 2010, 9:38am

A Tuesday 13 April 2010 10:42:03 Stamminger, Johannes escrigué:

> Mmh, if you don't want to have the nuisance of maintaining several tables
> for keeping your data, another possibility would be to compress your data
> before injecting it into variable length types in HDF5. You will have to
> deal with the zlib API to do so, but probably that would be easier than
> what you are planning. And you would get better results in terms of
> efficiency too. The drawback is that you won't be able to read your data
> by using standard HDF5 tools (like HDFView, for example).

I need the references in any way as the order I write the blobs is not
the correct one. I have to do some sorting, the refs dataset is
therefore needed anyway. And this is the only "maintenance" I have to
deal with AFAIS. Though I did not test to read the blobs through this
indirection, yet ...

Additionally I'm still not able to write variable length binary data -
as I try to get help for with the mentioned thread "Varying length
binary data".

And finally I doubt it being very efficient to compress each blob
separately (mean size just 465 bytes - and this is realistic data) - and
I definitely need the ability to access each blob separately.

So the refs list seems more charming to me at the moment ... ?

Fair enough. It seems so, yes.

···

--
Francesc Alted

Stamminger_Johannes · April 13, 2010, 3:09pm

Stamminger, Johannes wrote:
> Hi!
>
>
>
>>>> wrapping the native shared libraries. And it is *not* hdf-java as this
>>>> does not support the H5PT but using JNA for that purpose.
>>>>
>>>> Still interested?
>>>>
>>> ...
>>>
>> It will be very helpful to see the difference of JNA and JNI. The
>> current hdf-java JNI was carefully
>> designed so that the memory allocated in Java is directly used in the
>> JNI C. The overhead between
>> Java and C is very small, i.e. the I/O from JNI calls is close to C.
>>
>> A good way to test is to read and write a simple 1-D array, e.g.
>> float[10*1024*1024] in JNA and
>> in C (directly use the HDF5 library) and see the results.
>
> JNA with using there "direct mapping" (as I do) intends to be nearly as
> fast as JNI (in the background it seems to do just the same. But it has
> to do some initial additional shared library analysis).
>
> Indeed, a JNI comparison would be interesting. Maybe I go for this
> later, currently this is not my main target ...
>
You don't have to compare with JNI. Just compare to a simple C program.
If JNA is close to C in terms of memory usage and io speed, JNA is a way
to go.

Ok, let's compare plain C, JNI, JNA-interface mapping and JNA-direct
mapping then. All those make sense IMHO. But I have no idea of how to
measure native mem usage in java ... ?

...

> One of my major problems with this JNA approach remain the hdf constants
> as they are not available in the hdf native libs as symbols, only as
> #define's. My current workaround is to use - only for this purpose - the
> hdf-java's HDF5Constants (implicating to additionally put the hdf native
> libs to the java.library.path and the hdf-java jar's to the classpath).
> But I will contact the helpdesk to simply add the constants additionally
> as symbols to the native libs. And in the meanwhile I will do this
> myself with patching and compiling the sources.
> Btw the hdf-java suffers the same and has written some JNI code for this
> purpose.
>
Some ( or most) of the HDF5 constant values are defined at runtime.
There is no way to get such
C constants in Java directly. This is why we use private function calls
in JNI to get the values.

In the JNA documentation I found
https://jna.dev.java.net/javadoc/com/sun/jna/NativeLibrary.html#getGlobalVariableAddress(java.lang.String)

So global variables should be accessible from java - or are global
variables no option for this purpose?

Best regards,
Johannes Stamminger

···

On Di, 2010-04-13 at 08:16 -0500, Peter Cao wrote:

> On Mo, 2010-04-12 at 15:39 -0500, Peter Cao wrote:

Stamminger_Johannes · April 19, 2010, 2:16pm

Hi!

Today I wanted to test the shuffling to even improve compression. But
with H5PT I do not see it is possible. The dataset creation property
list is created inside the H5PT implementation ...

Is it possible at all? Any hints about this?

Thanks in advance,
Johannes Stamminger

···

On Di, 2010-04-13 at 09:53 +0200, Francesc Alted wrote:

> > use the shuffle filter in combination with zlib. In many circumstances,
> > shuffle may buy you significant additional compression ratio. This is
>
> Good to know. You know of whether this is shown in an example?

I suppose there are different examples spread in the web, but just to give you
an idea, this would work:

http://www.pytables.org/docs/manual/ch05.html#ShufflingOptim

> > something that zip cannot do (it can only compress streams of bytes, as
> > it has not the notion of ints/floats).

Peter_Cao · April 19, 2010, 3:08pm

Hi all,

We rebuilt HDFView and all the related libraries with the following flag:
"-mmacosx-version-min=10.5 -isysroot /Developer/SDKs/MacOSX10.5.sdk -fPIC"

Could someone test it on 64-bit Mac OS 10.5?
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf-java/v2.6.1/hdfview/hdfview_install_macosx_intel64.zip

We appreciate your help in advance.

--pc

gnwiii · April 19, 2010, 10:40pm

Hi all,

We rebuilt HDFView and all the related libraries with the following flag:
"-mmacosx-version-min=10.5 -isysroot /Developer/SDKs/MacOSX10.5.sdk -fPIC"

Could someone test it on 64-bit Mac OS 10.5?
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf-java/v2.6.1/hdfview/hdfview_install_macosx_intel64.zip

We appreciate your help in advance.

We appreciate your efforts to make this work. I'm sorry, however, to
report that I won't have
access to 64-bit Mac OS 10.5 until next week at the earliest. Hope
others will run the test.

···

On Mon, Apr 19, 2010 at 12:08 PM, Peter Cao <xcao@hdfgroup.org> wrote:

--pc

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

sean · April 21, 2010, 2:55pm

I forget if this has been mentioned already, and forgive me if you are
already aware, but there is no "64-bit Mac OS 10.5". Unlike Windows,
there is only one version of Mac OS 10.5 (and 10.6) and it is both 32
bit and 64 bit. It uses whatever is appropriate for the hardware at hand.

···

On Mon, 19 Apr 2010 10:08:24 -0500, Peter Cao said:

Hi all,

We rebuilt HDFView and all the related libraries with the following flag:
"-mmacosx-version-min=10.5 -isysroot /Developer/SDKs/MacOSX10.5.sdk -fPIC"

Could someone test it on 64-bit Mac OS 10.5?
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf-java/v2.6.1/hdfview/
hdfview_install_macosx_intel64.zip

--
____________________________________________________________
Sean McBride, B. Eng sean@rogue-research.com
Rogue Research www.rogue-research.com
Mac Software Developer Montréal, Québec, Canada

gnwiii · April 26, 2010, 12:35pm

Sorry to take so long to do this -- I have been away from the Apple
10.5 machines.
The above version:

hdfview_install_macosx_intel64.zip 9.1 MB 4/19/10 11:41:00 AM

works for me. Hdfview shows "compiled at jdk 1.6.0, running at 14.3-b01-101".

$ uname -r
9.8.0
[e.g., Leopard 10.5.8]
$ java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

Thanks for this.

···

On Mon, Apr 19, 2010 at 12:08 PM, Peter Cao <xcao@hdfgroup.org> wrote:

Hi all,

We rebuilt HDFView and all the related libraries with the following flag:
"-mmacosx-version-min=10.5 -isysroot /Developer/SDKs/MacOSX10.5.sdk -fPIC"

Could someone test it on 64-bit Mac OS 10.5?
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf-java/v2.6.1/hdfview/hdfview_install_macosx_intel64.zip

We appreciate your help in advance.

--
George N. White III <aa056@chebucto.ns.ca>
Head of St. Margarets Bay, Nova Scotia

Peter_Cao · April 26, 2010, 1:00pm

George,

Thank you for checking the snapshot. Glad to know it works.

--pc

George N. White III wrote:

···

On Mon, Apr 19, 2010 at 12:08 PM, Peter Cao <xcao@hdfgroup.org> wrote:

Hi all,

We rebuilt HDFView and all the related libraries with the following flag:
"-mmacosx-version-min=10.5 -isysroot /Developer/SDKs/MacOSX10.5.sdk -fPIC"

Could someone test it on 64-bit Mac OS 10.5?
ftp://ftp.hdfgroup.uiuc.edu/pub/outgoing/hdf-java/v2.6.1/hdfview/hdfview_install_macosx_intel64.zip

We appreciate your help in advance.

Sorry to take so long to do this -- I have been away from the Apple
10.5 machines.
The above version:

hdfview_install_macosx_intel64.zip 9.1 MB 4/19/10 11:41:00 AM

works for me. Hdfview shows "compiled at jdk 1.6.0, running at 14.3-b01-101".

$ uname -r
9.8.0
[e.g., Leopard 10.5.8]
$ java -version
java version "1.6.0_17"
Java(TM) SE Runtime Environment (build 1.6.0_17-b04-248-9M3125)
Java HotSpot(TM) 64-Bit Server VM (build 14.3-b01-101, mixed mode)

Thanks for this.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Problems with H5PT from java