Compression in variable length datasets

Andrea_Bedini · January 16, 2014, 9:37pm

Hi,

in a post few years ago [1] Quincey Koziol explained that the VL data is
stored in a "global heap" in
the file, which is not compressed. He also mentioned that a new "fractal
heap" code was being developed (which, I assume, would allow compression of
VL data).

Is there any news on this front? I there a way to compress VL data?

Thanks,
Andrea

[1]
http://hdf-forum.184993.n3.nabble.com/hdf-forum-Compression-in-variable-length-datasets-not-working-td194091.html

···

--
Andrea Bedini <andrea.bedini@gmail.com>

nyama · January 16, 2014, 10:12pm

Hi
I am trying to use zlib compression for my compound array first time.
My compound array is approximately 12MB. After compression, it became only 11MB.
So I think something is wrong.I tried changing CHUNK, but apparently CHUNK does not have any influence.
I chose RANK as 1. Is it correct for compound arrays? Also from hdfview, I can not see any info if the dataset I view is compressed.

Thank you

With respect
Nyamtulga Shaandar

epourmal · January 17, 2014, 4:46pm

Andrea,

···

On Jan 16, 2014, at 3:37 PM, Andrea Bedini <andrea.bedini@gmail.com> wrote:

Hi,

in a post few years ago [1] Quincey Koziol explained that the VL data is stored in a "global heap" in
the file, which is not compressed. He also mentioned that a new "fractal heap" code was being developed (which, I assume, would allow compression of VL data).

Is there any news on this front? I there a way to compress VL data?

No news. We need funding to implement compression of VL data. If any organization is willing to sponsor the feature, please contact us at info@hdfgroup.org

Thank you!

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal
Director of Technical Services and Operations
The HDF Group
1800 So. Oak St., Suite 203,
Champaign, IL 61820

(217)531-6112 (office)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,
Andrea

[1] http://hdf-forum.184993.n3.nabble.com/hdf-forum-Compression-in-variable-length-datasets-not-working-td194091.html

--
Andrea Bedini <andrea.bedini@gmail.com>
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

nevion · January 17, 2014, 6:17pm

I hear you guys need funding regularly but I thought DOE/Whamcloud/Cray
gave you guys a healthy chunk of change to fix alot of long standing issues
that would be very beneficial to get done.

Things I've been noting during my usage of HDF5:

   - No support for filters on vlen types, in particular - compression
   - No atomic transactions for writing new data (ie you crash, your
   document can be corrupted because of metadata issues)
   - Usage of a global mutex for most any routine, which can cause severe
   performance degradation if multiple threads are concurrently doing IO... (I
   don't want to hear about studies saying about how IO itself is the bigger
   problem... there are ram filesystems, SSDs, and different storage locations
   to invalidate these claims, all easy things to come by in HPC)
   - Mediocre examples that don't really show you how to get things done or
   clarify all that much. Reading source is often the only way to get an
   answer, I've found. It makes the learning curve appear huge to newbies who
   I've introduced to HDF.

I love HDF as it really addresses write once, use anywhere and gives me
something to deal with long term storage of custom binary data formats. I
hope it continues to spread throughout the world as the defacto data
storage format for pretty much anything between embedded systems to HPC in
all it's spaces. I also hope in the future more things will be leaving the
powerpoint slides / R&D and become production ready. I hope you guys get
everything you need to improve areas HDF is still lacking in.

-Jason

···

On Fri, Jan 17, 2014 at 8:46 AM, Elena Pourmal <epourmal@hdfgroup.org>wrote:

Andrea,

On Jan 16, 2014, at 3:37 PM, Andrea Bedini <andrea.bedini@gmail.com> > wrote:

Hi,

in a post few years ago [1] Quincey Koziol explained that the VL data is
stored in a "global heap" in
the file, which is not compressed. He also mentioned that a new "fractal
heap" code was being developed (which, I assume, would allow compression of
VL data).

Is there any news on this front? I there a way to compress VL data?

No news. We need funding to implement compression of VL data. If any
organization is willing to sponsor the feature, please contact us at
info@hdfgroup.org

Thank you!

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal
Director of Technical Services and Operations
The HDF Group
1800 So. Oak St., Suite 203,
Champaign, IL 61820
www.hdfgroup.org
(217)531-6112 (office)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,
Andrea

[1]
http://hdf-forum.184993.n3.nabble.com/hdf-forum-Compression-in-variable-length-datasets-not-working-td194091.html

--
Andrea Bedini <andrea.bedini@gmail.com>
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org

http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

epourmal · January 17, 2014, 8:56pm

Jason,

I hear you guys need funding regularly but I thought DOE/Whamcloud/Cray gave you guys a healthy chunk of change to fix alot of long standing issues that would be very beneficial to get done.

Unfortunately support for better handling of the VL datatypes (including compression) was not among the features they decided to fund. I'll let Quincey to elaborate on what DOE and Co. sponsored.

Things I've been noting during my usage of HDF5:
No support for filters on vlen types, in particular - compression

Architecture is already in place. We just need resources.

No atomic transactions for writing new data (ie you crash, your document can be corrupted because of metadata issues)

Good news is that this will be addressed by metadata journaling. Also, we are wrapping the SWMR feature (for data append-only) that addresses the issue too. Bad news is that I cannot tell you the exact day of the HDF5 1.10.0 release. We are targeting the end of the year and the features should be available in the snapshots during the year, but at this point no dates have been set. Data-append only SWMR prototype will be ready for users to try in early March 2014.

Usage of a global mutex for most any routine, which can cause severe performance degradation if multiple threads are concurrently doing IO... (I don't want to hear about studies saying about how IO itself is the bigger problem... there are ram filesystems, SSDs, and different storage locations to invalidate these claims, all easy things to come by in HPC)

Currently there are no plans to address this issue.

Mediocre examples that don't really show you how to get things done or clarify all that much. Reading source is often the only way to get an answer, I've found. It makes the learning curve appear huge to newbies who I've introduced to HDF.

Please share with us your ideas how we should improve, which important features are missing, etc. We also encourage this community to share the knowledge and the code.

I love HDF as it really addresses write once, use anywhere and gives me something to deal with long term storage of custom binary data formats. I hope it continues to spread throughout the world as the defacto data storage format for pretty much anything between embedded systems to HPC in all it's spaces. I also hope in the future more things will be leaving the powerpoint slides / R&D and become production ready.

Well… Your wishes coincide with our desire

I hope you guys get everything you need to improve areas HDF is still lacking in.

Thank you! We hope too!

Elena

···

On Jan 17, 2014, at 12:17 PM, Jason Newton <nevion@gmail.com> wrote:

-Jason

On Fri, Jan 17, 2014 at 8:46 AM, Elena Pourmal <epourmal@hdfgroup.org> wrote:
Andrea,

On Jan 16, 2014, at 3:37 PM, Andrea Bedini <andrea.bedini@gmail.com> wrote:

Hi,

in a post few years ago [1] Quincey Koziol explained that the VL data is stored in a "global heap" in
the file, which is not compressed. He also mentioned that a new "fractal heap" code was being developed (which, I assume, would allow compression of VL data).

Is there any news on this front? I there a way to compress VL data?

No news. We need funding to implement compression of VL data. If any organization is willing to sponsor the feature, please contact us at info@hdfgroup.org

Thank you!

Elena

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Elena Pourmal
Director of Technical Services and Operations
The HDF Group
1800 So. Oak St., Suite 203,
Champaign, IL 61820
www.hdfgroup.org
(217)531-6112 (office)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Thanks,
Andrea

[1] http://hdf-forum.184993.n3.nabble.com/hdf-forum-Compression-in-variable-length-datasets-not-working-td194091.html

--
Andrea Bedini <andrea.bedini@gmail.com>
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Andrea_Bedini · January 17, 2014, 11:47pm

Elena,

No news. We need funding to implement compression of VL data. If any
organization is willing to sponsor the feature, please contact us at
info@hdfgroup.org

Thanks for your reply. If more developer time is what you need, have you
considered a more open development process? you could put the repository on
github, keep the issue database public and let people submit pull requests.
My guess is that plenty of people in this community would be able/willing
to contribute. And of course, as the repository owner, you will still have
the last word on any commit.

my (hopefully not unwelcome) 2c,
Andrea

···

On 18 January 2014 03:46, Elena Pourmal <epourmal@hdfgroup.org> wrote:

--
Andrea Bedini <andrea.bedini@gmail.com>

nevion · January 18, 2014, 12:02am

Here's to a public issue tracker and public github/gitorious/bitbucket
repo... as I've said before contributing to the HDF project is a puzzle in
itself... I still don't know the status of my own submissions...

OpenCV used to be like this but look at it now after it moved to github...
it's bursting at the sides with life and progress.

-Jason

···

On Fri, Jan 17, 2014 at 3:47 PM, Andrea Bedini <andrea.bedini@gmail.com>wrote:

Thanks for your reply. If more developer time is what you need, have you
considered a more open development process? you could put the repository on
github, keep the issue database public and let people submit pull requests.
My guess is that plenty of people in this community would be able/willing
to contribute. And of course, as the repository owner, you will still have
the last word on any commit.

epourmal · January 18, 2014, 2:02am

Andrea,

Elena,

No news. We need funding to implement compression of VL data. If any organization is willing to sponsor the feature, please contact us at info@hdfgroup.org

Thanks for your reply. If more developer time is what you need, have you considered a more open development process? you could put the repository on github, keep the issue database public and let people submit pull requests. My guess is that plenty of people in this community would be able/willing to contribute. And of course, as the repository owner, you will still have the last word on any commit.
my

(hopefully not unwelcome)

Of course not! We really appreciate and value suggestions and open discussions.

We will be moving to git in a very nearest future. github is also under consideration.

All,

If you have a patch or would like to contribute code or documentation please send it to help@hdfgroup.org and follow instructions you get. One of our goals this year is to simplify our current process for accepting patches.

Thank you!

Elena

···

On Jan 17, 2014, at 5:47 PM, Andrea Bedini <andrea.bedini@gmail.com> wrote:

On 18 January 2014 03:46, Elena Pourmal <epourmal@hdfgroup.org> wrote:
2c,

Andrea

--
Andrea Bedini <andrea.bedini@gmail.com>
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Compression in variable length datasets