Efficient Way to Write Compound Data

Hi Matt,

···

On Aug 29, 2008, at 2:16 PM, Dougherty, Matthew T. wrote:

I am little confused on the example:

H5Zregister(H5Z_BZIP2,"bzip2",bzip2_filter); has three parameters.

the docs say one parameter passed, which is the structure
typedef struct H5Z_class_t {
           H5Z_filter_t filter_id;
           const char *comment;
           H5Z_can_apply_func_t can_apply_func;
           H5Z_set_local_func_t set_local_func;
           H5Z_func_t filter_func;
       } H5Z_class_t;

  We updated the filter registration API routine (H5Zregister), so all three of the parameters that used to be passed to it are now part of the class structure that is passed to it.

  Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Friday 29 August 2008, escriguéreu:

Variable length string data is not compressed. There is a small
structure in the dataset itself which points to the variable length
data, stored elsewhere in the file. Those structures themselves will
be compressed if you apply the filter, but the strings are stored as
is.

As far as I know, the only way to compress string data is to store
fixed-length strings. You can pad strings to a given maximum size,
or concatenate many strings into a single large string.

That's good to know. I normally use fixed-length strings, so this is
why I thought that every string type was compressible.

Cheers,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

I just want to say that you have done nothing wrong, and so there
is no need to be sorry. You made a simple mistake, and it is one that
most people here have probably made. I have made this mistake, and I'll
probably make it again in the future.

···

Elena Pourmal <epourmal@hdfgroup.org> wrote:

I am very sorry that I missed H5Sclose for the filespace handle. The
point was that one HAS to close datsapce handles and this should be
done in the loop to prevent memory growth.

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

If you need to store millions of rows in tables perhaps a real SQL
database would be better - then if you need to maintain complex
relationships between large numbers of objects with multithreaded
access and object level locking then an OODB would be the tool of
choice.

     No, there's a very real need for storing large amounts of numeric
data, and SQL databases aren't good for that. My example just happened
to contain a string, as part of some "worst-case" evaluation tests. In
practice, I'd guess that most people using HDF5 won't be storing large
amounts of strings (and you'd have a definite point that SQL databases
would be better for string-type data, in general).

     However, with large amounts of numeric data, you might want to use
different numeric data types, to save disk space (yeah, this is still a
concern, even with multi-terabyte disk systems). Still, being able to
store strings (in a separate dataset) is nice, because, sometimes, you
need to annotate your ginormous pile of numeric data.

···

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Wednesday 27 August 2008, Darryl Okahata escrigué:

> If you need to store millions of rows in tables perhaps a real SQL
> database would be better - then if you need to maintain complex
> relationships between large numbers of objects with multithreaded
> access and object level locking then an OODB would be the tool of
> choice.

     No, there's a very real need for storing large amounts of
numeric data, and SQL databases aren't good for that. My example
just happened to contain a string, as part of some "worst-case"
evaluation tests. In practice, I'd guess that most people using HDF5
won't be storing large amounts of strings (and you'd have a definite
point that SQL databases would be better for string-type data, in
general).

     However, with large amounts of numeric data, you might want to
use different numeric data types, to save disk space (yeah, this is
still a concern, even with multi-terabyte disk systems). Still,
being able to store strings (in a separate dataset) is nice, because,
sometimes, you need to annotate your ginormous pile of numeric data.

You may be surprised on the amount of people that is using HDF5 to store
string data too. Its capability for transparently compressing these
type of data (all data types in general) is very appreciated out there.

Cheers,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Thanks everyone for your input. Closing the filepsace definitely solves the
system locking up issue.

Regarding Andrew's comment regarding DBs to store data - it does not quite
work out well for many of us in Financial Services. We pull a lot of data
from the DB, store it as a flat file (sometimes even as a CSV) then we build
statistical models. The process of building a model is very iterative and
sometimes we will go through 100 - 250 iterations (i.e. 100 different
combinations of variables, modeling techniques, etc.). Most of our modeling
tools are built to deal with large volumes of data and do not store all the
data in memory - they process them line by line or in blocks and that's why
I was reading/writing line by line. Under this situation, reading the data
from DB will be very slow as there is no real complicated query - we just
read all the rows.

As for the API, I always start by cobbling together examples to learn from
it. Of course, I am not really a professional developer so I do not know
what is ideal. That said, people have been very helpful here and I am sure
they will help out :slight_smile:

SK

···

On Wed, Aug 27, 2008 at 4:32 PM, Francesc Alted <faltet@pytables.com> wrote:

A Wednesday 27 August 2008, Darryl Okahata escrigué:
> > If you need to store millions of rows in tables perhaps a real SQL
> > database would be better - then if you need to maintain complex
> > relationships between large numbers of objects with multithreaded
> > access and object level locking then an OODB would be the tool of
> > choice.
>
> No, there's a very real need for storing large amounts of
> numeric data, and SQL databases aren't good for that. My example
> just happened to contain a string, as part of some "worst-case"
> evaluation tests. In practice, I'd guess that most people using HDF5
> won't be storing large amounts of strings (and you'd have a definite
> point that SQL databases would be better for string-type data, in
> general).
>
> However, with large amounts of numeric data, you might want to
> use different numeric data types, to save disk space (yeah, this is
> still a concern, even with multi-terabyte disk systems). Still,
> being able to store strings (in a separate dataset) is nice, because,
> sometimes, you need to annotate your ginormous pile of numeric data.

You may be surprised on the amount of people that is using HDF5 to store
string data too. Its capability for transparently compressing these
type of data (all data types in general) is very appreciated out there.

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

All,

Thank you for the kind words and very constructive critique.

I would like to point any new users to examples that were recently posted on our Web site:

http://www.hdfgroup.org/HDF5/index.html

Information
     Examples programs by API 1.8 1.6

More examples can be found under

http://www.hdfgroup.uiuc.edu/UserSupport/code-examples/

Elena

···

On Aug 28, 2008, at 11:16 PM, SK wrote:

Thanks everyone for your input. Closing the filepsace definitely solves the system locking up issue.

Regarding Andrew's comment regarding DBs to store data - it does not quite work out well for many of us in Financial Services. We pull a lot of data from the DB, store it as a flat file (sometimes even as a CSV) then we build statistical models. The process of building a model is very iterative and sometimes we will go through 100 - 250 iterations (i.e. 100 different combinations of variables, modeling techniques, etc.). Most of our modeling tools are built to deal with large volumes of data and do not store all the data in memory - they process them line by line or in blocks and that's why I was reading/writing line by line. Under this situation, reading the data from DB will be very slow as there is no real complicated query - we just read all the rows.

As for the API, I always start by cobbling together examples to learn from it. Of course, I am not really a professional developer so I do not know what is ideal. That said, people have been very helpful here and I am sure they will help out :slight_smile:

SK

On Wed, Aug 27, 2008 at 4:32 PM, Francesc Alted > <faltet@pytables.com> wrote:
A Wednesday 27 August 2008, Darryl Okahata escrigué:
> > If you need to store millions of rows in tables perhaps a real SQL
> > database would be better - then if you need to maintain complex
> > relationships between large numbers of objects with multithreaded
> > access and object level locking then an OODB would be the tool of
> > choice.
>
> No, there's a very real need for storing large amounts of
> numeric data, and SQL databases aren't good for that. My example
> just happened to contain a string, as part of some "worst-case"
> evaluation tests. In practice, I'd guess that most people using HDF5
> won't be storing large amounts of strings (and you'd have a definite
> point that SQL databases would be better for string-type data, in
> general).
>
> However, with large amounts of numeric data, you might want to
> use different numeric data types, to save disk space (yeah, this is
> still a concern, even with multi-terabyte disk systems). Still,
> being able to store strings (in a separate dataset) is nice, because,
> sometimes, you need to annotate your ginormous pile of numeric data.

You may be surprised on the amount of people that is using HDF5 to store
string data too. Its capability for transparently compressing these
type of data (all data types in general) is very appreciated out there.

Cheers,

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Friday 29 August 2008, Elena Pourmal escrigué:

All,

Thank you for the kind words and very constructive critique.

I would like to point any new users to examples that were recently
posted on our Web site:

http://www.hdfgroup.org/HDF5/index.html

Wow, a very good tutorial in my opinion. I think this is a huge step
forward in introducing HDF5 concepts to naive users. Just a couple of
suggestions:

- It would be nice if you can add navigation buttons for going forward
and backward, instead of requiring the user to go up and click the next
section.

- I've seen several times throughout the tutorial (and maybe in other
HDF5 documents too) referring to HDF5 as a library to deal with "a
variety of scientific data". Why are you adding the "scientific"
qualifier? Provided that there already exist many users in the
financial field and its potential for other apps (like logfiles
management, indexing web contents, etc.), I think that you could
perfectly replace "scientific data" by "binary data", of just "data".

Information
     Examples programs by API 1.8 1.6

More examples can be found under

http://www.hdfgroup.uiuc.edu/UserSupport/code-examples/

Why do you keep these examples in a different site than hdfgroup.org?
Just curious.

Cheers,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

A Friday 29 August 2008, Elena Pourmal escrigué:
> All,
>
> Thank you for the kind words and very constructive critique.
>
> I would like to point any new users to examples that were recently
> posted on our Web site:
>
> http://www.hdfgroup.org/HDF5/index.html

Wow, a very good tutorial in my opinion. I think this is a huge step
forward in introducing HDF5 concepts to naive users. Just a couple of
suggestions:

IMHO it should be under the two big fields to download HDF5 and HDFView on
the right. I never noticed that be fore, my eye always jumps to the bolt
letters in the middle of the page.

also I would like to once again stress that most examples (at least those
under http://www.hdfgroup.uiuc.edu/UserSupport/code-examples)
do not work with HDF5 1.8

BR

-- dimitris

···

2008/8/29 Francesc Alted <faltet@pytables.com>

A Friday 29 August 2008, Dimitris Servis escrigué:

···

2008/8/29 Francesc Alted <faltet@pytables.com>

> A Friday 29 August 2008, Elena Pourmal escrigué:
> > All,
> >
> > Thank you for the kind words and very constructive critique.
> >
> > I would like to point any new users to examples that were
> > recently posted on our Web site:
> >
> > http://www.hdfgroup.org/HDF5/index.html
>
> Wow, a very good tutorial in my opinion. I think this is a huge
> step forward in introducing HDF5 concepts to naive users. Just a
> couple of suggestions:

IMHO it should be under the two big fields to download HDF5 and
HDFView on the right. I never noticed that be fore, my eye always
jumps to the bolt letters in the middle of the page.

I hadn't seen it before neither. Maybe because it is very recent?

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

This is just my experience. Either the API is hard to grasp or I am an
atypical case and should try something other than programming for a
living (:

     I think this is really an issue of perspective. The basic HDF5 API
isn't hard, as APIs go, but the documentation does leave much to be
desired. Even though there's a huge amount documented, there's still
missing information. Because of this, writing programs that use HDF5
isn't necessarily simple (it can look downright difficult, depending
upon what you want to do). It's easiest if beginners start with the
tutorials, but the tutorials don't always do a good job of explaining
the examples.

···

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Darryl,Andrew

I do not argue on whether the API is easy to learn or difficult. Everybody
has his own pace and programming is more a craft than a science.I assure you
it takes some time to really take off with it and I doubt I am already
flying :wink: However, I do agree that the documentation is rather faible and
the examples rarely work out of the box. Since my early programming days,
when I started with OSS, the libraries I used had no documentation but the
sources (as is the practice in manysoftware houses as well). Therefore I am
used to work with two monitors: one with my sources and the other with the
sources of the libraries I use...

Best Regards

-- dimitris

···

2008/8/27 Darryl Okahata <darrylo@soco.agilent.com>

> This is just my experience. Either the API is hard to grasp or I am an
> atypical case and should try something other than programming for a
> living (:

     I think this is really an issue of perspective. The basic HDF5 API
isn't hard, as APIs go, but the documentation does leave much to be
desired. Even though there's a huge amount documented, there's still
missing information. Because of this, writing programs that use HDF5
isn't necessarily simple (it can look downright difficult, depending
upon what you want to do). It's easiest if beginners start with the
tutorials, but the tutorials don't always do a good job of explaining
the examples.

--
       Darryl Okahata
       darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Hi,

Well, I have to recognize that one of my first attempts to work with a
releatively large API in C was with HDF5, and believe me that I've
found the documentation incredible useful and pretty accurate. Perhaps
you are right that more introductory material that is easy to find and
browse is desirable (I know that there are *lots* of examples sparsed
through the THG website, but maybe they could be better organized).

On the other hand, and in my opinion, HDF5 is not for the casual user,
but mainly oriented to application developers. This is because it
offers you lots of flexibility for twiking the different knobs of its
underlying machinery, like chunksizes, data cache, metadata cache,
hyperslices, rich data type definitions and so on and so forth. And
believe me, understanding all the possibilities that HDF5 is offering
is the most daunting task that face the programmer that starts with it
(and much harder that fighting with the API, which is relatively
trivial in comparison ;-).

Casual programmers that wants to leverage the advantages that HDF5 is
offering should look first to other tools that are based on it and that
are easier to work with. In particular, a look at:

http://www.hdfgroup.org/tools5desc.html

should be worth the effort for such users.

My two cents,

A Wednesday 27 August 2008, Dimitris Servis escrigué:

···

Darryl,Andrew

I do not argue on whether the API is easy to learn or difficult.
Everybody has his own pace and programming is more a craft than a
science.I assure you it takes some time to really take off with it
and I doubt I am already flying :wink: However, I do agree that the
documentation is rather faible and the examples rarely work out of
the box. Since my early programming days, when I started with OSS,
the libraries I used had no documentation but the sources (as is the
practice in manysoftware houses as well). Therefore I am used to work
with two monitors: one with my sources and the other with the sources
of the libraries I use...

Best Regards

-- dimitris

2008/8/27 Darryl Okahata <darrylo@soco.agilent.com>

> > This is just my experience. Either the API is hard to grasp or I
> > am an atypical case and should try something other than
> > programming for a living (:
>
> I think this is really an issue of perspective. The basic
> HDF5 API isn't hard, as APIs go, but the documentation does leave
> much to be desired. Even though there's a huge amount documented,
> there's still missing information. Because of this, writing
> programs that use HDF5 isn't necessarily simple (it can look
> downright difficult, depending upon what you want to do). It's
> easiest if beginners start with the tutorials, but the tutorials
> don't always do a good job of explaining the examples.
>
> --
> Darryl Okahata
> darrylo@soco.agilent.com
>
> DISCLAIMER: this message is the author's personal opinion and does
> not constitute the support, opinion, or policy of Agilent
> Technologies, or of the little green men that have been following
> him all day.
>
>
> -------------------------------------------------------------------
>--- This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org.
> To unsubscribe, send a message to
> hdf-forum-unsubscribe@hdfgroup.org.

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

You should see much less memory usage if you close the filespace ID.

     Ah, yes. Francesc Alted was right -- see my other message.

···

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

All,

I am very sorry that I missed H5Sclose for the filespace handle. The point was that one HAS to close datsapce handles and this should be done in the loop to prevent memory growth.

Elena

···

On Aug 26, 2008, at 12:58 PM, Darryl Okahata wrote:

You should see much less memory usage if you close the filespace ID.

    Ah, yes. Francesc Alted was right -- see my other message.

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

I wrote:

     That's what I thought, too. However, amazingly enough, it's not
leaking, and I've verified this with two different tools: purify and
valgrind. The HDF5 code just appears to be using unbelievable amounts
of memory when you change ITER from 100000 to 1000000.

     OK, I take that back. While Elena's program is NOT leaking memory
in the "memory leak" sense, it is using up unbelievable amounts of
memory, and it is doing so because the filespace handle isn't being
released (as mentioned by Francesc Alted). Once you start closing the
filespace handle, the memory usage drops down to sane levels. Setting
ITER == 1000000 is then possible, and that runs in around 4 seconds, on
my system.

···

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

That's what I thought, too. However, amazingly enough, it's not
leaking, and I've verified this with two different tools: purify and
valgrind. The HDF5 code just appears to be using unbelievable amounts
of memory when you change ITER from 100000 to 1000000.

···

Francesc Alted <faltet@pytables.com> wrote:

By looking at the example posted by Elena, I think she missed to close
the ``filespace`` handler, so the program is basically developing a
leak.

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Actually, it _is_ leaking, but we hook an "atexit" handler to clean up all the outstanding objects when the application is shutting down. You should see much less memory usage if you close the filespace ID.

  Quincey

···

On Aug 26, 2008, at 12:46 PM, Darryl Okahata wrote:

Francesc Alted <faltet@pytables.com> wrote:

By looking at the example posted by Elena, I think she missed to close
the ``filespace`` handler, so the program is basically developing a
leak.

    That's what I thought, too. However, amazingly enough, it's not
leaking, and I've verified this with two different tools: purify and
valgrind. The HDF5 code just appears to be using unbelievable amounts
of memory when you change ITER from 100000 to 1000000.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Darryl, if you don't mind, I would recommend the HDF Group publish your
example as a great way to write compound data. I think this can complement
the examples they already on Table API. This will be helpful for people who
don't mind using the lower level APIs to get very good performance.

     While they're welcome to do that, I don't think it's anything
special. I just wrote that program to evaluate HDF5 under different
circumstances.

···

--
  Darryl Okahata
  darrylo@soco.agilent.com

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion, or policy of Agilent Technologies, or
of the little green men that have been following him all day.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.