beginner help, Design of a HDF file

David, Ashwin,
Your questions are almost identical so I will try to answer them both. It
sounds to me that most likely you both want to use compound data types.
Compound data types will store objects which are analogous to c structs and
Fortran derived types on disk. But there are some considerations if you
want to quickly read and index your data. I'll try to answer all your
questions inline below. Documentation for compound data types can be found
here: http://www.hdfgroup.org/HDF5/doc/UG/UG_frame11Datatypes.html

From: David <evolvability@vt.edu>
To: hdf-forum@hdfgroup.org
Cc:
Date: Thu, 19 Jan 2012 18:14:04 -0500
Subject: [Hdf-forum] beginner help
I am sorry to ask such a basic question but I have simple object with a
couple properties and I would like to encode this in HDF5 format. Suppose
object X with properties a,b, and c. I will encode in the HDF5 file several
instances of this object. Like several records in a database.

So, is the object a group in HDF5 and are the properties the attributes?
How does the dataset fit it in?

HDF5 (and 4 I think) is designed to be able to work with large collections
of data as efficiently as possible, but this requires some foresight on the
part of the user. A large number of users are in earth sciences, aerospace,
etc. A data set is a very large (usually) array of a certain data type.
Often in sci-comp this will just mean very large arrays of floats, or some
other intrinsic or user defined atomic data type. The trick for you is to
figure out how critical IO performance is. If you have a few object Xs and
only read them in a few times your implementation approach won't matter
much.

Either way I would probably go with the compound data type though, and then
you can declare the data set as variable length so you can enlarge it as
necessary.

To optimize performance one needs to have an idea of typical data access
patterns. One can then use chunking to group locations in the data set
together in a logically contiguous chunks. In most use cases I can think of
(on typical magnetic disk drives) the really painful expense is the
latency/seek time. Chunking lets you physically group portions of your data
set together in a different way than the data set array is indexed (i.e.
into smaller chunks). It should then be much faster to read in an entire
chunk than to read in the equivalent portion of a non chunked data set
(which will incur more latency expense associated with the data stride on
disk).

Attributes are typically used for meta data but there is no strict
restrictions on how you use meta data. In our group we use it to store meta
data needed to understand the simulation parameters that are needed to
recreate and analyze the data. Attributes themselves are data sets but
usually small ones.

Again sorry for the probably obvious questions but I am not a computer
scientist but I would very much like to take advantage of the file format.

Thanks,
David

From: ashwin <acharya.ash@gmail.com>
To: hdf-forum@hdfgroup.org
Cc:
Date: Fri, 20 Jan 2012 06:16:02 -0800 (PST)
Subject: [Hdf-forum] Design of a HDF file
Hi,

I am new to HDF, so please be kind.

If I have a CAR object which looks like this :

CAR
color : String
type : int
model : String
aboutMe : String -- "more than 1000 characters long"

I have 100 million of these objects.

What is the best way to store this as an HDF object?

It depends on what the anticipated data access patterns are.

1. Does each car object have to go as a compound dataset?

No, but it's possible that this will be the most efficient way to store
them especially if all fields of the record are likely to be consumed at
once.

2. Is it good to have each car object as an array and store 100 million of
them? Or, for performance, is it required to limit the number of arrays
stored in a single group?

In my (limited) experience I would say that it's better to try to limit the
number of objects you create in an HDF5 file. It's probably a much better
idea to store the records as a large array, especially if you can sort them
into the order in which they will be read.

3. Is it OK to create a million groups - maybe for each type or color ?

I would avoid this. Again, the anticipated data access pattern dictates
what will be the most efficient way of storing the data (from an IO
throughput perspective). If you know you want to operate on cars by color
first then each type (i.e. loop through all red cars, first Pontiacs then
Fords etc.) you could have a data set for each color that is sorted by
model.

I'm not a CS or HDF5 expert so if I have misinterpreted anything anywhere
I'm sure other users will correct me. I hope I was somewhat helpful.

Izaak Beekman

ยทยทยท

On Mon, Jan 23, 2012 at 4:00 AM, <hdf-forum-request@hdfgroup.org> wrote:

---------- Forwarded message ----------
---------- Forwarded message ----------

===================================
(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu

UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu