FW: Writing a stream of data/Opaque Data type

ramakrishnan.iyer · January 30, 2008, 3:01am

FYI

···

-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,

The query is with hdf51.6.5

I want to write a structure Data whicb is defined as follows

struct Data{

Data1 *d1;

unsigned int id;

Data2 *d2;

}

The Data1 and Data2 are complicated Data structures .

Now if I am to write Data into HDF5 i can adopt the following approaches

1st approach - Create a corresponding data structure for the Data struct
in hdf5 and write the instances . BUt for this I will have to create
corresponding data structures for Data1 and Data2 . Since these are
quite complicated it will be a very difficult task to create
apporpriate structures for them . It will require a lot of effort . Also
for pointers i will have to use references .

2nd approach - Write the entire data structure as a blob using Opaque
Datatype. Since the Data strucure contains pointers it won't be safe to
assume that the addresses which I have written will be correct when i
read back the data . The right approach will be writing the actual
values pointed to by Data1 and Data2 pointers , as a stream of data . To
further simplify I can write the dataset with a datatype which contains
an unsigned int and a stream of bytes ( for Data1 and Data2)

My Queries:

What is your opinion on these 2 approaches ? Which one ie better/faster
?

Is there any other way by which i can write a stream of data in hdf5
other than the paque datatype ?

Is there a possibility of writing stream of data in hdf5 without knowing
the size of data to be written until the time of actual writing ?

Regards

Ramakrishnan

McCloskey_David_L · January 30, 2008, 2:09pm

If you use option 1, are you saying you'd create 3 compound data types? Data, Data1, and Data2? If so, you could make 3 datasets, one for each of these. You can make pointer type objects within HDF5 to point from an entry in Data to which entry in the Data1 and Data2 datasets correspond to this entry in Data. Then when you read an entry from Data, you can create new entries to Data1 and Data2, populate from those datasets, and then point to the newly created entries for what you have in memory as *d1 and *d2. Note that these compound data types will only be efficient if they are of fixed size.

For option 2, if you write opaque data, it won't be readable cross-platform if the other platforms have different byte ordering or word size. Also, you won't be able to store the data very efficiently in HDF5 and your performance will suffer.

Dave McCloskey

···

From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
Sent: Tuesday, January 29, 2008 10:01 PM
To: hdf-forum@hdfgroup.org
Subject: FW: Writing a stream of data/Opaque Data type

FYI
-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,
The query is with hdf51.6.5
I want to write a structure Data whicb is defined as follows
struct Data{
Data1 *d1;
unsigned int id;
Data2 *d2;
}
The Data1 and Data2 are complicated Data structures .
Now if I am to write Data into HDF5 i can adopt the following approaches
1st approach - Create a corresponding data structure for the Data struct in hdf5 and write the instances . BUt for this I will have to create corresponding data structures for Data1 and Data2 . Since these are quite complicated it will be a very difficult task to create apporpriate structures for them . It will require a lot of effort . Also for pointers i will have to use references .
2nd approach - Write the entire data structure as a blob using Opaque Datatype. Since the Data strucure contains pointers it won't be safe to assume that the addresses which I have written will be correct when i read back the data . The right approach will be writing the actual values pointed to by Data1 and Data2 pointers , as a stream of data . To further simplify I can write the dataset with a datatype which contains an unsigned int and a stream of bytes ( for Data1 and Data2)
My Queries:
What is your opinion on these 2 approaches ? Which one ie better/faster ?
Is there any other way by which i can write a stream of data in hdf5 other than the paque datatype ?
Is there a possibility of writing stream of data in hdf5 without knowing the size of data to be written until the time of actual writing ?
Regards
Ramakrishnan

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · January 30, 2008, 2:25pm

Errr...

I think Ramakrishnan confuses serialization with persistence...

-- dimitris

···

2008/1/30, McCloskey, David L. <McClosDL@westinghouse.com>:

If you use option 1, are you saying you'd create 3 compound data types?
Data, Data1, and Data2? If so, you could make 3 datasets, one for each of
these. You can make pointer type objects within HDF5 to point from an entry
in Data to which entry in the Data1 and Data2 datasets correspond to this
entry in Data. Then when you read an entry from Data, you can create new
entries to Data1 and Data2, populate from those datasets, and then point to
the newly created entries for what you have in memory as *d1 and *d2. Note
that these compound data types will only be efficient if they are of fixed
size.

For option 2, if you write opaque data, it won't be readable
cross-platform if the other platforms have different byte ordering or word
size. Also, you won't be able to store the data very efficiently in HDF5
and your performance will suffer.

Dave McCloskey

From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
Sent: Tuesday, January 29, 2008 10:01 PM
To: hdf-forum@hdfgroup.org
Subject: FW: Writing a stream of data/Opaque Data type

FYI
-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,
The query is with hdf51.6.5
I want to write a structure Data whicb is defined as follows
struct Data{
Data1 *d1;
unsigned int id;
Data2 *d2;
}
The Data1 and Data2 are complicated Data structures .
Now if I am to write Data into HDF5 i can adopt the following approaches
1st approach - Create a corresponding data structure for the Data struct
in hdf5 and write the instances . BUt for this I will have to create
corresponding data structures for Data1 and Data2 . Since these are quite
complicated it will be a very difficult task to create apporpriate
structures for them . It will require a lot of effort . Also for pointers i
will have to use references .
2nd approach - Write the entire data structure as a blob using Opaque
Datatype. Since the Data strucure contains pointers it won't be safe to
assume that the addresses which I have written will be correct when i read
back the data . The right approach will be writing the actual values
pointed to by Data1 and Data2 pointers , as a stream of data . To further
simplify I can write the dataset with a datatype which contains an unsigned
int and a stream of bytes ( for Data1 and Data2)
My Queries:
What is your opinion on these 2 approaches ? Which one ie better/faster ?
Is there any other way by which i can write a stream of data in hdf5 other
than the paque datatype ?
Is there a possibility of writing stream of data in hdf5 without knowing
the size of data to be written until the time of actual writing ?
Regards
Ramakrishnan

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
- What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn Your
Butt Tomorrow

ramakrishnan.iyer · January 31, 2008, 3:44am

Dear Dimitris ,

I do understand the difference between serialization and persistence .

serialization - A process of saving an object into a storage medium

Persistence - The object stored has necessary info so that it can be created back while deserializing .

My question relates to serialization only .

I want to know whether it is advisable to have 3 types of datasets which consists opaque data types as one of its data components . How will the performance be affected while writing and reading the datasets ? Will it be slower when I compare it with datasets which do not have an opaque datatype .

Also does compression apply to opaque datatypes ( I know it is not effective on variable length datatypes )

Given below is your reply for Opening Of Datasets wherein you have mentioned that writing opaque datatypes is faster .

···

-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com]
Sent: Friday, January 18, 2008 4:31 PM
To: hdf-forum@hdfgroup.org
Subject: Re: Opening datasets expensive?

Hi all,

I've only tested with large datasets using:
1) raw files
2) opaque datatypes (serialize with another lib and save data as opaque type)
3) native HDF structures with variable arrays
4) In memory buffered native HDF structures
5) Breakdown of structures to HDF5 native arrays

These are my results for writing files ranging in total from 1GB to 100GB:

Opaque data type writing is always faster than raw file, but not accounting for the time to serialize.
Writing HDF5 native structures, especially with variable arrays is always slower up to a factor of 2
Rearranging data into fixed size arrays is usually about 20% slower

On NFS things seem to be different, with HDF5 outperforming raw file in most cases.

Reading a dataset is usually faster with the raw file but faster with HDF5 after the first time!

Overwriting a dataset is always significantly faster with HDF5.

In all cases, writing opaque datatypes is always faster.

HTH

Regards

Ramakrishnan

-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com]
Sent: Wednesday, January 30, 2008 7:56 PM
Cc: hdf-forum@hdfgroup.org
Subject: Re: Writing a stream of data/Opaque Data type

Errr...

I think Ramakrishnan confuses serialization with persistence...

-- dimitris

2008/1/30, McCloskey, David L. <McClosDL@westinghouse.com>:

If you use option 1, are you saying you'd create 3 compound data types? Data, Data1, and Data2? If so, you could make 3 datasets, one for each of these. You can make pointer type objects within HDF5 to point from an entry in Data to which entry in the Data1 and Data2 datasets correspond to this entry in Data. Then when you read an entry from Data, you can create new entries to Data1 and Data2, populate from those datasets, and then point to the newly created entries for what you have in memory as *d1 and *d2. Note that these compound data types will only be efficient if they are of fixed size.

For option 2, if you write opaque data, it won't be readable cross-platform if the other platforms have different byte ordering or word size. Also, you won't be able to store the data very efficiently in HDF5 and your performance will suffer.

Dave McCloskey

From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
Sent: Tuesday, January 29, 2008 10:01 PM
To: hdf-forum@hdfgroup.org
Subject: FW: Writing a stream of data/Opaque Data type

FYI
-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,
The query is with hdf51.6.5
I want to write a structure Data whicb is defined as follows
struct Data{
Data1 *d1;
unsigned int id;
Data2 *d2;
}
The Data1 and Data2 are complicated Data structures .
Now if I am to write Data into HDF5 i can adopt the following approaches
1st approach - Create a corresponding data structure for the Data struct in hdf5 and write the instances . BUt for this I will have to create corresponding data structures for Data1 and Data2 . Since these are quite complicated it will be a very difficult task to create apporpriate structures for them . It will require a lot of effort . Also for pointers i will have to use references .
2nd approach - Write the entire data structure as a blob using Opaque Datatype. Since the Data strucure contains pointers it won't be safe to assume that the addresses which I have written will be correct when i read back the data . The right approach will be writing the actual values pointed to by Data1 and Data2 pointers , as a stream of data . To further simplify I can write the dataset with a datatype which contains an unsigned int and a stream of bytes ( for Data1 and Data2)
My Queries:
What is your opinion on these 2 approaches ? Which one ie better/faster ?
Is there any other way by which i can write a stream of data in hdf5 other than the paque datatype ?
Is there a possibility of writing stream of data in hdf5 without knowing the size of data to be written until the time of actual writing ?
Regards
Ramakrishnan

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
- What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn Your Butt Tomorrow

Dimitris_Servis · January 31, 2008, 9:51am

Hi Ramakrishnan,

no offense, I did not say you don't understand the difference, I said you
confuse them probably. I see it the other way around:

serialization - The object stored has necessary info so that it can be
created back while deserializing.

persistence - A process of saving an object into a storage medium.

There, I think that fixes it My point is that you first have to decide
what you want to do: sure, you can save an opaque data type into an HDF5
file, but is that what you really want to do? My rule of thumb goes like
that:

- If you want different instances of the same application to access a
volatile state of one of them, you are closer to serialization. Note, the
serialized object tree can be stored in any resource: flatfile, database,
HDF5, network etcetera.
- If you want (a) To share data between clients and platforms (hw, sw, oses,
vms) at any point in space and time (no need to version, share business
etcetera) (b) To have random access to this data, you should look for
persistence.

Now whether that's HDF5, MySQL, SQLite and so on will depend on the way your
clients are designed to work with the data and on the data itself. Many of
us are more than willing to design a toolkit that persists a collection of
objects and pointed-to arrays you described, because this effort pays off
under one or more of the above mentioned circumstances. Mix and match
strategies exist in cases where you might need to store both "native" HDF5
datasets and opaque data for a plethora of reasons. Still, the two might
coexist, for me it's difficult to say if one can replace the other under the
same requirements. That's why I am saying you confuse the two, they're not
interchangeable. Or maybe your requirements are not clear.

HTH

-- dimitris

···

2008/1/31, Ramakrishnan Iyer <ramakrishnan.iyer@altair.com>:

Dear Dimitris ,

I do understand the difference between serialization and persistence .

serialization - A process of saving an object into a storage medium

Persistence – The object stored has necessary info so that it can be
created back while deserializing .

My question relates to serialization only .

I want to know whether it is advisable to have 3 types of datasets which
consists opaque data types as one of its data components . How will the
performance be affected while writing and reading the datasets ? Will it be
slower when I compare it with datasets which do not have an opaque datatype
.

Also does compression apply to opaque datatypes ( I know it is not
effective on variable length datatypes )

Given below is your reply for Opening Of Datasets wherein you have
mentioned that writing opaque datatypes is faster .

-----Original Message-----
*From:* Dimitris Servis [mailto:servisster@gmail.com]
*Sent:* Wednesday, January 30, 2008 7:56 PM
*Cc:* hdf-forum@hdfgroup.org
*Subject:* Re: Writing a stream of data/Opaque Data type

Errr...

I think Ramakrishnan confuses serialization with persistence...

-- dimitris

2008/1/30, McCloskey, David L. <McClosDL@westinghouse.com>:

If you use option 1, are you saying you'd create 3 compound data types?
Data, Data1, and Data2? If so, you could make 3 datasets, one for each of
these. You can make pointer type objects within HDF5 to point from an entry
in Data to which entry in the Data1 and Data2 datasets correspond to this
entry in Data. Then when you read an entry from Data, you can create new
entries to Data1 and Data2, populate from those datasets, and then point to
the newly created entries for what you have in memory as *d1 and *d2. Note
that these compound data types will only be efficient if they are of fixed
size.

For option 2, if you write opaque data, it won't be readable
cross-platform if the other platforms have different byte ordering or word
size. Also, you won't be able to store the data very efficiently in HDF5
and your performance will suffer.

Dave McCloskey

From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
Sent: Tuesday, January 29, 2008 10:01 PM
To: hdf-forum@hdfgroup.org
Subject: FW: Writing a stream of data/Opaque Data type

FYI
-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,
The query is with hdf51.6.5
I want to write a structure Data whicb is defined as follows
struct Data{
Data1 *d1;
unsigned int id;
Data2 *d2;
}
The Data1 and Data2 are complicated Data structures .
Now if I am to write Data into HDF5 i can adopt the following approaches
1st approach - Create a corresponding data structure for the Data struct
in hdf5 and write the instances . BUt for this I will have to create
corresponding data structures for Data1 and Data2 . Since these are quite
complicated it will be a very difficult task to create apporpriate
structures for them . It will require a lot of effort . Also for pointers i
will have to use references .
2nd approach - Write the entire data structure as a blob using Opaque
Datatype. Since the Data strucure contains pointers it won't be safe to
assume that the addresses which I have written will be correct when i read
back the data . The right approach will be writing the actual values
pointed to by Data1 and Data2 pointers , as a stream of data . To further
simplify I can write the dataset with a datatype which contains an unsigned
int and a stream of bytes ( for Data1 and Data2)
My Queries:
What is your opinion on these 2 approaches ? Which one ie better/faster ?
Is there any other way by which i can write a stream of data in hdf5 other
than the paque datatype ?
Is there a possibility of writing stream of data in hdf5 without knowing
the size of data to be written until the time of actual writing ?
Regards
Ramakrishnan

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

ramakrishnan.iyer · February 5, 2008, 6:37am

Dear Dimitris ,
I will put my requirement in a different way . I have to write into an
hdf5 file a stream of bytes ( The values written can be an int , a
double , char anything ) . If I am to write a stream of bytes , I will
just define an opaque datatype with the number of bytes that needs to be
written .
In case i do not know the number of bytes to be written into the file
before hand I can read in all the values ( bytes ) into a buffer and
then finally create the opaque datatype with the required number of
bytes and write the byte stream into the dataset .
But what if I want to write into the file as i read in the byte stream .
Can i do that in HDF5 without knowing the actual size of bytes i am
writing ?
Do i bring in variable length Opaque datatype ( Can it exist ? )

Regards
Ramakrishnan

···

-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com]
Sent: Thursday, January 31, 2008 3:21 PM
To: Ramakrishnan Iyer
Cc: hdf-forum@hdfgroup.org
Subject: Re: Writing a stream of data/Opaque Data type

Hi Ramakrishnan,

no offense, I did not say you don't understand the difference, I said
you confuse them probably. I see it the other way around:

serialization - The object stored has necessary info so that it can be
created back while deserializing.

persistence - A process of saving an object into a storage medium.

There, I think that fixes it My point is that you first have to
decide what you want to do: sure, you can save an opaque data type into
an HDF5 file, but is that what you really want to do? My rule of thumb
goes like that:

- If you want different instances of the same application to access a
volatile state of one of them, you are closer to serialization. Note,
the serialized object tree can be stored in any resource: flatfile,
database, HDF5, network etcetera.
- If you want (a) To share data between clients and platforms (hw, sw,
oses, vms) at any point in space and time (no need to version, share
business etcetera) (b) To have random access to this data, you should
look for persistence.

Now whether that's HDF5, MySQL, SQLite and so on will depend on the way
your clients are designed to work with the data and on the data itself.
Many of us are more than willing to design a toolkit that persists a
collection of objects and pointed-to arrays you described, because this
effort pays off under one or more of the above mentioned circumstances.
Mix and match strategies exist in cases where you might need to store
both "native" HDF5 datasets and opaque data for a plethora of reasons.
Still, the two might coexist, for me it's difficult to say if one can
replace the other under the same requirements. That's why I am saying
you confuse the two, they're not interchangeable. Or maybe your
requirements are not clear.

HTH

-- dimitris

2008/1/31, Ramakrishnan Iyer <ramakrishnan.iyer@altair.com>:

Dear Dimitris ,

I do understand the difference between serialization and persistence .

serialization - A process of saving an object into a storage medium

Persistence - The object stored has necessary info so that it can be
created back while deserializing .

My question relates to serialization only .

I want to know whether it is advisable to have 3 types of datasets which
consists opaque data types as one of its data components . How will the
performance be affected while writing and reading the datasets ? Will it
be slower when I compare it with datasets which do not have an opaque
datatype .

Also does compression apply to opaque datatypes ( I know it is not
effective on variable length datatypes )

Given below is your reply for Opening Of Datasets wherein you have
mentioned that writing opaque datatypes is faster .

-----Original Message-----
From: Dimitris Servis [mailto:servisster@gmail.com]
Sent: Wednesday, January 30, 2008 7:56 PM
Cc: hdf-forum@hdfgroup.org
Subject: Re: Writing a stream of data/Opaque Data type

Errr...

I think Ramakrishnan confuses serialization with persistence...

-- dimitris

2008/1/30, McCloskey, David L. <McClosDL@westinghouse.com>:

If you use option 1, are you saying you'd create 3 compound data types?
Data, Data1, and Data2? If so, you could make 3 datasets, one for each
of these. You can make pointer type objects within HDF5 to point from
an entry in Data to which entry in the Data1 and Data2 datasets
correspond to this entry in Data. Then when you read an entry from
Data, you can create new entries to Data1 and Data2, populate from those
datasets, and then point to the newly created entries for what you have
in memory as *d1 and *d2. Note that these compound data types will only
be efficient if they are of fixed size.

For option 2, if you write opaque data, it won't be readable
cross-platform if the other platforms have different byte ordering or
word size. Also, you won't be able to store the data very efficiently
in HDF5 and your performance will suffer.

Dave McCloskey

From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
Sent: Tuesday, January 29, 2008 10:01 PM
To: hdf-forum@hdfgroup.org
Subject: FW: Writing a stream of data/Opaque Data type

FYI
-----Original Message-----
From: Ramakrishnan Iyer
Sent: Monday, January 28, 2008 2:48 PM
To: 'hdf-forum-help@hdfgroup.org'
Subject: Writing a stream of data/Opaque Data type

Dear HelpDesk Team,
The query is with hdf51.6.5
I want to write a structure Data whicb is defined as follows
struct Data{
Data1 *d1;
unsigned int id;
Data2 *d2;
}
The Data1 and Data2 are complicated Data structures .
Now if I am to write Data into HDF5 i can adopt the following approaches
1st approach - Create a corresponding data structure for the Data struct
in hdf5 and write the instances . BUt for this I will have to create
corresponding data structures for Data1 and Data2 . Since these are
quite complicated it will be a very difficult task to create
apporpriate structures for them . It will require a lot of effort . Also
for pointers i will have to use references .
2nd approach - Write the entire data structure as a blob using Opaque
Datatype. Since the Data strucure contains pointers it won't be safe to
assume that the addresses which I have written will be correct when i
read back the data . The right approach will be writing the actual
values pointed to by Data1 and Data2 pointers , as a stream of data . To
further simplify I can write the dataset with a datatype which contains
an unsigned int and a stream of bytes ( for Data1 and Data2)
My Queries:
What is your opinion on these 2 approaches ? Which one ie better/faster
?
Is there any other way by which i can write a stream of data in hdf5
other than the paque datatype ?
Is there a possibility of writing stream of data in hdf5 without knowing
the size of data to be written until the time of actual writing ?
Regards
Ramakrishnan

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · February 5, 2008, 12:51pm

That would work for a single stream of bytes, yes. If Ramakrishnan wanted a dataset with elements of unlimited length, then he's right, he could use a variable-length datatype with a base type of a 1-byte opaque datatype.

We've intended the difference between opaque datatypes (especially 1-byte sized ones) and native unsigned chars to be that opaque datatypes are "blobs of bytes" and aren't interpreted by the HDF5 library and native unsigned chars are numeric values and can be converted to other datatypes. Probably opaque datatypes are the best to use for uninterpretable sequences of bytes.

Quincey

···

On Feb 5, 2008, at 1:19 AM, Francesc Altet wrote:

A Tuesday 05 February 2008, Ramakrishnan Iyer escrigué:

Dear Dimitris ,
I will put my requirement in a different way . I have to write into
an hdf5 file a stream of bytes ( The values written can be an int , a
double , char anything ) . If I am to write a stream of bytes , I
will just define an opaque datatype with the number of bytes that
needs to be written .
In case i do not know the number of bytes to be written into the file
before hand I can read in all the values ( bytes ) into a buffer and
then finally create the opaque datatype with the required number of
bytes and write the byte stream into the dataset .
But what if I want to write into the file as i read in the byte
stream . Can i do that in HDF5 without knowing the actual size of
bytes i am writing ?
Do i bring in variable length Opaque datatype ( Can it exist ? )

If you want to write a byte stream without knowing how long it is going
to be, my advice is to use a 1-D dataset of bytes (for example,
H5T_NATIVE_UCHAR) with unlimited dimension (H5D_UNLIMITED). I think
that would be the easiest path for you.

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Altet · February 5, 2008, 7:19am

A Tuesday 05 February 2008, Ramakrishnan Iyer escrigué:

Dear Dimitris ,
I will put my requirement in a different way . I have to write into
an hdf5 file a stream of bytes ( The values written can be an int , a
double , char anything ) . If I am to write a stream of bytes , I
will just define an opaque datatype with the number of bytes that
needs to be written .
In case i do not know the number of bytes to be written into the file
before hand I can read in all the values ( bytes ) into a buffer and
then finally create the opaque datatype with the required number of
bytes and write the byte stream into the dataset .
But what if I want to write into the file as i read in the byte
stream . Can i do that in HDF5 without knowing the actual size of
bytes i am writing ?
Do i bring in variable length Opaque datatype ( Can it exist ? )

If you want to write a byte stream without knowing how long it is going
to be, my advice is to use a 1-D dataset of bytes (for example,
H5T_NATIVE_UCHAR) with unlimited dimension (H5D_UNLIMITED). I think
that would be the easiest path for you.

Hope that helps,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · February 5, 2008, 1:46pm

Hi Ramakrishnan

a 1D dataset of unlimited dimension and H5T_NATIVE_UCHAR type, as Francesc
already mentioned, is the best way. Actually using this approach, I have
seen no difference between using H5T_NATIVE_UCHAR and H5T_NATIVE_OPAQUE.
However, be warned that the stream is not platform independent, therefore
possibly unreadable from other clients than the one that wrote it. Plus,
there is a big performance penalty using this approach. I've noticed that
compared to optimized direct IO, HDF5 tends to heavily underperform when
with such arrays. I've noticed considerably reduced performance especially
for reading. This tends to become a problem especially for larger blobs
(>1GB). So it also depends on the size of the data you want to write.
However, if writing takes place only once and reading very often, you might
want to try chunking and compression, in which case writing is significantly
slower, but reading is significantly faster, so if you write once but read
more than 10 times the overall performance is better. So it also depends on
how you want to use the data. Plus I've noticed that serializing in memory
even for example using XDR and then writing a sized blob to the disk is
always MUCH faster overall than using sizeable arrays. So all in all, with
HDF5 you can do almost everything. But again, you should consider where,
how, how much and when your data will be used.

HTH

-- dimitris

···

2008/2/5, Ramakrishnan Iyer <ramakrishnan.iyer@altair.com>:

Dear Dimitris ,
I will put my requirement in a different way . I have to write into an
hdf5 file a stream of bytes ( The values written can be an int , a double ,
char anything ) . If I am to write a stream of bytes , I will just define an
opaque datatype with the number of bytes that needs to be written .
In case i do not know the number of bytes to be written into the file
before hand I can read in all the values ( bytes ) into a buffer and then
finally create the opaque datatype with the required number of bytes and
write the byte stream into the dataset .
But what if I want to write into the file as i read in the byte stream .
Can i do that in HDF5 without knowing the actual size of bytes i am writing
?
Do i bring in variable length Opaque datatype ( Can it exist ? )

Regards
Ramakrishnan

-----Original Message-----
*From:* Dimitris Servis [mailto:servisster@gmail.com]
*Sent:* Thursday, January 31, 2008 3:21 PM
*To:* Ramakrishnan Iyer
*Cc:* hdf-forum@hdfgroup.org
*Subject:* Re: Writing a stream of data/Opaque Data type

Hi Ramakrishnan,

no offense, I did not say you don't understand the difference, I said you
confuse them probably. I see it the other way around:

serialization - The object stored has necessary info so that it can be
created back while deserializing.

persistence - A process of saving an object into a storage medium.

There, I think that fixes it My point is that you first have to decide
what you want to do: sure, you can save an opaque data type into an HDF5
file, but is that what you really want to do? My rule of thumb goes like
that:

- If you want different instances of the same application to access a
volatile state of one of them, you are closer to serialization. Note, the
serialized object tree can be stored in any resource: flatfile, database,
HDF5, network etcetera.
- If you want (a) To share data between clients and platforms (hw, sw,
oses, vms) at any point in space and time (no need to version, share
business etcetera) (b) To have random access to this data, you should look
for persistence.

Now whether that's HDF5, MySQL, SQLite and so on will depend on the way
your clients are designed to work with the data and on the data itself. Many
of us are more than willing to design a toolkit that persists a collection
of objects and pointed-to arrays you described, because this effort pays off
under one or more of the above mentioned circumstances. Mix and match
strategies exist in cases where you might need to store both "native" HDF5
datasets and opaque data for a plethora of reasons. Still, the two might
coexist, for me it's difficult to say if one can replace the other under the
same requirements. That's why I am saying you confuse the two, they're not
interchangeable. Or maybe your requirements are not clear.

HTH

-- dimitris

2008/1/31, Ramakrishnan Iyer <ramakrishnan.iyer@altair.com>:
>
> Dear Dimitris ,
>
> I do understand the difference between serialization and persistence .
>
> serialization - A process of saving an object into a storage medium
>
> Persistence – The object stored has necessary info so that it can be
> created back while deserializing .
>
>
>
> My question relates to serialization only .
>
>
>
> I want to know whether it is advisable to have 3 types of datasets which
> consists opaque data types as one of its data components . How will the
> performance be affected while writing and reading the datasets ? Will it be
> slower when I compare it with datasets which do not have an opaque datatype
> .
>
> Also does compression apply to opaque datatypes ( I know it is not
> effective on variable length datatypes )
>
>
>
> Given below is your reply for Opening Of Datasets wherein you have
> mentioned that writing opaque datatypes is faster .
>
>
>
> -----Original Message-----
> *From:* Dimitris Servis [mailto:servisster@gmail.com]
> *Sent:* Wednesday, January 30, 2008 7:56 PM
> *Cc:* hdf-forum@hdfgroup.org
> *Subject:* Re: Writing a stream of data/Opaque Data type
>
>
>
> Errr...
>
> I think Ramakrishnan confuses serialization with persistence...
>
> -- dimitris
>
> 2008/1/30, McCloskey, David L. <McClosDL@westinghouse.com>:
>
> If you use option 1, are you saying you'd create 3 compound data types?
> Data, Data1, and Data2? If so, you could make 3 datasets, one for each of
> these. You can make pointer type objects within HDF5 to point from an entry
> in Data to which entry in the Data1 and Data2 datasets correspond to this
> entry in Data. Then when you read an entry from Data, you can create new
> entries to Data1 and Data2, populate from those datasets, and then point to
> the newly created entries for what you have in memory as *d1 and *d2. Note
> that these compound data types will only be efficient if they are of fixed
> size.
>
> For option 2, if you write opaque data, it won't be readable
> cross-platform if the other platforms have different byte ordering or word
> size. Also, you won't be able to store the data very efficiently in HDF5
> and your performance will suffer.
>
>
> Dave McCloskey
>
> From: Ramakrishnan Iyer [mailto:ramakrishnan.iyer@altair.com]
> Sent: Tuesday, January 29, 2008 10:01 PM
> To: hdf-forum@hdfgroup.org
> Subject: FW: Writing a stream of data/Opaque Data type
>
> FYI
> -----Original Message-----
> From: Ramakrishnan Iyer
> Sent: Monday, January 28, 2008 2:48 PM
> To: 'hdf-forum-help@hdfgroup.org'
> Subject: Writing a stream of data/Opaque Data type
>
> Dear HelpDesk Team,
> The query is with hdf51.6.5
> I want to write a structure Data whicb is defined as follows
> struct Data{
> Data1 *d1;
> unsigned int id;
> Data2 *d2;
> }
> The Data1 and Data2 are complicated Data structures .
> Now if I am to write Data into HDF5 i can adopt the following approaches
> 1st approach - Create a corresponding data structure for the Data struct
> in hdf5 and write the instances . BUt for this I will have to create
> corresponding data structures for Data1 and Data2 . Since these are quite
> complicated it will be a very difficult task to create apporpriate
> structures for them . It will require a lot of effort . Also for pointers i
> will have to use references .
> 2nd approach - Write the entire data structure as a blob using Opaque
> Datatype. Since the Data strucure contains pointers it won't be safe to
> assume that the addresses which I have written will be correct when i read
> back the data . The right approach will be writing the actual values
> pointed to by Data1 and Data2 pointers , as a stream of data . To further
> simplify I can write the dataset with a datatype which contains an unsigned
> int and a stream of bytes ( for Data1 and Data2)
> My Queries:
> What is your opinion on these 2 approaches ? Which one ie better/faster
> ?
> Is there any other way by which i can write a stream of data in hdf5
> other than the paque datatype ?
> Is there a possibility of writing stream of data in hdf5 without knowing
> the size of data to be written until the time of actual writing ?
> Regards
> Ramakrishnan
>
>
>
>
> ----------------------------------------------------------------------
> This mailing list is for HDF software users discussion.
> To subscribe to this list, send a message to
> hdf-forum-subscribe@hdfgroup.org.
> To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.
>
>
>
>

--
- What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn Your
Butt Tomorrow

Francesc_Altet · February 5, 2008, 3:50pm

A Tuesday 05 February 2008, Quincey Koziol escrigué:

We've intended the difference between opaque datatypes (especially
1- byte sized ones) and native unsigned chars to be that opaque
datatypes are "blobs of bytes" and aren't interpreted by the HDF5
library and native unsigned chars are numeric values and can be
converted to other datatypes. Probably opaque datatypes are the best
to use for uninterpretable sequences of bytes.

I didn't realized this interpretation. I'll have it in mind.

Cheers,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Altet · February 5, 2008, 4:01pm

A Tuesday 05 February 2008, Dimitris Servis escrigué:

Hi Ramakrishnan

a 1D dataset of unlimited dimension and H5T_NATIVE_UCHAR type, as
Francesc already mentioned, is the best way. Actually using this
approach, I have seen no difference between using H5T_NATIVE_UCHAR
and H5T_NATIVE_OPAQUE. However, be warned that the stream is not
platform independent, therefore possibly unreadable from other
clients than the one that wrote it. Plus, there is a big performance
penalty using this approach. I've noticed that compared to optimized
direct IO, HDF5 tends to heavily underperform when with such arrays.
I've noticed considerably reduced performance especially for reading.
This tends to become a problem especially for larger blobs (>1GB). So
it also depends on the size of the data you want to write. However,
if writing takes place only once and reading very often, you might
want to try chunking and compression, in which case writing is
significantly slower, but reading is significantly faster, so if you
write once but read more than 10 times the overall performance is
better.

But when using unlimited dimensions you need chunking, right? So,
provided that you choose a sensible chunksize, I'd say that you can
reach far decent performance (in both writing and reading) with chunked
datasets, unless I miss something.

So it also depends on how you want to use the data. Plus I've
noticed that serializing in memory even for example using XDR and
then writing a sized blob to the disk is always MUCH faster overall
than using sizeable arrays.

That's pretty interesting. Can you say how much speed-up could you
typically attain by doing this?

Thanks,

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · February 5, 2008, 4:40pm

Hi Francesc,

But when using unlimited dimensions you need chunking, right? So,

provided that you choose a sensible chunksize, I'd say that you can
reach far decent performance (in both writing and reading) with chunked
datasets, unless I miss something.

I am not sure I got you. To my experience, chunking exhibits reduced
performance compared to fixed size arrays and this is because the library
has to create and access the chunks when performing IO. A fixed size array
is allocated early and writing to it a blob of H5T_NATIVE_OPAQUE type is
almost as fast as writing to a flat file. My tests show that reading that
array takes as much time as for writing it using HDF5, at least up to 1GB.
However, a nicely tweaked flat file can read at an order of magnitude less
(WXPx64). But let me note that writing - say - a blob of 100MB 10 times in a
loop to a flat file appears to be slower to writing 10 chunks of 100MB using
HDF5 by about 5-6% . This shows that HDF5 probably does a better job in
managing the address space.

So it also depends on how you want to use the data. Plus I've
> noticed that serializing in memory even for example using XDR and
> then writing a sized blob to the disk is always MUCH faster overall
> than using sizeable arrays.

That's pretty interesting. Can you say how much speed-up could you
typically attain by doing this?

About 20% and growing with the blob size and can get up to 100%. That is,
when I compare writing a 1D dataset of structures to writing an in-memory
serialized blob. Note though, that I do most tests on WXPx64 using HDF
1.6.6and I care about really large file sizes. On linux the image may
be
different but I don't have such extensive tests yet.

Thanks,

My pleasure

BR

-- dimitris

Francesc_Altet · February 5, 2008, 5:14pm

Hi Dimitris,

A Tuesday 05 February 2008, Dimitris Servis escrigué:

Hi Francesc,

But when using unlimited dimensions you need chunking, right? So,

> provided that you choose a sensible chunksize, I'd say that you can
> reach far decent performance (in both writing and reading) with
> chunked datasets, unless I miss something.

I am not sure I got you. To my experience, chunking exhibits reduced
performance compared to fixed size arrays and this is because the
library has to create and access the chunks when performing IO. A
fixed size array is allocated early and writing to it a blob of
H5T_NATIVE_OPAQUE type is almost as fast as writing to a flat file.
My tests show that reading that array takes as much time as for
writing it using HDF5, at least up to 1GB. However, a nicely tweaked
flat file can read at an order of magnitude less (WXPx64). But let me
note that writing - say - a blob of 100MB 10 times in a loop to a
flat file appears to be slower to writing 10 chunks of 100MB using
HDF5 by about 5-6% . This shows that HDF5 probably does a better job
in managing the address space.

> So it also depends on how you want to use the data. Plus I've
>
> > noticed that serializing in memory even for example using XDR and
> > then writing a sized blob to the disk is always MUCH faster
> > overall than using sizeable arrays.
>
> That's pretty interesting. Can you say how much speed-up could you
> typically attain by doing this?

About 20% and growing with the blob size and can get up to 100%.
That is, when I compare writing a 1D dataset of structures to writing
an in-memory serialized blob. Note though, that I do most tests on
WXPx64 using HDF 1.6.6and I care about really large file sizes. On
linux the image may be
different but I don't have such extensive tests yet.

Thanks for your response. I presume that, in your benchmarks, you have
tried to find an optimized chunksize for your needs, haven't you?

···

--

0,0< Francesc Altet http://www.carabos.com/

V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Dimitris_Servis · February 5, 2008, 11:19pm

Yup

I usually can calculate the chunk size that best fits my data

···

2008/2/5, Francesc Altet <faltet@carabos.com>:

Hi Dimitris,

A Tuesday 05 February 2008, Dimitris Servis escrigué:
> Hi Francesc,
>
> But when using unlimited dimensions you need chunking, right? So,
>
> > provided that you choose a sensible chunksize, I'd say that you can
> > reach far decent performance (in both writing and reading) with
> > chunked datasets, unless I miss something.
>
> I am not sure I got you. To my experience, chunking exhibits reduced
> performance compared to fixed size arrays and this is because the
> library has to create and access the chunks when performing IO. A
> fixed size array is allocated early and writing to it a blob of
> H5T_NATIVE_OPAQUE type is almost as fast as writing to a flat file.
> My tests show that reading that array takes as much time as for
> writing it using HDF5, at least up to 1GB. However, a nicely tweaked
> flat file can read at an order of magnitude less (WXPx64). But let me
> note that writing - say - a blob of 100MB 10 times in a loop to a
> flat file appears to be slower to writing 10 chunks of 100MB using
> HDF5 by about 5-6% . This shows that HDF5 probably does a better job
> in managing the address space.
>
> > So it also depends on how you want to use the data. Plus I've
> >
> > > noticed that serializing in memory even for example using XDR and
> > > then writing a sized blob to the disk is always MUCH faster
> > > overall than using sizeable arrays.
> >
> > That's pretty interesting. Can you say how much speed-up could you
> > typically attain by doing this?
>
> About 20% and growing with the blob size and can get up to 100%.
> That is, when I compare writing a 1D dataset of structures to writing
> an in-memory serialized blob. Note though, that I do most tests on
> WXPx64 using HDF 1.6.6and I care about really large file sizes. On
> linux the image may be
> different but I don't have such extensive tests yet.

Thanks for your response. I presume that, in your benchmarks, you have
tried to find an optimized chunksize for your needs, haven't you?

--
>0,0< Francesc Altet http://www.carabos.com/
V V Cárabos Coop. V. Enjoy Data
"-"

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to
hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

--
- What is the difference between mechanical engineers and civil engineers?
Mechanical engineers build weapons civil engineers build targets.
- Good health is merely the slowest possible rate at which one can die.
- Life is like a jar of jalapeño peppers. What you do today, might Burn Your
Butt Tomorrow

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

FW: Writing a stream of data/Opaque Data type