HDF5 vs. XML

We are trying to better understand the relative merits of using XML or HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/\) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

···

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Tim,

I would agree with Gerd that this comparison is a bit of apples and oranges…

I do a lot of XML and, in fact, many people consider me to be an XML zealot, so I would agree that there are a lot of tools out there in XML Land. However, I am not familiar with any tools for dealing with binary data packed in XML (they may be there, but I am not familiar with them). The “available tools” point is, therefore, a bit hard to understand in this context…

You mention compression of XML. Gerd is correct that this is whole file compression. You need to uncompress the whole file in order to do anything with it. The compression approach used in HDF is much more intelligent. It compresses different datasets in the file independently and uncompresses only what you need. This optimizes file sizes and access speeds.

You also mention 1000’s of files. HDF would almost certainly give you many more aggregation options than XML with groups and potentially virtual datasets that provide an access framework for groups of files…

XML is really great for metadata and we are doing quite a bit of work with XML representations of the metadata in HDF files. This involves an HDF tool for extracting the metadata in XML for processing independent of the data. Gerd mentioned a couple of similar projects. I would add Nexus, which is doing quite a bit with XML and HDF (see 1.2. NeXus Design — nexus v2022.07 documentation and other related pages)…

Jim Collins has written about the “Tyranny of the Or” where organizations decide between X and Y. This contrasts with the “Power of the And”. I would encourage you to think about how XML and HDF can most effectively be used together rather than trying to choose between them…

Ted

By the way, you mentioned that you are storing sensor data. I worked with many sensor projects in NOAA and am curious about whether you are considering sensorML (http://www.opengeospatial.org/standards/sensorml\) for your metadata.

[cid:3777702D-45F4-4250-BB1C-8AFBD78174C5]

···

On Jan 2, 2014, at 8:53 AM, Gerd Heber <gheber@hdfgroup.org<mailto:gheber@hdfgroup.org>> wrote:

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/\) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

This might be a better link on the NeXus XML: 3.2. NXDL: The NeXus Definition Language — nexus v2022.07 documentation

[cid:3777702D-45F4-4250-BB1C-8AFBD78174C5]

···

On Jan 2, 2014, at 6:07 PM, Ted Habermann <thabermann@hdfgroup.org<mailto:thabermann@hdfgroup.org>> wrote:

Tim,

I would agree with Gerd that this comparison is a bit of apples and oranges…

I do a lot of XML and, in fact, many people consider me to be an XML zealot, so I would agree that there are a lot of tools out there in XML Land. However, I am not familiar with any tools for dealing with binary data packed in XML (they may be there, but I am not familiar with them). The “available tools” point is, therefore, a bit hard to understand in this context…

You mention compression of XML. Gerd is correct that this is whole file compression. You need to uncompress the whole file in order to do anything with it. The compression approach used in HDF is much more intelligent. It compresses different datasets in the file independently and uncompresses only what you need. This optimizes file sizes and access speeds.

You also mention 1000’s of files. HDF would almost certainly give you many more aggregation options than XML with groups and potentially virtual datasets that provide an access framework for groups of files…

XML is really great for metadata and we are doing quite a bit of work with XML representations of the metadata in HDF files. This involves an HDF tool for extracting the metadata in XML for processing independent of the data. Gerd mentioned a couple of similar projects. I would add Nexus, which is doing quite a bit with XML and HDF (see 1.2. NeXus Design — nexus v2022.07 documentation and other related pages)…

Jim Collins has written about the “Tyranny of the Or” where organizations decide between X and Y. This contrasts with the “Power of the And”. I would encourage you to think about how XML and HDF can most effectively be used together rather than trying to choose between them…

Ted

By the way, you mentioned that you are storing sensor data. I worked with many sensor projects in NOAA and am curious about whether you are considering sensorML (http://www.opengeospatial.org/standards/sensorml\) for your metadata.

<SignatureSm2.png>

On Jan 2, 2014, at 8:53 AM, Gerd Heber <gheber@hdfgroup.org<mailto:gheber@hdfgroup.org>> wrote:

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/\) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Ted,

Thanks for your thoughts, especially the HDF5 metadata extraction projects and the point of avoiding a false dichotomy between XML and HDF5.

Tim.

···

On 1/2/14 5:07 PM, Ted Habermann wrote:

Tim,

I would agree with Gerd that this comparison is a bit of apples and oranges...

I do a lot of XML and, in fact, many people consider me to be an XML zealot, so I would agree that there are a lot of tools out there in XML Land. However, I am not familiar with any tools for dealing with binary data packed in XML (they may be there, but I am not familiar with them). The "available tools" point is, therefore, a bit hard to understand in this context...

You mention compression of XML. Gerd is correct that this is whole file compression. You need to uncompress the whole file in order to do anything with it. The compression approach used in HDF is much more intelligent. It compresses different datasets in the file independently and uncompresses only what you need. This optimizes file sizes and access speeds.

You also mention 1000's of files. HDF would almost certainly give you many more aggregation options than XML with groups and potentially virtual datasets that provide an access framework for groups of files...

XML is really great for metadata and we are doing quite a bit of work with XML representations of the metadata in HDF files. This involves an HDF tool for extracting the metadata in XML for processing independent of the data. Gerd mentioned a couple of similar projects. I would add Nexus, which is doing quite a bit with XML and HDF (see http://download.nexusformat.org/doc/html/design.html and other related pages)...

Jim Collins has written about the "Tyranny of the Or" where organizations decide between X and Y. This contrasts with the "Power of the And". I would encourage you to think about how XML and HDF can most effectively be used together rather than trying to choose between them...

Ted

By the way, you mentioned that you are storing sensor data. I worked with many sensor projects in NOAA and am curious about whether you are considering sensorML (http://www.opengeospatial.org/standards/sensorml) for your metadata.

On Jan 2, 2014, at 8:53 AM, Gerd Heber <gheber@hdfgroup.org > <mailto:gheber@hdfgroup.org>> wrote:

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

http://www.mendeley.com/catalog/enhancements-extensible-data-model-format-xdmf/

http://www.mendeley.com/catalog/adaptive-informatics-multifactorial-highcontent-biological-data/

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

Hi,

sorry for the late reply, but I have been on holiday. The link for NeXus is www.nexusformat.org
NeXus started using HDF-4, added HDFD-5 when it became available and XML on request of people
who wish to edit their data in emacs...... Nowadays we focus on HDF-5. For your use case: it depends
on the size of the arrays. With the NeXus API we can handle something like 400x2000 and probably
up to 10x more without problems. But there is a caveat: in order to get there we hacked the XML-parser.
And this is the general thing with XML: it was never intended for array data. Everything you do for array
data in XML has to be maintained by you. Thus, if your focus is arrays you are better off with HDF-5.
C, F77, Java comes out of the box, python works nicely, you can load HDF-5 into matlab with a single call...

Best Regards,

Mark Koennecke, for the NeXus International Advisory Committee

···

On 01/03/2014 02:32 AM, Ted Habermann wrote:

This might be a better link on the NeXus XML: http://download.nexusformat.org/doc/html/nxdl.html

On Jan 2, 2014, at 6:07 PM, Ted Habermann <thabermann@hdfgroup.org > <mailto:thabermann@hdfgroup.org>> wrote:

Tim,

I would agree with Gerd that this comparison is a bit of apples and oranges�

I do a lot of XML and, in fact, many people consider me to be an XML zealot, so I would agree that there are a lot of tools out there in XML Land. However, I am not familiar with any tools for dealing with binary data packed in XML (they may be there, but I am not familiar with them). The �available tools� point is, therefore, a bit hard to understand in this context�

You mention compression of XML. Gerd is correct that this is whole file compression. You need to uncompress the whole file in order to do anything with it. The compression approach used in HDF is much more intelligent. It compresses different datasets in the file independently and uncompresses only what you need. This optimizes file sizes and access speeds.

You also mention 1000�s of files. HDF would almost certainly give you many more aggregation options than XML with groups and potentially virtual datasets that provide an access framework for groups of files�

XML is really great for metadata and we are doing quite a bit of work with XML representations of the metadata in HDF files. This involves an HDF tool for extracting the metadata in XML for processing independent of the data. Gerd mentioned a couple of similar projects. I would add Nexus, which is doing quite a bit with XML and HDF (see http://download.nexusformat.org/doc/html/design.html and other related pages)�

Jim Collins has written about the �Tyranny of the Or� where organizations decide between X and Y. This contrasts with the �Power of the And�. I would encourage you to think about how XML and HDF can most effectively be used together rather than trying to choose between them�

Ted

By the way, you mentioned that you are storing sensor data. I worked with many sensor projects in NOAA and am curious about whether you are considering sensorML (http://www.opengeospatial.org/standards/sensorml) for your metadata.

<SignatureSm2.png>

On Jan 2, 2014, at 8:53 AM, Gerd Heber <gheber@hdfgroup.org >> <mailto:gheber@hdfgroup.org>> wrote:

Tim, Happy New Year! I'm not aware of any comparative study.
(It'd be comparing apples and oranges: HDF5 is a smart data container.
XML is a document/message format.) Please add it to the Mendeley HDF group
(http://www.mendeley.com/groups/3317921/hdf/papers/) if you happen to come
across something.

Have you considered a hybrid approach, e.g., XDMF or SDCubes?

http://www.mendeley.com/catalog/enhancements-extensible-data-model-format-xdmf/

http://www.mendeley.com/catalog/adaptive-informatics-multifactorial-highcontent-biological-data/

My main concern would be that a pure XML approach will force you to
reinvent (and maintain!) a lot of infrastructure in XML that's built into HDF5
and that's transparent to end users: Not only will it not perform at the level HDF5 does,
it'll also confuse your users. E.g., using base64 encoded, compressed binary values is ok,
as long as you always want to decompress the entire value and not just
subsets of it. Would you really want to mimic chunking/tiling in XML?

Best, G.

-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Tim
Sent: Tuesday, December 31, 2013 5:06 PM
To: HDF Forum
Subject: [Hdf-forum] HDF5 vs. XML

We are trying to better understand the relative merits of using XML or
HDF5 file formats for a new project. Does anyone know of papers and/or studies, either qualitatively or quantitatively, that looked at parameters that might affect such a decision?

The project needs to store equipment sensor data covering specified time periods along with metadata about the data and equipment. There will be many 1000's of files which may contain binary data and matrices.

XML is the default selection, chiefly because it is ubiquitous and there is a rich toolset supporting it. This translates directly to lower development and maintenance costs. But, as the file size and binary data and number of matrices increase, XML becomes less efficient to work with.

NOTE 1: because XML can be compressed resulting in much smaller file sizes, for purposes of our investigation, we are considering compressed XML as a different file format, cXML.

NOTE 2: we plan to use BASE64 encoding for XML binary data.

Parameters we feel are important include:

1. Time to create the files.
2. File sizes.
3. Time to read the files.

Our plan is to generate fictitious but representative data files of various sizes, amounts of binary data and matrices, and record the above parameters. Then, mapping this information to our use cases, should result in us having usable empirical data with which to make a better informed decision regarding file formats.

The above study also provides us some insight into the technical issues related to supporting a HDF5 capability, which will need to be factored in.

Comments/thoughts on the above are appreciated.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://mail.lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org