Efficient serialization of HDF5 data

Dear HDF experts,

I build an application which operates on NetCDF data using Big Data
technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I
want to operate as much as possible in memory. The challenge is data
(de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and
efficient design would simply access and then exchange the raw binary data
over the network.

Currently, I fail to access this buffer without creating files. I am
investigating the use of the Apache Common VFS Ram file system to trick
NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619)
was to build an alternative to the core driver. I feel this is the more
desirable course of actions as it is about improving the existing solutions
instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be
appreciated !

Kind regards,

Michaël

Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:

I build an application which operates on NetCDF data using Big Data technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I want to operate as much as possible in memory. The challenge is data (de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple and efficient design would simply access and then exchange the raw binary data over the network.

Currently, I fail to access this buffer without creating files. I am investigating the use of the Apache Common VFS Ram file system to trick NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619) was to build an alternative to the core driver. I feel this is the more desirable course of actions as it is about improving the existing solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to suggest you to check


It would be superb if you could share your experience and whether Spark connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

···

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Have you looked at "diskless" files in netCDF? They are created in memory.

Also have a look at netCDF's support for DAP. Perhaps what you want is to
read a diskless file through DAP. I'm not sure if that is possible...

Ed Hartnett

···

On Mon, Dec 4, 2017 at 11:23 AM, Michaël Melchiore <rohel01@gmail.com> wrote:

Dear HDF experts,

I build an application which operates on NetCDF data using Big Data
technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I
want to operate as much as possible in memory. The challenge is data
(de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple
and efficient design would simply access and then exchange the raw binary
data over the network.

Currently, I fail to access this buffer without creating files. I am
investigating the use of the Apache Common VFS Ram file system to trick
NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619)
was to build an alternative to the core driver. I feel this is the more
desirable course of actions as it is about improving the existing solutions
instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be
appreciated !

Kind regards,

Michaël

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need is
not related to Spark. There are many alternatives to Spark which can be
used to perform in memory processing (Apache Storm, Apache Flink, Google
Dataflow...)
I have registered for more information regarding the Spark Connector but I
am not sure it is what I am looking for.

Kind regards,

Michaël

···

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <paramon@acdlabs.ru>:

Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:

I build an application which operates on NetCDF data using Big Data
technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead, I
want to operate as much as possible in memory. The challenge is data
(de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple
and efficient design would simply access and then exchange the raw binary
data over the network.

Currently, I fail to access this buffer without creating files. I am
investigating the use of the Apache Common VFS Ram file system to trick
NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket MQO-415619)
was to build an alternative to the core driver. I feel this is the more
desirable course of actions as it is about improving the existing solutions
instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be
appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to
suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark
connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Dear Michaël,
Have you tried using the core driver with a file image? Seems to me that
this is what you want to do, see H5Pset_file_image
<https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFileImage>.
This enables you to "open" the file data in memory and then retrieve it
again after you've finished operations, using H5Fget_file_image.

We have previously used this for networked HDF5-based data transfer;
admittedly with small data instead of big data, but the disk access
overhead was unacceptable in that case too.

Cheers,
Martijn

···

On 6 December 2017 at 03:43, Michaël Melchiore <rohel01@gmail.com> wrote:

Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need is
not related to Spark. There are many alternatives to Spark which can be
used to perform in memory processing (Apache Storm, Apache Flink, Google
Dataflow...)
I have registered for more information regarding the Spark Connector but I
am not sure it is what I am looking for.

Kind regards,

Michaël

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <paramon@acdlabs.ru>:

Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:

I build an application which operates on NetCDF data using Big Data
technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead,
I want to operate as much as possible in memory. The challenge is data
(de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple
and efficient design would simply access and then exchange the raw binary
data over the network.

Currently, I fail to access this buffer without creating files. I am
investigating the use of the Apache Common VFS Ram file system to trick
NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket
MQO-415619) was to build an alternative to the core driver. I feel this is
the more desirable course of actions as it is about improving the existing
solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be
appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to
suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark
connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Dear Martjin,

Yes, this is very promising. Thank you for bringing this to my attention.

Michaël

···

2017-12-05 21:34 GMT+01:00 Martijn Jasperse <m.jasperse@gmail.com>:

Dear Michaël,
Have you tried using the core driver with a file image? Seems to me that
this is what you want to do, see H5Pset_file_image
<https://support.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetFileImage>.
This enables you to "open" the file data in memory and then retrieve it
again after you've finished operations, using H5Fget_file_image.

We have previously used this for networked HDF5-based data transfer;
admittedly with small data instead of big data, but the disk access
overhead was unacceptable in that case too.

Cheers,
Martijn

On 6 December 2017 at 03:43, Michaël Melchiore <rohel01@gmail.com> wrote:

Dear Andrey,

While Apache Spark does aim at working in memory when possible, my need
is not related to Spark. There are many alternatives to Spark which can be
used to perform in memory processing (Apache Storm, Apache Flink, Google
Dataflow...)
I have registered for more information regarding the Spark Connector but
I am not sure it is what I am looking for.

Kind regards,

Michaël

2017-12-05 15:11 GMT+01:00 Андрей Парамонов <paramon@acdlabs.ru>:

Hello Michaël!

04.12.2017 21:23, Michaël Melchiore пишет:

I build an application which operates on NetCDF data using Big Data
technologies.

My design aims at avoiding unnecessarily writing data to disk. Instead,
I want to operate as much as possible in memory. The challenge is data
(de)serialization for distributed communications between computing nodes.

Since NetCDF4 and HDF5 already provide a portable data format, a simple
and efficient design would simply access and then exchange the raw binary
data over the network.

Currently, I fail to access this buffer without creating files. I am
investigating the use of the Apache Common VFS Ram file system to trick
NetCDF into working in memory.

But, a suggestion on the NetCDF Java mailing list (see ticket
MQO-415619) was to build an alternative to the core driver. I feel this is
the more desirable course of actions as it is about improving the existing
solutions instead of working around their limitations.

Do you think this approach is feasible ? Any starting pointers would be
appreciated !

I am probably not a distinguished expert in HDF5, but I take courage to
suggest you to check
https://www.hdfgroup.org/downloads/spark-connector/
It would be superb if you could share your experience and whether Spark
connector helped you to implement in-memory processing.

Best wishes,
Andrey Paramonov

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5