Pablo, how are you? There's nothing we could share with you at the moment.
Have you read the specification? Do have any comments/concerns?
(We a currently revising the spec...)
A related effort is OPeNDAP (http://www.opendap.org/\).
Would you mind describing your use case?
Best, G.
···
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Pablo Rozas Larraondo
Sent: Thursday, March 20, 2014 6:20 PM
To: hdf-forum
Subject: [Hdf-forum] HDF5 RESTful API
I've find out about the project on creating a RESTful API to interact with HDF5 from a question at Stack Overflow:
which point to this entry in this mail list from early 2013:
Thanks for your answer. I'm going to explain my view of a HDF5 RESTful
API and why we are looking into it at the moment.
I know about the OPeNDAP project, but this is only focused on
publishing data. The idea of a RESTful API would be to extend this
capabilities and create a mapping between HDF5 CRUD operations
(create, retrieve, update, delete) into URLs. This is more or less
like treating HDF5 files as databases, which is quite a challenge in
terms of keeping the consistency of the files under a concurrent
access environment. (I think the next HDF5 version will implement
single write multiple read functionalities which will make this task
much easier and more efficient)
One of the first questions I would ask myself is: Why we want to wrap
HDF5, which is one of the most neat and efficient systems for data IO,
with a RESTful API, which is maybe the most slow and inefficient way
of data transmission? (for RESTful APIs it's a good practice to encode
all client/server data interchange in JSON)
Answer 1: Because it gives great visibility/usability to our data.
Making use of RESTful technologies we can easily create web
applications or high level APIs to play with the data. We can also
create some basic functionalities to append new data or manage
datasets.
Answer 2: This is completely silly. If you really want to do that,
storing your data in flat text files wouldn't make any difference in
performance. HDF5 files normally store raw data and nobody wants to
consume gigabytes of raw data that need to be transmitted and
processed on the client side.
For me, both answers are right, that's why my approach would be to
create some kind of high level RESTful API which adds more
intelligence on the server side, as subsetting, aggregation or
statistical processing. As this is highly dependent on the type of
data stored in the HDF5 files it would be a challenge to do something
generic that covers most cases. At the moment we are trying to come
out with a solution to implement all this in our project. So far,
python seems to be the way to go for us, and we are starting to
implement something mainly based on h5py, pandas and flask.
I hope I've been clear on my explanation, sorry about the length of
the post and any feedback on this topic will be highly appreciated.
Pablo, how are you? There's nothing we could share with you at the moment.
Have you read the specification? Do have any comments/concerns?
(We a currently revising the spec...)
A related effort is OPeNDAP (http://www.opendap.org/).
Would you mind describing your use case?
Best, G.
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Pablo Rozas Larraondo
Sent: Thursday, March 20, 2014 6:20 PM
To: hdf-forum
Subject: [Hdf-forum] HDF5 RESTful API
I've find out about the project on creating a RESTful API to interact with HDF5 from a question at Stack Overflow:
Thanks for your answer. I'm going to explain my view of a HDF5 RESTful
API and why we are looking into it at the moment.
I know about the OPeNDAP project, but this is only focused on
publishing data. The idea of a RESTful API would be to extend this
capabilities and create a mapping between HDF5 CRUD operations
(create, retrieve, update, delete) into URLs. This is more or less
like treating HDF5 files as databases, which is quite a challenge in
terms of keeping the consistency of the files under a concurrent
access environment. (I think the next HDF5 version will implement
single write multiple read functionalities which will make this task
much easier and more efficient)
One of the first questions I would ask myself is: Why we want to wrap
HDF5, which is one of the most neat and efficient systems for data IO,
with a RESTful API, which is maybe the most slow and inefficient way
of data transmission? (for RESTful APIs it's a good practice to encode
all client/server data interchange in JSON)
There are a variety of binary protocols these days: binprot, b-json which would
be more efficient. And good enough for medium data size applications.
Answer 1: Because it gives great visibility/usability to our data.
Making use of RESTful technologies we can easily create web
applications or high level APIs to play with the data. We can also
create some basic functionalities to append new data or manage
datasets.
Answer 2: This is completely silly. If you really want to do that,
storing your data in flat text files wouldn't make any difference in
performance. HDF5 files normally store raw data and nobody wants to
consume gigabytes of raw data that need to be transmitted and
processed on the client side.
The HDF5 files may contain results which need to be visualized on the
client machine. IMHO, interactive visualization of data of some size is still
an area where current WWW technology is leaving to be desired and a fat
client is necessary.
For me, both answers are right, that's why my approach would be to
create some kind of high level RESTful API which adds more
intelligence on the server side, as subsetting, aggregation or
statistical processing. As this is highly dependent on the type of
data stored in the HDF5 files it would be a challenge to do something
generic that covers most cases. At the moment we are trying to come
out with a solution to implement all this in our project. So far,
python seems to be the way to go for us, and we are starting to
implement something mainly based on h5py, pandas and flask.
I hope I've been clear on my explanation, sorry about the length of
the post and any feedback on this topic will be highly appreciated.
Pablo, how are you? There's nothing we could share with you at the moment.
Have you read the specification? Do have any comments/concerns?
(We a currently revising the spec...)
A related effort is OPeNDAP (http://www.opendap.org/).
Would you mind describing your use case?
Best, G.
-----Original Message-----
From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Pablo Rozas Larraondo
Sent: Thursday, March 20, 2014 6:20 PM
To: hdf-forum
Subject: [Hdf-forum] HDF5 RESTful API
I've find out about the project on creating a RESTful API to interact with HDF5 from a question at Stack Overflow:
Hi
I am totally new to hdf5. Lately, I replicated a database to hdf5 format and created API in C for it.But when I read huge compound array from my database first time, it is slower than original API (also in C) which is reading same info by parsing from a hierarchical text database. When I run my software second time to do same thing, time performance gets little bit better, but still slower than original one. So my question is:Is it an expected behaviour?
Pablo, thanks for the explanation. Here're a few comments:
... concurrent access environment ...
I agree with you. In a REST context, this is a lot easier, because of statelessness.
And who knows if there're any HDF5 files. All a user/application sees are resources
and representations; no way of telling if it came from an HDF5 file...
Why HDF5/REST?
Answer 1: Because it gives great visibility/usability to our data.
I'm with you on that one.
Answer 2: This is completely silly.
Yes, if pushed to the extreme; but that's true for anything, isn't it?
One of the nice things about HTTP is content negotiation. Yes, we should support
the mainstream formats (JSON, XML), but there's room for others, such as, Avro.
Another option is connection upgrade, which could be used to stream data over WebSocket.
I'd be interested in your thoughts on the resource and URI structure,
i.e., how (in-)convenient it is for the kinds of applications you're
thinking about. The main challenge for us is to come up with something that's
general, i.e., not specific to a particular application domain, and useful
at the same time.
What platform were you in?
If you were in a Linux system, it is possible that the second time, if done right after the first time,
could be faster because Linux may have cache all data files in memory if the memory is larger
than all files data read. If you run the same thing 3 times in a row and if there is little change
in performance between the second and third runs, the kernel memory caching could be
the reason.
Note that what I offered is a possible reason. Without studying the design and implementation
of your software, I can't say for sure that is the reason of the improvement.
About why your database design is slower than the original API read, it is hard to determine
why it is slower without studying your implementation. How much slower did you observe?
Is it just 10-20% slower or is it orders of magnitude slower?
Hi
I am totally new to hdf5. Lately, I replicated a database to hdf5 format and created API in C for it.But when I read huge compound array from my database first time, it is slower than original API (also in C) which is reading same info by parsing from a hierarchical text database. When I run my software second time to do same thing, time performance gets little bit better, but still slower than original one. So my question is:Is it an expected behaviour?