mongodb compared to HDF5 ?

PICHARD_Guillaume · February 23, 2013, 8:19pm

Hi everyone,I'm trying to find the best fit for time series data (a lot,
let's say 1 sample every 10 ms for 10 hours which are never updated only
added and then read back) and I'd like your opinion on mongodb compared to
HDF5. Which one is the best fit ? Which one is the more performant ? Any
other pros/cons for one or the other ?Thanks a lot,Guillaume.

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/mongodb-compared-to-HDF5-tp4025922.html
Sent from the hdf-forum mailing list archive at Nabble.com.

gheber · February 24, 2013, 5:18pm

Guillaume, how are you? This is an interesting question, but there're
several omissions
and assumptions that make it rather ill-posed.

The omissions have to do with what you didn't tell us (and I come back to
that in a moment).
The assumptions have to do with an unspecified base on which HDF5 and
MongoDB are comparable.
(I will not spend time to discuss this second point and only state that,
apart from trivial
situations, there is no basis for such a comparison. HDF5 and MongoDB are
two very different
animals, which raises several interesting possibilities of using them
together. More on that soon...)

In any event, I suggest you spend some quality time with both candidates.
Have a look at PyTables, install MongoDB, and kick the tires. For
prototyping,
both are fun to play with. For a production solution, you need to ask and
answer
many more questions.

My first question for you would be, 'What's the data life cycle of your
data?'
You told us something about the acquisition, then what? (cleaning,
transformation,
products, distribution, (re-)use, archival, any of those?) What about the
underlying
model and the metadata that go with that?

At the indicated rate, you'll acquire about 216 million samples in 10 hours.
What's the size of an individual sample? How similar are individual samples?
By 'similar' I mean structure and value, i.e., how compressible are they?
Are they strings, or numbers disguised as strings?

How many JSON/BSON documents were you thinking about?
(MongoDB's current BSON document size limit is 16MB.)

Do you need MongoDB sharding across instances on EC2?

How will your acquisition rate change in the future? (It for sure will go
up...)
How do you access the data? What are the interface constraints of your
clients?

In terms of raw read/write performance, I don't see a scenario where MongoDB
has a chance
to beat HDF5. This doesn't mean that MongoDB couldn't be sufficient for your
purposes.

MongoDB lets you create indexes out-of-the-box. Plain HDF5 has no such
mechanism built-in.
(PyTables does and there are add-ons for HDF5 such as FastBit.)

These are just a few pointers for your homework. Keep us posted on how
you're getting on!

My parting comment would be this: If you're after building a long-term
archive
of large time series data, the idea of using MongoDB strikes me as rather
silly.
It wasn't made for that, it's a document database, remember?
On the other hand, using MongoDB as the catalog for metadata and to publish
time series excerpts
and aggregates is a perfectly sensible and efficient solution.

Best, G.

···

From: Hdf-forum [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of
guillaume
Sent: Saturday, February 23, 2013 2:20 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] mongodb compared to HDF5 ?

Hi everyone, I'm trying to find the best fit for time series data (a lot,
let's say 1 sample every 10 ms for 10 hours which are never updated only
added and then read back) and I'd like your opinion on mongodb compared to
HDF5. Which one is the best fit ? Which one is the more performant ? Any
other pros/cons for one or the other ? Thanks a lot, Guillaume.
________________________________________
View this message in context: mongodb compared to HDF5 ?
Sent from the hdf-forum mailing list archive at Nabble.com.

PICHARD_Guillaume · February 24, 2013, 10:34pm

Hi Gerd,

I did took a look at pyTables and MongoDB, it is indeed fun !

Well, the life cycle of my data would mainly be archival and data lookup on conditions (like "retrieve every row where column B equals 12", much like what pyTables and MongoDB can do on queries).
1 sample can be from 4 bytes to 256 bytes (strings), it can be int, double or strings or else.

Sharding is sure a plus that MongoDB has "out-of-the-box" but it's not really mandatory.

Acquisition rate may not really go up, it's mainly samples from sensors and having to capture at a higher rate would not really have any sense.

If I understand correctly what you are saying, you think that there are not really sense of using MongoDB for time series as there would be not really sense of using HDF5 for storing documents?

Where can I find examples for FastBit?

Thank,
Guillaume.

-----Message d'origine-----

···

De : Hdf-forum [mailto:hdf-forum-bounces@hdfgroup.org] De la part de Gerd Heber
Envoyé : dimanche 24 février 2013 18:18
À : 'HDF Users Discussion List'
Objet : Re: [Hdf-forum] mongodb compared to HDF5 ?

Guillaume, how are you? This is an interesting question, but there're several omissions and assumptions that make it rather ill-posed.

The omissions have to do with what you didn't tell us (and I come back to that in a moment).
The assumptions have to do with an unspecified base on which HDF5 and MongoDB are comparable.
(I will not spend time to discuss this second point and only state that, apart from trivial situations, there is no basis for such a comparison. HDF5 and MongoDB are two very different animals, which raises several interesting possibilities of using them together. More on that soon...)

In any event, I suggest you spend some quality time with both candidates.
Have a look at PyTables, install MongoDB, and kick the tires. For prototyping, both are fun to play with. For a production solution, you need to ask and answer many more questions.

My first question for you would be, 'What's the data life cycle of your data?'
You told us something about the acquisition, then what? (cleaning, transformation, products, distribution, (re-)use, archival, any of those?) What about the underlying model and the metadata that go with that?

At the indicated rate, you'll acquire about 216 million samples in 10 hours.
What's the size of an individual sample? How similar are individual samples?
By 'similar' I mean structure and value, i.e., how compressible are they?
Are they strings, or numbers disguised as strings?

How many JSON/BSON documents were you thinking about?
(MongoDB's current BSON document size limit is 16MB.)

Do you need MongoDB sharding across instances on EC2?

How will your acquisition rate change in the future? (It for sure will go
up...)
How do you access the data? What are the interface constraints of your clients?

In terms of raw read/write performance, I don't see a scenario where MongoDB has a chance to beat HDF5. This doesn't mean that MongoDB couldn't be sufficient for your purposes.

MongoDB lets you create indexes out-of-the-box. Plain HDF5 has no such mechanism built-in.
(PyTables does and there are add-ons for HDF5 such as FastBit.)

These are just a few pointers for your homework. Keep us posted on how you're getting on!

My parting comment would be this: If you're after building a long-term archive of large time series data, the idea of using MongoDB strikes me as rather silly.
It wasn't made for that, it's a document database, remember?
On the other hand, using MongoDB as the catalog for metadata and to publish time series excerpts and aggregates is a perfectly sensible and efficient solution.

Best, G.

From: Hdf-forum [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of guillaume
Sent: Saturday, February 23, 2013 2:20 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] mongodb compared to HDF5 ?

Hi everyone, I'm trying to find the best fit for time series data (a lot, let's say 1 sample every 10 ms for 10 hours which are never updated only added and then read back) and I'd like your opinion on mongodb compared to HDF5. Which one is the best fit ? Which one is the more performant ? Any other pros/cons for one or the other ? Thanks a lot, Guillaume.
________________________________________
View this message in context: mongodb compared to HDF5 ?
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

gheber · February 25, 2013, 2:16pm

Guillaume, how are you?

Well, the life cycle of my data would mainly be archival and data lookup

on conditions

(like "retrieve every row where column B equals 12", much like what

pyTables and MongoDB

can do on queries).
1 sample can be from 4 bytes to 256 bytes (strings), it can be int, double

or strings or else.

It's tempting to have the same storage layout for acquisition, archival, and
retrieval, and
in simple cases this might even work. Generally, it's not always such a
great idea.

If I understand correctly what you are saying, you think that there are

not really sense

of using MongoDB for time series as there would be not really sense of

using HDF5

for storing documents?

That's one way of putting it. You can obviously mimic storing documents in
HDF5,
the same way you can mimic storing time series in MongoDB. And mimicking is
good
enough, sometimes. It really depends on what your expectations for quality
are.

Where can I find examples for FastBit?

Here's a quotation from John Wu's earlier posting:

"Both FastQuery and FastBit are available in source code form

FastQuery http://codeforge.lbl.gov/projects/fastquery
FastBit http://codeforge.lbl.gov/projects/fastbit

Feel free to join FastBit mailing list
<https://hpcrdm.lbl.gov/pipermail/fastbit-users> to post your questions
regarding FastBit and FastQuery."

Best, G.

-----Message d'origine-----

···

De : Hdf-forum [mailto:hdf-forum-bounces@hdfgroup.org] De la part de Gerd
Heber Envoyé : dimanche 24 février 2013 18:18 À : 'HDF Users Discussion
List'
Objet : Re: [Hdf-forum] mongodb compared to HDF5 ?

Guillaume, how are you? This is an interesting question, but there're
several omissions and assumptions that make it rather ill-posed.

The omissions have to do with what you didn't tell us (and I come back to
that in a moment).
The assumptions have to do with an unspecified base on which HDF5 and
MongoDB are comparable.
(I will not spend time to discuss this second point and only state that,
apart from trivial situations, there is no basis for such a comparison. HDF5
and MongoDB are two very different animals, which raises several interesting
possibilities of using them together. More on that soon...)

In any event, I suggest you spend some quality time with both candidates.
Have a look at PyTables, install MongoDB, and kick the tires. For
prototyping, both are fun to play with. For a production solution, you need
to ask and answer many more questions.

My first question for you would be, 'What's the data life cycle of your
data?'
You told us something about the acquisition, then what? (cleaning,
transformation, products, distribution, (re-)use, archival, any of those?)
What about the underlying model and the metadata that go with that?

At the indicated rate, you'll acquire about 216 million samples in 10 hours.
What's the size of an individual sample? How similar are individual samples?
By 'similar' I mean structure and value, i.e., how compressible are they?
Are they strings, or numbers disguised as strings?

How many JSON/BSON documents were you thinking about?
(MongoDB's current BSON document size limit is 16MB.)

Do you need MongoDB sharding across instances on EC2?

How will your acquisition rate change in the future? (It for sure will go
up...)
How do you access the data? What are the interface constraints of your
clients?

In terms of raw read/write performance, I don't see a scenario where MongoDB
has a chance to beat HDF5. This doesn't mean that MongoDB couldn't be
sufficient for your purposes.

MongoDB lets you create indexes out-of-the-box. Plain HDF5 has no such
mechanism built-in.
(PyTables does and there are add-ons for HDF5 such as FastBit.)

These are just a few pointers for your homework. Keep us posted on how
you're getting on!

My parting comment would be this: If you're after building a long-term
archive of large time series data, the idea of using MongoDB strikes me as
rather silly.
It wasn't made for that, it's a document database, remember?
On the other hand, using MongoDB as the catalog for metadata and to publish
time series excerpts and aggregates is a perfectly sensible and efficient
solution.

Best, G.

From: Hdf-forum [mailto:hdf-forum-bounces@hdfgroup.org] On Behalf Of
guillaume
Sent: Saturday, February 23, 2013 2:20 PM
To: hdf-forum@hdfgroup.org
Subject: [Hdf-forum] mongodb compared to HDF5 ?

Hi everyone, I'm trying to find the best fit for time series data (a lot,
let's say 1 sample every 10 ms for 10 hours which are never updated only
added and then read back) and I'd like your opinion on mongodb compared to
HDF5. Which one is the best fit ? Which one is the more performant ? Any
other pros/cons for one or the other ? Thanks a lot, Guillaume.
________________________________________
View this message in context: mongodb compared to HDF5 ?
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

mongodb compared to HDF5 ?