Optimising HDF5 data structure

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas

Hi Tamas,

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.
MySQL or PostgreSQL would be a good choice if you need a full set of relational database engine features for your data analysis, but file-based solutions (SQLite) could be also taken into consideration.
In your case data would be stored into two tables (hits and events) with a proper key-based join between them.

Regards,
Rafal

Dear Tamas,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

It is a pleasure to see some HEP people here.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

This approach would create a large number of dataset ( one per id ),
which is from my experience, a bad idea in HDF5

I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

Best Regards,
Adrien

···

Le 30. 03. 17 à 21:33, Tamas Gal a écrit :

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Hello Tamas,

I use HDF5 to store stream of irregular time series (IRTS) from financial
sector. The events are organised per day into a data-set, where as the
data-set is a variable length stream/vector with custom datatype.
The custom record type is created to increase density, and and iterator in
C/C++ to iterate through the event stream which is linked against Julia,R,
python code.
Because the custom datatype is saved into the file it is readily accessible
though hdfview.

The access pattern to this database is write once/read many, sequential and
I get good result over the past 5 years. I use it in MPI cluster
environment, C++/Julia/Rcpp.

custom datatpe in my case:
[event id, asset, .... ]

You see to have optimised access both sequentially all events, and
sequentially only some events is a by-objective problem that you can
mitigate by using more space to gain time.

As others pointed out, chunk size matters.

hope it helps,
steve

···

On Thu, Mar 30, 2017 at 3:33 PM, Tamas Gal <tamas.gal@me.com> wrote:

Dear all,

we are using HDF5 in our collaboration to store large event data of
neutrino interactions. The data itself has a very simple structure but I
still could not find a acceptable way to design the structure of the HDF5
format. It would be great if some HDF5 experts could give me a hint how to
optimise it.

The data I want to store are basically events, which are simply groups of
hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id
(int16)

As already mentioned, an event is simply a list of a few thousands hits
and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands
events (thus a couple of million hits) and the final two structures which
kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for
each hit-field) with an additional "column" (again, an array) to store the
event_id they belong to.

This is of course nice if I want to do analysis on the whole file,
including all the events, but is slow when I want to iterate through each
event_id, since I need select the corresponding hits by looking at the
event_ids. In pytables or the Pandas framework, this works using binary
search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group
them. The events can then be accessed by reading "/hits/event_id", like
"/hits/23", which is a similar table used in the first approach. To iterate
through the events, I need to create a list of nodes and walk over them, or
I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event,
which may be related to the fact that HDF5 stores the nodes in a b-tree,
like pandas the index table.

The slowness is compared to a ROOT structure which is also used in
parallel. If I compare some basic event-by-event analysis, the same code
run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues.
Some other approaches were creating meta tables to keep track of the
indices of the hits for faster lookup, but this was kind of awkward and not
self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this
simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

30.03.2017 22:33, Tamas Gal пишет:

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

Hello Tamas!

My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

*faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Best wishes,
Andrey Paramonov

···

From my experience HDF5 is almost as fast as direct disk read, and even

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Dear Rafal,

thanks for you reply.

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org <http://www.pytables.org/>) and h5py (http://www.h5py.org <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org <http://www.numpy.org/>) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas

···

On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl> wrote:

Dear Steven,

···

On 31. Mar 2017, at 17:13, Steven Varga <steven.varga@gmail.com> wrote:

I use HDF5 to store stream of irregular time series (IRTS) from financial sector. The events are organised per day into a data-set, where as the data-set is a variable length stream/vector with custom datatype.
The custom record type is created to increase density, and and iterator in C/C++ to iterate through the event stream which is linked against Julia,R, python code.
Because the custom datatype is saved into the file it is readily accessible though hdfview.

this sounds very interesting. Do you have some public code to look at the implementation details?

Cheers,
Tamas

Hi Tamas,

I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation. Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?

Best,

Francesc Alted

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Tamas Gal <tamas.gal@me.com>
Sent: Friday, March 31, 2017 10:20:37 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

Dear Rafal,

thanks for you reply.

On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl<mailto:syriusz@man.poznan.pl>> wrote:

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas

I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.

alright, there is hope :wink:

Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance.

We experimented with compression levels and libs and ended up using the blosc. And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background.

Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?

The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python:

ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total

My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing.

If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.
http://cassandra.apache.org/

Thanks, I will have a look!

···

On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:
On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:

On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:
From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Sounds good :wink:

Cheers,
Tamas

See below:

Dear Steven,

I use HDF5 to store stream of irregular time series (IRTS) from financial
sector. The events are organised per day into a data-set, where as the
data-set is a variable length stream/vector with custom datatype.

The custom record type is created to increase density, and and iterator in
C/C++ to iterate through the event stream which is linked against Julia,R,
python code.

Because the custom datatype is saved into the file it is readily accessible
though hdfview.

this sounds very interesting. Do you have some public code to look at the
implementation details?

here is something similar I've done for a compound data set, iterators,

and random search:

<https://github.com/rburkholder/trade-frame/tree/master/lib/TFHDF5TimeSeries

···

On 31. Mar 2017, at 17:13, Steven Varga <steven.varga@gmail.com <mailto:steven.varga@gmail.com> > wrote:

code currently works in linux, and should work in windows

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Oh sure, there is always hope [:wink:]

Ok, so based on your report, I'd suggest to use other codecs inside Blosc. By default the "blosc" filter translates to the "blosc:blosclz" codec internally, but you can also specify "blosc:lz4", "blosc:snappy", "blosc:zlib" and "blosc:zstd". Each codec has its strong and weak points, so my first advice is that you experiment with them (specially with "blosc:zstd" which is a surprisingly good newcomer).

Then, you should experiment with different chunksizes. If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.

Indexing can make lookups much faster too. Make sure that you create the index with maximum optimization (look for Structured storage classes — PyTables 3.8.1.dev0 documentation) before deciding that it is not for you. Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.

In general, for understanding how chunksizes, compression and indexing can affect your lookup performance it is worth to have a look at the _Optimization Tips_ chapter of PyTables UG: Optimization tips — PyTables 3.8.1.dev0 documentation .

Good luck,

Francesc Alted

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Tamas Gal <tamas.gal@me.com>
Sent: Friday, March 31, 2017 11:14 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org<mailto:faltet@hdfgroup.org>> wrote:
I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation.

alright, there is hope :wink:

On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org<mailto:faltet@hdfgroup.org>> wrote:
Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance.

We experimented with compression levels and libs and ended up using the blosc. And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

We also pass the number of expected rows when creating the tables, however this is a pytables feature, so there is some magic in the background.

On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org<mailto:faltet@hdfgroup.org>> wrote:
Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?

The chunksizes had no significant impact on the performance, but I admit I need to rerun all the performance scripts to show some actual values. The index level is new to me, I need to read up on that, but I think pytables takes care of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading both with thin C wrappers in Python:

ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total

On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru<mailto:paramon@acdlabs.ru>> wrote:
My experience suggests that simply indexing the data is not enough to achieve top performance. The actual layout of information on disk (primary index) should be well-suited for your typical queries. For example, if you need to query by event_id, all values with the same event_id have to be closely located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction with this indexing and big table thing.

On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru<mailto:paramon@acdlabs.ru>> wrote:
If you have several types of typical queries, it might be worth to duplicate the information using different physical layouts. This philosophy is utilized to great success in e.g.

Thanks, I will have a look!

On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru<mailto:paramon@acdlabs.ru>> wrote:
From my experience HDF5 is almost as fast as direct disk read, and even *faster* when using fast compression (LZ4, blosc). On my data HDF5 proved to be much faster compared to SQLite and local PostgreSQL databases.

Sounds good :wink:

Cheers,
Tamas

I'd say that there should be a layout in which you can store your data

in HDF5 that is competitive with ROOT; it is just that finding it may
require some more experimentation.

alright, there is hope :wink:

Things like the compressor used, the chunksizes and the index level that

you are using might be critical for achieving more performance.

We experimented with compression levels and libs and ended up using the

blosc. And this is what we used:

tb.Filters(complevel=5, shuffle=True, fletcher32=True, complib='blosc')

Typing on my phone so can't say much. Just wanted to react to this. I
haven't used pytables, but if the shuffle parameter here refers to the HDF5
library's built in shuffle filter, I think you want to turn it off when
using blosc, since the blosc compressor does its own shuffling, and I think
the two may interfere.

Cheers,
Elvis

We also pass the number of expected rows when creating the tables,

however this is a pytables feature, so there is some magic in the
background.

Could you send us some links to your codebases and perhaps elaborate

more on the performance figures that you are getting on each of your
approaches?

The chunksizes had no significant impact on the performance, but I admit

I need to rerun all the performance scripts to show some actual values. The
index level is new to me, I need to read up on that, but I think pytables
takes care of it.

Here are some examples comparing the ROOT and HDF5 file formats, reading

both with thin C wrappers in Python:

ROOT_readout.py 5.27s user 3.33s system 153% cpu 5.609 total
HDF5_big_table_readout.py 17.88s user 4.29s system 105% cpu 21.585 total

My experience suggests that simply indexing the data is not enough to

achieve top performance. The actual layout of information on disk (primary
index) should be well-suited for your typical queries. For example, if you
need to query by event_id, all values with the same event_id have to be
closely located to minimize the number of disk seeks.

OK, this was also my thought. It seems we went in the wrong direction

with this indexing and big table thing.

If you have several types of typical queries, it might be worth to

duplicate the information using different physical layouts. This philosophy
is utilized to great success in e.g.

http://cassandra.apache.org/

Thanks, I will have a look!

From my experience HDF5 is almost as fast as direct disk read, and even

*faster* when using fast compression (LZ4, blosc). On my data HDF5 proved
to be much faster compared to SQLite and local PostgreSQL databases.

···

Den 31 mars 2017 11:15 fm skrev "Tamas Gal" <tamas.gal@me.com>:

On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:29, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:
On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:
On 31. Mar 2017, at 10:36, Андрей Парамонов <paramon@acdlabs.ru> wrote:

Sounds good :wink:

Cheers,
Tamas

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Thanks for all the feedback so far!

Then, you should experiment with different chunksizes. If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.
[...]
Indexing can make lookups much faster too. Make sure that you create the index with maximum optimization (look forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex <http://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex>) before deciding that it is not for you. Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.

Alright, I will study that extensively. :slight_smile:

I am just curious if I tie the HDF5 format to much to the pytables framework. We also use other languages and as far as I understand, pytables creates some hidden tables to do all the magic behind. Or are these commonly supported HDF5 features? (sorry for the dumb question)

It is a pleasure to see some HEP people here.

Thanks, glad to hear :wink:

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Ahm, with "used in parallel" I was referring to the fact that we use both ROOT and HDF5 files in our collaboration. There is a generation conflict between the two "frameworks" as you may imagine. Younger people refuse to use ROOT (for good reasons, but that's another story). That's why I maintain this branch in parallel.

This approach would create a large number of dataset ( one per id ),
which is from my experience, a bad idea in HDF5

Yes, this is kind of the problem with the second approach. h5py is extremely fast when iterating whereas pytables takes 50 times longer using the very same code (a for loop and direct access to the nodes). And there are people using other frameworks so there might be some huge performance variations I fear, which of course is not user friendly at all.

···

On 31. Mar 2017, at 13:49, Francesc Altet <faltet@hdfgroup.org> wrote:
On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name> wrote:
On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name> wrote:
On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name> wrote:

On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name> wrote:

I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

I see... So there might be a well suited set of chunk/index parameters which could improve the speed of that structure. I need to dig deeper then.

Cheers,
Tamas

Indeed, indexing is a PyTables feature, so if want to use HDF5 with other interfaces, then better not rely on this.

Francesc Alted

···

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Tamas Gal <tamas.gal@me.com>
Sent: Friday, March 31, 2017 5:10:45 PM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

Thanks for all the feedback so far!

On 31. Mar 2017, at 13:49, Francesc Altet <faltet@hdfgroup.org<mailto:faltet@hdfgroup.org>> wrote:
Then, you should experiment with different chunksizes. If you are using PyTables, then make sure to pass them in the `chunkshape` parameter, whereas h5py uses `chunks`.
[...]
Indexing can make lookups much faster too. Make sure that you create the index with maximum optimization (look forhttp://www.pytables.org/usersguide/libref/structured_storage.html?highlight=create_index#tables.Column.create_csindex) before deciding that it is not for you. Also, using blosc (+ a suitable codec) when creating the index can usually accelerate things quite a bit.

Alright, I will study that extensively. :slight_smile:

I am just curious if I tie the HDF5 format to much to the pytables framework. We also use other languages and as far as I understand, pytables creates some hidden tables to do all the magic behind. Or are these commonly supported HDF5 features? (sorry for the dumb question)

On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name<mailto:Adev@adev.name>> wrote:
It is a pleasure to see some HEP people here.

Thanks, glad to hear :wink:

On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name<mailto:Adev@adev.name>> wrote:
The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

If I remember properly, ROOT can only read in parallel, no write. Does
it matter for you ?

Ahm, with "used in parallel" I was referring to the fact that we use both ROOT and HDF5 files in our collaboration. There is a generation conflict between the two "frameworks" as you may imagine. Younger people refuse to use ROOT (for good reasons, but that's another story). That's why I maintain this branch in parallel.

On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name<mailto:Adev@adev.name>> wrote:
This approach would create a large number of dataset ( one per id ),
which is from my experience, a bad idea in HDF5

Yes, this is kind of the problem with the second approach. h5py is extremely fast when iterating whereas pytables takes 50 times longer using the very same code (a for loop and direct access to the nodes). And there are people using other frameworks so there might be some huge performance variations I fear, which of course is not user friendly at all.

On 31. Mar 2017, at 10:47, Adrien Devresse <Adev@adev.name<mailto:Adev@adev.name>> wrote:
I would use Approach #1 and store all your events in a "column" fashion
similar to what ROOT does.

For the fast querying problem, you can post-process your file and add a
separate column acting as an ordered index / associative array with a
layout of the type "event_id" -> "range row"

I see... So there might be a well suited set of chunk/index parameters which could improve the speed of that structure. I need to dig deeper then.

Cheers,
Tamas

Dear Tamas,

My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.

You could use event-ID for the dataset name, and any other event metadata could be stored as attributes on the dataset.

The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach.

It sounds as if you have already tried this approach.

To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.

I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue?

Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes?

rgds
Ewan

···

On 31 Mar 2017, at 9:30 AM, hdf-forum-request@lists.hdfgroup.org wrote:

Send Hdf-forum mailing list submissions to
  hdf-forum@lists.hdfgroup.org

To subscribe or unsubscribe via the World Wide Web, visit
  http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

or, via email, send a message with subject or body 'help' to
  hdf-forum-request@lists.hdfgroup.org

You can reach the person managing the list at
  hdf-forum-owner@lists.hdfgroup.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."

Today's Topics:

  1. Optimising HDF5 data structure (Tamas Gal)
  2. Re: Optimising HDF5 data structure (Rafal Lichwala)
  3. Re: Optimising HDF5 data structure (Tamas Gal)
  4. Re: Optimising HDF5 data structure (Francesc Altet)

----------------------------------------------------------------------

Message: 1
Date: Thu, 30 Mar 2017 21:33:11 +0200
From: Tamas Gal <tamas.gal@me.com>
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <57368E28-2DC4-4564-B45C-C477B914EE95@me.com>
Content-Type: text/plain; charset=us-ascii

Dear all,

we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.

The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:

Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)

As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.

I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:

Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.

This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.

Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.

The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.

I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

Best regards,
Tamas

------------------------------

Message: 2
Date: Fri, 31 Mar 2017 09:52:55 +0200
From: Rafal Lichwala <syriusz@man.poznan.pl>
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <271c3831-e80d-9e21-ba92-0b701aed2ac6@man.poznan.pl>
Content-Type: text/plain; charset=utf-8; format=flowed

Hi Tamas,

So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp

Second - I know this is HDF5 forum, but for such a huge but simple set
of data, I would suggest to use some SQL engine as a backend.
MySQL or PostgreSQL would be a good choice if you need a full set of
relational database engine features for your data analysis, but
file-based solutions (SQLite) could be also taken into consideration.
In your case data would be stored into two tables (hits and events) with
a proper key-based join between them.

Regards,
Rafal

------------------------------

Message: 3
Date: Fri, 31 Mar 2017 10:20:37 +0200
From: Tamas Gal <tamas.gal@me.com>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <6405B83B-001D-4476-A845-58E91AA20BAB@me.com>
Content-Type: text/plain; charset="us-ascii"

Dear Rafal,

thanks for you reply.

On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl> wrote:

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org <http://www.pytables.org/>) and h5py (http://www.h5py.org <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org <http://www.numpy.org/>) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/5b241dd2/attachment-0001.html>

------------------------------

Message: 4
Date: Fri, 31 Mar 2017 08:29:50 +0000
From: Francesc Altet <faltet@hdfgroup.org>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID:
  <SN2PR17MB08325034F8F122D27AACDE3EB5370@SN2PR17MB0832.namprd17.prod.outlook.com>
  
Content-Type: text/plain; charset="us-ascii"

Hi Tamas,

I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation. Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?

Best,

Francesc Alted

________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Tamas Gal <tamas.gal@me.com>
Sent: Friday, March 31, 2017 10:20:37 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure

Dear Rafal,

thanks for you reply.

On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl<mailto:syriusz@man.poznan.pl>> wrote:

I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.

I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...

Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.

We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.

So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?

Cheers,
Tamas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/3a986f34/attachment.html>

------------------------------

Subject: Digest Footer

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org

------------------------------

End of Hdf-forum Digest, Vol 93, Issue 29
*****************************************

Sorry Ewan, I nearly missed your message!

My instinct in your situation would be to define a compound data structure to represent one hit (it sounds as if you have done that) and then write a dataset per event.

Yes, we use compound data structures for the hits right now.

To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.

I believe you can directly get the number of rows in each dataset and so I am confused by the attribute suggestion. It seems performance was still an issue?

That was referring to the one-big-table with an event_id array. Kind of mocking the pytables indexing feature without having a strict pytables dependency, so like an extra dataset which stores the "from-to" index values for each event. Which is of course ugly :wink:

···

On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:
On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:

On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:
Generally I find that performance is all about the chunk size - HDF will generally read a whole chunk at a time and cache those chunks - have you tried different chunk sizes?

I tried but obviously I did something wrong... :wink:

Note that the HDF5 chunk cache size can be very important. HDF5 does
not look at the access pattern to estimate the optimal cache size. If
your access pattern is not sequential, you need to set a cache size that
minimizes I/O for that access pattern.
I've noted that accessing small hyperslabs is quite slow in HDF5,
probably due to B-tree lookup overhead.

Some colleagues at sister institutes have used the ADIOS data system
developed at Oak Ridge and said it was much faster than HDF5. However,
AFAIK it can use a lot of memory to achieve it. But a large chunk cache
is not much different.

BTW. I assume that in your tests both ROOT and HDF5 used cold data,
thus no data was already available in the system file buffers.

- Ger

Tamas Gal <tamas.gal@me.com> 31-Mar-17 17:46 >>>

Sorry Ewan, I nearly missed your message!

My instinct in your situation would be to define a compound data

structure to represent one hit (it sounds as if you have done that) and
then write a dataset per event.

Yes, we use compound data structures for the hits right now.

To iterate through the events, I need to create a list of nodes and

walk over them, or I store the number of events as an attribute and
simply use an iterator.

I believe you can directly get the number of rows in each dataset and

so I am confused by the attribute suggestion. It seems performance was
still an issue?

That was referring to the one-big-table with an event_id array. Kind of
mocking the pytables indexing feature without having a strict pytables
dependency, so like an extra dataset which stores the "from-to" index
values for each event. Which is of course ugly :wink:

···

On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:
On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:

On 31. Mar 2017, at 13:28, Ewan Makepeace <makepeace@jawasoft.com> wrote:
Generally I find that performance is all about the chunk size - HDF

will generally read a whole chunk at a time and cache those chunks -
have you tried different chunk sizes?

I tried but obviously I did something wrong... :wink:

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5