On 31 Mar 2017, at 9:30 AM, hdf-forum-request@lists.hdfgroup.org wrote:
Send Hdf-forum mailing list submissions to
hdf-forum@lists.hdfgroup.org
To subscribe or unsubscribe via the World Wide Web, visit
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
or, via email, send a message with subject or body 'help' to
hdf-forum-request@lists.hdfgroup.org
You can reach the person managing the list at
hdf-forum-owner@lists.hdfgroup.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."
Today's Topics:
1. Optimising HDF5 data structure (Tamas Gal)
2. Re: Optimising HDF5 data structure (Rafal Lichwala)
3. Re: Optimising HDF5 data structure (Tamas Gal)
4. Re: Optimising HDF5 data structure (Francesc Altet)
----------------------------------------------------------------------
Message: 1
Date: Thu, 30 Mar 2017 21:33:11 +0200
From: Tamas Gal <tamas.gal@me.com>
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <57368E28-2DC4-4564-B45C-C477B914EE95@me.com>
Content-Type: text/plain; charset=us-ascii
Dear all,
we are using HDF5 in our collaboration to store large event data of neutrino interactions. The data itself has a very simple structure but I still could not find a acceptable way to design the structure of the HDF5 format. It would be great if some HDF5 experts could give me a hint how to optimise it.
The data I want to store are basically events, which are simply groups of hits. A hit is a simple structure with the following fields:
Hit: dom_id (int32), time (int32), tot (int16), triggered (bool), pmt_id (int16)
As already mentioned, an event is simply a list of a few thousands hits and the number of hits is changing from event to event.
I tried different approaches to store information of a few thousands events (thus a couple of million hits) and the final two structures which kind of work but have still poor performance are:
Approach #1: a single "table" to store all hits (basically one array for each hit-field) with an additional "column" (again, an array) to store the event_id they belong to.
This is of course nice if I want to do analysis on the whole file, including all the events, but is slow when I want to iterate through each event_id, since I need select the corresponding hits by looking at the event_ids. In pytables or the Pandas framework, this works using binary search index trees, but it's still a bit slow.
Approach #2: using a hierarchical structure to store the events to group them. The events can then be accessed by reading "/hits/event_id", like "/hits/23", which is a similar table used in the first approach. To iterate through the events, I need to create a list of nodes and walk over them, or I store the number of events as an attribute and simply use an iterator.
It seems that it is only a tiny bit faster to access a specific event, which may be related to the fact that HDF5 stores the nodes in a b-tree, like pandas the index table.
The slowness is compared to a ROOT structure which is also used in parallel. If I compare some basic event-by-event analysis, the same code run on a ROOT file is almost an order of magnitude faster.
I also tried variable length arrays but I ran into compression issues. Some other approaches were creating meta tables to keep track of the indices of the hits for faster lookup, but this was kind of awkward and not self explaining enough in my opinion.
So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?
Best regards,
Tamas
------------------------------
Message: 2
Date: Fri, 31 Mar 2017 09:52:55 +0200
From: Rafal Lichwala <syriusz@man.poznan.pl>
To: hdf-forum@lists.hdfgroup.org
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <271c3831-e80d-9e21-ba92-0b701aed2ac6@man.poznan.pl>
Content-Type: text/plain; charset=utf-8; format=flowed
Hi Tamas,
So my question is: how would an experienced HDF5 user structure this simple data to maximise the performance of the event-by-event readout?
I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.
http://benchmarksgame.alioth.debian.org/u64q/compare.php?lang=python3&lang2=gpp
Second - I know this is HDF5 forum, but for such a huge but simple set
of data, I would suggest to use some SQL engine as a backend.
MySQL or PostgreSQL would be a good choice if you need a full set of
relational database engine features for your data analysis, but
file-based solutions (SQLite) could be also taken into consideration.
In your case data would be stored into two tables (hits and events) with
a proper key-based join between them.
Regards,
Rafal
------------------------------
Message: 3
Date: Fri, 31 Mar 2017 10:20:37 +0200
From: Tamas Gal <tamas.gal@me.com>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID: <6405B83B-001D-4476-A845-58E91AA20BAB@me.com>
Content-Type: text/plain; charset="us-ascii"
Dear Rafal,
thanks for you reply.
On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl> wrote:
I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.
I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org <http://www.pytables.org/>) and h5py (http://www.h5py.org <http://www.h5py.org/>) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org <http://www.numpy.org/>) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...
Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.
We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.
So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?
Cheers,
Tamas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/5b241dd2/attachment-0001.html>
------------------------------
Message: 4
Date: Fri, 31 Mar 2017 08:29:50 +0000
From: Francesc Altet <faltet@hdfgroup.org>
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Message-ID:
<SN2PR17MB08325034F8F122D27AACDE3EB5370@SN2PR17MB0832.namprd17.prod.outlook.com>
Content-Type: text/plain; charset="us-ascii"
Hi Tamas,
I'd say that there should be a layout in which you can store your data in HDF5 that is competitive with ROOT; it is just that finding it may require some more experimentation. Things like the compressor used, the chunksizes and the index level that you are using might be critical for achieving more performance. Could you send us some links to your codebases and perhaps elaborate more on the performance figures that you are getting on each of your approaches?
Best,
Francesc Alted
________________________________
From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org> on behalf of Tamas Gal <tamas.gal@me.com>
Sent: Friday, March 31, 2017 10:20:37 AM
To: HDF Users Discussion List
Subject: Re: [Hdf-forum] Optimising HDF5 data structure
Dear Rafal,
thanks for you reply.
On 31. Mar 2017, at 09:52, Rafal Lichwala <syriusz@man.poznan.pl<mailto:syriusz@man.poznan.pl>> wrote:
I see two solutions for your purposes.
First - try to switch from Python to C++ - it's much faster.
I am of course aware of the fact that Python is in general much slower than a statically typed compiled language, however pytables (http://www.pytables.org) and h5py (http://www.h5py.org) are thin wrappers and are tightly bound to the numpy library (http://www.numpy.org) which is totally competitive. I also use Julia to access HDF5 content and I did not notice a better performance. So I am not sure if this is a real bottleneck in our case...
Second - I know this is HDF5 forum, but for such a huge but simple set of data, I would suggest to use some SQL engine as a backend.
We definitely need a file based approach, so a centralised database engine is not an option. I also tried sqlite, however the performance is very poor compared to our HDF5 solution.
So maybe our data structure is not that bad overall, yet our expectations might be a bit too high?
Cheers,
Tamas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.hdfgroup.org/pipermail/hdf-forum_lists.hdfgroup.org/attachments/20170331/3a986f34/attachment.html>
------------------------------
Subject: Digest Footer
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
------------------------------
End of Hdf-forum Digest, Vol 93, Issue 29
*****************************************