Hi Ger,
A Wednesday 10 September 2008, Ger van Diepen escrigué:
We are thinking of storing the data observed with our radio
telescopes in HDF5. The amount of data can be ten to a few hundred
GBytes. The data arrives in order of time. The data have basically 4
axes: polarisation, frequency, baseline, and time. Depending on the
application a slice of data a along one or more of those axes is
needed. So a chunked dataset seems like a good candidate. However,
the axes are not regular. E.g. for longer baselines the integration
times can be shorter. So we cannot use a simple 4-dim dataset of
float values which would allow for easy access in all directions.
An option would be to store the data in a hierarchical way. E.g. a
group per time, then a group per baseline and finally a dataset
containing an array of data for the pol/freq axes. However, I fear
that in that way it is expensive to get, say, a slice containing all
data for a given baseline and frequency.
Another option is to store it like groups, but then in a dataset with
variable length entries. However, I guess I cannot chunk such a
dataset. So again it would be expensive to get the slice mentioned
above.
I'm not sure if I understand you, but it seems that you are referring
as "chunking" to what is called 'hyperslicing' in HDF5 jargon.
So I'm wondering what is the best way to store such data while having
reasonable access times along all axes?
One possibility would be to use a table as in a traditional database.
In terms of HDF5 that can be implemented as a compound, chunked (in the
sense of HDF5) dataset with one field for each irregular axis, plus an
additional field for keeping the actual float values. The length of
such a dataset would be the product of the lengths for each of the
axes. This would arguably take much more space on disk than other
solutions (the entries are not made only of actual values, but also of
*axes values*), but as the axes information would have relatively low
entropy, the compressor+shuffle filters could greatly reduce the amount
of space needed (to be reasonably similar of what your original values
would take).
For accessing the values as slices of your axes, you should add some
logic on your app that allows you to select the information you are
interested in. For example, if you want the values within a range
of 'polarization' and 'frequency', you can traverse the dataset and
select those values.
However, in order to avoid traversing the complete table, you may want
to index all the fields that are treated as axis, so as to speed-up the
lookups (as a matter of fact, this is what traditional databases do).
<blurb-mode>
In case you were using the Python language for your analysis job, you
may want to use PyTables Pro [1] for this. It implements an indexing
engine that can cope with very large datasets, and lets you do
operations like:
slice = table.readWhere('(pol>10) & (pol<20) | (pres<1.3)'
field="actual_value")
where 'slice' has the data that you are interested in. Of course, if
the 'pol' or 'pres' fields are indexed, then the need of traversing the
complete dataset is avoided.
In addition to use HDF5 as a container for all of its data, the indexing
engine behing PyTables Pro does scale much better than the ones in
traditional databases, as can be seen in [2].
</blurb-mode>
[1] http://www.pytables.org/moin/PyTablesPro
[2] http://www.pytables.org/docs/OPSI-indexes.pdf
Hope that helps,
···
--
Francesc Alted
Freelance developer
Tel +34-964-282-249
----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.