Selection and crosscuts in HDF5 files

Ger_van_Diepen · September 10, 2008, 12:23pm

We are thinking of storing the data observed with our radio telescopes in HDF5. The amount of data can be ten to a few hundred GBytes. The data arrives in order of time.
The data have basically 4 axes: polarisation, frequency, baseline, and time. Depending on the application a slice of data a along one or more of those axes is needed. So a chunked dataset seems like a good candidate.
However, the axes are not regular. E.g. for longer baselines the integration times can be shorter. So we cannot use a simple 4-dim dataset of float values which would allow for easy access in all directions.

An option would be to store the data in a hierarchical way. E.g. a group per time, then a group per baseline and finally a dataset containing an array of data for the pol/freq axes. However, I fear that in that way it is expensive to get, say, a slice containing all data for a given baseline and frequency.

Another option is to store it like groups, but then in a dataset with variable length entries. However, I guess I cannot chunk such a dataset. So again it would be expensive to get the slice mentioned above.

So I'm wondering what is the best way to store such data while having reasonable access times along all axes?

Regards,
Ger

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · September 10, 2008, 6:20pm

Hi Ger,

We are thinking of storing the data observed with our radio telescopes in HDF5. The amount of data can be ten to a few hundred GBytes. The data arrives in order of time.
The data have basically 4 axes: polarisation, frequency, baseline, and time. Depending on the application a slice of data a along one or more of those axes is needed. So a chunked dataset seems like a good candidate.
However, the axes are not regular. E.g. for longer baselines the integration times can be shorter. So we cannot use a simple 4-dim dataset of float values which would allow for easy access in all directions.

Just to be certain I know what we're talking about here, are you thinking that you want the dimensions of your dataset to be "ragged" in one dimension while expanding another dimension? (And holding the other two dimensions fixed)

An option would be to store the data in a hierarchical way. E.g. a group per time, then a group per baseline and finally a dataset containing an array of data for the pol/freq axes. However, I fear that in that way it is expensive to get, say, a slice containing all data for a given baseline and frequency.

Another option is to store it like groups, but then in a dataset with variable length entries. However, I guess I cannot chunk such a dataset. So again it would be expensive to get the slice mentioned above.

You can chunk datasets that have variable-length datatype elements.

So I'm wondering what is the best way to store such data while having reasonable access times along all axes?

Hmm, HDF5 doesn't tackle the case I mentioned above (ragged dims, etc) in "big" ways, just "small" ways with variable-length datatypes. The downside of using variable-length datatypes is that there's no way to subset along that "dimension" currently.

Quincey

···

On Sep 10, 2008, at 7:23 AM, Ger van Diepen wrote:

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · September 10, 2008, 5:21pm

Hi Ger,

A Wednesday 10 September 2008, Ger van Diepen escrigué:

We are thinking of storing the data observed with our radio
telescopes in HDF5. The amount of data can be ten to a few hundred
GBytes. The data arrives in order of time. The data have basically 4
axes: polarisation, frequency, baseline, and time. Depending on the
application a slice of data a along one or more of those axes is
needed. So a chunked dataset seems like a good candidate. However,
the axes are not regular. E.g. for longer baselines the integration
times can be shorter. So we cannot use a simple 4-dim dataset of
float values which would allow for easy access in all directions.

An option would be to store the data in a hierarchical way. E.g. a
group per time, then a group per baseline and finally a dataset
containing an array of data for the pol/freq axes. However, I fear
that in that way it is expensive to get, say, a slice containing all
data for a given baseline and frequency.

Another option is to store it like groups, but then in a dataset with
variable length entries. However, I guess I cannot chunk such a
dataset. So again it would be expensive to get the slice mentioned
above.

I'm not sure if I understand you, but it seems that you are referring
as "chunking" to what is called 'hyperslicing' in HDF5 jargon.

So I'm wondering what is the best way to store such data while having
reasonable access times along all axes?

One possibility would be to use a table as in a traditional database.
In terms of HDF5 that can be implemented as a compound, chunked (in the
sense of HDF5) dataset with one field for each irregular axis, plus an
additional field for keeping the actual float values. The length of
such a dataset would be the product of the lengths for each of the
axes. This would arguably take much more space on disk than other
solutions (the entries are not made only of actual values, but also of
*axes values*), but as the axes information would have relatively low
entropy, the compressor+shuffle filters could greatly reduce the amount
of space needed (to be reasonably similar of what your original values
would take).

For accessing the values as slices of your axes, you should add some
logic on your app that allows you to select the information you are
interested in. For example, if you want the values within a range
of 'polarization' and 'frequency', you can traverse the dataset and
select those values.

However, in order to avoid traversing the complete table, you may want
to index all the fields that are treated as axis, so as to speed-up the
lookups (as a matter of fact, this is what traditional databases do).

<blurb-mode>
In case you were using the Python language for your analysis job, you
may want to use PyTables Pro [1] for this. It implements an indexing
engine that can cope with very large datasets, and lets you do
operations like:

slice = table.readWhere('(pol>10) & (pol<20) | (pres<1.3)'
field="actual_value")

where 'slice' has the data that you are interested in. Of course, if
the 'pol' or 'pres' fields are indexed, then the need of traversing the
complete dataset is avoided.

In addition to use HDF5 as a container for all of its data, the indexing
engine behing PyTables Pro does scale much better than the ones in
traditional databases, as can be seen in [2].
</blurb-mode>

[1] http://www.pytables.org/moin/PyTablesPro
[2] http://www.pytables.org/docs/OPSI-indexes.pdf

Hope that helps,

···

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · September 11, 2008, 11:54am

Hi Ger,

Hi Quincey and Francesc,

Thanks for your answers.
Indeed it expands in the time dimension and it is ragged. For instance, for baseline A we'll have a cube of [ntimeA,nfreq,npol], while for baseline B we'll have [ntimeB,nfreq,npol]. Ragging is very much desired to save a factor of at least 2 in storage, so we cannot have a cube [ntime,nbaseline,nfreq,npol].
We have 3 main applications with different access patterns.
- RFI detection needs a sliding window in time and freq per baseline.
- Calibration needs all data in chunks of time.
- Imaging needs all data in chunks of frequency.

When chunking a dataset with variable length data types, I cannot see it can still chunk all 4 dimensions. What does it chunk?

I was thinking of making a chunked 3-D dataset with a variable-length datatype for the ragged dimension.

I guess I can use multiple non-ragged chunked data sets, for instance one per baseline or combine baselines with the same time integration. I have to think more about that.
I assume the chunk cache is shared by all datasets, so it should be large enough when doing, say, the imaging.

I think you may have some reasonable results with making a [ntime,nbaseline,nfreq,npol] cube as long as you sent the chunk dimensions relatively small and add a compression filter (like deflate). You will benefit from the fact that chunk without any data elements aren't instantiated in the file and chunks that are partially filled with [ragged] elements will be compressed well. It won't be as good as having a fully supported ragged dimension, but it will still give you better subsetting capabilities than having multiple datasets.

Maybe Francesc's idea of a database-like approach is feasible, but I hesitate to index billions of values that way. Typical values are:
npol=4
nfreq=1024
nbaseline=900
ntime=6400 for long baselines and 800 for short baselines (and something like 3200 or 1600 for intermediate baselines)

*ick*

Quincey

···

On Sep 11, 2008, at 1:29 AM, Ger van Diepen wrote:

Cheers,
Ger

Quincey Koziol <koziol@hdfgroup.org> 09/10/08 8:20 PM >>>

Hi Ger,

On Sep 10, 2008, at 7:23 AM, Ger van Diepen wrote:

We are thinking of storing the data observed with our radio
telescopes in HDF5. The amount of data can be ten to a few hundred
GBytes. The data arrives in order of time.
The data have basically 4 axes: polarisation, frequency, baseline,
and time. Depending on the application a slice of data a along one
or more of those axes is needed. So a chunked dataset seems like a
good candidate.
However, the axes are not regular. E.g. for longer baselines the
integration times can be shorter. So we cannot use a simple 4-dim
dataset of float values which would allow for easy access in all
directions.

  Just to be certain I know what we're talking about here, are you
thinking that you want the dimensions of your dataset to be "ragged"
in one dimension while expanding another dimension? (And holding the
other two dimensions fixed)

An option would be to store the data in a hierarchical way. E.g. a
group per time, then a group per baseline and finally a dataset
containing an array of data for the pol/freq axes. However, I fear
that in that way it is expensive to get, say, a slice containing all
data for a given baseline and frequency.

Another option is to store it like groups, but then in a dataset
with variable length entries. However, I guess I cannot chunk such a
dataset. So again it would be expensive to get the slice mentioned
above.

  You can chunk datasets that have variable-length datatype elements.

So I'm wondering what is the best way to store such data while
having reasonable access times along all axes?

  Hmm, HDF5 doesn't tackle the case I mentioned above (ragged dims,
etc) in "big" ways, just "small" ways with variable-length datatypes.
The downside of using variable-length datatypes is that there's no way
to subset along that "dimension" currently.

  Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Francesc_Alted1 · September 11, 2008, 1:07pm

Hi Ger,

A Thursday 11 September 2008, Quincey Koziol escrigué:
[clip]

> Maybe Francesc's idea of a database-like approach is feasible, but
> I hesitate to index billions of values that way. Typical values
> are: npol=4
> nfreq=1024
> nbaseline=900
> ntime=6400 for long baselines and 800 for short baselines (and
> something like 3200 or 1600 for intermediate baselines)

No problem. Tables with 5 billions of entries (and more) are typical
figures for PyTables Pro as you can see in the white paper of OPSI
(that I mentioned on a previous message).

Sure, indexes should take quite a few of space, but much less than, for
example, PostgreSQL (typically, 3x less, and up to 15x less in
forthcoming Pro 2.1). Also, the creation of indexes is around 10x
faster. For example, creating an index for a table with 5 billions of
rows would take just a couple of hours (using a machine with an
Opteron64 processor at 2 GHz and a regular SATA disk), so the complete
indexation for the all 5 columns required for your case (4 if you don't
want to index the values) would take just 10 hours. All in all, this
is not that much for getting first-class time access to your data on
any of your axes.

Finally, I must say that the main drawback of OPSI indexes (but also the
reason behind its high efficency and compactness), is that the update
of values in indexed tables is far more slow than other databases (10x
slower or more). However, if you are going to use it for mostly
read-only or append-only tables, then there is no problem at all.
Also, the speed of the update of values that are not indexed is not
affected by other columns that can be indexed.

At any rate, whether or not this can be a good solution for you, depends
largely on your requirements.

···

On Sep 11, 2008, at 1:29 AM, Ger van Diepen wrote:

--
Francesc Alted
Freelance developer
Tel +34-964-282-249

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Selection and crosscuts in HDF5 files