Save related data

Hi there,

I’m completly new to hdf5 and I try to find out, which forms are the best to save special data:

For example I have:

angle  chan1	chan2, 		chan3, .....   chan 32768
0		10			22			3				35
1		10			22			3				35
2		10			26			3				755
3		10			23			3				35
4		10			21			3				31
5		10			22			3				34
...
359		10			22			3				35
0		10			22			3				311
1		10			22			3				35
2		10			22			3				35
...

such a table. It can have 10000 of lines and also 10000 of cols.

I can easily save it in hdf5. Later I want to get special selections like:

get all lines which have an angle between 216:220 and sum up the cols, so that I get back a list.

or

get all lines which have an angle between 216:220 and sum up the lines, so that I get back also a list.

For my first idea, I wanted to save the table as an 2d list. and also save the angle as a seperate list.
For my selection I could read out the angle list, do the selection and out of the result of “lines” I do a special selection from the main table.

This will have a big memory/cpu consumption, because I have to go through the main table many times.

Is there a trick, how to save data, so that I can easily extract them.

This is an example for the whole spectrum I did (maybe there is a better way)

np.array(self.hdf5file[“spectrum”][()][startkanal:endkanal].sum(axis=0))

I want something like:

np.array(self.hdf5file[“spectrum”][()][startkanal:endkanal][“SORT OUT WRONG LINES”].sum(axis=0))

any ideas?

1 Like

You are moving on a pareto front of code complexity, space used and latency. You could organise the data as a stream of records, ordered by angle, this way data access is efficient rowise; costly column wise.
This is the non homogenious or record based storage.

Alternatively you could use homogenious storage, where each cell is the same datatype, then with chunks you can control the efficiency of access respect to pattern: rowwise, columnwise or block.

In some cases it is advisable to build indices, to locate the data lines, blocks of interest. This happens when you dump data into a packet table from a sensor network, then process them later.

Hope it helps

Do you have a main environment? Multiple environments? For Python, R, Julia, etc. there are ready-made solutions, e.g., PyTables, or in the data frame-line of thinking pandas, rhdf5, etc.

Rolling your own makes sense only once you wanna scale up and out, and multiple languages/environments are involved.

G.