Compression filter that employes multiple datasets

I may have a compression filter that works for UNstructured data. So, not so much for a nice, 2D/3D array but for a linearized list of values that represent the nodes of a 2D/3D UNstructured mesh, for example. So, the knowledge of what points in the linearized list to be compressed are “next to” each other is from a second list (like an integer nodelist).

My first thought is that the compression filter would need to interrogate another dataset in the file (the dataset with the integer nodelist) to do its work. Is that possible?

Another thought I had is that the memory type in an H5Dwrite call could include both the data to be written as well as the companion (integer nodelist) data and the compressor does its work on the data and tosses the companion data. But, how would readback work in that case???

I suppose both the data to be compressed and the companion data could be treated as a single, aggregate dataset as well. That might be ok. But, I might have many datasets all using the same companion data and I wouldn’t want to be storing that companion data multiple times for each case.

Just curious if anyone else has tackled this issue?

Hi Mark,
I have incomplete work for C++ on sparse matrix representations considering most popular formats. Picking Compressed Sparse Row or CSR from the collection you find {values, column_idx, row_idx} with the following possible layouts:

  • some_group with 3 datasets, possible extra data as group attributes. Easy to implement, understand; pyhton and friends already established the pattern.
  • use single dataset with a filter delivers a message: this is a single dataset, meant to be accessed a given way, nothing good comes out of meddling with internals.

I find the second version interesting, also more work. The idea is to pack the related data into a single chunk, use direct chunk write| read to do the IO. Is this something you have on mind (abstractly speaking)?

best wishes: steven

Mark, I read the UN… I still would like to start from the simple case of structured data. In this case Lucas’ HDF5-UDF (aka. computational storage) might be the way to go. That would also cover the case of block meshes, where you have structured blocks stitched together. A virtual dataset, which can be a mixture of local and non-local data, could perhaps be used to stitch a few computational datasets and the seams (which may or may not compress well) together. How far this idea can be pushed, I don’t know. It appears though that if you had a dictonary of a few simple reference blocks (meshes) and a standard set of operators (cheaply computable!) to map them to regions of the real mesh, then to me this looks like a case for some form of “computational storage.”

Best, G.

1 Like

I forgot to mention that workflow-style multi-dataset computions are a standard use case for HDF5-UDF:
See the example of C = A.foo + B.bar near the bottom of HDF5-UDF,
where A and B are datasets of compound datatypes, and we derive C as the sum of the foo and bar fields.

G.

@miller86, it is possible to open another dataset from your I/O filter implementation, but it’s a bit tricky: the filter API doesn’t give you the file handle, so you won’t be able to simply H5Dopen() and call it a day.

The workarounds I use on HDF5-UDF to identify the underlying file are to scan /proc/self/fd (on Linux), lookup entries on /proc/fd (macOS), and resort to Win32 APIs (Windows). Once I have a set of candidate files I inspect each one of them for the dataset I’d like to read.

As @gheber says, most use cases I have around HDF5-UDF involve producing values for an output dataset given values from other existing datasets. If you think that UDFs might be a good fit for you, please let me know. I’ll be happy to give you pointers to get started :slight_smile:

Best regards,
Lucas