Resizing Virtual Datasets

I am working on a dataset that we get in real time as a list of 10 second hdf5 files. I have been using the virtual dataset feature of h5py with great success to access the dataset seamlessly. But as the data comes in, I’ve reached a number of files where rebuilding the virtual dataset starts to take so long that we can’t access the latest updates in a reasonable amount of time with this method (several hours). Is there a way to resize a virtual dataset to update it with new virtual sources without having to recompute everything? Actually, it takes time to open all the files to retrieve some metadata to build the virtual dataset correctly. So I’m thinking of keeping that metadata in a file and bulding the virtual dataset from there. But it would be great if this could be avoided. I looked in the documentation but couldn’t find anything and all my attempts failed for now…

Can you describe your use case in a little more detail? Using the VDS RFC as a reference, which of the use cases in section 3 resembles your use case? Is the number of frames/samples in new files predictable/controllable? The sources of virtual datasets can be virtual datasets. I’m not suggesting going overboard here, but come up w/ a strategy to control the redefinition overhead. Also, are you keeping the VDS in memory? (E.g., use the core driver on the file containing the VDS, at least during acquisition? You can still copy it later to another file.)

G.

Sure,

My case looks like case 8.5 Fixed-dimension Full Source and Unlimited-dimension Virtual Dataset Mapping, Blocked Pattern in the Virtual Dataset, printf-named Source Datasets with an Unlimited Dimension.

We have a set of files containing 2D datasets and metadata. The 2D datasets are generally the same size (but exceptions can occur) and we are interested in the full source (but exceptions can occur). We want to concatenate along the first axis all the datasets. The number of these files is growing over time (we already have ~ a million in a month and plan to record for years…). I’m looking for a way to easily access all these files, ideally without rewriting them and with a quick way to update the virtual dataset where the files are linked.

Right now, the script I’m using has to go through all the files to check their dimensions and retrieve some of the metadata, which I put together in a file that includes a summary of the metadata and a huge virtual dataset. This takes hours, but it’s not a problem if I have to run it once and can update it quickly with the latest new files. Users can then easily access the data set from there.

I discovered that there is a dataset.virtual_sources() function that would allow me to retrieve the sources and potentially recreate the virtual dataset, but I wonder if there is a better way. (Also, it already takes 10 seconds to access the sources in my case, so I’m afraid it will be too slow when I have several million files).

Thanks for your help.

A.

Thanks for the explanation. I think there are several strategies to consider.

Depending on the variability of the dataset shapes, you could define a virtual dataset that covers a certain minimum subset that you know to be present in all datasets/files. The printf-formatted definition works for dataset names and file names. So you could define this VDS without touching/opening any source files.

How, at what granularity will the data be accessed? Daily? Weekly? Monthly? It might make more sense to define VDS at the minimum granularity level and then define additional VDS on top of those as needed. I think you won’t get around some form of consolidation because the more VDS you have in your data path, the more overhead/indirection/fragmentation you have. That will kill performance eventually.

What’s the storage/file system? There is plenty of room here for parallelization, i.e., you can process all those files in parallel (e.g., mpi4py), collect and process the metadata in memory, and then create the “master file” in one go.

Finally, there might be better solutions than VDS. I don’t know your bandwidth/latency requirements, but have you looked at the HSDS implementation of HDF5? As long as you have a reasonable upper bound on the dataset shape, you can create that large dataset in HSDS with no VDS needed. Why create a gazillion files in the first place, and condemn yourself to “death by fragmentation?” You can also do a hybrid approach where you acquire data in files and then load, say, a daily batch into HSDS, and keep the original files for archival purposes but run your analytics against HSDS.

G.

We are doing real time monitoring. We set up the instrument o write small files that are sent through internet to our servers. The time lag we get is directly linked to the length of those files. Those files are processed at receptions but we also need to have a simple access to the whole dataset for further investigations. Yes it would be better to consolidates those file at some point. But because the final storage format is still under debate I was looking for a convenient temporary solution to work directly from those gazillion files we have. Otherwise:

  • Because we are dealing with continuous time-series, we cannot afford to miss some part of the data.
  • The idea to make one VDS per day and then link those VDS is not that bad.
  • I started to explore parallelization but even though it look sub optimal to recompute everything every time I want to add a few files to the VDS.
  • I will take a look to HSDS.
    Reading your message, a arrived to the conclusion that there are no simple way to add a new source to an existing VDS. That was my main question.
    Thank you for the proposed strategies, i will explore those options.

A.

You’d do that at a granularity that makes sense, e.g., daily.

Not if everything that comes along it unlike what’s come before. If the file and dataset names follow a pattern, you can use printf-formatted dataset and file names, and there is no need to modify or add anything to an existing VDS. It’s like a simple regular expression.

The challenge is finding the right balance between your acquisition constraints and your analytics needs. Since the gazillion files are an artifact of the acquisition, you need a consolidation strategy to limit the knock-on on analytics performance. I would not consider VDS a consolidation strategy if there is too much variability in the dataset shapes. How big are those files? You might keep a window, e.g., the current day’s or hour’s data, in memory and consolidate it in the background. All of this would be unnecessary w/ HSDS.

G.