Memory usage with many datasets

Hello

First I want to give a short explanation how I use libhdf5: I have an application which produces a good amount of data which is then stored in multiple hdf5-files. The amount of files is around 300. And for each file there are 30 datasets. Those are always either 1 or 2-dimensional and only the first dimension is allowed to grow. So overall ~9000 datasets are written. Also inside one file all datasets have the same first dimension (=data is always written to all of them).

All those files are rotated after one hour: Existing data is flushed, the hdf5-file is closed and finally a new hdf5-file is opened. It’s important that data is only written / appended.

Due to the global lock in libhdf5 it is not possible to write the data as fast as it is produced. The solution here is that I simply load the library multiple times with dlopen and then one library instance handles 50 files (=1500 datasets). Spawning multiple processes would also be possible, but it has some significant downsides, because the application shares a lot of internal state which would have to be duplicated (=higher memory and CPU usage). Because of that, the dlopen-“trick” is used.

Until today I’ve used libhdf5 1.10.0, but for testing I’ve used now 1.14.4.1.

Now my problem:
The memory usage seems to grow a lot with the number of datasets. In general it seems to grow slowly over the first hour and then it stays roughly the same. I reduced the already not so big chunk-size plus as well the cache-size (both a lot), but the memory usage still stays high (roughly unchanged). Also after rotation I call H5garbage_collect and then I would expect that the memory usage is roughly again what it was initially. However, it keeps being high. If I reduce for testing the number of datasets, then I can see that the memory consumption also reduced quite a lot.

One test I did was to “merge” most of the datasets: E.g. instead of having 10 vectors of size n, I merged them to one matrix with the dimensions of nx10. Plus as well a chunk and cache size which is 10 times larger was used (so overall identical to the sum of the chunk sizes of the vectors). Finally I only kept two datasets, but they contained all the data of the initial 30 datasets. I would expect a similar memory usage, but with this it was already much lower. Unfortunately, I cannot do this merging for in the final code.

Is there some internal memory in libhdf5 which gets allocated and never freed (but reused)? And maybe this amount is roughly proportional to the maximum number of datasets which were opened at a time? If something like that exists, can it be freed somehow? Or reduced?

Thanks a lot

Benjamin Meier

1 Like

While it’s common for programs run in Linux to see their memory usage appear to grow even if unused blocks are freed appropriately, it’s surprising that it’s much worse in 1.14. Do you have a reproducer program?

Thanks for the response :blush:

Sorry, I wrote a bit much (and probably not so clear), but the problem is actually not worse or better with 1.14. It’s identical to 1.10. I only tried the new version, because I thought it might be some old issue.

Overall the symptom is just that the memory usage seems to be much higher if I create many small datasets (with few “columns”) vs two large datasets (with many “columns”). The cache / chunk-size is configured to be identical for both cases and also the amount of data is identical, but it just seems like there is some penalty (in terms of memory usage) for having many datasets.

I’ll try to create a minimal program tomorrow. Unfortunately, the code is currently embedded in some large application, so it might take a bit time.

Actually thanks to creating a minimal example, I could figure out an issue in the code😅. The cache for datasets was set on a wrong handle (dcpl_id instead of dapl_id). The nice thing is that libhdf5 1.14 complains about this. Unfortunately 1.10 silently accepted it (but I guess it triggered some possibly weird behavior😅).

The problem is actually solved and the initial underlying issue was just that caches were way too big, because they used the default values (which is in my case too big if I use many datasets).

Thanks for the help &
Have a nice weekend😊

2 Likes

Great! Glad you were able to get it sorted out!