Question about memory


#1

Good day everyone,

i have a bit of confusion about how hdf file space on memory is handled when writing and deleting datasets.
This is the situation: i made a small code to test how size of file changes after:

  1. writing data to datasets
  2. removing said datasets with H5Ldelete and
  3. repacking the file.

step 1. As expected file size increases after writing, so it is all good. For example, file size now is 4972 KB;
(between 1. and 2. i close and open the file again to make sure data has been written to disk)

step 2. After calling H5Ldelete on the datasets i would expect the file size NOT to change, as from what i have understood so far the datasets have just been made unreacheable, but have not been actually deleted from memory. Instead i see that the size of the file now is 53 KB;

step 3. After repacking, the file size of the new repacked file is 29 KB (as there are still some very small datasets left untouched by design);

Points 1 and 3 are fine for me, but i don’t understand what is happening in point 2, and i hope someone could enlighten me about it :sweat_smile:
My (wild) guess is that in point 2 the dataset’s data is cleared and freed while the header is still part of the file.
This would make sense with a similar result i had when setting the size of a dataset of length X to 0 with H5Dset_extent(), as in that case too i observed a decrease in size of the file containing the dataset.
Still it appears to me in total contrast with what i read in the documentation, where it is stated that no memory is freed until a repack (or similar action) is issued.

I am using hdf5 version 1.8.18, and all these sizes are the ones i’m reading from window’s File Explorer (which might not be the most reliable source of information maybe).

edit: final file sizes are confirmed also by H5Fget_filesize()

edit2: adding sample code that performs points 1 and 2
codeSample.txt (2.4 KB)


#2

Did you consider uploading and sharing your simple self contained compilable example to github or similar platform? – this way others can take a closer look as well.
When you speak of memory: do you mean disk space or RAM?


#3

as memory i mean disk space, sorry.

i haven’t uploaded it because it still uses structure and classes from a bigger c# program so i would need to rewrite it to make it portable outside of it (Hdf functions are in a c++ part of the code).

edit: i’m attaching a small c++ code that performs the relevant actions. I’m running it on visual studio 2017
codeSample.txt (2.4 KB)


#4

Did you check the return code of your H5write call? (or run h5dump on your file?)
According to your code sample, there’s no selection on mem_space and this should
yield an H5write error because of a element count mismatch in the memory and file
selections. (Maybe H5Sselect_all on mem_space was your intention?).
Deleting the (presumably last) link will drop the datasets reference count to 0 and
mark the whole thing as freespace. Since there is no freespace tracking/reclamation in
1.8.x beyond H5Fopen/H5close, the file size should change only marginally
(the link data is freed) if at all.


#5

i’m not sure i understand what you mean…
While i don’t have an H5write call in the code, the return of H5Dwrite is 0, which should mean that the writing operation was completed successfully, right?
Also, i can’t seem to find h5write and h5dump: are they part of the API for c++? (i’m using c++)

Anyway, here i’m doing the writing action only once, but normally this action is performed x times (hence the dataset is chunked with unlimited max size). Every cycle i create a mem_space that just fits the data (which size can vary each time) and write it to the dataset…

Deleting the (presumably last) link will drop the datasets reference count to 0

this is exactly what i’m going for, and indeed what i’m expecting to see is the size remain (roughly) the same, which does not seem to happen.

there is no freespace tracking/reclamation in 1.8.x beyond H5Fopen/H5close

Wait, then calling H5Fclose reclaims freespace?


#6

https://support.hdfgroup.org/HDF5/docNewFeatures/NewFeaturesFileSpaceMgmtDocs.html


#7

file_size.c (1.9 KB)

With HDF5 1.10.5 the output looks like this:

Initial file size: 2048 [bytes]
File size after dataset creation/append: 242768 [bytes]
File size after dataset unlink: 241224 [bytes]
Final file size after close/re-open: 800 [bytes]

What’s the HDF5 1.8.x output?


#8

Thank you for the code! And sorry for the late reply, but i was not in the office anymore…

With hdf5 1.8 I get the same values as you do!

However, since we normally work with much bigger datasets, i tested the code with increasing data by adding some more extend/write cycles, and what i observe is that over a certain dimension, it does not recover all of the space after deleting the dataset, but only a fraction. Using the same milestones:

Initial file size: 2048
File size after dataset creation/append (preforming 3 appends): 715176
File size after after dataset unlink: 714632
Final file size after close/re-open: 478904

It seems that it went back to the dataset size before the last addition (size of file after each append are 242768, 478344 and 715176). What is happening then here?


#9

I don’t know much about how HDF5 works, but could it be that the data for the last addition is likely allocated at the end of the file, and when the file is written out to disk, the unused space at the end of the file is reclaimed, but internal “holes” of unused space surrounded by used data objects are not reclaimed? (i.e. no reorganization of the file layout is done to reclaim such areas).

Just a guess. To really know I guess you need to dive into the guts of HDF5.


#10

Hi,
I am not getting the same result as you… I am using the HDF.PInvoke with HDF1.10.5 in .net. But the result is coming the same bytes …
long fileId = H5F.open(@“NEW_PRO_1.h5”, H5F.ACC_RDWR, H5P.DEFAULT);

        long MapGroupId = H5G.open(fileId, "/NEW_PRO_1/Sample/Area/Map/");

        int status = H5L.delete(MapGroupId, "DATA");
        Hdf5.CloseGroup(MapGroupId);

        ulong fileSizeAfterLinkDelete = 0;
        H5F.get_filesize(fileId, ref fileSizeAfterLinkDelete);
        H5F.close(fileId);
        Debug.Write($"File size after link delete in bytes {fileSizeAfterLinkDelete}");


        long readOnlyFileId = H5F.open(@"NEW_PRO_1.h5",H5F.ACC_RDONLY,H5P.DEFAULT);
        ulong fileSizeAfterOpen = 0;
        H5F.get_filesize(readOnlyFileId, ref fileSizeAfterOpen);
        Debug.Write($"File size after Open in bytes {fileSizeAfterOpen}");
        H5F.close(readOnlyFileId);