HDF5 Memory Usage High for Writes

glenn.hazelwood · February 3, 2023, 10:29pm

I am developing a multi-threaded C++ application using the C version of the HDF5 library version 1.10.2. I am developing on Red Hat Enterprise Linux 7.7. The application is long-running but uses a lot of memory (~2GB for writing ~100MB HDF5 files). Valgrind does not show any memory leaks.

The main thread spawns up to 16 threads to gather the data from other processes. Then 15 of the threads complete and the remaining thread spawns another thread to write the data to an HDF5 file. The data is 1-D by the way. My application uses up more and more system memory, up to ~2GB after about 2 hours. For another run with more points, the system memory fills up and then the application reads and writes to disk.

The previous version of this application, which was also multi-threaded, did not use HDF5 and did not use as much memory. I tried massif with the option set to show page allocations and the largest page allocation was not for HDF5 but another library used in the application. That other library is used in the previous version of my application but it did use as much memory.

I experimented with various chunk sizes. The original chunk size was 1x10 items (usually 40 B), then I tried increasing the chunk_size to 1x1000 (4kB) and the memory usage stayed the same.

derobins · February 3, 2023, 10:56pm

Hi Glenn,

To start, I’d update your HDF5 version to something more recent than 1.10.2. Early versions of the 1.10 series of releases had some performance issues that were fixed in later versions. Also, the larger chunk sizes will probably be better, as they’ll result in a smaller chunk index, more efficient reads, and better performance.

Also, are you using the thread-safe version of the library?

I have seen problems with applications having a constantly growing size in memory, without valgrind complaining. Usually this turns out to be an effect of how the OS assigns and reclaims memory and not something inherent to HDF5. It may be that the kernel is deciding that pages that were allocated to your application are unreclaimable and won’t recycle them, so your application’s memory footprint will grow. It could also be that the kernel is being lazy about reclaiming memory.

You can find more info on how the Linux kernel allocates and recycles pages here:
https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html

Does the amount of memory growth vary with the size of the I/O? If so, what’s the life cycle of the buffers you are allocating for I/O? Are you inserting buffers into a data structure that holds references to the buffers until the end of the program?

Also, what is the life cycle of HDF5 objects in your application? How many files and datasets do you have open at any give time? And are you closing them when you are done with them?

brian.appel · July 13, 2023, 12:45am

Touching back on this, I’m going to ask what may be a stupid question.

If you create a pair of array data types using H5Tarray_create with basically the same attributes (same base type, same dimensions), do they end up getting saved to your file as two distinct types?

gheber · July 13, 2023, 1:11am

Thanks for asking, @brian.appel . We are all here to help and learn from one another, so no stupid questions…

The correct answer is, as so often, “Yes, but…” If you define those datatypes as part of “normal” dataset or attribute creation, separate copies of the corresponding metatdata will be stored with the respective datasets or attributes. In fact, you don’t need to call H5Tarray_create twice, you could just use the same datatype handle many times as H5[A,D]create will create (object-)private copies of the datatype definition.

Now, let’s say that the metadata associated with a datatype was substantial or you wanted to be certain that several datasets or attributes use the same datatype object (in the file). You can achieve this using a so-called committed or linked datatype. The idea is that before using your datatype, you store it in the file as a dataype object using H5Tcommit. That way you can reach it via a path name (like groups and datasets), which is also great for documentation purposes. You can then open such a datatype object via H5Topen, and use the corresponding handle (hid_t) in the creation of new datasets or attributes, and all will refer to the same datatype object in their metadata rather than include seprate copies of datatype descriptions. OK?

G.

brian.appel · July 13, 2023, 11:17am

This is exactly what I was looking for! Thanks!

brian.appel · July 13, 2023, 2:01pm

Following up again on this. Exactly how much space savings might you get if you are utilizing common data types like this? Both in file size and cache sizes.

What’s the difference between a file with 10000 datasets defined, each with a specific data type, such as:

     DATASET "state" {
        DATATYPE  H5T_STRING {
           STRSIZE H5T_VARIABLE;
           STRPAD H5T_STR_NULLTERM;
           CSET H5T_CSET_UTF8;
           CTYPE H5T_C_S1;
        }
        DATASPACE  SIMPLE { ( 2 ) / ( H5S_UNLIMITED ) }
        DATA {
        (0): "Pos1", "Pos2"
        }
     }

Where each of the 10000 datasets define their datatype as shown, and the same file if you were to define the specifics of the “string” data type as a commited data type, and all of your 10000 datasets simply referenced that?

gheber · July 13, 2023, 11:35pm

For “primitive” datatypes such as the one shown (UTF-8 encoded variable-length string), I’d expect the savings to be minimal. In a few cases, storing the reference to the shared datatype object might actually use more space than the datatype definition.

The datatype metadata is encoded in binary form and doesn’t resemble the verbose h5dump output:

DATATYPE  H5T_STRING {
  STRSIZE H5T_VARIABLE;
  STRPAD H5T_STR_NULLTERM;
  CSET H5T_CSET_UTF8;
  CTYPE H5T_C_S1;
}

That’s 115 characters or bytes. The in-file datatype message (see section IV.A.2.d. The Datatype Message in the file format spec., I believe, is just 8 bytes. That’s less than 80 KiB for 10,000 datasets. Saving 10,000 references to a shared datatype object would take more space.

Is there a maximum string length for the strings that you want to store in yor dataset? If so, I’d recommend to use a fixed-size string type, because then I/O will be faster and you could apply compression to your dataset. If your strings are variable-length and write-once, you could also store them as two arrays: one array stores the bytes of all strings, one after the other, and another array to store offsets into the character array indicating begining and end positions of byte sequences representing individual strings. You can apply compression to these two arrays as well. (See the section on strings in the User’s Guide.)

G.

brian.appel · July 14, 2023, 11:28am

Okay, interesting.

I’m currently analyzing how we’re using HDF5, and to be honest, the actual number of string types getting created is much higher than 10000.

We’re talking more like in excess of 100,000. Our code is creating individual string types (using H5Tcopy) to setup different fixed length string types of varying sizes.

Regarding string types of fixed length, looking at the file spec again, I see I mis-read it the first time.

Based on your reply I guess that a reference to another data type (committed) would not be any more efficient, given that only 8 bytes are being used per entity that defines a data type.

gheber · July 14, 2023, 12:08pm

Correct.

What is the metadata to data ratio per dataset? If the type definition were 8 bytes (and we ignore the rest of the object metadata), 8 bytes would store max. two UTF-8 encoded Unicode code points. If your string lengths and element sizes were in the single digit range, that would be concerning, because then the full object metadata would be comparable in size to the “raw” data.
If that were thes case, I’d be looking into aggregating smaller datasets into larger datasets, and, depending on use case, there are several ways to achieve that.

Best, G.