Memory usage increase when writing data sets (without closing file)

Hello,

I have written a simple program in that creates an hdf5 file with one group and one attribute. Then the program writes 100 datasets under the group. I wrote this program to check if there is a memory increase when writing data sets. I am using Windows 11 and Visual Studio 2019 and HDF5 version HDF5 library version: 1.14.2

Looking at the memory usage in the visual studio, the memory slowly increases from 3 MB to 4 MB after writing about 46 data sets. I tired doing a global flush and explicitly call for a garbage collect but it did not seem to help. The only way I saw where it is possible to keep memory constant to close and re-open the file after each write. My question, is there a way to keep the memory constant without having to close an re-poening the file? here is a screen shot of the memory increase and I have posted the code below that.

#include
#include
#include
#include
#include
#include <H5Cpp.h>

int main(int argc, char** argv)
{

std::string filename = "test.h5";

H5::H5File* localFileHandel = nullptr;

// Try block to detect exceptions raised by any of the calls inside it
try
{
    localFileHandel = new H5::H5File(filename, H5F_ACC_TRUNC);
}
catch (H5::FileIException& error)
{
    std::cout << "ERROR: Unable to create an new H5 file with the name = " << filename << std::endl;
    std::cout << error.getDetailMsg() << std::endl;
    return -1;
}

std::cout << "Created file" << filename << std::endl;
// create a group 
H5::Group group = localFileHandel->createGroup("GroupTest");

if (false == group.isValid(group.getId()))
{
    std::cout << "Error: unable to create group " << std::endl;
    return -1;
}
else
{
    std::cout << "Group created OK" << std::endl;
}

// create an attribute within the group 

H5::DataSpace dataSpace(H5S_SCALAR);
H5::Attribute attribute = group.createAttribute("GroupTest", H5::PredType::NATIVE_INT, dataSpace);

// create and write data set every sec for 100 times 
std::vector<float> data(256 * 256 * 4, 10.0f);

for (size_t ii = 0; ii < 1000; ++ii)
{
    //hdf5Io.readWrite(filename);
    std::string dataName = std::to_string(ii);
    std::vector<hsize_t> dims = { (hsize_t)data.size() };

    H5::DataSpace dataSpace((hsize_t)dims.size(), &dims[0], &dims[0]);
    H5::DataSet dataSet = group.createDataSet(dataName, 
    H5::FloatType(H5::PredType::NATIVE_FLOAT), dataSpace);

    dataSet.write(&data[0],H5::FloatType(H5::PredType::NATIVE_FLOAT));
    //hdf5Io.close();
    std::cout << "Sleep 0.5 sec" << std::endl;
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    std::cout << "Wake up, write data set" << ii + 1 << std::endl;

    dataSet.close();
    H5Fflush(localFileHandel->getId(), H5F_SCOPE_GLOBAL);
    H5garbage_collect();
}
return 0;

}

HDF5 がファイルを閉じずにフラッシュした場合にメモリリークが発生する可能性があるのは、単純ではないですが、いくつか理由があります。

内部キャッシュとフラッシュ:

  • HDF5 にデータを書き込むと、効率性のために最初に内部キャッシュに保存されます。
  • flush を呼び出すと、このキャッシュされたデータはディスクにプッシュされますが、キャッシュ自体が使用するメモリは必ずしも解放されるわけではありません
  • このキャッシュされたデータは、ファイルが閉じられるまで割り当てられたままになり、メモリリークの可能性があります。

リソースの割り当てとファイナライズ:

  • HDF5 はデータバッファだけでなく、ファイルハンドル、オブジェクト参照、メタデータ構造など、さまざまなリソースを割り当てます。
  • これらのリソースは、ファイルが閉じられたときにのみ解放されます。
  • ファイルを開きっぱなしにしておくと、この割り当てられたメモリが他のプロセスで使用できなくなり、リークが発生しているように見える可能性があります。

ガベージコレクションの制限:

  • HDF5 にはガベージコレクション機能がありますが、主に閉じられたオブジェクトに対してのみ機能します。
  • flush は部分的なガベージコレクションをトリガーする場合がありますが、特に開いているファイルとその内部構造に関連するメモリを回収することはありません。

その他の要因:

  • 特定の設定オプションや使用パターンは、メモリリークの問題をさらに悪化させる可能性があります。
  • たとえば、チャンク (データを小さな断片で書き込む) を使用すると、内部キャッシュのサイズとリークの可能性が大きくなります。

代替案とベストプラクティス:

  • メモリリークを防ぎ、適切なリソース解放を確実にするため、HDF5 ファイルは使用後に常に明示的に閉じてください
  • with h5py.File(...) などのコンテキストマネージャを活用します。これらは例外が発生しても自動的にファイルを閉じ、コードの堅牢性とリソース管理を向上させます。
  • 特定のパフォーマンス最適化が必要でない限り、flush の使用は最小限にしてください。
  • 必要に応じてキャッシュ設定を調整し、パフォーマンスとメモリ使用のバランスを取るようにしてください。

覚えておいてください: フラッシュはデータをディスクに書き込みますが、すべてのリソースが解放されることを保証するわけではありません。HDF5 を効率的かつバグのない方法で使用するには、ファイルを常に閉じることは不可欠な習慣です。

特定のメモリリークが発生している場合は、HDF5 のドキュメントやフォーラムを参照してトラブルシューティングを行い、潜在的なバグレポートを確認してください。コードのスニペットと使用パターンを共有することで、問題の診断と解決策の発見に役立ちます。

上記の説明は、HDF5 が flush を使用してもファイルを閉じなければメモリリークが発生する可能性がある理由を明らかにしたことを願っています. ファイルを確実に閉じてください。

Credit: Google AI

Thank you for the reply @hyoklee
Just to leave an English answer here, the best practice here is “not to use flush” because it could cause a memory leak where the data is written to the disk but not necessarily released from memory.
Instead the best practice is to close the file after the write is complete.

I will rewrite my example with no flush and again with an open and close to compare memory usage and post it to have a complete comparison which I would hope with that it would help someone else who might have the same question. I will do this later this week.

3 Likes

Hello,
So I worked on my example and changed the code to make it easier to provide more output. I have added a flag to either do multiple writes without closing the file ( no flush ) and with closing the file. The memory increase still happens and it does not seem to matter the files closed between writes or not. I am not sure if I am doing anything wrong or if I am missing something. I would appreciate any further comments.

Here are screen shots again from the memory usage monitor in visual studio ( 2019 )

initial memory

initial_memory

After 50 or more writes

after_50_some_writes

HDF LIBRARY VERSION FROM H5get_libversion() = 1.14.2
Sleep between writes =500 milli-seconds
Size of data to be written per dataset =1 MB

Here is my code. Please note that there is a boolean flag called “closeReopen” . When the flag is set to true the file will be closed and reopened after each write. If the flag is false the file will not be closed.

#include "H5Cpp.h"
#include <iostream>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
#include <thread>


int main(int argc, char** argv)
{
    std::string filename = "test.h5";
    size_t numberOfWrites = 100; // number of writes to prefom 
    size_t waitInMilliSec = 500; // wait time between writes in millseconds 
    bool closeReopen = false; // choose wether to close and reopen between write ( set to true ) or not close ( set to false )
    size_t dataSize = 256 * 256 * 4; // the data size to be written ( float vector with each item having a value of 10.0f)

    H5::H5File* localFileHandel = nullptr;

    /*
    *  OPEN FILE
    */
    try
    {
        localFileHandel = new H5::H5File(filename, H5F_ACC_TRUNC);
    }
    catch (H5::FileIException& error)
    {
        std::cout << "ERROR: po::io::HDF5::createNew() Unable to create an new HDF5 file with the name = " << filename << std::endl;
        std::cout << error.getDetailMsg() << std::endl;

        return -1;
    }

    std::cout << "Created HDF file:" << filename << std::endl;
    
    /*
    *  CREATE GROUP
    */
 
    // First create a group and call it Group Test  
    H5::Group group = localFileHandel->createGroup("GroupTest");

    /*
    *   Check if the group creation was OK
    */
    if (false == group.isValid(group.getId()))
    {
        std::cout << "Error: unable to create group " << std::endl;
        return -1;
    }
    else
    {
        std::cout << "Group created OK" << std::endl;
    }

    /*
    *  CREATE GROUP
    */
    //Create an integer scalar attribute and call the attribute Attribute test 
    H5::DataSpace dataSpace(H5S_SCALAR);
    H5::Attribute attribute = group.createAttribute("AttributeTest", H5::PredType::NATIVE_INT, dataSpace);

    /*
    * Get the library version and print it 
    */
    unsigned int h5MajorVersion;
    unsigned int h5MinorVersion;
    unsigned int h5ReleaseNumber;
    H5get_libversion(&h5MajorVersion, &h5MinorVersion, &h5ReleaseNumber);
   
    /* Create the data that we will be writing muiltiple times
    *  The data size is dataSize, the data is initialized to 10.0f
    * single precision real data
    */
    std::vector<float> data(dataSize, 10.0f);
    float vectorSizeInMem = (float)(data.capacity() * sizeof(float))  / 1024 / 1024;

    std::cout << "Testing HDF5 multiple writes." << std::endl;
    std::cout << "Writing = " << numberOfWrites << " Times" << std::endl;
    std::cout << "HDF LIBRARY VERSION FROM H5get_libversion() = " << h5MajorVersion << "." << h5MinorVersion << "." << h5ReleaseNumber << std::endl;
    std::cout << "Sleep between writes =" << waitInMilliSec << " milli-seconds" << std::endl;
    std::cout << "Size of data to be written per dataset =" << vectorSizeInMem  << " MB" << std::endl;
    
    if (true == closeReopen)
    {
        std::cout << "File will be closed and reopen after each data set" << std::endl;
    }
    else
    {
        std::cout << "File will not be closed between writes" << std::endl;
    }
    std::cout << std::endl << std::endl;

    /*
    *  Loop and write the data 
    */
    for (size_t ii = 0; ii < numberOfWrites; ++ii)
    {  
       // Group close / reopen does not have an affect on the memory !
       // group = localFileHandel->openGroup("GroupTest");
       // 
        // The data set name will be the loop index number 
        std::string dataName = std::to_string(ii);
        std::vector<hsize_t> dims = { (hsize_t)data.size() };

        /*
        *  Create the dataspace and the data set with GroupTest as its parent
        */
        H5::DataSpace dataSpace((hsize_t)dims.size(), &dims[0], &dims[0]);
        H5::DataSet dataSet = group.createDataSet(dataName, H5::FloatType(H5::PredType::NATIVE_FLOAT), dataSpace);

        /*
        *  Write the data set 
        */
        dataSet.write(&data[0],H5::FloatType(H5::PredType::NATIVE_FLOAT));
        /*
        *  Close the data set
        */
        dataSet.close();

        // Group close / reopen does not have an affect on the memory !
        //group.close();

        /*
        *  Adding sleep to wait between writes 
        */
        std::cout << "Wrote data set" << ii << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(waitInMilliSec));
        

        if (true == closeReopen)
        {
            // close the file 
            //std::cout << "closing file with close() function " << std::endl;
            localFileHandel->close();

            // re-open the file with read write access to append data to it 
            try
            {
                std::cout << "open file with H5F_ACC_RDWR attribute " << std::endl;
                localFileHandel->openFile(filename, H5F_ACC_RDWR);
            }
            catch (H5::FileIException& error)
            {
                std::cout << "ERROR: po::io::HDF5::createNew() Unable to create an new HDF5 file with the name = " << filename << std::endl;
                std::cout << error.getDetailMsg() << std::endl;

                return -1;
            }
        }
    }

    /*
    *  Write info one more time since the print for each ouput to console is done 
    *  which makes it difficult to scroll the top some times to see this information
    */
    std::cout << "Wrote = " << numberOfWrites << " Times" << std::endl;
    std::cout << "HDF LIBRARY VERSION FROM H5get_libversion() = " << h5MajorVersion << "." << h5MinorVersion << "." << h5ReleaseNumber << std::endl;
    std::cout << "Sleep between writes =" << waitInMilliSec << " milli-seconds" << std::endl;
    std::cout << "Size of data to be written per dataset =" << vectorSizeInMem << " MB" << std::endl;


    if (true == closeReopen)
    {
        std::cout << "File will be closed and reopened after each data set" << std::endl;
    }
    else
    {
        std::cout << "File will not be be closed between writes" << std::endl;
    }

    return 0;
}

Hello, I have finally figured out what is going on with the help of my collogues. I am posting a modified version of the code with what I believe is the correct way to handle H5 file writes over time.
It is important to close all attributes / groups / data sets after each use and also close the file. Closing the file alone will still create memory leaks.

In my previous post, I did not close the attribute and did not close and repoen the group which was causing the issue.

In m new modified program there are no memory leaks when I have the flag I created closeReopen set to true ( which will open and close the file for each data set write ) . If he closeReopen flag is turned to false then the HF5 file will only opened once. This will create a memory leak.

Hope this code will help others

#include "H5Cpp.h"
#include <iostream>
#include <vector>
#include <string>
#include <chrono>
#include <iostream>
#include <thread>


int main(int argc, char** argv)
{

    /*
    *  To avoid memory leaks : 
    *  close all open groups / attributes / data sets after each use 
    *  close the file after each data set write and repon it again
    */

    std::string filename = "test.h5";
    size_t numberOfWrites = 100; // number of writes to prefom 
    size_t waitInMilliSec = 500; // wait time between writes in millseconds 
    bool closeReopen = true; // choose wether to close and reopen between write ( set to true ) or not close ( set to false )
    size_t dataSize = 256 * 256 * 4; // the data size to be written ( float vector with each item having a value of 10.0f)

    H5::H5File* localFileHandel = nullptr;

    /*
    *  OPEN FILE
    */
    try
    {
        localFileHandel = new H5::H5File(filename, H5F_ACC_TRUNC);
    }
    catch (H5::FileIException& error)
    {
        std::cout << "ERROR: po::io::HDF5::createNew() Unable to create an new HDF5 file with the name = " << filename << std::endl;
        std::cout << error.getDetailMsg() << std::endl;

        return -1;
    }

    std::cout << "Created HDF file:" << filename << std::endl;
    
    /*
    *  CREATE GROUP
    */
 
    // First create a group and call it Group Test  
    H5::Group group = localFileHandel->createGroup("GroupTest");
    

    /*
    *   Check if the group creation was OK
    */
    if (false == group.isValid(group.getId()))
    {
        std::cout << "Error: unable to create group " << std::endl;
        return -1;
    }
    else
    {
        std::cout << "Group created OK" << std::endl;
    }

    /*
    *  CREATE GROUP
    */
    //Create an integer scalar attribute and call the attribute Attribute test 
    H5::DataSpace dataSpace(H5S_SCALAR);
    H5::Attribute attribute = group.createAttribute("AttributeTest", H5::PredType::NATIVE_INT, dataSpace);

    /*
    *  NOTE : it is important that you close groups/attributes/data sets after they have been used
    *  If not, this creates a memory leak even if , you close and repon the file
    */
    attribute.close();
    group.close();

    /*
    * Get the library version and print it 
    */
    unsigned int h5MajorVersion;
    unsigned int h5MinorVersion;
    unsigned int h5ReleaseNumber;
    H5get_libversion(&h5MajorVersion, &h5MinorVersion, &h5ReleaseNumber);
   
    /* Create the data that we will be writing muiltiple times
    *  The data size is dataSize, the data is initialized to 10.0f
    * single precision real data
    */
    std::vector<float> data(dataSize, 10.0f);
    float vectorSizeInMem = (float)(data.capacity() * sizeof(float))  / 1024 / 1024;

    std::cout << "Testing HDF5 multiple writes." << std::endl;
    std::cout << "Writing = " << numberOfWrites << " Times" << std::endl;
    std::cout << "HDF LIBRARY VERSION FROM H5get_libversion() = " << h5MajorVersion << "." << h5MinorVersion << "." << h5ReleaseNumber << std::endl;
    std::cout << "Sleep between writes =" << waitInMilliSec << " milli-seconds" << std::endl;
    std::cout << "Size of data to be written per dataset =" << vectorSizeInMem  << " MB" << std::endl;
    
    if (true == closeReopen)
    {
        std::cout << "File will be closed and reopen after each data set" << std::endl;
    }
    else
    {
        std::cout << "File will not be closed between writes" << std::endl;
    }
    std::cout << std::endl << std::endl;

    /*
    *  Loop and write the data 
    */
    for (size_t ii = 0; ii < numberOfWrites; ++ii)
    {  
       // Group open 
       // Again important to open / close group after use to avoid memory leaks 
        group = localFileHandel->openGroup("GroupTest");
       // 
        // The data set name will be the loop index number 
        std::string dataName = std::to_string(ii);
        std::vector<hsize_t> dims = { (hsize_t)data.size() };

        /*
        *  Create the dataspace and the data set with GroupTest as its parent
        */
        H5::DataSpace dataSpace((hsize_t)dims.size(), &dims[0], &dims[0]);
        H5::DataSet dataSet = group.createDataSet(dataName, H5::FloatType(H5::PredType::NATIVE_FLOAT), dataSpace);

        /*
        *  Write the data set 
        */
        dataSet.write(&data[0],H5::FloatType(H5::PredType::NATIVE_FLOAT));
        /*
        *  Close the data set
        */
        dataSet.close();

        // Group close
        group.close();

        /*
        *  Adding sleep to wait between writes 
        */
        std::cout << "Wrote data set" << ii << std::endl;
        std::this_thread::sleep_for(std::chrono::milliseconds(waitInMilliSec));
        

        if (true == closeReopen)
        {
            // close the file 
            //std::cout << "closing file with close() function " << std::endl;
            localFileHandel->close();

            // re-open the file with read write access to append data to it 
            try
            {
                std::cout << "open file with H5F_ACC_RDWR attribute " << std::endl;
                localFileHandel->openFile(filename, H5F_ACC_RDWR);
            }
            catch (H5::FileIException& error)
            {
                std::cout << "ERROR: po::io::HDF5::createNew() Unable to create an new HDF5 file with the name = " << filename << std::endl;
                std::cout << error.getDetailMsg() << std::endl;

                return -1;
            }
        }
    }

    /*
    *  Write info one more time since the print for each ouput to console is done 
    *  which makes it difficult to scroll the top some times to see this information
    */
    std::cout << "Wrote = " << numberOfWrites << " Times" << std::endl;
    std::cout << "HDF LIBRARY VERSION FROM H5get_libversion() = " << h5MajorVersion << "." << h5MinorVersion << "." << h5ReleaseNumber << std::endl;
    std::cout << "Sleep between writes =" << waitInMilliSec << " milli-seconds" << std::endl;
    std::cout << "Size of data to be written per dataset =" << vectorSizeInMem << " MB" << std::endl;


    if (true == closeReopen)
    {
        std::cout << "File will be closed and reopened after each data set" << std::endl;
    }
    else
    {
        std::cout << "File will not be be closed between writes" << std::endl;
    }

    return 0;
}
1 Like