Working with lots of HDF5 files

SOLTYS_Radoslaw · January 26, 2016, 10:55am

We're looking into replacing our custom storage design for time-series data with HDF5 and we're looking mainly at HDF5 version 1.10 for the SWMR capability as we're doing this already with our custom storage.
To find out the best layout - we drafted a few test cases and started off a tutorial code sample in C++, adjusting it to replicate our current database structure, being one file per signal - so we are creating new empty files in a loop - and there we already ran into problems:

- the HDF5 garbage collector allocates lots of memory as soon as files are created - we tried to tune it with setGcReferences(), but could not;

- having reached 2GB - the HDF5 create function throws the exception "no space available for allocation" (We're running 64-bit Windows 8 with 16GB of RAM)
I'd have a few questions at this point:

- Can we reduce the amount of memory used by the garbage collector? If yes - how?

- Taking a step back: is the HDF5 API designed to handle thousands of files in practice?

- Or would it be better to have a single file with the same number of datasets in it? (We're talking about a few thousand datasets, each with several million rows.)

Thanks for your kind support

···

________________________________
CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium.

derobins · January 26, 2016, 6:06pm

Some Windows comments:

* SWMR is not well-tested on Windows since the current SWMR test harness is based on shell scripts and makes use of fork(). There's no reason why it shouldn't work on Windows, though. It's just not tested there at this time. We'll be trying to add at least some minimal testing in the near future, but it might be a little bit before we have a full test suite.

* NTFS and parallel file systems like GPFS should support SWMR. SMB-style network access (e.g.: Windows file shares) will NOT support SWMR, however, since we can't guarantee write ordering. This is not unlike NFS, which is also not supported for SWMR access.

Also, if you are using HDF5 1.10.0, be sure to use H5Pset_libver_bounds() to use the latest file format. The newer data structures are much more efficient than the backward-compatible defaults. You'll lose HDF5 1.8 compatibility, though, so keep that in mind.

https://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetLibverBounds

Dana Robinson
Software Engineer
The HDF Group

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of SOLTYS Radoslaw
Sent: Tuesday, January 26, 2016 5:56 AM
To: hdf-forum@lists.hdfgroup.org
Subject: [Hdf-forum] Working with lots of HDF5 files

We're looking into replacing our custom storage design for time-series data with HDF5 and we're looking mainly at HDF5 version 1.10 for the SWMR capability as we're doing this already with our custom storage.
To find out the best layout - we drafted a few test cases and started off a tutorial code sample in C++, adjusting it to replicate our current database structure, being one file per signal - so we are creating new empty files in a loop - and there we already ran into problems:

- the HDF5 garbage collector allocates lots of memory as soon as files are created - we tried to tune it with setGcReferences(), but could not;

- having reached 2GB - the HDF5 create function throws the exception "no space available for allocation" (We're running 64-bit Windows 8 with 16GB of RAM)
I'd have a few questions at this point:

- Can we reduce the amount of memory used by the garbage collector? If yes - how?

- Taking a step back: is the HDF5 API designed to handle thousands of files in practice?

- Or would it be better to have a single file with the same number of datasets in it? (We're talking about a few thousand datasets, each with several million rows.)

Thanks for your kind support

________________________________
CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium.

miller86 · January 26, 2016, 5:48pm

Hi,

See some comments embedded below. . .

···

From: Hdf-forum <hdf-forum-bounces@lists.hdfgroup.org<mailto:hdf-forum-bounces@lists.hdfgroup.org>> on behalf of SOLTYS Radoslaw <radoslaw.soltys@power.alstom.com<mailto:radoslaw.soltys@power.alstom.com>>
Reply-To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Date: Tuesday, January 26, 2016 2:55 AM
To: "hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>" <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Subject: [Hdf-forum] Working with lots of HDF5 files

We’re looking into replacing our custom storage design for time-series data with HDF5 and we’re looking mainly at HDF5 version 1.10 for the SWMR capability as we’re doing this already with our custom storage.
To find out the best layout – we drafted a few test cases and started off a tutorial code sample in C++, adjusting it to replicate our current database structure, being one file per signal –

Hmm. I a "file per signal" could be a poor choice. It depends on how "big" a signal is and whether your workflows can easily be re-tooled for a "many signals in one file" paradigm". But, I would think you'd want to write many time series to the same HDF5 file each as its own 'dataset', perhaps in its own 'group' within the file. You can create meaningful group/folder hierarchies *within* an HDF5 file (kinda like dirs in linux or folders in Windows/OS X) which makes it very convenient to organize data

so we are creating new empty files in a loop – and there we already ran into problems:

- the HDF5 garbage collector allocates lots of memory as soon as files are created – we tried to tune it with setGcReferences(), but could not;

Hmmm. Not sure the 'garbage collector' routines actually allocate anything. I think their purpose is to free up any unused stuff. Maybe you want to set freelist limits I use the C interface and so am familiar with these only via this interface, https://www.hdfgroup.org/HDF5/doc/RM/RM_H5.html

- having reached 2GB – the HDF5 create function throws the exception “no space available for allocation” (We’re running 64-bit Windows 8 with 16GB of RAM)

Are you running on a FAT32 filesystem there? Probably not but doesn't hurt to ask

I’d have a few questions at this point:

- Can we reduce the amount of memory used by the garbage collector? If yes - how?

(see above regarding freelist limits)

- Taking a step back: is the HDF5 API designed to handle thousands of files in practice?

We often use it with this number of files. But, generally, the application has only a handful open at any one time. If you mean having *all* files open simultaneously, I think that could present problems. I've never tested it that way.

- Or would it be better to have a single file with the same number of datasets in it? (We’re talking about a few thousand datasets, each with several million rows.)

Much better!

Hope that helps.

Thanks for your kind support

________________________________
CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium.

SOLTYS_Radoslaw · January 28, 2016, 1:22pm

@Mark: The file system is NTFS.

Based on the heap dump - my colleague doing the coding - suggested it was the garbage collector that allocated the memory, but I tend to think that the allocation actually took place outside of the garbage collector, while the garbage collector only holds pointers to be able to release it later:

[Id][count][size]
• Heap (...)
• [External Frame] (...)
  • _scrt_common_main_seh 565,146 1,378,173,390
   • main 568,970 1,378,173,390
    • ??0H5File@H5@@QAE@ABV?Sbasic_string@DU...307,6131,348,254,716
     • ?p get_file@H5File@H5@@AAEXPBDIAINFile...307,6131,348,254,716
      • _H5Fcreate307,6111,348,254,656
       • _H5F_open 293,8621,347,974,068
        > _H5G_mkroot 146,273 12,165,176
        • _H5F_new 112,4991,328,411,148
         > _H5P_copy_plist 57,4951,091,116
         • _H5AC_create 15,0011,318,220,008
          • _H5C_create 15,0011,318,220,008
           > _H5S1._create 10,000 300,000
           • _H5FL reg_calloc 5,001 1,317,920,008
            • H5FL_reg_malloc 5,0011,317,920,008
             • _H5FLgarbage coll 5,0001,317,920,000
               [External Frame] 5,000 1,317,920,000
             > _H5FL_reg_free 1 8

Dana, Mark -
Thank you for your valuable advice - I will test your ideas really soon and will post on the forum.

Radoslaw

···

________________________________
CONFIDENTIALITY : This e-mail and any attachments are confidential and may be privileged. If you are not a named recipient, please notify the sender immediately and do not disclose the contents to another person, use it for any purpose or store or copy the information in any medium.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Working with lots of HDF5 files