Virtual datasets and open file limit

Linux has a limit on the number of open files, which can be checked with the command ulimit -n . 1024 is a common default, and the hard limit (ulimit -Hn), the maximum that a non-admin user can set it to, is 4096 on our institutional cluster.

When opening files to read data from a virtual dataset HDF5 does not appear to close any files to stay under this limit. It keeps opening files and reading data until it hits the limit, and then returns the fill value as if no further data existed. As in another thread I started, this is particularly problematic because it silently treats this as missing data; the failure doesn’t show up as an error or warning at all. And while HDF5 is keeping all of its files open, anything else that tries to open a file in that process will fail.

Here’s an example. I’m using Python for convenience, but h5py doesn’t do anything special when you read from a virtual dataset:

In [1]: import h5py

In [2]: import numpy as np

In [3]: vlayout = h5py.VirtualLayout((2048, 10), dtype='i8')

In [4]: for i in range(2048):  # Create 2048 individual files
   ...:     with h5py.File(f'{i}.h5', 'w') as f:
   ...:         f['a'] = np.arange(10)
   ...:         vlayout[i] = h5py.VirtualSource(f['a'])  # Map into a VDS

In [5]: vf = h5py.File('vds.h5', 'w')

In [6]: vds = vf.create_virtual_dataset('a', vlayout)

In [7]: vds[:10, 5]  # Read from 10 files - fine
Out[7]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [8]: vds[:, 5]  # Read from all 2048 files - empty data at the end
Out[8]: array([5, 5, 5, ..., 0, 0, 0])

In [9]: vds[1100:1110, 5]  # Still nothing here, and now IPython has a problem
The history saving thread hit an unexpected error
(OperationalError('unable to open database file')).
History will not be written to the database.
Out[9]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

This was using HDF5 1.10.6.


Hi Thomas,

I entered bug HDFFV-11082 for this issue.


Is there any progress on that issue? I’m facing the same problem.

Hello! Is there a method to close the open file descriptors while the process loads all data? I’m encountering the same issue myself.

From the issue Barbara mentioned:

The VDS code currently hold files open until the dataset is closed. Source files are opened in the following cases:
Static mapping: When selected for I/O
Unlimited, non “printf” mapping: When selected for I/O, and when the VDS extent is checked
“printf” mapping: If the source dataset’s mapping is in the bounding box of the VDS selection for I/O

We should implement an option to limit the number of open files, probably using a simple LRU cache list. We should investigate if we can integrate this into the existing external file cache code.

This is certainly something we’d like to implement if we can find funding for it.