Virtual datasets and open file limit


#1

Linux has a limit on the number of open files, which can be checked with the command ulimit -n . 1024 is a common default, and the hard limit (ulimit -Hn), the maximum that a non-admin user can set it to, is 4096 on our institutional cluster.

When opening files to read data from a virtual dataset HDF5 does not appear to close any files to stay under this limit. It keeps opening files and reading data until it hits the limit, and then returns the fill value as if no further data existed. As in another thread I started, this is particularly problematic because it silently treats this as missing data; the failure doesn’t show up as an error or warning at all. And while HDF5 is keeping all of its files open, anything else that tries to open a file in that process will fail.

Here’s an example. I’m using Python for convenience, but h5py doesn’t do anything special when you read from a virtual dataset:

In [1]: import h5py

In [2]: import numpy as np

In [3]: vlayout = h5py.VirtualLayout((2048, 10), dtype='i8')

In [4]: for i in range(2048):  # Create 2048 individual files
   ...:     with h5py.File(f'{i}.h5', 'w') as f:
   ...:         f['a'] = np.arange(10)
   ...:         vlayout[i] = h5py.VirtualSource(f['a'])  # Map into a VDS
   ...:

In [5]: vf = h5py.File('vds.h5', 'w')

In [6]: vds = vf.create_virtual_dataset('a', vlayout)

In [7]: vds[:10, 5]  # Read from 10 files - fine
Out[7]: array([5, 5, 5, 5, 5, 5, 5, 5, 5, 5])

In [8]: vds[:, 5]  # Read from all 2048 files - empty data at the end
Out[8]: array([5, 5, 5, ..., 0, 0, 0])

In [9]: vds[1100:1110, 5]  # Still nothing here, and now IPython has a problem
The history saving thread hit an unexpected error
(OperationalError('unable to open database file')).
History will not be written to the database.
Out[9]: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

This was using HDF5 1.10.6.