Large number of incrementally growing extendible datasets

Bjorn_Andres · June 18, 2009, 12:40pm

Hello!

In a computational geometry application, I am dealing with sets of points in 3D space. Each point set can be represented as a matrix (e.g. a 2D dataset in HDF5) having as many rows as there are points, and three columns for the three coordinates of each point.

The exact number of sets (about 10^6) is known at the initialization of an algorithm while the number of points in each set is unknown. Incrementally, new points are computed and have to be appended to the datasets that have already been constructed.

I have written code (C++, HDF5 1.8.3) which creates 10^7 datasets in one HDF5 file. These datasets are made extendible in the first dimension such that new rows of coordinates can be appended.

There are on average 10 appends per dataset and each append consists on average of 250 points (3 kBytes). After having written about 7 GB to the hard drive, the performance goes down to almost zero. Note that at most one dataset is open at any time.

I am now wondering whether the introduction of 10^6 extendible datasets is a bad idea overall.

- Does HDF5 in its internal organization move around data such that extendible datasets (after appending new data) are contiguous?

- Does HDF5 require that the file is contiguous on the hard drive? Can the file system cause the problem?

- Can caching be a problem? Note that at most one dataset is open at any time. I close it right after having appended data.

I appreciate valuable hints!

Kind regards,
Bjoern

···

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.

Quincey_Koziol · June 18, 2009, 1:02pm

Hi Bjoern,

Hello!

In a computational geometry application, I am dealing with sets of points in 3D space. Each point set can be represented as a matrix (e.g. a 2D dataset in HDF5) having as many rows as there are points, and three columns for the three coordinates of each point.

The exact number of sets (about 10^6) is known at the initialization of an algorithm while the number of points in each set is unknown. Incrementally, new points are computed and have to be appended to the datasets that have already been constructed.

I have written code (C++, HDF5 1.8.3) which creates 10^7 datasets in one HDF5 file. These datasets are made extendible in the first dimension such that new rows of coordinates can be appended.

There are on average 10 appends per dataset and each append consists on average of 250 points (3 kBytes). After having written about 7 GB to the hard drive, the performance goes down to almost zero. Note that at most one dataset is open at any time.

I am now wondering whether the introduction of 10^6 extendible datasets is a bad idea overall.

- Does HDF5 in its internal organization move around data such that extendible datasets (after appending new data) are contiguous?

No, the data isn't moved around in the file. What chunk size have you chosen for your datasets?

- Does HDF5 require that the file is contiguous on the hard drive?

No.

Can the file system cause the problem?

It's always possible, but seems unlikely here.

- Can caching be a problem? Note that at most one dataset is open at any time. I close it right after having appended data.

The HDF5 library does a lot of caching internally, but it's possible that you have created a situation which is unusual enough that the default caching algorithms aren't working correctly. What's the group structure of your file look like?

Quincey

···

On Jun 18, 2009, at 7:40 AM, Björn Andres wrote:

jazzcat81 · June 18, 2009, 9:43pm

Hi Bjoern, Quincey,

The issue you describe sounds very similar to a problem in hdf5 1.8.2, where writing many small (albeit non-extendible) datasets to a single group led to the program freezing after a certain number of datasets (around 700000 in my case). At the time, I was able to work around this by introducing subgroups in order to limit the number of datasets per group. Although the issue I am referring to was resolved in 1.8.3, could this also be a valid workaround in your case?

Regards,

Patrick

Quincey Koziol wrote:

···

Hi Bjoern,

On Jun 18, 2009, at 7:40 AM, Bj�rn Andres wrote:

Hello!

In a computational geometry application, I am dealing with sets of points in 3D space. Each point set can be represented as a matrix (e.g. a 2D dataset in HDF5) having as many rows as there are points, and three columns for the three coordinates of each point.

The exact number of sets (about 10^6) is known at the initialization of an algorithm while the number of points in each set is unknown. Incrementally, new points are computed and have to be appended to the datasets that have already been constructed.

I have written code (C++, HDF5 1.8.3) which creates 10^7 datasets in one HDF5 file. These datasets are made extendible in the first dimension such that new rows of coordinates can be appended.

There are on average 10 appends per dataset and each append consists on average of 250 points (3 kBytes). After having written about 7 GB to the hard drive, the performance goes down to almost zero. Note that at most one dataset is open at any time.

I am now wondering whether the introduction of 10^6 extendible datasets is a bad idea overall.

- Does HDF5 in its internal organization move around data such that extendible datasets (after appending new data) are contiguous?

    No, the data isn't moved around in the file. What chunk size have you chosen for your datasets?

- Does HDF5 require that the file is contiguous on the hard drive?

    No.

Can the file system cause the problem?

    It's always possible, but seems unlikely here.

- Can caching be a problem? Note that at most one dataset is open at any time. I close it right after having appended data.

    The HDF5 library does a lot of caching internally, but it's possible that you have created a situation which is unusual enough that the default caching algorithms aren't working correctly. What's the group structure of your file look like?

    Quincey

----------------------------------------------------------------------
This mailing list is for HDF software users discussion.
To subscribe to this list, send a message to hdf-forum-subscribe@hdfgroup.org.
To unsubscribe, send a message to hdf-forum-unsubscribe@hdfgroup.org.