speeding up write of chunked HDF

Balint_Takacs · December 13, 2011, 12:37pm

Hi all,

I need to fill a huge 3D array, chunked in its second dimension. My data
are coming as slices with a fixed index in the third dimension, so the
layout needs to be re-ordered. The chunks are uncompressed. When the data
is read, the access pattern sweeps through it in the second dimension, so
the chunking layout makes sense. The data is stored on an SSD, so random
access should be relatively fast. I cannot manipulate data index order.

In theory, when filling up the array, the data could be continuously
written if it were to be stored in a raw file. However, with HDF this
becomes painfully slow. The only way I found to speed this up somewhat is
to read as much slices I can into memory, and then write together in
batches, but I still experience <2MB/sec write transfers on average.

The file is gradually growing as the slices are added. If this expansion
requires re-ordering the entire data, this could explain the slow write
speed. I was wondering whether pre-allocating the entire file somehow could
help with this, and what is the best way to do it. I could not find any
related API function. I know the entire data size before the data
collection starts.

The only idea I have so far is to fill the array with some dummy value (not
the fill one) by sweeping through the chunking dimension before adding the
slices. This would probably grow the file to its final size rapidly, but I
am not sure that this helps at all, and is definitely ugly.

I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.

Thank you for you comments in advance.

Regards,

Balint

Balint_Takacs · December 14, 2011, 10:30pm

set_alloc_time is exactly what I am looking for. Thank you all!

Unfortunately I do not have the luxury of a newer MATLAB, and thus I am probably stuck with the 1.6.5 HDF version. I doubt that it is possible to independently upgrade the HDF library. I am quite sure that this combo has some nasty bugs; right now, I am struggling with incorrect all-zero reads from some machines after changing the initial fill pattern. The failure happens without any error messages, and because MATLAB swallows the API return codes, I can only guess what is happening. Maybe the 0.5GB buffer cache I am using is too large for some hardware.

BTW is parallel access safe with read-only access opening? I have already learnt the hard way that read-write opening is not safe even when reading only (file was truncated to its header once). I read somewhere that this is a confirmed bug in this HDF version, but I would expect parallel reads to be safe for read-only opening, even if the library was compiled without threading support.

Once more, thank you for your answers.

Regards,
Balint

Zaak_Beekman · December 13, 2011, 7:01pm

Balint,
I am not sure whether pre-allocation will help the performance but there is
a good chance it may since the default for chunked data sets is to allocate
space incrementally (chunk by chunk) as data is written to the data set,
especially if the chunks are small and there are a lot of them. If matlab
has access to the low level HDF5 APIs (which I beleive it does) you can use
the H5Pset_alloc_time and pass alloc_time as H5D_ALLOC_TIME_EARLY to set a
dataset creation property list. There should be no need to mess with the
fill value or do any filling as far as I can tell.

You will need to create a property list first then set this property then
pass it in to H5Dcreate. Also, I think matlab splits the HDF5 API into
classes so the function might look like H5P.set_alloc_time or something
like that. It might also be worth while to check that matlab is a recent
version so that it is compiled/linked against a recent HDF5 version/build.

Documentation for H5Pset_alloc_time may be found here:
http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime

Good luck,
Izaak Beekman

···

===================================
(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu

UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu

On Tue, Dec 13, 2011 at 12:00 PM, <hdf-forum-request@hdfgroup.org> wrote:

Send Hdf-forum mailing list submissions to
       hdf-forum@hdfgroup.org

To subscribe or unsubscribe via the World Wide Web, visit
       http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
or, via email, send a message with subject or body 'help' to
       hdf-forum-request@hdfgroup.org

You can reach the person managing the list at
       hdf-forum-owner@hdfgroup.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."

Today's Topics:

  1. Datasets keep old names after parent has been renamed?
     (Darren Dale)
  2. speeding up write of chunked HDF (Balint Takacs)

---------- Forwarded message ----------
From: Darren Dale <dsdale24@gmail.com>
To: HDF Users Discussion List <hdf-forum@hdfgroup.org>
Cc:
Date: Mon, 12 Dec 2011 12:29:18 -0500
Subject: [Hdf-forum] Datasets keep old names after parent has been renamed?
(Apologies if this gets posted twice)

Someone reported a bug at the h5py issue tracker:

---
import h5py

# test setup
fid = h5py.File('test.hdf5', 'w')

g = fid.create_group('old_loc')
g2 = g.create_group('group')
d = g.create_dataset('dataset', data=0)

print "before move:"
print g2.name
print d.name

# now rename toplevel group
g.parent.id.move('old_loc', 'new_loc')

print "after move:"
# old parent remains in dataset name, group is ok
print g2.name
print d.name

# parent is accessed by name 'g' which does not exist any more
d.parent

fid.close()
---

That script produces the following output:

---
before move:
/old_loc/group
/old_loc/dataset
after move:
/new_loc/group
/old_loc/dataset
Traceback (most recent call last):
File "move_error.py", line 24, in <module>
  d.parent
File
"/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/base.py",
line 144, in parent
  return self.file[posixpath.dirname(self.name)]
File
"/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/group.py",
line 128, in __getitem__
  oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5o.pyx", line 176, in h5py.h5o.open (h5py/h5o.c:2814)
KeyError: "unable to open object (Symbol table: Can't open object)"
---

g.name and d.name simply return the result of h5i.get_name.

d.parent just splits d.name at the last "/" and returns the the first
part of the split.

g.parent.id.move calls H5Gmove2. I've read the warnings about
corrupting data using H5Gmove at
http://www.hdfgroup.org/HDF5/doc1.6/Groups.html#H5GUnlinkToCorrupt ,
but the situation described there does not appear to be relevant to
the problem we are seeing. Is h5py not performing the move properly,
or could this be a bug in HDF5?

Thanks,
Darren

---------- Forwarded message ----------
From: Balint Takacs <takbal@gmail.com>
To: hdf-forum@hdfgroup.org
Cc:
Date: Tue, 13 Dec 2011 12:37:08 +0000
Subject: [Hdf-forum] speeding up write of chunked HDF
Hi all,

I need to fill a huge 3D array, chunked in its second dimension. My data
are coming as slices with a fixed index in the third dimension, so the
layout needs to be re-ordered. The chunks are uncompressed. When the data
is read, the access pattern sweeps through it in the second dimension, so
the chunking layout makes sense. The data is stored on an SSD, so random
access should be relatively fast. I cannot manipulate data index order.

In theory, when filling up the array, the data could be continuously
written if it were to be stored in a raw file. However, with HDF this
becomes painfully slow. The only way I found to speed this up somewhat is
to read as much slices I can into memory, and then write together in
batches, but I still experience <2MB/sec write transfers on average.

The file is gradually growing as the slices are added. If this expansion
requires re-ordering the entire data, this could explain the slow write
speed. I was wondering whether pre-allocating the entire file somehow could
help with this, and what is the best way to do it. I could not find any
related API function. I know the entire data size before the data
collection starts.

The only idea I have so far is to fill the array with some dummy value
(not the fill one) by sweeping through the chunking dimension before adding
the slices. This would probably grow the file to its final size rapidly,
but I am not sure that this helps at all, and is definitely ugly.

I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.

Thank you for you comments in advance.

Regards,

Balint

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Quincey_Koziol · December 14, 2011, 1:08pm

Hi Balint,
Zaak's suggestion is the correct way to allocate all the chunks for the dataset at creation time. However, I'm more concerned about the version of the HDF5 library you are using - we are currently at release 1.8.8 and there have been _many_ performance improvements in the chunked dataset I/O code since 1.6.5 (BTW, the final release of the 1.6.x branch was 1.6.10). I would suggest writing a short C program to benchmark your access pattern and then test it against both the latest 1.6.x release and the 1.8.x release. If there is still a performance problem, we can look into it.

Quincey

···

On Dec 13, 2011, at 1:01 PM, Zaak Beekman wrote:

Balint,
I am not sure whether pre-allocation will help the performance but there is a good chance it may since the default for chunked data sets is to allocate space incrementally (chunk by chunk) as data is written to the data set, especially if the chunks are small and there are a lot of them. If matlab has access to the low level HDF5 APIs (which I beleive it does) you can use the H5Pset_alloc_time and pass alloc_time as H5D_ALLOC_TIME_EARLY to set a dataset creation property list. There should be no need to mess with the fill value or do any filling as far as I can tell.

You will need to create a property list first then set this property then pass it in to H5Dcreate. Also, I think matlab splits the HDF5 API into classes so the function might look like H5P.set_alloc_time or something like that. It might also be worth while to check that matlab is a recent version so that it is compiled/linked against a recent HDF5 version/build.

Documentation for H5Pset_alloc_time may be found here: http://www.hdfgroup.org/HDF5/doc/RM/RM_H5P.html#Property-SetAllocTime

Good luck,
Izaak Beekman

(301)244-9367
Princeton University Doctoral Candidate
Mechanical and Aerospace Engineering
ibeekman@princeton.edu

UMD-CP Visiting Graduate Student
Aerospace Engineering
ibeekman@umiacs.umd.edu
ibeekman@umd.edu

On Tue, Dec 13, 2011 at 12:00 PM, <hdf-forum-request@hdfgroup.org> wrote:
Send Hdf-forum mailing list submissions to
       hdf-forum@hdfgroup.org

To subscribe or unsubscribe via the World Wide Web, visit
       http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
or, via email, send a message with subject or body 'help' to
       hdf-forum-request@hdfgroup.org

You can reach the person managing the list at
       hdf-forum-owner@hdfgroup.org

When replying, please edit your Subject line so it is more specific
than "Re: Contents of Hdf-forum digest..."

Today's Topics:

  1. Datasets keep old names after parent has been renamed?
     (Darren Dale)
  2. speeding up write of chunked HDF (Balint Takacs)

---------- Forwarded message ----------
From: Darren Dale <dsdale24@gmail.com>
To: HDF Users Discussion List <hdf-forum@hdfgroup.org>
Cc:
Date: Mon, 12 Dec 2011 12:29:18 -0500
Subject: [Hdf-forum] Datasets keep old names after parent has been renamed?
(Apologies if this gets posted twice)

Someone reported a bug at the h5py issue tracker:

---
import h5py

# test setup
fid = h5py.File('test.hdf5', 'w')

g = fid.create_group('old_loc')
g2 = g.create_group('group')
d = g.create_dataset('dataset', data=0)

print "before move:"
print g2.name
print d.name

# now rename toplevel group
g.parent.id.move('old_loc', 'new_loc')

print "after move:"
# old parent remains in dataset name, group is ok
print g2.name
print d.name

# parent is accessed by name 'g' which does not exist any more
d.parent

fid.close()
---

That script produces the following output:

---
before move:
/old_loc/group
/old_loc/dataset
after move:
/new_loc/group
/old_loc/dataset
Traceback (most recent call last):
File "move_error.py", line 24, in <module>
  d.parent
File "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/base.py",
line 144, in parent
  return self.file[posixpath.dirname(self.name)]
File "/Users/darren/Library/Python/2.7/lib/python/site-packages/h5py/_hl/group.py",
line 128, in __getitem__
  oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "h5o.pyx", line 176, in h5py.h5o.open (h5py/h5o.c:2814)
KeyError: "unable to open object (Symbol table: Can't open object)"
---

g.name and d.name simply return the result of h5i.get_name.

d.parent just splits d.name at the last "/" and returns the the first
part of the split.

g.parent.id.move calls H5Gmove2. I've read the warnings about
corrupting data using H5Gmove at
http://www.hdfgroup.org/HDF5/doc1.6/Groups.html#H5GUnlinkToCorrupt ,
but the situation described there does not appear to be relevant to
the problem we are seeing. Is h5py not performing the move properly,
or could this be a bug in HDF5?

Thanks,
Darren

---------- Forwarded message ----------
From: Balint Takacs <takbal@gmail.com>
To: hdf-forum@hdfgroup.org
Cc:
Date: Tue, 13 Dec 2011 12:37:08 +0000
Subject: [Hdf-forum] speeding up write of chunked HDF
Hi all,

I need to fill a huge 3D array, chunked in its second dimension. My data are coming as slices with a fixed index in the third dimension, so the layout needs to be re-ordered. The chunks are uncompressed. When the data is read, the access pattern sweeps through it in the second dimension, so the chunking layout makes sense. The data is stored on an SSD, so random access should be relatively fast. I cannot manipulate data index order.

In theory, when filling up the array, the data could be continuously written if it were to be stored in a raw file. However, with HDF this becomes painfully slow. The only way I found to speed this up somewhat is to read as much slices I can into memory, and then write together in batches, but I still experience <2MB/sec write transfers on average.

The file is gradually growing as the slices are added. If this expansion requires re-ordering the entire data, this could explain the slow write speed. I was wondering whether pre-allocating the entire file somehow could help with this, and what is the best way to do it. I could not find any related API function. I know the entire data size before the data collection starts.

The only idea I have so far is to fill the array with some dummy value (not the fill one) by sweeping through the chunking dimension before adding the slices. This would probably grow the file to its final size rapidly, but I am not sure that this helps at all, and is definitely ugly.

I am using MATLAB 2007a with the 1.6.5 HDF library it is coming with.

Thank you for you comments in advance.

Regards,

Balint

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

speeding up write of chunked HDF

Good luck, Izaak Beekman

Good luck,
Izaak Beekman