poor h5dump performance dumping to binary

I am getting really awful performance using 'h5dump' to dump a scalar field as a binary file. It takes literally hours, whereas an h5copy -f ref takes just under 2 minutes, and a simple 'cp' is a little bit quicker than that.

My file header is reproduced below [1]. I am using

   h5dump -b LE -d /C00 -o outfile.raw infile.h5

to convert. For comparison, 'h5copy' is run thusly:

   h5copy -s C00 -d C00 -i infile.h5 -o ./testing.h5 -v -f ref

While running, h5dump pegs a core at 98+% CPU usage.

I've tried attaching gdb to the process while it's running, so that I can obtain some poor-man's profiling. One popular stacktrace is appended below [2]. I guess it's locking and unlocking a mutex constantly? Other traces I have seen multiple times: H5I_object_verify called from H5Tequal; __pthread_setcancelstate from H5TS_cancel_count_inc from H5Tequal; __pthread_mutex_lock from H5TS_mutex_lock from H5open; H5T_cmp from H5Tequal (rarely).

If locking is indeed the problem, can I disable it at runtime somehow? These files are only being accessed by one process at a time, h5dump isn't even multithreaded anyway, and furthermore the access is purely read-only.

I am using HDF5 1.8.4. Please enlighten me as to how I can get reasonable performance out of these files.

Thanks,

-tom

[1]
$ h5dump -p -H TS_2011_12_26/TS_C00_0_16.h5
HDF5 "TS_C00_0_16.h5" {
GROUP "/" {
    DATASET "C00" {
       DATATYPE H5T_STD_U16LE
       DATASPACE SIMPLE { ( 301, 2550, 2550 ) / ( 301, 2550, 2550 ) }
       STORAGE_LAYOUT {
          CONTIGUOUS
          SIZE 3914505000
          OFFSET 1400
       }
       FILTERS {
          NONE
       }
       FILLVALUE {
          FILL_TIME H5D_FILL_TIME_IFSET
          VALUE 0
       }
       ALLOCATION_TIME {
          H5D_ALLOC_TIME_LATE
       }
    }
}

[2]
(gdb) bt
#0 __pthread_mutex_lock (mutex=0x7ff2d1e7fac8) at pthread_mutex_lock.c:47
#1 0x00007ff2d1bf67d6 in H5TS_mutex_unlock () from /usr/lib/libhdf5.so.6
#2 0x00007ff2d19105b8 in H5open () from /usr/lib/libhdf5.so.6
#3 0x0000000000420501 in ?? ()
#4 0x000000000041fbf6 in ?? ()
#5 0x0000000000416e7f in ?? ()
#6 0x000000000041c856 in ?? ()
#7 0x000000000041cea9 in ?? ()
#8 0x000000000040abaf in ?? ()
#9 0x000000000040a20d in ?? ()
#10 0x000000000040d3c6 in ?? ()
#11 0x000000000040f387 in ?? ()
#12 0x00007ff2d156530d in __libc_start_main (main=0x40eae4, argc=8,
     ubp_av=0x7fffa529e648, init=<optimized out>, fini=<optimized out>,
     rtld_fini=<optimized out>, stack_end=0x7fffa529e638) at libc-start.c:226
#13 0x0000000000405349 in ?? ()

Hi Tom,
  Looks like you are working with a thread-safe build of HDF5, which is unnecessary for the command-line tools. You could rebuild the HDF5 distribution (I would suggest moving up to 1.8.8 or the 1.8.9 prerelease) without the thread-safe configure flag, and that should get rid of the mutex issues.

  Quincey

···

On Apr 25, 2012, at 9:28 AM, tom fogal wrote:

I am getting really awful performance using 'h5dump' to dump a scalar field as a binary file. It takes literally hours, whereas an h5copy -f ref takes just under 2 minutes, and a simple 'cp' is a little bit quicker than that.

My file header is reproduced below [1]. I am using

h5dump -b LE -d /C00 -o outfile.raw infile.h5

to convert. For comparison, 'h5copy' is run thusly:

h5copy -s C00 -d C00 -i infile.h5 -o ./testing.h5 -v -f ref

While running, h5dump pegs a core at 98+% CPU usage.

I've tried attaching gdb to the process while it's running, so that I can obtain some poor-man's profiling. One popular stacktrace is appended below [2]. I guess it's locking and unlocking a mutex constantly? Other traces I have seen multiple times: H5I_object_verify called from H5Tequal; __pthread_setcancelstate from H5TS_cancel_count_inc from H5Tequal; __pthread_mutex_lock from H5TS_mutex_lock from H5open; H5T_cmp from H5Tequal (rarely).

If locking is indeed the problem, can I disable it at runtime somehow? These files are only being accessed by one process at a time, h5dump isn't even multithreaded anyway, and furthermore the access is purely read-only.

I am using HDF5 1.8.4. Please enlighten me as to how I can get reasonable performance out of these files.

Thanks,

-tom

[1]
$ h5dump -p -H TS_2011_12_26/TS_C00_0_16.h5
HDF5 "TS_C00_0_16.h5" {
GROUP "/" {
  DATASET "C00" {
     DATATYPE H5T_STD_U16LE
     DATASPACE SIMPLE { ( 301, 2550, 2550 ) / ( 301, 2550, 2550 ) }
     STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 3914505000
        OFFSET 1400
     }
     FILTERS {
        NONE
     }
     FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE 0
     }
     ALLOCATION_TIME {
        H5D_ALLOC_TIME_LATE
     }
  }
}
}

[2]
(gdb) bt
#0 __pthread_mutex_lock (mutex=0x7ff2d1e7fac8) at pthread_mutex_lock.c:47
#1 0x00007ff2d1bf67d6 in H5TS_mutex_unlock () from /usr/lib/libhdf5.so.6
#2 0x00007ff2d19105b8 in H5open () from /usr/lib/libhdf5.so.6
#3 0x0000000000420501 in ?? ()
#4 0x000000000041fbf6 in ?? ()
#5 0x0000000000416e7f in ?? ()
#6 0x000000000041c856 in ?? ()
#7 0x000000000041cea9 in ?? ()
#8 0x000000000040abaf in ?? ()
#9 0x000000000040a20d in ?? ()
#10 0x000000000040d3c6 in ?? ()
#11 0x000000000040f387 in ?? ()
#12 0x00007ff2d156530d in __libc_start_main (main=0x40eae4, argc=8,
   ubp_av=0x7fffa529e648, init=<optimized out>, fini=<optimized out>,
   rtld_fini=<optimized out>, stack_end=0x7fffa529e638) at libc-start.c:226
#13 0x0000000000405349 in ?? ()

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Quincey,

Thanks for your reply.

This helped considerably. I can dump one of my files in 16.5 minutes now, down from the 4+ hours it took before. However, this is still the slowest part of my pipeline. Another order of magnitude improvement would be welcome, of course ;), but I'd be really happy if we could just halve my current runtime for it. Any other ideas?

Secondly, my previous HDF5 was simply installed as part of Ubuntu. I imagine pre-installed HDF5 versions are common for many users. Could I request that "no thread safety" be made a runtime option, which the command line tools could set implicitly? As shown, it provides a huge performance benefit, and is significantly easier to use, then, because users won't need to compile their own HDF5.

Thanks,

-tom

···

On 04/25/2012 05:53 PM, Quincey Koziol wrote:

Hi Tom,
  Looks like you are working with a thread-safe build of HDF5, which is unnecessary for the command-line tools. You could rebuild the HDF5 distribution (I would suggest moving up to 1.8.8 or the 1.8.9 prerelease) without the thread-safe configure flag, and that should get rid of the mutex issues.

  Quincey

On Apr 25, 2012, at 9:28 AM, tom fogal wrote:

I am getting really awful performance using 'h5dump' to dump a scalar field as a binary file. It takes literally hours, whereas an h5copy -f ref takes just under 2 minutes, and a simple 'cp' is a little bit quicker than that.

My file header is reproduced below [1]. I am using

  h5dump -b LE -d /C00 -o outfile.raw infile.h5

to convert. For comparison, 'h5copy' is run thusly:

  h5copy -s C00 -d C00 -i infile.h5 -o ./testing.h5 -v -f ref

While running, h5dump pegs a core at 98+% CPU usage.

I've tried attaching gdb to the process while it's running, so that I can obtain some poor-man's profiling. One popular stacktrace is appended below [2]. I guess it's locking and unlocking a mutex constantly? Other traces I have seen multiple times: H5I_object_verify called from H5Tequal; __pthread_setcancelstate from H5TS_cancel_count_inc from H5Tequal; __pthread_mutex_lock from H5TS_mutex_lock from H5open; H5T_cmp from H5Tequal (rarely).

If locking is indeed the problem, can I disable it at runtime somehow? These files are only being accessed by one process at a time, h5dump isn't even multithreaded anyway, and furthermore the access is purely read-only.

I am using HDF5 1.8.4. Please enlighten me as to how I can get reasonable performance out of these files.

Thanks,

-tom

[1]
$ h5dump -p -H TS_2011_12_26/TS_C00_0_16.h5
HDF5 "TS_C00_0_16.h5" {
GROUP "/" {
   DATASET "C00" {
      DATATYPE H5T_STD_U16LE
      DATASPACE SIMPLE { ( 301, 2550, 2550 ) / ( 301, 2550, 2550 ) }
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 3914505000
         OFFSET 1400
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE 0
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}

[2]
(gdb) bt
#0 __pthread_mutex_lock (mutex=0x7ff2d1e7fac8) at pthread_mutex_lock.c:47
#1 0x00007ff2d1bf67d6 in H5TS_mutex_unlock () from /usr/lib/libhdf5.so.6
#2 0x00007ff2d19105b8 in H5open () from /usr/lib/libhdf5.so.6
#3 0x0000000000420501 in ?? ()
#4 0x000000000041fbf6 in ?? ()
#5 0x0000000000416e7f in ?? ()
#6 0x000000000041c856 in ?? ()
#7 0x000000000041cea9 in ?? ()
#8 0x000000000040abaf in ?? ()
#9 0x000000000040a20d in ?? ()
#10 0x000000000040d3c6 in ?? ()
#11 0x000000000040f387 in ?? ()
#12 0x00007ff2d156530d in __libc_start_main (main=0x40eae4, argc=8,
    ubp_av=0x7fffa529e648, init=<optimized out>, fini=<optimized out>,
    rtld_fini=<optimized out>, stack_end=0x7fffa529e638) at libc-start.c:226
#13 0x0000000000405349 in ?? ()

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Hi Tom,

Hi Quincey,

Thanks for your reply.

This helped considerably. I can dump one of my files in 16.5 minutes now, down from the 4+ hours it took before. However, this is still the slowest part of my pipeline. Another order of magnitude improvement would be welcome, of course ;), but I'd be really happy if we could just halve my current runtime for it. Any other ideas?

  Dunno, can you run with gprof?

Secondly, my previous HDF5 was simply installed as part of Ubuntu. I imagine pre-installed HDF5 versions are common for many users. Could I request that "no thread safety" be made a runtime option, which the command line tools could set implicitly? As shown, it provides a huge performance benefit, and is significantly easier to use, then, because users won't need to compile their own HDF5.

  Hmm, that could be done, yes. I'll file an issue for it.

    Quincey

···

On Apr 27, 2012, at 5:46 AM, tom fogal wrote:

Thanks,

-tom

On 04/25/2012 05:53 PM, Quincey Koziol wrote:

Hi Tom,
  Looks like you are working with a thread-safe build of HDF5, which is unnecessary for the command-line tools. You could rebuild the HDF5 distribution (I would suggest moving up to 1.8.8 or the 1.8.9 prerelease) without the thread-safe configure flag, and that should get rid of the mutex issues.

  Quincey

On Apr 25, 2012, at 9:28 AM, tom fogal wrote:

I am getting really awful performance using 'h5dump' to dump a scalar field as a binary file. It takes literally hours, whereas an h5copy -f ref takes just under 2 minutes, and a simple 'cp' is a little bit quicker than that.

My file header is reproduced below [1]. I am using

h5dump -b LE -d /C00 -o outfile.raw infile.h5

to convert. For comparison, 'h5copy' is run thusly:

h5copy -s C00 -d C00 -i infile.h5 -o ./testing.h5 -v -f ref

While running, h5dump pegs a core at 98+% CPU usage.

I've tried attaching gdb to the process while it's running, so that I can obtain some poor-man's profiling. One popular stacktrace is appended below [2]. I guess it's locking and unlocking a mutex constantly? Other traces I have seen multiple times: H5I_object_verify called from H5Tequal; __pthread_setcancelstate from H5TS_cancel_count_inc from H5Tequal; __pthread_mutex_lock from H5TS_mutex_lock from H5open; H5T_cmp from H5Tequal (rarely).

If locking is indeed the problem, can I disable it at runtime somehow? These files are only being accessed by one process at a time, h5dump isn't even multithreaded anyway, and furthermore the access is purely read-only.

I am using HDF5 1.8.4. Please enlighten me as to how I can get reasonable performance out of these files.

Thanks,

-tom

[1]
$ h5dump -p -H TS_2011_12_26/TS_C00_0_16.h5
HDF5 "TS_C00_0_16.h5" {
GROUP "/" {
  DATASET "C00" {
     DATATYPE H5T_STD_U16LE
     DATASPACE SIMPLE { ( 301, 2550, 2550 ) / ( 301, 2550, 2550 ) }
     STORAGE_LAYOUT {
        CONTIGUOUS
        SIZE 3914505000
        OFFSET 1400
     }
     FILTERS {
        NONE
     }
     FILLVALUE {
        FILL_TIME H5D_FILL_TIME_IFSET
        VALUE 0
     }
     ALLOCATION_TIME {
        H5D_ALLOC_TIME_LATE
     }
  }
}
}

[2]
(gdb) bt
#0 __pthread_mutex_lock (mutex=0x7ff2d1e7fac8) at pthread_mutex_lock.c:47
#1 0x00007ff2d1bf67d6 in H5TS_mutex_unlock () from /usr/lib/libhdf5.so.6
#2 0x00007ff2d19105b8 in H5open () from /usr/lib/libhdf5.so.6
#3 0x0000000000420501 in ?? ()
#4 0x000000000041fbf6 in ?? ()
#5 0x0000000000416e7f in ?? ()
#6 0x000000000041c856 in ?? ()
#7 0x000000000041cea9 in ?? ()
#8 0x000000000040abaf in ?? ()
#9 0x000000000040a20d in ?? ()
#10 0x000000000040d3c6 in ?? ()
#11 0x000000000040f387 in ?? ()
#12 0x00007ff2d156530d in __libc_start_main (main=0x40eae4, argc=8,
   ubp_av=0x7fffa529e648, init=<optimized out>, fini=<optimized out>,
   rtld_fini=<optimized out>, stack_end=0x7fffa529e638) at libc-start.c:226
#13 0x0000000000405349 in ?? ()

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org