**** Slow Chunked I/O and H5S_ALL ****


#1

TL;DR

Making my data set chunked (instead of contiguous) has slowed down writing massively!!

The Solution

When defining the dataspace for the memory buffer being written make sure the rank matches that of the target storage array, even if that means having a depth of 1 in a particular direction.

Furthermore aim for a chunk size of approximately 1MiB to match the default cache size of 1MB. If it is expedient to use larger chunks the cache size should be altered to accomodate that.

Example: chunky.cpp (2.4 KB) )

( From the final post courtesy of @gheber. )

Full story:

I have a chunked data set with dims 128 x 128 x 4, elements are uchar.

I would like to write a series of 128x128 images into the data set.

Initially I used a contiguous data set and wrote only 4 images, this yielded a write speed of approx. 200us per image.

I wish to record a variable number of frames, so i wish to make the data set extendable. As part of that I need to make the data set chunked. The chunks are 128x128x1.

This is the only change I make to my code to enable chunking:

dcpl_id = H5Pcreate(H5P_DATASET_CREATE); // dcpl_id is the Dataset Creation Property List which is used later by H5Dcreate2 to create the dataset.
ret = H5Pset_chunk(dcpl_id, RANK, chunk_dims);
assert(ret >= 0);

Here is my problem:

With this chunking the write speed per frame is approximately 2500us per image.

Note: I am not extending the data set, i am still only filling an existing data set.

Here is where it gets weird:

When I use the H5S_ALL flag to define the memory data space for H5Dwrite the speed is returned, even increased slightly to approximately 150us.

Defining the memory data space using H5S_ALL uses the data set data space and hyperslab for the memory data space. This seems to make it sufficiently fast but obviously this is impracticle when the planned dataset is larger than memory (TB rather than GB).

Failed remedys

I have tried:

Defining the memory data space hyperslab using count size or using block size (no change detected between the two).

I have tried not defining a hyperslab within the memory data space. The memory data space already defines the entire memory buffer I wish to write:

  • 2500 us becomes 2300 us per image, so only an incremental improvement.
  • This what I was doing when using the C++ highlevel api.

C++ API

This is made more confusing because I have this working using the Cpp high level api and I am able to atain the speeds of 150-200us per image.

What is the C++ api doing that I am not?? (probably quite a lot…)

Property lists

I am using the default property lists for everything apart from defining the chunk size.

Is there some property which could potentially be changed to return my desired speeds?

In particular I am thinking of the xfer_plist_id for H5Dwrite

UPDATE 1.0:

I have modified the dataset creation property list to set the storage space allocation time to early.

H5Pset_alloc_time(dcpl_id, H5D_ALLOC_TIME_EARLY);

This reduces the write time may be slightly to approximately 2200us per frame…


#2

How did you obtain those timings? How are you measuring time?
Which “C++ highlevel API”?

How many images are you planning to write? A chunk size of 128x128 (16 KiB) is tiny.
Aim for the vicinity of 1 MiB. If you are writing full images, there’s no need to initialize the chunks:

H5Pset_fill_time(dcpl_id, H5D_FILL_TIME_NEVER)

G.


#3

Hi @gheber

Thank you for your response.

The C++ api I used was the default HDF5 API. My apologies if I mislead, it is just slightly less verbose than the C API.

How am I timing?

Using the std chrono high res clock and only on the call to write:

tStart = std::chrono::high_resolution_clock::now();

ret = H5Dwrite(dset_id, H5T_NATIVE_UCHAR, mem_space_id, dset_space_id, H5P_DEFAULT, write_buf);
assert(ret >= 0);

elapsed = std::chrono::high_resolution_clock::now() - tStart;
auto elapsed_count_us = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();

I run this a couple of times to get an approximate idea. Since I am only trying to detect the order magnitude this seems to suffice, I appreciate this is not the most effective instrumentation.

Using the call to never fill:

TIme is now ~2000us to write.

Most major improvement yet but still an order of magnitude off.

Chunk Size

I appreciate this on the small side.

So I changed the size to array to: 128 x (128x60) x 4 and the chunk to 128 x (128x60) x 1.
As if I was writing 60 stitched frames in to a single slice of my array.
With uchar elements that makes each chunk 960KiB.
This isnt how i woudl do it long term but just to check the write speed I did it this way.

However this didn’t improve performance. write time was 130000us, or approximately 2200us per 128*128 frame.

Summary

I am still at a loss as to why the C++ API, using the original block size, is so much faster. Not filling the array might have saved some time.

I didn’t change any options for C++ API like alloc time or fill time.

An as mentioned when call H5S_ALL I somehow reached the desired speeds…


#4

Can you give us a small working example? Use C++ ( is the gold standard!), but don’t use the HDF5 C++ API. Go straight to C. I’m not saying that that’s a problem, it just obscures the issue and doesn’t realyy buy you anything. If you can’t, I will write an example, but you’ll have to wait a couple of days.

We need to see your selections and how you iterate over dimensions. Reading between the lines, you layout should be Inf x 128 x 128 with a chunk size of 60 x 128 x 128, i.e., time or image sequence number is the slowest changing dimension. You’ll write/read 60 images in one go. You could make the chunks bigger or introduce a fourth dimension (batch no.), but then you’d have to think about adjusting the chunk cache size, which is 1 MB by default.

There is no technical reason why you can’t make your original performance goal.

Which library version are you using on what platform?

G.


#5

Hi @gheber

I am using:

  • HDF5 v1.12
  • C++ (MSVC 16)

Nice to hear I am on the right track by using the C API.

Below is my example.

This is actually an edited version of h5_ex1.c from the parallel examples (obviously this just a serial example).

On extendable dimensions

I notice you specify the array shape to be inf x 128 x 128.
But used the reverse: 128 x 128 x inf.

Could that be impacting things?

Continguous implementation

By simply removing the chunk specification from the dataset creation property list (disabling line 85) and removing the maximum dimensions from the dataset dataspace (line 75 ) we are back to contiguous format.

This yields a 10x speed up…

/* 
HDF5 Dataset example, writing by columns 

Traditional serial writing

*/

/* System header files */
#include <assert.h>
#include <stdlib.h>
#include <chrono>
/* HDF5 header file */
#include "hdf5.h"

/* Predefined names and sizes */
#define FILENAME "h5_ex1.h5"
#define DATASETNAME "Dataset 1"
#define RANK    3
#define DIM0    128
#define DIM1    128*60
#define DIM2    4     /* Should be same as MPI rank */


/* Global variables */
//uint8_t* write_buf;

int main(int argc, char* argv[])
{
   /* HD5 */
   hid_t file_id;              /* File ID */
   hid_t fapl_id;		        /* File access property list */
   hid_t dset_id;		        /* Dataset ID */
   hid_t dset_space_id;	    /* Dataset dataspace ID */
   hid_t dcpl_id;              /* Dataset creation property list */
   hid_t mem_space_id;		    /* Memory dataspace ID */
   hsize_t dset_dims[RANK];   	/* dataset dimensions at creation time */
   hsize_t max_dims[RANK];     /* dataset maximum dimensions */
   hsize_t chunk_dims[RANK] = { DIM0, DIM1, 1 }; /* Chunk dimesions of dataset */
   hsize_t mem_dims[2];    	/* Memory buffer dimemsion sizes */
   hsize_t dset_start[RANK];	/* dset dataset selection start coordinates (for hyperslab setting) */
   hsize_t dset_count[RANK];	/* dset dataset selection count coordinates (for hyperslab setting) */
   hsize_t mem_start[2];	    /* Memory buffer selection start coordinates (for hyperslab setting) */
   hsize_t mem_count[2];	    /* Memory buffer selection count coordinates (for hyperslab setting) */
   herr_t ret;         	    /* Generic return value */
   int i, j;                   /* Loop index */

   /* Timing stuff */
   std::chrono::steady_clock::time_point tStart;
   std::chrono::duration<__int64, std::nano> elapsed;

   /* Iniialize buffer of dataset to write */
   /* (in a real application, this would be your science data) */
   /* <SKIPPED> */

   /* Create an HDF5 file access property list */
   fapl_id = H5Pcreate(H5P_FILE_ACCESS);
   assert(fapl_id > 0);


   /* Create the file collectively */
   file_id = H5Fcreate(FILENAME, H5F_ACC_TRUNC, H5P_DEFAULT, fapl_id);
   assert(file_id > 0);

   /* Release file access property list */
   ret = H5Pclose(fapl_id);
   assert(ret >= 0);

   /* Define the dataspace dimensions of the dataset( with unlimited max dimensions) */
   dset_dims[0] = DIM0;
   dset_dims[1] = DIM1;
   dset_dims[2] = DIM2;
   max_dims[0] = DIM0;
   max_dims[1] = DIM1;
   max_dims[2] = H5S_UNLIMITED;
   dset_space_id = H5Screate_simple(RANK, dset_dims, max_dims);
   assert(dset_space_id > 0);

   
   // Modify dataset creation properties, i.e. set alloc time to early, enable chunking
   dcpl_id = H5Pcreate(H5P_DATASET_CREATE);
   ret = H5Pset_alloc_time(dcpl_id, H5D_ALLOC_TIME_EARLY);
   assert(ret >= 0);
   ret = H5Pset_fill_time(dcpl_id, H5D_FILL_TIME_NEVER);
   assert(ret >= 0);
   ret = H5Pset_chunk(dcpl_id, RANK, chunk_dims);
   assert(ret >= 0);
   
   /* Loop over columns */
   dset_id = H5Dcreate2(file_id, DATASETNAME, H5T_NATIVE_UCHAR,dset_space_id, H5P_DEFAULT, dcpl_id, H5P_DEFAULT);
   assert(dset_id > 0);

   /* Create memory dataspace for write buffer */
   mem_dims[0] = DIM0;
   mem_dims[1] = DIM1;
   mem_space_id = H5Screate_simple(2, mem_dims, NULL);
   assert(mem_space_id > 0);

   
   // Select all elements in the memory buffer
   mem_start[0] = 0;
   mem_start[1] = 0;
   mem_count[0] = 1;
   mem_count[1] = 1;
   //ret = H5Sselect_hyperslab(mem_space_id, H5S_SELECT_SET, mem_start, NULL, mem_count, NULL);
   //assert(ret >= 0);
   

   /* Initialize each column with column number */
   uint8_t* write_buf = new uint8_t[DIM0 * DIM1];

   for (j = 0; j < DIM2; j++)
   {
       /* Select column of elements in the file dataset */
       dset_start[0] = 0;
       dset_start[1] = 0;
       dset_start[2] = j;
       dset_count[0] = DIM0;
       dset_count[1] = DIM1;
       dset_count[2] = 1;

       ret = H5Sselect_hyperslab(dset_space_id, H5S_SELECT_SET, dset_start, NULL, dset_count, NULL);
       assert(ret >= 0);


       for (i = 0; i < DIM0; i++)
       {
           for (int k = 0; k < DIM1; k++)
           {
               write_buf[(i * DIM1) + k] = j;
           }
       }

       tStart = std::chrono::high_resolution_clock::now();

       /* Write data independently */
       ret = H5Dwrite(dset_id, H5T_NATIVE_UCHAR, mem_space_id, dset_space_id, H5P_DEFAULT, write_buf);
       //assert(ret >= 0);
       
       elapsed = std::chrono::high_resolution_clock::now() - tStart;
       auto elapsed_count_us = std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();

       printf("Write finished in %d\n", elapsed_count_us);        
   }

   free(write_buf);

   /* Close memory dataspace */
   ret = H5Sclose(mem_space_id);
   assert(ret >= 0);

   /* Close dset dataspace */
   ret = H5Sclose(dset_space_id);
   assert(ret >= 0);

   /* Close dataset collectively */
   ret = H5Dclose(dset_id);
   assert(ret >= 0);

   /* Close the file collectively */
   ret = H5Fclose(file_id);
   assert(ret >= 0);

   return(0);
}

#6

Apparent solution

I have up to this point be writing 2D frames to slices of a 3D array. I therefore neglected to make the dataspace describing the memory array match the rank of the storage array.

I change the memory dataspace definition to be 3 dimensional like the data set, even if only 1 in the third dimension.

( the chunk size was always 3 dimensional )

Following this change the speed is back to being like lightning even when doing individual frames.

On Chunk Size

As predicted, using chunk sizes closer to 1MiB yields significant improvements over smaller chunks.

Average write time per image:
16KiB: 53.5us
960KiB: 17.5us

Thank you @gheber

:slight_smile:


#7

Glad to hear that. Just for the record, I’ve denoised the example a bit. Is that what you’ve settled on?chunky.cpp (2.4 KB)

G.


#8

Thank you once again @gheber.

The example is much neater and I have added it to a “Solution” section in the original question for future readers.

In the end I have gone with 128 x 128 x inf array shape nevertheless the example still neatly illustrates the core of the solution.

Quick Question on chunk cache

You state that we could use larger chunks (> 1MiB).

  1. Would this increase performance? (assuming a suitable chunk cache size)
  2. What is the limiting factor to chunk cache size? (is it physical? my CPU has 20MB cache size…)

#9
  1. Like most caches, it’s there for spatial or temporal locality in chunk access. If there is such locality in your application and it can be exploited, the cache size can be adjusted per dataset (H5Pset_chunk_cache https://portal.hdfgroup.org/display/HDF5/H5P_SET_CHUNK_CACHE) to hold multiple chunks, and the eviction policy can be adjusted.
  2. The chunk cache is managed by the HDF5 library and lives in RAM, which is the limiting factor. There is no direct connection with CPU caches, but depending on your access (read/write) pattern there might be a sweet range. (Also, don’t forget about all those caches on the storage end of things.) You can go to extremes and consider pinning threads to cores, but I would caution against over-optimization for a particular setup. Many of those assumptions won’t be valid the moment you share the file or application with someone, or you upgrade your system. It’s easy to play with different chunk sizes, either programmatically or using a tool such as h5repack. Pick something that works with modest to middle-of-the-road RAM assumptions. If your data lends itself to compression, you can explore that as well (w/ h5repack).

G.