H5 Map - a HDF5 chunk caching API

Dear HDF community

Please find a document detailing an HDF5 API develop at ORNL regarding chunk caching.

At this moment I would be very happy in receiving comments/questions/ suggestions.

H5Map.pdf (337 KB)

···

----------------------
Pedro Vicente
pedro.vicente@space-research.org

Hi Pedro,

Ok, I am not sure I fully understand what you are proposing or
requesting but I certainly won't let that stop me from sharing my
opinions :wink:

It sounds like you are dealing with the fundamental 'read-modify-write'
performance problems that often occurs in caching algorithms where
operations 'above' the cache span multiple cache blocks.

If I read your document correctly, it proposes a couple of solutions.
One is to add your own temporary storage layer to act as a sort of
impedance matcher thereby enforcing that any actual I/O requests to HDF5
match on cache block boundaries.

The other solution involves a 'new API' I think. Is this a proposed new
API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces? Is the main issue
that you need to be able to adjust chunk cache size based on your
application's needs at the time the data is being read or written? If
so, I think the HDF5 library has all the functions you'd need to do that
already. So, its really not clear to me what value added you are
proposing. If your datasets are 1 dimensional and you are processing
more or less sequentially through it, as your pictures suggest, then I'd
think a cache size equal to a few blocks should be sufficient to avoid
the read-modify-write behavior. If its 2D, and your access is more or
less a sliding 2D window, then I'd think a cache size of about 9 blocks
would be sufficient to avoid read-modify-write behavior. If its 3D, then
27 blocks. So, I figure I must be missing something that motivates this
because beyond manipulating the cache size, I am not seeing what else
your 'new API' solution provides.

Sorry if my response is off target.

Mark

···

On Mon, 2011-07-25 at 14:36 -0700, Pedro Silva Vicente wrote:

Dear HDF community

Please find a document detailing an HDF5 API develop at ORNL regarding
chunk caching.

At this moment I would be very happy in receiving comments/questions/
suggestions.

----------------------
Pedro Vicente
pedro.vicente@space-research.org
http://www.space-research.org/

Hi Mark,

Sorry if my response is off target.

Not at all off target, it's right spot on in target [ Like most of these email list subscribers located outside of Champaign IL, we never met personally
but I have been reading your comments for like, 10 years?, and they are always on target]

If I read your document correctly, it proposes a couple of solutions.

Correct: 2 solutions for a problem that was basically a "bad" use of HDF5.

The Histotool software saves data in real time from neuton experiments. The NeXus format (based in HDF5) is used

http://www.nexusformat.org/Main_Page

One of the things HistoTool does is to append data as described in the PDF.
However, due to the way the NeXus API was used, performance was very slow. It was several orders of magnitude slower
than using a plain binary file to save the experiment results, so one question that come up was

"Why should I use HDF5 instead of a binary file, if it's several orders of magnitude slower?"

So, I implemented the 2 solutions explained.

This API can be used as a "use case" of how to circumvent the "problems" of the chunk design (by "problems" I mean the fact that using HDF5 as in the original software implementation, performance was several orders of magnitude slower than a plain binary file; this has to do with the way chunks are designed)

One is to add your own temporary storage layer to act as a sort of
impedance matcher thereby enforcing that any actual I/O requests to HDF5
match on cache block boundaries.

Right. In this case the user had to modify the software, just to avoid the "problem" mentioned. That should not happen, the fact that a user has to modify his code to avoid that problem.

The other solution involves a 'new API' I think. Is this a proposed new
API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces?

It is a new API. At this moment it is not previewed that will be incorporated in HDF5.
But it would be certainly possible with some changes to make it more of general use maybe.
The datasets are all 1D and I use extensively STL (vectors) but this could be changed to be of general use (like input/output format data)

Is the main issue that you need to be able to adjust chunk cache size based on your
application's needs at the time the data is being read or written?

Yes. It allows to add the chunk size as a parameter (currently 8GB); on top of this, it keeps track of a multitude of *open* datasets in a STL map (pair path, HDF5 dataset ID);
the purpose of this is to avoid closing and opening datasets frequently (that is as least as possible).

I think the HDF5 library has all the functions you'd need to do that already.

Correct. This API is a use case of it.
I posted it as a general information to the community. Hopefully it will help someone that at some point is faced with a similar problem.

Like I said "comments/questions/ suggestions" are welcome (Like "can I have it too?" )

Thanks, Mark, for the comments

Feel free to follow up with more questions or suggestions of how to improve it to be of more general use.

Pedro

···

----- Original Message -----
  From: Mark C. Miller
  To: HDF Users Discussion List
  Sent: Monday, July 25, 2011 5:35 PM
  Subject: Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API

  Hi Pedro,

  Ok, I am not sure I fully understand what you are proposing or
  requesting but I certainly won't let that stop me from sharing my
  opinions :wink:

  It sounds like you are dealing with the fundamental 'read-modify-write'
  performance problems that often occurs in caching algorithms where
  operations 'above' the cache span multiple cache blocks.

  If I read your document correctly, it proposes a couple of solutions.
  One is to add your own temporary storage layer to act as a sort of
  impedance matcher thereby enforcing that any actual I/O requests to HDF5
  match on cache block boundaries.

  The other solution involves a 'new API' I think. Is this a proposed new
  API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces? Is the main issue
  that you need to be able to adjust chunk cache size based on your
  application's needs at the time the data is being read or written? If
  so, I think the HDF5 library has all the functions you'd need to do that
  already. So, its really not clear to me what value added you are
  proposing. If your datasets are 1 dimensional and you are processing
  more or less sequentially through it, as your pictures suggest, then I'd
  think a cache size equal to a few blocks should be sufficient to avoid
  the read-modify-write behavior. If its 2D, and your access is more or
  less a sliding 2D window, then I'd think a cache size of about 9 blocks
  would be sufficient to avoid read-modify-write behavior. If its 3D, then
  27 blocks. So, I figure I must be missing something that motivates this
  because beyond manipulating the cache size, I am not seeing what else
  your 'new API' solution provides.

  Sorry if my response is off target.

  Mark

  On Mon, 2011-07-25 at 14:36 -0700, Pedro Silva Vicente wrote:
  >
  > Dear HDF community
  >
  > Please find a document detailing an HDF5 API develop at ORNL regarding
  > chunk caching.
  >
  > At this moment I would be very happy in receiving comments/questions/
  > suggestions.
  >
  >
  >
  > ----------------------
  > Pedro Vicente
  > pedro.vicente@space-research.org
  > http://www.space-research.org/
  >
  >
  >
  >
  >

  _______________________________________________
  Hdf-forum is for HDF software users discussion.
  Hdf-forum@hdfgroup.org
  http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Ok, thanks for context. That helps. If I recall though, the best of your
two solutions resulted in only about a 65% speedup. Thats maybe a 2.2x
speedup which is certainly a step in the right direction but hardly
solves the 'orders of magnitude' problem you mention.

Now, I've been using and comparing raw binary I/O and products like HDF5
for many years. In all my experiences and depending on the size of the
read/write requests, I have observed and have been willing to tolerate
at most a 2-3x performance hit for using a 'high level' I/O library over
what is achievable using raw binary I/O interfaces. And, honestly, the
performance hit is usually less than 20-25% of raw binary bandwidth.
That, that assumes the HDF5 library is being used properly and, in turn,
the HDF5 library is using the underlying filesystem properly. I have
seen situations where neither or both are not the case and indeed I have
also seen orders of magnitude loss of performance. In fact, within the
last year, we needed to write our a specialized Virtual File Driver
(VFD) to get good performance on our BG/P system. Writing a new VFD for
HDF5 wasn't necessarily simple. But, it was possible and doing so
resulted in 30-50x performance improvement. On top of that, getting the
applications to use HDF5 slightly differently I think can gives us
another 2-3x performance improvement.

So, I guess what I am saying is that I've got to believe it is possible
to change the way NeXus uses HDF5 and/or the way HDF5 interacts with
underlying storage (maybe by writing your own VFD), to address the
'orders of magnitude' performance hit. Otherwise, I agree with you, why
should anyone pay that kind of a price to use it?

Mark

···

On Mon, 2011-07-25 at 17:49 -0700, Pedro Silva Vicente wrote:

However, due to the way the NeXus API was used, performance was very
slow. It was several orders of magnitude slower
than using a plain binary file to save the experiment results, so one
question that come up was

“Why should I use HDF5 instead of a binary file, if it’s several
orders of magnitude slower?”

So, I implemented the 2 solutions explained.

Thanks for follow up.

When I mentioned “orders of magnitude” I don’t have precise numbers; for that I would have to write a program that does exactly what the program is doing without HDF5: that is, read up to 20GB of input binary data (one file) and distribute it by several datasets (or several files, without HDF5), appending at some point a “slab” to them. So, the “orders of magnitude” is just my guess.

So, I guess what I am saying is that I've got to believe it is possible
to change the way NeXus uses HDF5 and/or the way HDF5 interacts with
underlying storage (maybe by writing your own VFD),

In this case the systems are Linux x86_64. The way NeXus is designed I think the behavior explained in the H5Map API is not possible. NeXus is designed so that one does not have to deal with “IDs” like HDF5, where we assemble all of them together, which allows great flexibility.
In the case of NeXus, it is more “high-level” than HDF5. You have operations like

1. nxsfile.openData(dataset_name);
2. nxsfile.putSlab( data, start, size );
3. nxsfile.closeData();

which maintain internally a “current” HDF5 dataset ID (there is only one “current”); in this case these operations open and close the dataset (in HDF5 terms), so I wanted to avoid precisely this. In the H5Map API it is possible to keep open all these datasets that I want to append to. They are closed only at end and not continuously re-opened and re-closed a multitude of times. I can obtain the HDF5 dataset ID that I want to append to by looking the STL map by path.

Pedro

···

----- Original Message -----
  From: Mark C. Miller
  To: HDF Users Discussion List
  Sent: Tuesday, July 26, 2011 12:34 AM
  Subject: Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API

  On Mon, 2011-07-25 at 17:49 -0700, Pedro Silva Vicente wrote:

  > However, due to the way the NeXus API was used, performance was very
  > slow. It was several orders of magnitude slower
  > than using a plain binary file to save the experiment results, so one
  > question that come up was
  >
  > “Why should I use HDF5 instead of a binary file, if it’s several
  > orders of magnitude slower?”
  >
  > So, I implemented the 2 solutions explained.

  Ok, thanks for context. That helps. If I recall though, the best of your
  two solutions resulted in only about a 65% speedup. Thats maybe a 2.2x
  speedup which is certainly a step in the right direction but hardly
  solves the 'orders of magnitude' problem you mention.

  Now, I've been using and comparing raw binary I/O and products like HDF5
  for many years. In all my experiences and depending on the size of the
  read/write requests, I have observed and have been willing to tolerate
  at most a 2-3x performance hit for using a 'high level' I/O library over
  what is achievable using raw binary I/O interfaces. And, honestly, the
  performance hit is usually less than 20-25% of raw binary bandwidth.
  That, that assumes the HDF5 library is being used properly and, in turn,
  the HDF5 library is using the underlying filesystem properly. I have
  seen situations where neither or both are not the case and indeed I have
  also seen orders of magnitude loss of performance. In fact, within the
  last year, we needed to write our a specialized Virtual File Driver
  (VFD) to get good performance on our BG/P system. Writing a new VFD for
  HDF5 wasn't necessarily simple. But, it was possible and doing so
  resulted in 30-50x performance improvement. On top of that, getting the
  applications to use HDF5 slightly differently I think can gives us
  another 2-3x performance improvement.

  So, I guess what I am saying is that I've got to believe it is possible
  to change the way NeXus uses HDF5 and/or the way HDF5 interacts with
  underlying storage (maybe by writing your own VFD), to address the
  'orders of magnitude' performance hit. Otherwise, I agree with you, why
  should anyone pay that kind of a price to use it?

  Mark

  _______________________________________________
  Hdf-forum is for HDF software users discussion.
  Hdf-forum@hdfgroup.org
  http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org