Hi Mark,
Sorry if my response is off target.
Not at all off target, it's right spot on in target [ Like most of these email list subscribers located outside of Champaign IL, we never met personally
but I have been reading your comments for like, 10 years?, and they are always on target]
If I read your document correctly, it proposes a couple of solutions.
Correct: 2 solutions for a problem that was basically a "bad" use of HDF5.
The Histotool software saves data in real time from neuton experiments. The NeXus format (based in HDF5) is used
http://www.nexusformat.org/Main_Page
One of the things HistoTool does is to append data as described in the PDF.
However, due to the way the NeXus API was used, performance was very slow. It was several orders of magnitude slower
than using a plain binary file to save the experiment results, so one question that come up was
"Why should I use HDF5 instead of a binary file, if it's several orders of magnitude slower?"
So, I implemented the 2 solutions explained.
This API can be used as a "use case" of how to circumvent the "problems" of the chunk design (by "problems" I mean the fact that using HDF5 as in the original software implementation, performance was several orders of magnitude slower than a plain binary file; this has to do with the way chunks are designed)
One is to add your own temporary storage layer to act as a sort of
impedance matcher thereby enforcing that any actual I/O requests to HDF5
match on cache block boundaries.
Right. In this case the user had to modify the software, just to avoid the "problem" mentioned. That should not happen, the fact that a user has to modify his code to avoid that problem.
The other solution involves a 'new API' I think. Is this a proposed new
API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces?
It is a new API. At this moment it is not previewed that will be incorporated in HDF5.
But it would be certainly possible with some changes to make it more of general use maybe.
The datasets are all 1D and I use extensively STL (vectors) but this could be changed to be of general use (like input/output format data)
Is the main issue that you need to be able to adjust chunk cache size based on your
application's needs at the time the data is being read or written?
Yes. It allows to add the chunk size as a parameter (currently 8GB); on top of this, it keeps track of a multitude of *open* datasets in a STL map (pair path, HDF5 dataset ID);
the purpose of this is to avoid closing and opening datasets frequently (that is as least as possible).
I think the HDF5 library has all the functions you'd need to do that already.
Correct. This API is a use case of it.
I posted it as a general information to the community. Hopefully it will help someone that at some point is faced with a similar problem.
Like I said "comments/questions/ suggestions" are welcome (Like "can I have it too?" )
Thanks, Mark, for the comments
Feel free to follow up with more questions or suggestions of how to improve it to be of more general use.
Pedro
···
----- Original Message -----
From: Mark C. Miller
To: HDF Users Discussion List
Sent: Monday, July 25, 2011 5:35 PM
Subject: Re: [Hdf-forum] H5 Map - a HDF5 chunk caching API
Hi Pedro,
Ok, I am not sure I fully understand what you are proposing or
requesting but I certainly won't let that stop me from sharing my
opinions 
It sounds like you are dealing with the fundamental 'read-modify-write'
performance problems that often occurs in caching algorithms where
operations 'above' the cache span multiple cache blocks.
If I read your document correctly, it proposes a couple of solutions.
One is to add your own temporary storage layer to act as a sort of
impedance matcher thereby enforcing that any actual I/O requests to HDF5
match on cache block boundaries.
The other solution involves a 'new API' I think. Is this a proposed new
API for HDF5 Lib or HDF5 HL or HDF5 lite interfaces? Is the main issue
that you need to be able to adjust chunk cache size based on your
application's needs at the time the data is being read or written? If
so, I think the HDF5 library has all the functions you'd need to do that
already. So, its really not clear to me what value added you are
proposing. If your datasets are 1 dimensional and you are processing
more or less sequentially through it, as your pictures suggest, then I'd
think a cache size equal to a few blocks should be sufficient to avoid
the read-modify-write behavior. If its 2D, and your access is more or
less a sliding 2D window, then I'd think a cache size of about 9 blocks
would be sufficient to avoid read-modify-write behavior. If its 3D, then
27 blocks. So, I figure I must be missing something that motivates this
because beyond manipulating the cache size, I am not seeing what else
your 'new API' solution provides.
Sorry if my response is off target.
Mark
On Mon, 2011-07-25 at 14:36 -0700, Pedro Silva Vicente wrote:
>
> Dear HDF community
>
> Please find a document detailing an HDF5 API develop at ORNL regarding
> chunk caching.
>
> At this moment I would be very happy in receiving comments/questions/
> suggestions.
>
>
>
> ----------------------
> Pedro Vicente
> pedro.vicente@space-research.org
> http://www.space-research.org/
>
>
>
>
>
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org