Memory management in H5FD_class_t

lucasvr · February 23, 2022, 3:44am

Hi, folks!

The VFD interface provides a good abstraction for hardware-accelerated I/O routines. John Ravi’s GDS VFD is one such example: it uses NVIDIA’s GPUDirect Storage to initiate I/O to a cudaMalloc'd buffer provided by the caller.

Different VFDs may deal with memory managed by other specialized allocators: a VFD optimized to read from Samsung’s SmartSSD would need to perform I/O to a buffer allocated via OpenCL’s clCreateBuffer(), for instance.

However, I don’t want to bloat every application that needs to support said VFDs by introducing #ifdefs and explicit calls to device-specific memory allocators. For this reason I started to look around to find which kind of abstraction the VFD interface provides for memory management.

I have noticed that the H5FD_class_t provides two members named alloc and free that I’ve initially presumed to be in charge of interfacing with specialized memory allocators (which would be just perfect):

H5FD__alloc_real(H5FD_t *file, ...) calls the VFD alloc handler
H5FD__free_real(H5FD_t *file, ...) calls the VFD free handler

However, code comments (“allocate space in the file”) and the APIs below suggest that alloc and free relate to fallocate() / ftruncate() instead:

H5FD_alloc(H5FD_t *file, ..., H5F_t *f, ...) marks the file f end-of-allocated flag dirty
H5FDalloc(H5FD_t *file, ..., H5F_t *f, ...) modifies the dataset access property list

In the absence of memory-management APIs at the VFD layer I’d like to ask if this is something that’s captured somebody else’s attention before. Is there someone working on this topic already? If that’s not the case then I’d be happy to prototype an extension to H5FD_class_t as long as that’s useful to the community.

Such an API would be useful for two primary reasons. First, applications could be as vendor/VFD-neutral as possible. Second, I remember having seen some code in the I/O filter path (H5Z*.c) which assumed that the output buffers had been allocated with malloc. If that code were to call into the VFD to reallocate buffers that wouldn’t be a problem anymore.

Thanks!
Lucas

gheber · February 24, 2022, 2:03pm

Lucas, how are you? I believe that the community is very intrested in this work. I’ve asked a few colleagues and there’s definitely been some thinking along similar lines, but no specific proposal has been produced. Here’s what we can do: If you can sketch out a proposal for a revised H5FD_class_t, we could use one of our Friday webinars (11 PM Central) for you to present your ideas and start the discussion. Everyone interested would be welcome to join.

Food for thought: Different device-specific memory types may have different capabilities in terms of what operations can or cannot be performed on them w.r.t. to the different stages of the HDF5 I/O flow. For example, some may support gather/scatter type operations. Others may or may not be used to do datatype conversion. It’d be OK to impose restrictions, define capabilties, etc., but there’s more to it than just (de-)allocation.

What do you think?
Best, G.

jhenderson · February 24, 2022, 5:21pm

Hi @lucasvr,

It might be worth looking at This commit and This commit where I started to flesh this out a little when working with the GDS VFD. The idea is to have this catch-all H5FDctl operation that you provide with an op_code and arguments to perform essentially arbitrary operations in a file driver. In the latter commit, I added some convenient op_codes for mem copy, mem alloc and mem free that drivers can use to perform these ops in an application directly with H5FDctl. Further, I introduced the H5FD_FEAT_MEMMANAGE feature flag for drivers. A driver that needs to do its own memory management should specify this flag in its query callback to instruct HDF5 to avoid direct stdlib memcpy of application-provided buffers in certain places within the library and instead ask the driver to do the copy. This was initially a problem for compact dataset I/O with the GDS VFD; there may be (and likely are) other places that haven’t been fixed yet. In future improvements, it would be good for HDF5 to be able to ask the driver to allocate and free buffers itself rather than the current approach of allocating them with malloc and then having the driver distinguish how to copy between stdlib malloc’ed buffers and its own type of buffer.

Happy to discuss more if this looks like it could be applicable for what you’d like to accomplish!

lucasvr · February 24, 2022, 7:34pm

Hi Gerd, hi Jordan! Thank you for your replies.

@jhenderson, your commits look very much like what I had in mind for the basic infrastructure. Given the number of specialized functions the driver may need to implement, I was afraid that H5FD_class_t would end up with too many new members. The ctl interface you introduced is just great. I will embrace the use of H5FDctl on my working tree.

@gheber, @jhenderson, it feels like I/O filter interface should also provide a capabilities interface to let the core HDF5 routines know if it’s OK to provide a buffer allocated by the VFD as input to an I/O filter, or if a data copy via H5FD_CTL__MEM_COPY is needed. What do you think?

The primary application I have is to enable parallel decompression of data chunks supported by GPUs: the GDS VFD would allocate device memory, and that memory would be provided as input to the I/O filter.

I’ve been looking into the interaction between the various modules and creating a diagram of the parts that could host the routines related to GPU decompression of data chunks. My idea is to mimic the steps used by the MPI-based I/O path (represented by the green arrows in the diagram below):

introduce a variable such as io_info->using_gpu_vfd
initialize H5D_layout_ops par_read member so it points to H5D__chunk_gpu_read()
let H5D__chunk_gpu_read() read the compressed data via VFD
let H5D__chunk_gpu_read() prepare an iovec-like array with the offset and length of each chunk
let H5D__chunk_gpu_read() call into the I/O filter while providing the iovec as input (would need a change to the H5Z_class2_t APIs – possibly bumping to H5Z_class3_t)

Does this sound like a good approach to you or would you recommend something different?

Thanks!
Lucas

jhenderson · February 24, 2022, 10:27pm

it feels like I/O filter interface should also provide a capabilities interface to let the core HDF5 routines know if it’s OK to provide a buffer allocated by the VFD as input to an I/O filter, or if a data copy via H5FD_CTL__MEM_COPY is needed.

I think heading in this direction makes a lot of sense and is a good next step for opening up the HDF5 ecosystem for use with this type of hardware. It does seem as though the I/O filter interface may need to have better provisions for a filter to be able to inform HDF5 about the types of buffers it can handle (as well as inform the library about the types of buffers it might hand back). And as you noted in another thread, the I/O filter interface doesn’t currently give you the file handle, so one can’t make an HDF5 call to generically request memory management. I think this would be something nice to include if the I/O filter interface is to be revised.

introduce a variable such as io_info->using_gpu_vfd

initialize H5D_layout_ops par_read member so it points to H5D__chunk_gpu_read()

let H5D__chunk_gpu_read() read the compressed data via VFD

let H5D__chunk_gpu_read() prepare an iovec-like array with the offset and length of each chunk

let H5D__chunk_gpu_read() call into the I/O filter while providing the iovec as input (would need a change to the H5Z_class2_t APIs – possibly bumping to H5Z_class3_t)

I feel like the approach to something like this might become more obvious once the basic infrastructure to support GPU I/O filters is in place, but I wonder if this is perhaps re-inventing more of the wheel than is necessary. Since the reads for chunk data go down to the file driver layer, it seems you should be able to re-use HDF5’s existing chunking support for accomplishing this, without needing to worry about creating your own write/read routines. Of course there will most likely need to be some refactoring to deal with places where the library assumes buffers are allocated with malloc, support for the changes described for the H5Z_class_t, and so on, but at least in theory it should be relatively straightforward to leverage HDF5’s existing chunking code.

One more thing on this topic: I remembered that H5FDctl is not really an application-level routine in HDF5 at the moment and was more of library-internal support for the GDS VFD. I think it would make sense for us to repurpose H5allocate(resize/free)_memory for VFD-level memory management within HDF5 applications, rather than those routines simply using malloc semantics. There may be some application compatibility concerns there, though.

Certainly a lot to think about here!

lucasvr · February 25, 2022, 11:58pm

Definitely – I’ll keep this in mind.

Yes, I was a bit concerned that my approach would add too many new elements unnecessarily. Thanks for your advice. I’ll see how far I can go with the VFD + revamped I/O filter interfaces first.

Thanks!
Lucas

lucasvr · March 1, 2022, 4:15am

@jhenderson I noticed that a process calling H5FD_ctl is not able to tell if the driver failed to handle a request or if the command was not recognized – both will set major=H5E_VFL, minor=H5E_FCNTL, ret_value=FAIL.

I’d like to catch situations in which the driver does implement ctl but doesn’t recognize H5FD_CTL__MEM_{ALLOC,FREE}; in such a case I’d like to fallback to allocating system memory.

There are at least a couple of ways to implement that: (1) adding an extra output argument hbool_t *is_recognized to both H5FD_ctl and ctl callbacks, and (2) propagating a different minor in H5FD_ctl depending on the error path taken. The latter brings unnecessary overhead associated with the error stack, but the former isn’t pretty either. Do you have an opinion on this?

Thanks!
Lucas

jhenderson · March 2, 2022, 7:45pm

@lucasvr IMO, the cleanest solution would be for HDF5 to return actual error codes via herr_t, but since we don’t yet do this, I think the latter solution you mentioned would be best, as long as the interpretation of the error stack isn’t too painful or significant in terms of overhead. I’d think returning H5E_UNSUPPORTED for the minor error code in the case where the driver doesn’t implement the request would be reasonable.

lucasvr · March 2, 2022, 8:10pm

Thanks! I’m already testing the error stack lookup approach and it works. I’ll embrace H5E_UNSUPPORTED to make things clearer next.

gheber · March 7, 2022, 1:15am

Lucas, which software did you use to create that diagram? G.

lucasvr · March 7, 2022, 3:19am

@gheber I used Graphviz. This was not an automated process, though: since only a few functions and data structures were relevant to me, I had to manually group/link them and process the resulting .dot file to generate the figure (dot -Tpng file.dot -o file.png).

The “source code” of this graph is available here in case you’d like to use it as a reference. It helps to have the PNG opened by the side!

gheber · March 7, 2022, 2:59pm

Masterfully done & thank you for the source. G.

lucasvr · March 15, 2022, 10:00pm

Hi folks!

I’ve drafted a pair of functions for memory allocations via the VFD interface. Please find the pull request here.

A natural next step would be to create a similar wrapper (i.e., one which attempts to invoke a function via VFD and that falls back to the corresponding H5MM function if the opcode is not implemented by the driver) for H5FD_CTL__MEM_COPY, but I wanted to validate my approach with you first.

Note that I’ve also submitted a related pull request that introduces an API to retrieve the H5FD_t * handle associated with a valid file_id so that programs can easily consume functions from the H5FD family.

Thanks!
Lucas

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Memory management in H5FD_class_t