This pattern is frequent with sensor networks, HFT trading, etc,. For C++ H5CPP offers you h5::append operator to internally buffer packets to chunk size, then send it to its own compression pipeline based on BLAS level 3 blocking and finally making a H5Dwrite_chunk call. Unfortunately C++17 is not an option for everyone, but you could compile and “C” export a subroutine and link it to fortran/C/ etc?
The mechanism can saturate a commodity computer IO band width, using a single core, which is your IO thread.

To solve the entire problem: IPC or interprocess communication is to your help. The simplest data structure is a queue, I will get back to this later; a good alternative middle ware is ZeroMQ, Kafka, RabbitMQ or what not are also options.
ZeroMQ is popular where both latency and throughput matters, and want some wiggle room when it comes to deployment: [inter thread | interprocess | UDP | TCP | multicast: pgm. epgm, etc…] In fact It is like a Swiss army knife. ZeroMQ is so useful to solve these sort of problems, it deserves an article on its own.
Let’s get to the hand rolled queues. A tough one indeed, this is not what you would do on your own, but if you are up for it here is a threaded version with mutex/lock, then there are other varieties: lock free and wait free queues.
All of these solutions are to decouple software components starting from single process multithread to robust multicomputer multiprocess layouts. Where is HDF5 coming into the picture? The Disk IO thread(s) are of course; and since a single thread can saturate the IO bandwidth on commodity hardware, with this approach you get robust event recorder which can scale from intra process to multinode system.
hope it helps: steve
(diagramm is shamelessly stolen from the internet)