Reading and Writing Multiple Files Using Multiple Threads C++

Greetings, I am a Ph.D. student who currently uses his own file type to read and write information, but it has severe limitations as it’s a text file (I call it the .cbcsv file). The two main limitations I want to overcome are: 1 parsing large amounts of data takes a long time (because the file is text) and 2 all of the information has to be read before the relevant information can be pulled from the files, but I use this file type because it can read 64 input files and write to 64 output files simultaneously. I was wondering if I could use HDF5 instead?

I noticed that HDF5 is thread-safe, but according to the documentation it is not multithreaded:

The above page then talks about how concurrent access is possible but this seems to be contradicted by this page:
https://support.hdfgroup.org/HDF5/faq/threadsafe.html

Digging further I stumbled upon this webinar which talks about a “global lock” that is activated around the whole library when a single thread accesses it. This would make it unusable for what I require. But this was ~ 18 months ago now and I was wondering if this webinar is now out of date:

As such I’m a little confused as to whether HDF5 can do what I need it to do. For further context, I have attached an example .cpp program that reads and writes the file type that I’m currently using:
https://hep.ph.liv.ac.uk/~rcollins/Plots/weekly_work/2022_03_17/exampleCbcsv.cpp

Thanks in advance for any response.

@snillocdlanor writing 64 HDF5 independent files simultaneously from multiple thread will work out of the box – no need to use any locking; you have a disjoint set of graphs. This method is used in conjunction with merging the files later on a batch processor.

Generally users are interested in writing multiple datasets in a single file simultaneously from multiple threads, which is covered in your posted link. However in most cases I would recommend to implement a single thread writer, an IO queue where multiple threads write into that queue. For optimal performance you need a zero copy implementation of the queue, and most of the components.

HDF5 scalability, IO performance and overall its properties when it comes to scientific computing is generally better than one can implement – unless of course you are getting your doctorate in filesystems.

You might find this link to H5CPP library relevant and can accelerate your progress of implementing the IO framework for your project.

Perhaps you could post the pseudo code you want to implement?
best wishes: steve

Can you tell us a little more about your data (.cbcsv doesn’t ring a bell, but I’m ignorant), and how you want to process them?

G.

Hi, thanks for the response. I admit that what I want is a little unusual, it is multithreaded but a very basic form of multithreading. Each multithreaded program relies churning through n number of files, I picked 64 files as that’s the number of threads I have available on my machine. My files contain flat and vector/array information that I will need to pull from. Essentially I’m working just a single step up from multiple instances. I used to use .root files but ROOT cannot write out to multiple files unless locked (thus ruining performance). I wasted a fair amount of time trying to square that circle, hence this question here to double check I’m not about to waste any time using HDF5 instead. And from the sounds of it HDF5 can do what I want.

Pseudo code essentially goes like this

  1. Parse in arguments -> What files do you want to analyse? (say 64 for the sake of argument)
  2. Set up the threads (again say 64 for sake of argument)
  3. Give each of these threads the files they need to analyse (so 1 file per thread with previous numbers)
  4. Each of these threads creates a new output file (so 64 new output files)
  5. Each thread analyses the data it has been given and fills the output files accordingly
  6. The threads are joined together and the program ends

So from a file standpoint it should be simple, I now know I shouldn’t have used my own file type as that only causes more problems, but I was inexperienced at the time and didn’t know better.

In case you’re wondering why I don’t use multiple instances its because the inter-process communication between the various programs I use in my analysis chain ( ~10 programs) broke down and became a nightmare to fix. So as strange as it sounds multithreading is actually much easier for me to work with as the analysis is handled by a single program, just using multiple threads.

I hope that clears everything up.

Currently I just store int, vector int, double, vector double, bool, vector bool. The .cbcsv is just something I made up in a hurry, as I was inexperienced at the time and didn’t know that other file types like json and hdf5 were widely used. It’s just a csv file that can store vector information too in curly brackets (hence my naming it the curly bracket and comma-separated value file .cbcsv). All I need is a file type that can write out to 64 files independently without me needing to lock when I write out. steven’s post seems to indicate .hdf5 can do this. I made this post because originally I was using .root files and when it came to multi-thread them it required me to lock when writing out to independent files which ruined the performance. So I made this post to double-check that hdf5 wouldn’t be a similar experimence.

You can find this C++20 a threaded IO framework using a queue on my github page. The idea is the maintain a queue where you push data in; and consume it from a single thread. In order to compile the files, you need to have g+±10 or higher.

Here is the queue implementation, with the necessary synchronisation:

namespace h5::exp {
	template <typename T> struct queue {
		queue(): mutex(), cv(){
		}
        // running on different thread than CTOR
		void push(const T& value) {
            std::lock_guard<std::mutex> lock(mutex);
            data.push(value);
            cv.notify_one();
		}

        T pop(void){
            std::unique_lock<std::mutex> lock(mutex);
            while(data.empty())
                cv.wait(lock);
            T value = data.front();
            data.pop();
            return value;
        }

    private:
		mutable std::mutex mutex;
		std::condition_variable cv;
        std::queue<T> data;
	};
}

And notice how straightforward the HDF5 IO gets with H5CPP:

struct task_t {
	std::string name;
	std::vector<double> data;
};

int main(){
	h5::exp::queue<task_t> io;

	auto n_threads = number_of_threads;
	std::atomic<unsigned> n_jobs = n_threads;
	std::vector<std::jthread> pool(n_threads); 

	h5::fd_t fd = h5::create("example.h5", H5F_ACC_TRUNC);

	std::jthread io_thread([&]{
		while(n_jobs--){
			auto task = io.pop();
			h5::write(fd, task.name, task.data);
		}
	});

	for(auto& current_thread: pool) 
		current_thread = std::jthread([&]{
			auto payload = h5::utils::get_test_data<double>(dataset_size);
			std::string path = h5::utils::get_random_string(dataset_name_min, dataset_name_max);
			io.push(
				task_t{path, payload});
		});
}

best wishes: steve

This looks very interesting, thank you very much for all of this help. Currently, I don’t know if I can use this as at my university we’re still stuck on the prehistoric g++ 4.8.5-44. But it looks as though HDF5 is much better than I thought it was!

AFAIK: HDF5 is second to none – yes it is fast, it is scalable; but most importantly all statistical systems use it, and it makes a difference to learn something once and use it for decades.
image
Sorry to hear about the policy at your institution. The queue can be implemented using mutexes; while H5CPP routines can be replaced with HDF5 CAPI calls. You see after all you can verify the pattern, examine its properties and if you liked it, implement it in C. The actual coding is not too hard once you know what to do.

best wishes: steve