I am working on the following problem:
A code is producing about 20000 dataset, which should be placed pairwise into groups, resulting in 10000 groups with 2 datasets each. Each pair of data for the datasets is calculated by a computer node, so only one node needs to write to one dataset without any data from other nodes. The site of one dataset is about 500 kByte.
My approach for doing so is the following:
1) I am open the file in parallel mode
2) I am looping with all nodes over all groups and dataset: creating and closing them collectively
3) Then I loop over the calculation of the dataset as for (int i=rank; i < 10000, i+=size){„calculate the data of the data set“}
4) One node writes exclusively into a single dataset with a transfer protocol which is INDEPENDENT
So far the idea. I am profiling the code with MPE and it works ok for a small number of nodes but with more nodes it gets worse, much worse, up to a point that doing the calculation on a single node doing a serial writing, while the remaining nodes are idle.
I am stuccoed now to get a good performance which scales nicely with the number of cores.
I am working on the following problem:
A code is producing about 20000 dataset, which should be placed pairwise into groups, resulting in 10000 groups with 2 datasets each. Each pair of data for the datasets is calculated by a computer node, so only one node needs to write to one dataset without any data from other nodes. The site of one dataset is about 500 kByte.
500 kbyte for 20000 datasets: you're moving ~ 10 gigs of data
My approach for doing so is the following:
1) I am open the file in parallel mode
2) I am looping with all nodes over all groups and dataset: creating and closing them collectively
3) Then I loop over the calculation of the dataset as for (int i=rank; i < 10000, i+=size){�calculate the data of the data set�}
4) One node writes exclusively into a single dataset with a transfer protocol which is INDEPENDENT
So far the idea. I am profiling the code with MPE and it works ok for a small number of nodes but with more nodes it gets worse, much worse, up to a point that doing the calculation on a single node doing a serial writing, while the remaining nodes are idle.
I am stuccoed now to get a good performance which scales nicely with the number of cores.
Any help or tips are appreciated
Well you're pumping 10 gigs of data through one node. that's not going to scale.
I guess you could decompose your parallel writes over the datasets, but i'm not sure how HDF5 updates the free blocks list in that case.
Could you produce instead onf 20000 datasets one dataset with an additional dimension called, oh, "data id" maybe of size 20000?
Some HDF5 users like "poor man's parallel I/O". I think it's a horrible architecture but one must be pragmatic: it's a good solution to defective file systems. Perhaps you have one of those file systems?