Write many datasets at the same time using OpenMP

kerim.khemraev · February 3, 2020, 12:30pm

Hi,

I am beginner at HDF5 and for my purpose I need to create many 2D datasets. To do it faster I would like to adjust OpenMP to allow each core of my computer write unique dataset. So it would possible to write few datasets at the same time. Is there a way to do that? Here is an example how I tried to do that but I got errors:

wall_clock timer; // I use Armadillo math library
timer.tic();
#pragma omp parallel for private(DATA, trc_dataset_id)
for(int i = 0; i < 10000; i++){
    DATA.fill(i); // sets all values of DATA Armadillo matrix equal to i
    QString datasetName = QString::number(i, 10);
    trc_dataset_id = H5Dcreate(file_id, datasetName.toUtf8().constData(), H5T_NATIVE_INT, trc_dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    H5Dwrite(trc_dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, DATA.mem);
    H5Dclose(trc_dataset_id);
}
H5Sclose(trc_dataspace_id);
H5Fclose(file_id);
double t = timer.toc();
std::cout << t << std::endl;

Here are the errors I get:

HDF5-DIAG: Error dHDF5-DIAG: Error detected in HDF5 (etected in 1.10.6HDF5 ()1.10.6) threathread 0:
d #0000: F:\Qt\Downloaded\CMake-hdf5-1.10.6\CMake-hdf5-1.10.6\hdf5-1.10.6\src\H5D.c :
#line000: F:\Qt\Downloaded\CMake-hdf5-1.10.6\CMake-hdf5-1.10.6\hdf5-1.10.6\src\H5D.c151 in H5Dcreate2(): unable to create dataset line 151 in
H5Dcreate2() major: : Dataset
unable to create dataset
minor: major: Unable to initialize object
Dataset
minor: Unable to initialize object

I use Windows 10 x64, Qt 5.14.0, MSVC x64 compilator and my computer has 2 cores

gheber · February 3, 2020, 2:16pm

The use case you are describing is currently not supported, in the following sense. For any kind of multithreading you need a thread-safe build of the library. If unsure, you can determine that by calling H5is_library_threadsafe (https://portal.hdfgroup.org/display/HDF5/H5_IS_LIBRARY_THREADSAFE). With a thread-safe build, your application should run fine, but you may not see any throughput gains or even slow down. The main reason is that the current implementation of the HDF5 library does not use any multithreading internally. At any given time, only one of your threads will be able “to be in the library” and do things such as create new objects, do I/O, or close handles.

How big are you armadillo objects (DATA)? There’s conceptually nothing wrong with your example. However, you might be faster by pipelining/blocking the creation of your datasets, assuming throughput is the metric.

G.

steven · February 3, 2020, 3:31pm

You are not creating data sets instead doing create and write. Get a profiler, and profile the code; you could be writing 1TB in that arma::mat. There is difference between writing large chunks, single chunk, or fragment of chunk to HDF5. Not to mention stressing the metadata structure.

In any event you may be interested in this Windows port of H5CPP, which natively supports armadillo C++. Profiled and well behaved. To make this happen break up your code in two parts:

create an IO thread with a queue, all your HDF5 IO should go through this layer, use thread primitives to protect this queue.
rest of your threads doing : GUI/QT screen stuff, compute stuff, scheduling IO to your dedicated queue.

You also could just use h5::append which has an internal buffer, and protect that with thread primitives. Before you do any of this, profile the code, compute base IO baseline by simulating similar activity with available system IO. Compare results or even better: share it with us.

best: steve

kerim.khemraev · February 3, 2020, 3:34pm

Thank you for explanation

I just checked I built HDF5 without Thear-Safety. So you think that there sould not be any perfomance gain if I recompile the library with Thread-Safety option?

Armadillo object (matrix) maybe of different size. Say sometimes it is about 1000 rows by 100 columns of float format, but it maybe be about 3000 rows by 10000 columns. And the number of such datasets also vary: from n100 till n10000. The resulted size of HDF5 file maybe from n100 megabyte till n100 Gigabytes.

kerim.khemraev · February 3, 2020, 3:53pm

Steven,
Thank you for information. I think you are saying somethink meaningful but I need some time to understand that. I’m quite new at HDF5 and C/C++ programming.
I saw armadillo functions that work with I/O HDF5 library but I’m also worried about Armadillo uses column-major orientation and HDF5 uses row-major I/O operations. And the perfomance is the most meaningful for me. I need to think will H5CPP be useful for me.

kerim.khemraev · February 4, 2020, 12:23am

Steven,
I’ve just buit H5CPP with MPI enabled but have not tested it. It was built without errors so I think it should work.
Also I have used profiling in Qt but I googled some information and I tried Intel VTune Profiller. I think I understood the idea.

Could you please explain a little bit more. As I understood you are saying about to use one thread to read/write data and all other threads use to process data? If so, then I’m going do that when I begin work on data-processing step. But now I only need to fast read data from SEGY format to HDF5 and no processing is needed.
Now I use SSD so I/O operations is quite fast but I would like to adjust all resources available to achieve the highest perfomance.

kerim

steven · February 4, 2020, 2:03am

Hey Kerim,

Measure the total raw throughput for your SSD drive. If the sum of the IO reading the SEGY and writing the HDF5 exceeds this on a single SSD, be suspicious

Basically I am trying to say that according to my measurements you saturate the SSD before the IO queues of the OS. These slides provide you with some numbers be certain to activate the custom pipeline with h5::high_throughput custom property, like this h5::open(fd,"some_dataset", h5::high_throughput);

The above one only works with serial HDF5. The parallel is a different cup of tea, and mostly used on supercomputers, HPC clusters with proper parallel filesystem. Is this what you intend to do?

The dedicated IO process/thread is to pool IO request in the user space, then delegate them to kernel when it is optimal. This way one can reach significantly higher IOPS. Differently put: if you have many threads, but a single IO queue to schedule, and you saturate the IO queue then your threads will line up in busy wait. Does this make sense?

Before engaging with threads/processes one must ask if the task can be separated into independent sub-tasks: is it embarrassingly parallel? If not, then you are to deal with Amdal’s law.

best: steve

kerim.khemraev · February 4, 2020, 2:09pm

Hi Steve,

Thank you for presentation. Am I right on slide 21 you create data of size 2 Gb in RAM and using single HDF5 write command you write this data? and on slide 22 you create many datasets of size 30 Mb in RAM and in write them in loop? If so how could be that you outperform the I/O limit of SSD (500 Mb/sec)?

I think that in my case threads could run independently that is why I have chosen to create many datasets each of different size (number of rows of Armadillo matrix is the same but the number of columns vary from one dataset to another). So I thought that I could use OpenMP to create and write the data in datasets independently and also I could compresse each dataset independently. What about MPI I would like to use use also but later. I am from Matlab guy and there is much information to learn abou C/C++

I also considered the way of creating one big dataset (it may weigh 300 Gb) in which I write the data with hyperslab, but still, written data has different number of columns which make inconvenience if I want to use chunk and compression. As far as I know for a single dataset I can use only constant chunk-size? I cannot vary chunk-size in the dataset?
Actually compression for my data is not very effective, after compression my data lose only 10-15 percent of weight.

If we refuse the compression staff and keep in mind all the parallel I/O perfomance of HDF5 file, then would you recommend me to use single big dataset or many datasets of smaller size? The mathematical data processing is passing individually for each matrix that I should retrieve either from smaller dataset or from the hyperslab of big dataset.

I made some experiment with I/O perfomance:

int main()
{

Mat<qint32> DATA;
qint32 k = 0;
hid_t file_id, trc_dataspace_id, trc_dataset_id;
hsize_t nRow = 601, nCol = 110;
hsize_t trc_dims[2] = {nCol, nRow};

DATA.set_size(nRow, nCol);
for (hsize_t i = 0; i < nRow; i++){
    for (hsize_t j = 0; j < nCol; j++){
        DATA(i,j) = k;
        k++;
    }
}

file_id = H5Fcreate("raw_le.h5", H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
trc_dataspace_id = H5Screate_simple(2, trc_dims, nullptr);

wall_clock timer;
timer.tic();
for(int i = 0; i < 400; i++){ // 400 is equal to file-size about 100 Mb, 4000 is equal to 1 Gb
    QString datasetName = QString::number(i, 10); // 10 - это десятичная система исчисления
    trc_dataset_id = H5Dcreate(file_id, datasetName.toUtf8().constData(), H5T_NATIVE_INT, trc_dataspace_id, H5P_DEFAULT, H5P_DEFAULT, H5P_DEFAULT);
    H5Dwrite(trc_dataset_id, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, DATA.mem);
    H5Dclose(trc_dataset_id);
}
H5Sclose(trc_dataspace_id);
H5Fclose(file_id);
double t = timer.toc();
std::cout << t << std::endl;

}

For 400 iterations (100 Mb) it takes 0.15 seconds
For 4000 iterations (1 Gb) it varies a lot: from 8 to 16 seconds (I dont know why)
According to my SSD information it has the I/O speed 500 Mb/sec

I’m going to try the same experiment with hyperslab and H5CPP

kerim

steven · February 4, 2020, 4:49pm

Kerim,

the experiments are statistical, median of some larger population: 100x2GB or 1000x30MB to saturate the underlying caches. Naturally this happens with a distribution, and a mean/median doesn’t reflect the actual happening. – although statistically speaking with large enough population the residual error should go down.

It is customary to post minimum working example stripped off additional libraries. Can you remove QT and send a pull request added to this github example?

In the posted example, which i upload here: main.cpp (1.5 KB)
I created for you 4000 armadillo arma::Mat<int32_t> matrices, filled then normal distribution, then efficiently created data sets, and finally did the IO. The results are here on an LENOVO x250 i5 SSD:

CREATING 4000 h5::ds_t cost:0.215566 or 18555.8dataset / sec
GENERATING  4000 random matrices,  cost:30.9288 or 129.329matrix / sec
WRITING  4000 random matrices to dataset,  cost:1.46594 or 2728.62matrix / sec
THROUHPUT: 687.613MB/s

best: steve

kerim.khemraev · February 5, 2020, 12:03am

Steve,

I uploaded my C/C++ experiment here. Please check if I’ve done it correctly.

I just tried to run your example from main.cpp but as I use MSVC 2017 which doesn’t support C++17 I cannot run it
I need to find out if I compiled libraries with MSVC 2017 x64 would them work with MSVC 2019 x64?
I’m going to find this information tomorrow

kerim

steven · February 5, 2020, 4:28am

Hi Kerim,

I made some changes to your uploaded CAPI based solution, making the two code near identical except the IO part, as well as added a harness, which in the following case takes 50 samples from h5cpp-io version, which has N=400 datasets, then it computes the mean throughput:

./io-test ./h5cpp-io 50 400

for the capi I got: 1432MB/sec h5cpp-io 1361MB/sec the CAPI does pretty good job here, slightly better performance can be explained with the extra dataspace descriptors inside the H5CPP code.

NOTE: each dataset is indeed random, each test is a complete run on target architecture, distribution assumed to be normal hence the mean.
best:
steve

kerim.khemraev · February 5, 2020, 3:59pm

Hi Steve,

Thank for tests! I don’t understand why we can write data faster than our SSD allows (500 Mb/s)?
I don’t know how HDF5 library works but as I know HDF5 file has selfdescribing structure. I think that HDF5 file that consists of N 2D datasets has a descrition that I could investigate and write my own code for writing N 2D datasets using memory mapping (which is really fast) and threads. How do you think in this case would I achieve meaningful increase of perfomance? I know that SSD I/O speed puts some limitations on I/O perfomance but somehow we have already got numbers that exceed it.
By the way most of my such ideas were wrong so if I’ve said something nonsense feel free to point this out

By the way,
I’ve just installed MSVC 2019 compilator that supports C++17 but I get two new errors when running your h5cpp code (in main.cpp few massages above):

C:\apps\MSVC_apps_release\H5CPP_04_02_2020_MPI\include\h5cpp\H5Zall.hpp:52: error: C2664: ‘int compress2(Bytef *,uLongf *,const Bytef *,uLong,int)’: cannot convert argument 2 from ‘size_t *’ to ‘uLongf *’
C:\apps\MSVC_apps_release\H5CPP_04_02_2020_MPI\include\h5cpp\H5Pall.hpp:171: error: C2440: ‘initializing’: cannot convert from ‘initializer list’ to ‘h5::impl::prop_t<h5::impl::detail::hid_t<h5::impl::fcpl_t,herr_t H5Pclose(hid_t),true,true,1>,hid_t h5::impl::default_fcpl(void),h5::impl::capi_t<hid_t,H5F_fspace_strategy_t,hbool_t,hsize_t>,herr_t H5Pset_file_space_strategy(hid_t,H5F_fspace_strategy_t,hbool_t,hsize_t)>’

If necessary I could post it somewhere on github.

kerim

drozdowski.chris · February 5, 2020, 5:46pm

Hi Kerim,

I am the one who ports H5CPP to the Windows platform. If you will post your H5CPP-based code somewhere I will take a look.

FYI- VS 2017 does support C++17 and H5CPP has compiled fine for me after adapting.

Thanks,
Chris Drozdowski

kerim.khemraev · February 5, 2020, 8:04pm

Hi Chris,

Actually it is a code written by Steve main_steve.cpp (1.8 KB)
I’m sorry, I just needed to include c++1z to configuration of my Qt project to enable C++17 (didn’t know that).
Besides the errors I posted above, every 5 second I get the error:

2020-02-05T22: 53: 53 Clang code model: Error: The clangbackend program terminated unexpectedly and was restarted.

Here is this error on the picture:

By the way, am I right h5cpp doesn’t contain any libraries? I compile without errors but the lib folder looks like:

kerim

drozdowski.chris · February 5, 2020, 9:24pm

Hi,

First, are you targetting Windows or Linux from Windows (which is a thing these days)? The rest of this post is for targetting Windows only.

h5cpp cannot compile with Clang on Windows. A known limitation of h5cpp is that you cannot use Clang > 6.x (See: https://github.com/steven-varga/h5cpp/issues/45). Meanwhile, Clang for Windows uses the MS version of the STL which requires Clang >= 8.0. So no use of Clang unless I am missing something.
Windows uses its own version of MPI, not the same MPI that the Linux world uses. I know nothing about the MPI on Windows except that h5cpp currently cannot use the MS implementation of MPI.
You may not be using the proper Windows port of h5cpp (hence the errors). You can go to the following repo to get both the Windows h5cpp port and the port of Steve’s examples together and ready to compile “out of box” with VS 2017 or 2019. Please read the README.md file.

For ease, you can install the prebuilt HDF5 1.10.6 binaries (hdf5-1.10.6-Std-win10_64-vs15).

The examples are a good place to make sure you both can compile and to learn more about using h5cpp in general.

Cheers,
Chris Drozdowski

kerim.khemraev · February 6, 2020, 12:02am

Chris,

I read Readme
I just downloaded h5cpp-windows-master and tried to run attribute example and compound exaple… neither of them works
The error I get:

C:\apps\h5cpp_windows\h5cpp\H5Dwrite.hpp:60: error: C2666: ‘h5::write’: 2 overloads have similar conversions

kerim

drozdowski.chris · February 6, 2020, 1:06am

Please open an issue in the GitHub repo.

Thanks,
Chris

drozdowski.chris · February 6, 2020, 12:13pm

Hi community,

For anyone targeting Windows who wishes to try Steve’s throughput examples, I have added them to the h5cpp-windows repo (https://github.com/ChrisDrozdowski/h5cpp-windows). They are the following example projects: 4k_datasets_throughput, capi_io, and h5cpp_io.

I myself have no issues building the examples solution using VS 2017 (another user uses VS 2019 without issue), so if anyone else has issues, please add an Issue to the repo.

Thanks,
Chris Drozdowski

kerim.khemraev · February 7, 2020, 5:06pm

I think I finally understood what you meant in your post.
So I read some information about OpenMP and I found a way to load up all my cores while reading my SEGY data and writing it to HDF5.
I just used #pragma omp parallel for to read every part of SEGY file in loop and to save it as dataset of HDF5 I use a queue which I generate with #pragma omp critical written before HDF5-output commands begin. That allowed me to achieve increase of perfomance since I waste less time when reading and preparing SEGY.
Thank you, Steve and Chris!

kerim

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Write many datasets at the same time using OpenMP