Hi Mohamad,
thanks for your reply. The reason I suspected Lustre of being the
culprit is simply that the error does not appear on my personal
computer. I thought that maybe the files are written/opened too fast or
too many at the same time for the synchronization process of Lustre to
handle.
I am inserting various pieces of code that show how I am calling the
HDF5 library. Any comment on proper ways of doing so is much appreciated!
To open the file, I use the following code:
==================================================================
int H5Interface::OpenFile (std::string filename, int flag) {
bool tried_once = false;
struct timespec timesp;
timesp.tv_sec = 0;
timesp.tv_nsec = 200000000;
for (int tries = 0; tries < 300; tries++) {
try {
H5::Exception::dontPrint();
if(flag == 0) {
file = H5::H5File (filename, H5F_ACC_TRUNC);
} else if (flag == 1) {
file.openFile(filename, H5F_ACC_RDONLY);
} else if (flag == 2) {
file.openFile(filename, H5F_ACC_RDWR);
}
if (tried_once) {
std::cout << "Opening " << filename << " succeded after "
<< tries << " several tries" << std::endl;
}
return 0;
} catch( FileIException error ) {
tried_once = true;
}
catch( DataSetIException error ) {
tried_once = true;
}
catch( DataSpaceIException error ) {
tried_once = true;
}
nanosleep(×p, NULL);
}
std::cerr << "H5Interface:\tOpening " << filename << " failed";
return -1;
}
It often happens that opening a file succeeds only after 1 or 2 tries.
I write and read strings like this:
==================================================================
int H5Interface::WriteString(std::string path, std::string value) {
try {
H5::Exception::dontPrint();
H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
H5std_string str (value);
hsize_t dims[1] = { 1 };
H5::DataSpace str_space(uint(1), dims, NULL);
H5::DataSet str_set;
if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
str_set = file.openDataSet(path);
} else {
str_set = file.createDataSet(path, str_t, str_space);
}
str_set.write (str, str_t);
str_set.close();
}
catch( FileIException error ) {
// error.printError();
return -1;
}
catch( DataSetIException error ) {
// error.printError();
return -1;
}
catch( DataSpaceIException error ) {
// error.printError();
return -1;
}
return 0;
}
==================================================================
int H5Interface::ReadString(std::string path, std::string * data) {
try {
H5::Exception::dontPrint();
if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
H5::StrType str_t(H5::PredType::C_S1, H5T_VARIABLE);
H5std_string str;
H5::DataSet str_set = file.openDataSet(path);
str_set.read (str, str_t);
str_set.close();
*data = std::string(str);
}
}
catch( FileIException error ) {
// error.printError();
return -1;
}
catch( DataSetIException error ) {
// error.printError();
return -1;
}
catch( DataSpaceIException error ) {
// error.printError();
return -1;
}
return 0;
}
And finally for writing and reading boost::multi_arrays, for example:
==================================================================
int H5Interface::Read2IntMultiArray(std::string path,
boost::multi_array<int,2>& data) {
try {
H5::DataSet v_set = file.openDataSet(path);
H5::DataSpace space = v_set.getSpace();
hsize_t dims[2];
int rank = space.getSimpleExtentDims( dims );
DataSpace mspace(rank, dims);
int data_out[dims[0]][dims[1]];
data.resize(boost::extents[dims[0]][dims[1]]);
v_set.read( data_out, PredType::NATIVE_INT, mspace, space );
for (int i = 0; i < int(dims[0]); i++) {
for (int j = 0; j < int(dims[1]); j++) {
data[i][j] = data_out[i][j];
}
}
v_set.close();
}
[...]
==================================================================
int H5Interface::WriteIntMatrix(std::string path, uint rows,
uint cols, int * data) {
try {
H5::Exception::dontPrint();
hsize_t dims_m[2] = { rows, cols };
H5::DataSpace v_space (2, dims_m);
H5::DataSet v_set;
if (H5Lexists(file.getId(), path.c_str(), H5P_DEFAULT)) {
v_set = file.openDataSet(path);
} else {
v_set = file.createDataSet(path, H5::PredType::NATIVE_INT,
v_space);
}
v_set.write(data, H5::PredType::NATIVE_INT);
v_set.close();
}
[...]
As far as the workflow goes, a scheduler provides the basic h5 file with
all the parameters and tells the workers to load this file and then put
their measurements in. So they are enlarging the file as time goes by.
Have a nice day, Peter
On 11/19/2012 03:36 PM, Mohamad Chaarawi wrote:
Hi Peter,
The problem does sound strange.
I do not understand why file locking helped reduce errors. I though
you said each process writes to its own file anyway, so locking the
file or having one process manage the reads/writes should not matter
anyway.
Is it possible you could send me a piece of code from your simulation
that is performing I/O, that I can look at and diagnose further?
A program that I can run and replicates the problem (on Lustre) would
be great. If that is not possible, then please just describe or
copy-paste how you are calling into the HDF5 library for your I/O.
Thanks,
Mohamad
On 11/18/2012 10:24 AM, Peter Boertz wrote:
Hello everyone,
I run simulations on a cluster (using OpenMPI) with a Lustre filesystem
and I use HDF5 1.8.9 for data output. Each process has its own file, so
I believe there is no need for the parallel HDF5 version, is this
correct?
When a larger number (> 4) processes want to dump their data at the same
time, I get various errors of paths and objects not found or any other
operation failing. I can't really make out the reason for it, as the
code works fine on my personal workstation and runs for days with writes
/ reads every 5 minutes without failing.
What I have tried so far is having one process manage all the read/write
operations so that all other processes have to check whether anyone else
is already dumping their data. I also implemented
boost::interprocess:file_lock to prevent writing in the same file, which
is however excluded by the queuing system anyway, so this was more of a
paranoid move to be absolutely sure. All that helped reducing the number
fatal errors significantly, but did not completely get rid of them. The
biggest problem is, that some of the files get corrupted when the
program crashes which is especially inconvenient.
My question is, if there is any obvious mistake I am making and how I
would go about solving this issue. My initial guess is that the Lustre
filesystem plays some role in this, since it is the only difference to
my personal computer where everything runs smoothly. As I said, neither
the errors messages nor the traceback show any consistency.
bye, Peter
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org