Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) //H5SL_TYPE_HADDR case
....
Some bad addresses on some of the variables such as x which was set by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I could be offending them from high level API calls and file interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
···
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) //H5SL_TYPE_HADDR case
....
Some bad addresses on some of the variables such as x which was set by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I could be offending them from high level API calls and file interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
Yes, you are correct, that shouldn't happen. :-/ Do you have a simple C program you can send to show this failure?
Quincey
···
On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
I'll be pulling pieces out of the large c++ project into a small test c program to see if the seg fault can be duplicated in a wieldable example and if accomplished, will send it to you.
MemoryScape and gdb(Netbeans) doesn't show any memory issues from our library code and hdf5. MemoryScape doesn't expand through the H5SL_REMOVE macro so in another working copy I'm trying to treat a copy of it as a function.
···
On 12/07/2010 05:20 PM, Quincey Koziol wrote:
Hi Roger,
On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
Further:
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) //H5SL_TYPE_HADDR case
....
Some bad addresses on some of the variables such as x which was set by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I could be offending them from high level API calls and file interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
Yes, you are correct, that shouldn't happen. :-/ Do you have a simple C program you can send to show this failure?
Quincey
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
The problem cannot be repeated in a c example program because the problem is upstream use of vector's with a -= operation. The failure in this area did not show except in this mpi application even though the same library and code is used in single process application.
...............
scores[qbChemicalShiftTask->getNMRAtomIndices()[index]]-=qbChemicalShiftTask->getNMRTrace()[index];
.............
where scores etc. are std:: vector<double>, std:: vector<int> typedefs. In certain runs the indices were incorrect so this was badly constructed and needed to be rewritten to have better alignment of indices to the previous creation of the scores vector and better checking.
These vectors were not used as input to any hdf5 interfaces; hdf5 looks clean and stable in the file closing. The problem was entirely in non hdf5 code but resulted in stepping on the H5SL skip list from my compile and running of the integrated system.
Thank you for being ready to look into it if it could be duplicated and shown to be in the H5SL remove area. The problem wasn't in hdf5 code.
···
On 12/08/2010 10:12 AM, Roger Martin wrote:
Hi Quincy,
I'll be pulling pieces out of the large c++ project into a small test c program to see if the seg fault can be duplicated in a wieldable example and if accomplished, will send it to you.
MemoryScape and gdb(Netbeans) doesn't show any memory issues from our library code and hdf5. MemoryScape doesn't expand through the H5SL_REMOVE macro so in another working copy I'm trying to treat a copy of it as a function.
On 12/07/2010 05:20 PM, Quincey Koziol wrote:
Hi Roger,
On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
Further:
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) //H5SL_TYPE_HADDR case
....
Some bad addresses on some of the variables such as x which was set by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I could be offending them from high level API calls and file interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
Yes, you are correct, that shouldn't happen. :-/ Do you have a simple C program you can send to show this failure?
Quincey
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.
The problem cannot be repeated in a c example program because the problem is upstream use of vector's with a -= operation. The failure in this area did not show except in this mpi application even though the same library and code is used in single process application.
...............
scores[qbChemicalShiftTask->getNMRAtomIndices()[index]]-=qbChemicalShiftTask->getNMRTrace()[index];
.............
where scores etc. are std:: vector<double>, std:: vector<int> typedefs. In certain runs the indices were incorrect so this was badly constructed and needed to be rewritten to have better alignment of indices to the previous creation of the scores vector and better checking.
These vectors were not used as input to any hdf5 interfaces; hdf5 looks clean and stable in the file closing. The problem was entirely in non hdf5 code but resulted in stepping on the H5SL skip list from my compile and running of the integrated system.
Thank you for being ready to look into it if it could be duplicated and shown to be in the H5SL remove area. The problem wasn't in hdf5 code.
Ah, that's good to hear, thanks!
Quincey
···
On Dec 8, 2010, at 11:41 AM, Roger Martin wrote:
On 12/08/2010 10:12 AM, Roger Martin wrote:
Hi Quincy,
I'll be pulling pieces out of the large c++ project into a small test c program to see if the seg fault can be duplicated in a wieldable example and if accomplished, will send it to you.
MemoryScape and gdb(Netbeans) doesn't show any memory issues from our library code and hdf5. MemoryScape doesn't expand through the H5SL_REMOVE macro so in another working copy I'm trying to treat a copy of it as a function.
On 12/07/2010 05:20 PM, Quincey Koziol wrote:
Hi Roger,
On Dec 7, 2010, at 2:06 PM, Roger Martin wrote:
Further:
Debugging with MemoryScape:
Reveals a segfault in H5SL.c (1.8.5) at line 1068
...1068....
H5SL_REMOVE(SCALAR, slist, x, const haddr_t, key, -) //H5SL_TYPE_HADDR case
....
Some bad addresses on some of the variables such as x which was set by "x = slist->header;" which is a skip list.
These appear to be internal API functions and I'm wondering how I could be offending them from high level API calls and file interfaces. What could be in the cache H5C when
H5Fget_obj_count(fileID, H5F_OBJ_ALL) = 1
and H5Fget_obj_count(fileID, H5F_OBJ_DATASET | H5F_OBJ_GROUP | H5F_OBJ_DATATYPE | H5F_OBJ_ATTR) =0
for the file the code is trying to close.
Yes, you are correct, that shouldn't happen. :-/ Do you have a simple C program you can send to show this failure?
Quincey
On 12/03/2010 11:33 AM, Roger Martin wrote:
Hi,
Using hdf1.8.5 and 1.8.6 pre2; openmpi 1.4.3 on linux rhel4 and rhel5
In a case where the hdf5 operations aren't using MPI but build an h5 file exclusive to individual MPI jobs/processes:
The create:
currentFileID = H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
and many file operations using the hl methods including packet table, tables and datasets etc. perform successfully.
Then near the individual processes' end the
H5Fclose(currentFileID);
is called but doesn't return. A check for open objects says only one file object is open but no other objects(group, dataset etc). No other software or process is acting on this h5; it is named exclusively for the one job it is associated with.
This isn't a parallel hdf5 in MPI attempt. In another scenario parallel hdf5 is working the collective way just fine. This current issue is for people who don't have or want a parallel file system and I made a coarsed grained MPI to run independent jobs for these folks. Each job has its own h5 opened with H5Fcreate(filePath.c_str(), H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
Where should I look?
I'll try to make a small example test case for show and tell.