"invisible" file corruption

Hi,

We've been occasionally seeing HDF5 read failures in our production
environment (using HDF5 1.8.4, C++ packet table API) so are attempting to
upgrade to 1.8.10 in the hope that it might fix things. Unfortunately the
problem appears to now be worse ...

To give you an example of the kind of weirdness we're seeing, we have a
particular file with the following header (as per h5dump):

HDF5 "HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5" {
GROUP "/" {
   DATASET "TheoreticalQuote" {
      DATATYPE H5T_COMPOUND {
         H5T_STD_I64LE "TimeStamp";
         H5T_IEEE_F64LE "BidPrice";
         H5T_IEEE_F64LE "AskPrice";
         H5T_IEEE_F64LE "Volume";
         H5T_IEEE_F64LE "LastInputBidPrice";
         H5T_IEEE_F64LE "LastInputAskPrice";
      }
      DATASPACE SIMPLE { ( 28851988 ) / ( H5S_UNLIMITED ) }
   }
}
}

As you can see this file (150MB in size, compressed) has ~28M records. If
we try to read a few records at the end, we succeed:

$ h5dump --dataset TheoreticalQuote -s 28851970 -c 5
HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
            0.83743,
            0.83745
         },
      (28851974): {
            3564222274822547,
            0.83743,
            0.83745,
            nan,
            0.83743,
            0.83745
         }
      }
   }
}
}

If we try to read a large set of records (300K) in the middle, we also
succeed, but only sometimes!:

$ h5dump --dataset TheoreticalQuote -s 15000000 -c 300000
HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
            0.82127,
            0.82144
         },
      (15299999): {
            3558294916506950,
            0.82127,
            0.82144,
            nan,
            0.82127,
            0.82144
         }
      }
   }
}
}

Trying a different starting point, we don't get an error per se, but where
are the results?

$ h5dump --dataset TheoreticalQuote -s 14700000 -c 300000
HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
      H5T_IEEE_F64LE "Volume";
      H5T_IEEE_F64LE "LastInputBidPrice";
      H5T_IEEE_F64LE "LastInputAskPrice";
   }
   DATASPACE SIMPLE { ( 28851988 ) / ( H5S_UNLIMITED ) }
   SUBSET {
      START ( 14700000 );
      STRIDE ( 1 );
      COUNT ( 300000 );
      BLOCK ( 1 );
      DATA {
      }
   }
}
}

Finally, these peculiarities probably suggest a subtly corrupt file and
explain why our application using the packet table API fails to read this
particular file at this offset, as per our log:

2012-Dec-12 08:18:58.656324[0x00007faae7fff700]: DEBUG:
dataStoreLib.BufferedFile(NZDUSD): reading from file
/home/ligerdemo/data/HotSpot/FX/filtered/NZDUSD/HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5,
earliest first = true, page *start index = 14700000, page end index =
15000000*, start index = 14700000, end index = 28920777
2012-Dec-12 08:18:58.662190[0x00007faae7fff700]: ERROR: HDF5: seq: 0 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Zdeflate.c function:
H5Z_filter_deflate line: 125 desc: inflate() failed
2012-Dec-12 08:18:58.662214[0x00007faae7fff700]: ERROR: HDF5: seq: 1 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Z.c function:
H5Z_pipeline line: 1120 desc: filter returned failure during read
2012-Dec-12 08:18:58.662220[0x00007faae7fff700]: ERROR: HDF5: seq: 2 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dchunk.c function:
H5D__chunk_lock line: 2766 desc: data pipeline read failed
2012-Dec-12 08:18:58.662225[0x00007faae7fff700]: ERROR: HDF5: seq: 3 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dchunk.c function:
H5D__chunk_read line: 1735 desc: unable to read raw data chunk
2012-Dec-12 08:18:58.662229[0x00007faae7fff700]: ERROR: HDF5: seq: 4 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c function:
H5D__read line: 449 desc: can't read data
2012-Dec-12 08:18:58.662242[0x00007faae7fff700]: ERROR: HDF5: seq: 5 file:
/home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c function:
H5Dread line: 174 desc: can't read data
2012-Dec-12 08:18:58.662257[0x00007faae7fff700]: CRITICAL: File::File:
Failed to get records between indexes *14700000, 14999999* from file
/home/ligerdemo/data/HotSpot/FX/filtered/NZDUSD/HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5

Things to note:

   1. The "corrupt" file in question was originally created using the HDF5
   1.8.4 API, and is now being read/ appended using HDF5 1.8.10.
   2. Our application tries to read this file using the 1.8.10 API
   3. The h5dump utility used above is an old version - 1.8.4 - though I do
   not think this is relevant due to the application failing to read also.

My basic question is - has anyone seen this kind of invisible file
corruption before, and if so do you know what might cause this? Also, I'm
wondering if perhaps we're not shutting down / closing files correctly,
which is causing these corruption problems .... right now our code
constructs a H5::CompType object, a H5::H5File object, and a FL_PacketTable
object in that order per file, then destructs in the reverse order .... is
that sufficient or should we be calling a global shutdown routine as well?

Any help on this would be very, very appreciated.

Thanks

Hi again,

More on this ...

I have just compiled a simple program (attached) that repeatedly opens, appends, then closes a bunch of files, but fails to run past a single iteration. This suggests to me we're either calling something wrong or HDF5 is internally not correctly closing things down after the first iteration. The fact that things are not being closed properly could explain why we're seeing problems in production with semi-corupt files?....

To compile (on Ubuntu 12.04 64-bit) assuming HDF5 1.8.10 is installed to /usr/local:

$ g++ -std=c++0x -I /usr/local/hdf5-1.8.10-linux-x86_64-static/include simple.cpp /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libhdf5_hl_cpp.a /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libhdf5_hl.a /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libhdf5_cpp.a /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libhdf5.a /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libsz.a /usr/local/hdf5-1.8.10-linux-x86_64-static/lib/libz.a

Output:

$ ./a.out
HDF5-DIAG: Error detected in HDF5 (1.8.10) thread 0:
   #000: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5D.c line 170 in H5Dcreate2(): unable to create dataset
     major: Dataset
     minor: Unable to initialize object
   #001: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dint.c line 439 in H5D__create_named(): unable to create and link to dataset
     major: Dataset
     minor: Unable to initialize object
   #002: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
     major: Links
     minor: Unable to initialize object
   #003: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5L.c line 1882 in H5L_create_real(): can't insert link
     major: Symbol table
     minor: Unable to insert object
   #004: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
     major: Symbol table
     minor: Object not found
   #005: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Gtraverse.c line 641 in H5G_traverse_real(): traversal operator failed
     major: Symbol table
     minor: Callback failed
   #006: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5L.c line 1674 in H5L_link_cb(): name already exists
     major: Symbol table
     minor: Object already exists
AppendPackets failed on iteration 1

Can anyone provide some insight into why this might be failing?

Many thanks
Jess

simple.cpp (2.76 KB)

···

On 12/12/12 08:50, Jess Morecroft wrote:

Hi,

We've been occasionally seeing HDF5 read failures in our production environment (using HDF5 1.8.4, C++ packet table API) so are attempting to upgrade to 1.8.10 in the hope that it might fix things. Unfortunately the problem appears to now be worse ...

To give you an example of the kind of weirdness we're seeing, we have a particular file with the following header (as per h5dump):

HDF5 "HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5" {
GROUP "/" {
   DATASET "TheoreticalQuote" {
      DATATYPE H5T_COMPOUND {
H5T_STD_I64LE "TimeStamp";
H5T_IEEE_F64LE "BidPrice";
H5T_IEEE_F64LE "AskPrice";
H5T_IEEE_F64LE "Volume";
H5T_IEEE_F64LE "LastInputBidPrice";
H5T_IEEE_F64LE "LastInputAskPrice";
      }
DATASPACE SIMPLE { ( 28851988 ) / ( H5S_UNLIMITED ) }
   }
}

As you can see this file (150MB in size, compressed) has ~28M records. If we try to read a few records at the end, we succeed:

$ h5dump --dataset TheoreticalQuote -s 28851970 -c 5 HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
0.83743,
0.83745
         },
(28851974): {
3564222274822547,
0.83743,
0.83745,
nan,
0.83743,
0.83745
         }
      }
   }
}

If we try to read a large set of records (300K) in the middle, we also succeed, but only sometimes!:

$ h5dump --dataset TheoreticalQuote -s 15000000 -c 300000 HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
0.82127,
0.82144
         },
(15299999): {
3558294916506950,
0.82127,
0.82144,
nan,
0.82127,
0.82144
         }
      }
   }
}

Trying a different starting point, we don't get an error per se, but where are the results?

$ h5dump --dataset TheoreticalQuote -s 14700000 -c 300000 HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5 | tail -15
H5T_IEEE_F64LE "Volume";
H5T_IEEE_F64LE "LastInputBidPrice";
H5T_IEEE_F64LE "LastInputAskPrice";
   }
   DATASPACE SIMPLE { ( 28851988 ) / ( H5S_UNLIMITED ) }
   SUBSET {
      START ( 14700000 );
      STRIDE ( 1 );
      COUNT ( 300000 );
      BLOCK ( 1 );
      DATA {
      }
   }
}

Finally, these peculiarities probably suggest a subtly corrupt file and explain why our application using the packet table API fails to read this particular file at this offset, as per our log:

2012-Dec-12 08:18:58.656324[0x00007faae7fff700]: DEBUG: dataStoreLib.BufferedFile(NZDUSD): reading from file /home/ligerdemo/data/HotSpot/FX/filtered/NZDUSD/HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5, earliest first = true, page *start index = 14700000, page end index = 15000000*, start index = 14700000, end index = 28920777
2012-Dec-12 08:18:58.662190[0x00007faae7fff700]: ERROR: HDF5: seq: 0 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Zdeflate.c function: H5Z_filter_deflate line: 125 desc: inflate() failed
2012-Dec-12 08:18:58.662214[0x00007faae7fff700]: ERROR: HDF5: seq: 1 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Z.c function: H5Z_pipeline line: 1120 desc: filter returned failure during read
2012-Dec-12 08:18:58.662220[0x00007faae7fff700]: ERROR: HDF5: seq: 2 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dchunk.c function: H5D__chunk_lock line: 2766 desc: data pipeline read failed
2012-Dec-12 08:18:58.662225[0x00007faae7fff700]: ERROR: HDF5: seq: 3 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dchunk.c function: H5D__chunk_read line: 1735 desc: unable to read raw data chunk
2012-Dec-12 08:18:58.662229[0x00007faae7fff700]: ERROR: HDF5: seq: 4 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c function: H5D__read line: 449 desc: can't read data
2012-Dec-12 08:18:58.662242[0x00007faae7fff700]: ERROR: HDF5: seq: 5 file: /home/hdftest/snapshots-bin-hdf5_1_8_10/current/src/H5Dio.c function: H5Dread line: 174 desc: can't read data
2012-Dec-12 08:18:58.662257[0x00007faae7fff700]: CRITICAL: File::File: Failed to get records between indexes *14700000, 14999999* from file /home/ligerdemo/data/HotSpot/FX/filtered/NZDUSD/HotSpot_FX_filtered_NZDUSD-TheoreticalQuote.h5

Things to note:

1. The "corrupt" file in question was originally created using the
    HDF5 1.8.4 API, and is now being read/ appended using HDF5 1.8.10.
2. Our application tries to read this file using the 1.8.10 API
3. The h5dump utility used above is an old version - 1.8.4 - though I
    do not think this is relevant due to the application failing to
    read also.

My basic question is - has anyone seen this kind of invisible file corruption before, and if so do you know what might cause this? Also, I'm wondering if perhaps we're not shutting down / closing files correctly, which is causing these corruption problems .... right now our code constructs a H5::CompType object, a H5::H5File object, and a FL_PacketTable object in that order per file, then destructs in the reverse order .... is that sufficient or should we be calling a global shutdown routine as well?

Any help on this would be very, very appreciated.

Thanks