Getting error "invalid dataset size, likely file corruption"

Hi, I am developing the independent library PureHDF with write support for HDF5 files. I use h5dump to verify that the written files conform to the spec. With h5dump 1.14.1 I was able to create a file with a dataset with null dataspace and h5dump was able to dump it:

HDF5 "<file-path>" {
GROUP "/" {
   DATASET "Null" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  NULL
      DATA {
      }
   }
}
}

I upgraded to version 1.14.4.2 and now I get the error message invalid dataset size, likely file corruption. I traced it back to this commit (Fixes for file format security issues (#4283) · HDFGroup/hdf5@ce53bc0 · GitHub) where the following problematic check has been introduced (line 408):

if (H5_addr_le((layout->storage.u.contig.addr + data_size), layout->storage.u.contig.addr))
            HGOTO_ERROR(H5E_DATASET, H5E_OVERFLOW, FAIL, "invalid dataset size, likely file corruption");

This check fails when the data size is = 0, which is the case when the data space is a null dataspace.

Is the error on my side (maybe I understood something wrong), or is the newly introduced check the problem?

Thanks!

Hi @apollo3zehn-h5,

do you happen to have that file available somewhere? We do test NULL dataspaces, but generally the library shouldn’t have allocated space in the file for a 0-sized contiguous dataset, so the check should normally be skipped. For example, here’s one of the test files we have that has a dataset with a NULL dataspace:

h5dump -pH tnullspace.h5
GROUP "/" {
   ATTRIBUTE "attr" {
      DATATYPE  H5T_STD_U32LE
      DATASPACE  NULL
   }
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  NULL
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 0
         OFFSET HADDR_UNDEF
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_IFSET
         VALUE  H5D_FILL_VALUE_DEFAULT
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_LATE
      }
   }
}
}

Notice the OFFSET HADDR_UNDEF part, which means that file space isn’t allocated and the check is skipped. What does the output of h5dump -pH look like for your file? It’s possible there’s a bug either in older versions or the current version of HDF5 with regards to file space allocation for 0-sized datasets, but I also think the use of H5_addr_le may have been unintentional and H5_addr_lt may have been intended instead.

1 Like

Thanks for your quick reply!

I skip now file space allocation when there is a null dataspace and h5dump is happy now:

HDF5 "/tmp/tmpDgTBco.tmp" {
GROUP "/" {
   DATASET "Null" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  NULL
      STORAGE_LAYOUT {
         CONTIGUOUS
         SIZE 0
         OFFSET HADDR_UNDEF
      }
      FILTERS {
         NONE
      }
      FILLVALUE {
         FILL_TIME H5D_FILL_TIME_NEVER
         VALUE  H5D_FILL_VALUE_UNDEFINED
      }
      ALLOCATION_TIME {
         H5D_ALLOC_TIME_EARLY
      }
   }
}
}

If space allocation is generally not allowed in combination with a null dataspace, it would be great to add this information to the spec document (HDF5: HDF5 File Format Specification Version 3.0) so others will not run into the same issue.

Thanks again!
My issue is solved now :slight_smile:

To avoid confusion: I am not creating files using the C-library but instead using my independent implementation. I was simply following the HDF5 spec and ran into this issue. To avoid this for other library authors and to align the spec with the reality I would appreciate if the condition null dataspace = no allocation allowed would become part of that spec :slight_smile:

2 Likes

It’s an interesting question for sure. Considering the behavior of previous versions of the library, I’m tempted to say that there shouldn’t be any reason for this to not be allowed and that the library should be fixed. When you previously allocated file space for the dataset, was there a particular fixed address given to the dataset (since you would have been allocating a 0-byte region)?

Yes, previously there was a specific address. It was produced by the simple FreeSpaceManager class: PureHDF/src/PureHDF/VOL/Native/Core.Writing/FreeSpaceManager.cs at 2c300909a30048cea79cecd91180ca10679aac92 · Apollo3zehn/PureHDF · GitHub

Previously it did not check for the length and always returned a valid address.

Thanks for the pointer! It may have been that this was intentionally left unspecified in the file format specification, but we plan on discussing this internally to determine a resolution on it. In the meantime, I’ll fix the newly-added check in the library because I believe it’s an ill-formed check in either case.