[time series data, Fortran example]


#1

Dear all,

I would like to store time series data in HDF5 format. Would you please help to make an informed choice and direct me to an example in Fortran? At the moment I see two alternatives: (i) to use a dataset with an extendible time dimension and append data to it and (ii) create a new dataset for every new time step and store the data in the same HDF5 file. What solution would be better in terms of performance? If possible would you please direct me to an example in Fortran?


#2

I do work with financial timeseries and authored H5CPP h5::append operator to handle packet tables, which is the HDF5 way to express incremental datasets.

The most efficient way is to go with “(i) to use a dataset with an extendible time dimension and append data to it.” In case of stock market data, I would create a compound h5::gzip{9} dataset for each trading day.

Rolling your own append operator is non trivial. The first generation was to use H5Dread | H5Dwrite calls, combined with internal chunk size custom buffer. This strategy lead me to resonable performance: ~80% ballpark of underlying filesystem throughput.
About a year ago Gerd Heber brought my attention to the newly introduced optimized functions and indeed the rewrite was spectacular:
The throughout was in the ballpark of the underlying filesystem.

To get that result I provided a custom filter chain based on BLAS level 3 cache aware blocking, and then dispatched the filtered chunks to H5DOwrite_chunk | H5DOread_chunk.

I am not familiar with FORTRAN, but if I were in your shoes I would see how H5Oappend does in terms of performance. And if it suits your needs, use it with your option (i).

best wishes: steven


#3

While I would defer to @steven given his experience with similar things, I will mention that there are cases where breaking the time series up into multiple datasets (and/or even files) may be advantageous even at the expense of addtional coding complexity in the reader. It has to do with file size issues and the ability to easily move subsets of the time series around to different file systems if you needed to. Related to this is the window of opportunity the existing/written data is vulnerable to failures perhaps causing the whole of the existing dataset to be unuesable. I think HDF5 goes to great lenghts to protect agains this issue but I don’t have any actual experience with stressing testing it.

With all the data in the same dataset, the file can grow quite large and become unweildy. But, by large, I think it would have to be many tens of gigabytes before this would be a potential issue.

Hope that helps.


#4

Dear all,

I have created a minimalistic example using an unlimited dimension for storing time series data. I have followed the Fortran example called h5_extend.f90. However, I’m struggling at the final step of writing (appending) an additional data portion to the dataset. The code crashes with the following error messages:

HDF5-DIAG: Error detected in HDF5 (1.10.5) thread 0:
#000: ../../src/H5Dio.c line 322 in H5Dwrite(): could not get a validated dataspace from file_space_id
major: Invalid arguments to routine
minor: Bad value
#001: ../../src/H5S.c line 254 in H5S_get_validated_dataspace(): selection + offset not within extent
major: Dataspace
minor: Out of range

The main data structure in my code is a 3D array of “beads”. The first dimension of an array is the Cartesian coordinate, the second—is the bead ID and the third—is the time frame index. I can successfully write this array into a dataset. The second part of the code is creating data for an additional one time frame, extends the existing dataset and attempts to write (append) the new single time frame data into the extended dataset. Would you please point me in the right direction? I suspect I have a mistake in the huperslab selection. Maybe something is wrong with the offset or the count of the elements?

Example code:

  PROGRAM H5_ARR_BEADS_EXT

  USE HDF5

  IMPLICIT NONE

  ! No. of coordinates, beads, time frames
  integer, parameter :: NX = 3, NB = 10, NF = 5

  ! Beads array
  real, allocatable :: beads(:,:,:)

  ! Coordinate arrays
  real, allocatable :: gxx(:), gxy(:), gxz(:)

  ! File
  integer(hid_t)   :: fileID
  character(len=8) :: fileName = "data.h5"

  ! Dataspace
  integer          :: spaceRank    =   2
  integer(size_t)  :: spaceDims(2) = [NF, NB]  ! frame, bead
  integer(hsize_t) :: maxSpaceDims(2)
  integer(hid_t)   :: spaceID

  ! Memory space
  integer          :: memRank
  integer(hsize_t) :: memDims(2)
  integer(hid_t)   :: memID

  ! Hyperslab
  integer(hsize_t) :: offset(2), count(2)

  ! Datatype
  integer         :: arrTypeRank    =  1
  integer(size_t) :: arrTypeDims(1) = [NX]
  integer(hid_t)  :: arrTypeID, arrTypeWriteID

  ! Dataset
  integer(size_t)   :: dataSetDims(3) = [NF, NB, NX]  ! frame, bead, coord
  integer(hid_t)    :: dataSetID
  character(len=64) :: dataSetName
  integer(hsize_t)  :: dataSetExtDims(3)

  ! Property
  integer(hid_t)   :: propID
  integer(hsize_t) :: chunkDims(2)

  integer :: info  ! I/O status
  integer :: b, f  ! counters


  !              coord,         bead,          frame
  allocate(beads(dataSetDims(3),dataSetDims(2),dataSetDims(1)),           &
           gxx(dataSetDims(2)), gxy(dataSetDims(2)), gxz(dataSetDims(2)), &
           stat=info)

  ! Initialise Fortran interface
  call H5open_f(info)

  ! Create new file (with default properties)
  call H5Fcreate_f(fileName, H5F_ACC_TRUNC_F, fileID, info)

  !-----------------------------------------------------------------------

  ! Initialise data
  !   for each time frame
  do f = 1, NF

    ! Initialise data arrays
    call random_number(gxx)
    call random_number(gxy)
    call random_number(gxz)

    !   for each bead
    do b = 1, NB

      beads(:,b,f) = [gxx(b), gxy(b), gxz(b)]
      print *, "Bead(",b,",",f,"):", beads(:,b,f)

    end do
  end do

  ! Create 2D array dataspace (frame, bead) with
  ! unlimited time frame dimension
  ! maxSpaceDims = [H5S_UNLIMITED_F, spaceDims(2)]
  maxSpaceDims = [H5S_UNLIMITED_F, H5S_UNLIMITED_F]
  call H5Screate_simple_f(spaceRank, spaceDims, spaceID, info, maxSpaceDims)
  
  ! Enable chunking
  ! chunkDims = [int(1,8), spaceDims(2)]
  chunkDims = [int(1,8), int(1,8)]
  call H5Pcreate_f(H5P_DATASET_CREATE_F, propID, info)
  call H5Pset_chunk_f(propID, spaceRank, chunkDims, info)

  ! Create 1D array datatype (coord)
  call H5Tarray_create_f(H5T_NATIVE_REAL, arrTypeRank, arrTypeDims, &
                         arrTypeID, info)

  ! Create dataset with chunking property
  dataSetName = "Beads"
  call H5Dcreate_f(fileID, dataSetName, arrTypeID, spaceID, dataSetID, &
                   info, propID)

  ! Write data into file
  call H5Dwrite_f(dataSetID, arrTypeID, beads, dataSetDims, info)

  !-----------------------------------------------------------------------

  deallocate(beads, stat=info)

  !              coord,         bead,          frame
  allocate(beads(dataSetDims(3),dataSetDims(2),1), &
           stat=info)

  print *, ""
  print *, "----- ----- -----"
  print *, ""

  ! Empty data array
  beads = 0.0

  ! Initialise new data portion
  ! Initialise data arrays
  call random_number(gxx)
  call random_number(gxy)
  call random_number(gxz)

  !   for each bead
  do b = 1, NB

    beads(:,b,1) = [gxx(b), gxy(b), gxz(b)]
    print *, "Bead(",b,",",1,"):", beads(:,b,1)

  end do

  ! Extend dataset
  dataSetExtDims = [NF+1, NB, NX]
  call H5Dset_extent_f(dataSetID, dataSetExtDims, info)

  ! Create 2D array memory space (frame, bead)
  memRank = 2;  memDims = [1, NB]
  call H5Screate_simple_f(memRank, memDims, memID, info)

  ! Write to extended part of dataset
  !   Select hyperslab in dataspace
  offset = [NF, 0];   count = [1, NB]
  call H5Sselect_hyperslab_f(spaceID, H5S_SELECT_SET_F, offset, &
                             count, info)

  !   Write data
  dataSetDims = [1, NB, NX]
  call H5Dwrite_f(dataSetID, arrTypeID, beads(:,:,1), dataSetDims, info, &
                  memID, spaceID)

  ! Close dataset
  call H5Dclose_f(dataSetID, info)

  ! Close datatype
  call H5Tclose_f(arrTypeID, info)

  ! Close dataspace
  call H5Sclose_f(spaceID, info)

  ! Close memory space
  call H5Sclose_f(memID, info)

  !-----------------------------------------------------------------------

  ! Close file
  call h5fclose_f(fileID, info)

  ! Close Fortran interface
  call h5close_f(info)

  ! Deallocate data arrays
  deallocate(beads, gxx, gxy, gxz, stat=info)

  END PROGRAM H5_ARR_BEADS_EXT

#5

From the error message: yes, your coordinates are wrong. It is s finicky process. Also adding one by one instead of chunk by chunk is sub optimal.

Did you consider writing the IO part in C++ then linking against it? (You may need a C export) I know the idea may seem unusual, but I profiled the C++ and getting this right is not the easiest.

Best: steve

@miller86 I think we are saying the same :slight_smile:


#6

I add one by one for simplicity to ensure that this is not a source of errors. What coordinates do you think are wrong? I would prefer to stay in Fortran domain at the moment. I have tried to append the additional beads with an offset of [NF-1,0] and it works by writing the data to the last frame of the original dataset. But I need to write into the first frame of the new data portion, i.e. offset has to be [NF,0].


#7

Maxim, try making all dimensions same size. Check if this works. – it should so check for typos. Change one of the dimensions in the chain, verify if it works. Useing consecutive prime numbers is a plus: 3,5,7 :slight_smile: recursively pick the next dimension, move on only with success.

In the end, you will know. Happy debugging!


#8

Maxim, What Fortran compiler do you use? Have you succeeded to read/write HDF5 in Fortran?


#9

In case it is still relevant:

  1. A “piece-wise” append routine: https://github.com/pdebuyl/fortran_h5md/blob/master/h5md_module.f90#L362

This does add a single time element of rank 1 (i.e. stack 1d vectors into a 2d
array for a “time x data index” array).

  1. A buffered append: https://github.com/pdebuyl/fortran_h5md/blob/master/h5md_module.f90#L1175

This does append several slices at once (for buffering i/o manually).

Both routines are in Fortran. There are many such overloaded routines in the
module.

The first routine has extra boilerplace to simultaneously append auxiliary
datasets with timestamping info.

HTH,

Pierre