Appends data without a bigger malloc?

jayson.bonaudo · September 11, 2020, 12:50pm

Hello, I am new to the HDF5 format and I am having some problems with append data to an existing dataset in C.

I have to put a lot of data from another file into an HDF5 file. I am therefore forced to create an unlimited datasets that I expand as I go. My problem is that for some large files I end up with too many values and it becomes difficult to have to malloc an bigger array of doubles.

So is there a way to add to an existing dataset without having to malloc a bigger and bigger array?

I apologize if my question is difficult to understand my level in English is bad enough

Regards,

miller86 · September 11, 2020, 4:47pm

Hmm…I’m not sure I am understanding your question here but using the appropriate data space (and selections), it is possible to allocate in memory only the stuff you need to append to the dataset in the file.

Also, just because you need to write data to the dataset at different times does not obligate you to use an extendible dataset. If you know (even approximately) the ultimate size of the required dataset, you can create a dataset of that size in the file and then write bits and pieces to it as you need until you’ve written everything you need. If you used chunked layout, I think in all cases, the file size will be such that it will be only the data that has been written, not the whole dataset (until, of course, the whole dataset has been written).

miller86 · September 11, 2020, 4:56pm

Depending on the underlying file system, even a contiguous layout dataset that is partially written can wind up creating a file that would show in ls -l as having a size that is only for the data written (or file blocks that are touched by data written) and not always the actual dataset size. This is because some file systems permit holes in the file address space.

contact · September 11, 2020, 6:55pm

Hi @jayson.bonaudo,

To avoid having to allocate an ever increasing array, you need to use a (point or hyperslab) selection like mentioned by @miller86.

If you know the number of doubles your dataset will ultimately store just create it with the appropriate size up-front (and no need for the dataset to be extendable). Otherwise, if you do not know the size before hand, it needs to be extendable. Either way, the dataset needs to be chunked. Afterwards, with a proper selection, you can write/append (or read for that matter) a subset of the dataset with a fixed size array, chunk by chunk, avoiding the need to allocate a big array or an ever increasing array (which in either case can consume all the available memory).

To illustrate how this works in practice, here goes a small example in C using HDFql:

 // declare variables
 char script[100];
 double data[10];
 int number;
 int i;

 // create HDF5 file 'test.h5'
 hdfql_execute("CREATE FILE test.h5");   

 // use (i.e. open) file 'test.h5'
 hdfql_execute("USE FILE test.h5");

 // create chunked (size 10) dataset 'dset' of data type double (size 50)
 hdfql_execute("CREATE CHUNKED(10) DATASET dset AS DOUBLE(50)");

 // register variable 'data' for subsequent usage (by HDFql)
 number = hdfql_variable_register(&data);

 // loop 5 times
 for(i = 0; i < 5; i++)
 {
      // fill-up variable 'data' with some values
      // (...)

      // prepare script that writes 10 values stored in variable 'data' into dataset 'dset' using an hyperslab selection (already stored values in the dataset will not be overwritten)
      sprintf(script, "INSERT INTO dset(%d:::10) VALUES FROM MEMORY %d", i * 10, number); 

      // execute script
      hdfql_execute(script);
 }

Hope this helps!

jayson.bonaudo · September 14, 2020, 7:20am

Hi, thanks for your reply.

Sorry for the muddled explanations.
What I meant is that for example I have a double array with [2001,2002,2003,2004] in a dataset. But if I want to add values I can’t do it with just the necessary values in a smaller array, for example just [2005, 2006]. Because if I enlarge the dataset and add the values, I often have this [2001, 2002, 2003, 2004, 2001, 2002], Of course using an extend and chunked dataset, and using hyperslab, I cannot add the values afterwards. To get there I have to have a larger array like [0, 0, 0, 0, 2005, 2006] and there I can add the values afterwards. But like I said I will make it happen with only the necessary values.
I’ll post the code I’m using tonight, because currently the computer I’m working on doesn’t have internet access, so I’ll wait until I’m on mine to send it to you.

contact · September 14, 2020, 11:28am

Hi @jayson.bonaudo,

In principle, there is no need to have a larger array like the one you wrote (i.e. [0, 0, 0, 0, 2005, 2006]) to be able to write values 2005 and 2006 in position #4 and #5 in the dataset. Just have it as [2005, 2006] and please make sure the hyperslab points to the right position (in the dataset) and has the right size (in this case, 2).

Moreover, there is no need to resize the array to a smaller size; just reuse it by filling only the first positions that you need and write it into the dataset using an hyperslab with a correct size (like this, only the first elements of the array will be written in the dataset).

Cheers!

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Appends data without a bigger malloc?