HDF5 chunk cache

Vladimir · May 3, 2012, 11:45am

Hi all,

i'm new to the HDF5 database and till now i have managed to get the things
i wanted from the library running.
I'm using the C++ API and i already have running parts of my goal program,
which uses compound datatype of
integers, reals and VarLenTypes and all that dynamically chosen. I manage
to write my data relatively fast with
buffer maintained from my application and given ready for writing to the
library.

Little more about the task:
The data is stored in one dimensional table( vector ), which consists of
records of compound data type.
The number of records is unknown, so the max size of the dateset is
unlimited x 1 with enabled chunking.
The data is received by the application record by record.

Currently my application allocates space and saves the received data and
when the applications buffer is full,
the library's write function is called with the pointer to the buffer
containing the data. ( my chunk size is equal to the size of my buffer )
And the process start all over again, till the data is over.

But here is the thing, that i still can't understand how the chunk cache is
working.
After all i read and sow by now, i thought that when i call the write to
dataset function, the library will
copy my data to it's cache and if the whole chunk is written( written from
my app into HDF5's cache not to the disk from the library )
or the cache is already full, the chunks in the cache will be written to
the hard drive to be made space for the new incoming chunks.
And that this writing to the hard drive doesn't happen at the precise time
of calling the H5Dwrite function.

And if i'm correct if i set the chunk cache size to some big value and set
the chunk size to 1, all those chunks would be written
together, when the chunk cache is full or when i close the dataset perhaps ?
( Why i want to use chunk size equal to 1 ? => Read below, where i explain
how i want to read the data. )

Each chunk is read with one I/O operation ( one for each chunk ), but is
each chunk written with one separate I/O operation?
That would explain why my data is written so slow, when i set the chunk
size to 1 even with explicitly bigger chunk cache size.

I need to write the data row by row or in other words i need to write one
dimensional dataset, which consists of
compound data, line by line, compound item by compound item.
If i set the chunk to f.e. 10 records and i use the H5Dwrite to write them
to HD5's cache one by one, is that going to be 1 or 10 I/O operations?

Why i want to use chunk size equal to 1:
Than i have to read the data again record by record or in other words line
by line in *random order.* That's why
for me would be better if i set the chunk size to 1 record, so i wouldn't
have to read more than i need.
I can't load all data( all chunks ) in the RAM, because i'll have to
allocate more than 16GB and i can't afford it.
I don't want to make the chink size bigger, because i'll have to read large
amount of currently unused data from the disk
and than read it again, when i really need it.
For Example:
( if the chunk size is 5000 records )
read record: 1 // chunk 1
read record: other records
...
read record: 2000000 // chunk 400
read record: 2 // that data is in the same chunk as record 1, which is not
in the cache, because probably it is replaced by now from other chunks
// and i need to read the same chunk again ...

That's why the best choice for me would be to read only the data i need.
I'm ready to make the chunk size bigger than 1 in order to improve the
writing performance, but only if there is no other way doing that. That's
why i need more information about what happens after i call H5Dwrite.
And how exactly the chunk cache is maintained internal. In which conditions
are I/O operations initiated.

I will appreciate all the help i can get.

···

--
*V.Daskalov*

Quincey_Koziol · May 15, 2012, 7:17pm

Hi Vladimir,
You almost certainly don't want to set your chunk size to 1, for (at least) two reasons:

- Each chunk needs a "chunk record" in the chunk index data structure (currently a B-tree). If you have too many chunks, the size (and I/O) of the chunk index will dominate that of the actual chunks.

- Also, performing I/O on 1-byte will be about as expensive as 4KB, on most machines (and sometimes as expensive as 64KB, on larger systems), so you aren't saving anything by performing smaller I/O operations.

Quincey

···

On May 3, 2012, at 6:45 AM, Vladimir Daskalov wrote:

Hi all,

i'm new to the HDF5 database and till now i have managed to get the things i wanted from the library running.
I'm using the C++ API and i already have running parts of my goal program, which uses compound datatype of
integers, reals and VarLenTypes and all that dynamically chosen. I manage to write my data relatively fast with
buffer maintained from my application and given ready for writing to the library.

Little more about the task:
The data is stored in one dimensional table( vector ), which consists of records of compound data type.
The number of records is unknown, so the max size of the dateset is unlimited x 1 with enabled chunking.
The data is received by the application record by record.

Currently my application allocates space and saves the received data and when the applications buffer is full,
the library's write function is called with the pointer to the buffer containing the data. ( my chunk size is equal to the size of my buffer )
And the process start all over again, till the data is over.

But here is the thing, that i still can't understand how the chunk cache is working.
After all i read and sow by now, i thought that when i call the write to dataset function, the library will
copy my data to it's cache and if the whole chunk is written( written from my app into HDF5's cache not to the disk from the library )
or the cache is already full, the chunks in the cache will be written to the hard drive to be made space for the new incoming chunks.
And that this writing to the hard drive doesn't happen at the precise time of calling the H5Dwrite function.

And if i'm correct if i set the chunk cache size to some big value and set the chunk size to 1, all those chunks would be written
together, when the chunk cache is full or when i close the dataset perhaps ?
( Why i want to use chunk size equal to 1 ? => Read below, where i explain how i want to read the data. )

Each chunk is read with one I/O operation ( one for each chunk ), but is each chunk written with one separate I/O operation?
That would explain why my data is written so slow, when i set the chunk size to 1 even with explicitly bigger chunk cache size.

I need to write the data row by row or in other words i need to write one dimensional dataset, which consists of
compound data, line by line, compound item by compound item.
If i set the chunk to f.e. 10 records and i use the H5Dwrite to write them to HD5's cache one by one, is that going to be 1 or 10 I/O operations?

Why i want to use chunk size equal to 1:
Than i have to read the data again record by record or in other words line by line in random order. That's why
for me would be better if i set the chunk size to 1 record, so i wouldn't have to read more than i need.
I can't load all data( all chunks ) in the RAM, because i'll have to allocate more than 16GB and i can't afford it.
I don't want to make the chink size bigger, because i'll have to read large amount of currently unused data from the disk
and than read it again, when i really need it.
For Example:
( if the chunk size is 5000 records )
read record: 1 // chunk 1
read record: other records
...
read record: 2000000 // chunk 400
read record: 2 // that data is in the same chunk as record 1, which is not in the cache, because probably it is replaced by now from other chunks
// and i need to read the same chunk again ...

That's why the best choice for me would be to read only the data i need. I'm ready to make the chunk size bigger than 1 in order to improve the
writing performance, but only if there is no other way doing that. That's why i need more information about what happens after i call H5Dwrite.
And how exactly the chunk cache is maintained internal. In which conditions are I/O operations initiated.

I will appreciate all the help i can get.
--
V.Daskalov
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

HDF5 chunk cache