Problem reading data from HDF5 file using

Hello,

I’m using last HDFql Fortran version to read data from hdf5 generated by another application (nastran)

My dataset is located under multiple group, let’s say group1/group2/mydataset

inside that dataset, I have multiple row and column, column that have name (let’s say id, cl1, cl2, cl3, cl4)

Problem I have is that different row have different type, and some of those type are not primary type
Hence, when I do a select from group1/group2/mydataset , and then hdfl_cursor_first() and hdfql_cursor_get_int(), I correctly get my ID, but then, if I try to access any of the other row, I get totally random data.

For instance, in my case, cl1 and cl2 are array but that are considered “Java object” by hdf5, what I am intrested about are cl3 and cl4 which are regular float number

But if I do “hdfl_cursor_absolute(4)” to get to the right column, and then “hdfql_cursor_get_float()” I get a totally random number (like 3.2e-35) instead of the one in my dataset.

Tl:Dr, apart from the first and the last number of my dataset, I’m not able to access any of the other number, but instead I get random numbers. (Probably from the fact that cursor doesn’t know how many bits it has to go to go over the java object?)

Thanks for your help

Little precision I forgot to had:

When I do hdfql_cursor_get_count(), it displays 80, but I don’t have 80 element in my dataset, even counting java arrays as multiple element

Hi @angelbillyguyon,

Would it be possible to share the HDF5 file you are trying to open with HDFql in Fortran, as well as the name of the dataset you wish to read?

Thanks!

Hello, the hdf file in question contains information that I’m not allowed to share, accessing to the group is not a problem, it’s not accessing each different data in the named dataset

I’m sorry I can’t provide you with more than that, but basically you can see that this second column is not a primitive type, it’s some kind of java array containing multiple float.

Same for 3rd element, 4th, 5th and going look like basic float, ID is a regular int.

It’s why, I think that when I do hdfql_cursor_next(), it doesn’t know how to go to actual next element

Best regards,

Hi @angelbillyguyon,

Thanks for the additional details.

Would you mind to send the result of running h5dump against the HDF5 file and post the result stripped from data (i.e. only the structure of the HDF5 file is posted and not its data)?

Thanks!

Hello, here is a sample of the file, I changed every group name and altered every data so it’s as the same type as before but different value so you can get a view of it

sample.txt (2.7 KB)

Best regards,

Hi @angelbillyguyon,

Thanks for sharing the result of running h5dump against the HDF5 file.

The reason why function hdfql_get_cursor_count returns 80 is because the number of elements stored in dataset A3 is equal to 1+36+9+1+1+1+1+1+1+1+1+1+1+1+1+9+1+1+1+9+1. (Keep in mind that a cursor in HDFql always flattens the data that it stores.)

Therefore, to read the data stored in A3 and retrieve it correctly through a cursor afterwards do the following:

PROGRAM Test

    USE HDFql

    INTEGER :: state
    INTEGER :: i

    state = hdfql_execute("SELECT FROM test.h5 A1/A2/A3")

    state = hdfql_cursor_next()
    WRITE(*, *) "ID=", hdfql_cursor_get_bigint()

    WRITE(*, *) "g1="
    DO i = 1, 36
        state = hdfql_cursor_next()
        WRITE(*, *) hdfql_cursor_get_double()
    END DO

    WRITE(*, *) "g2="
    DO i = 1, 9
        state = hdfql_cursor_next()
        WRITE(*, *) hdfql_cursor_get_double()
    END DO

    state = hdfql_cursor_next()
    WRITE(*, *) "g3=", hdfql_cursor_get_double()

    ! (...)

    state = hdfql_cursor_next()
    WRITE(*, *) "DOMAIN_ID=", hdfql_cursor_get_bigint()

END PROGRAM

Alternatively, instead of using a cursor, you could use a (user-defined) Fortran variable of type structure (https://en.wikibooks.org/wiki/Fortran/structures), where HDFql could read dataset A3 and populate the variable with data - this would greatly increase performance.

Hope it helps!

Hi thanks a lot for your response,

So if I understand correctly your way of working is looking for the data type with h5dump, my problem was that I was reading the first int as a simple Integer in fortran instead of a bigint.

You said I should use a structure instead of cursor for performance issue (and performance is one of the big reason we’re using h5), therefore I’m wondering something :

On this example my dataset is only 1 row, but in reality, they are gonna have hundred of thousand if not million of row.

how would I use structure type to stock each row of the dataset on our previous example for instance?
Also in my case only some of the column of each row are of interest, is there a way to optimize that?

Hope I’m not asking for too much,
Thanks again,

Best regards,

Hi @angelbillyguyon,

To support reading more than one row, the code previously posted should be modified as follows:

PROGRAM Test

    USE HDFql

    INTEGER :: state
    INTEGER :: i

    state = hdfql_execute("SELECT FROM test.h5 A1/A2/A3")

    DO WHILE(hdfql_cursor_next() == HDFQL_SUCCESS)

        WRITE(*, *) "ID=", hdfql_cursor_get_bigint()

        WRITE(*, *) "g1="
        DO i = 1, 36
            state = hdfql_cursor_next()
            WRITE(*, *) hdfql_cursor_get_double()
        END DO

        WRITE(*, *) "g2="
        DO i = 1, 9
            state = hdfql_cursor_next()
            WRITE(*, *) hdfql_cursor_get_double()
        END DO

        state = hdfql_cursor_next()
        WRITE(*, *) "g3=", hdfql_cursor_get_double()

        ! (...)

        state = hdfql_cursor_next()
        WRITE(*, *) "DOMAIN_ID=", hdfql_cursor_get_bigint()

    END DO

END PROGRAM

Since dataset A3 can potentially grow to have millions of rows, an out-of-memory exception may occur when trying to read it. To circumvent this possible exception, you need to read small chunks of A3 at the time using HDF5 hyperslabs capabilities. Therefore, the code above will need to be modified to work explicitly with hyperslabs in case A3 grows considerably. The good news is that the upcoming release of HDFql (version 2.5.0) will introduce the concept of a sliding cursor, which (implicitly/seamlessly) uses hyperslabs in case a dataset does not fit in (RAM) memory due to its sheer dimension. Please see this post for additional details about sliding cursors.

On the other hand, if you wish to use a (user-defined) variable instead of a cursor, you need to create such variable as a structure (that mimics the members of dataset A3) and pass it to HDFql when reading dataset A3 - HDFql will take care of populating the variable with data from this dataset. In addition, just like with cursors, you need to explicitly take care of reading A3 in chunks (using hyperslabs capabilities) in case it does not fit in (RAM) memory.

Hope it helps!

Hello,

Thanks again for your response, I actually already managed to “optimize” it using type structure as you said previously and hyperslab also

I just initially didn’t think that HDFql would populate my user-defined structure type easily but it actually did.
Now everything is working perfectly.

Thanks a lot for the time you took helping me,

Best regards

2 Likes

HDFql is just plain magic. G.

1 Like

Hello there,

Sorry for bothering you again, but I’m having some problem again
Since the last time I’ve been able to properly read through my hdf5 correctly with HDFql, storing my data and doing whatever I needed with them.

Problem is, I now have to work with pretty big file (multiple gigabits), with dataset with more than 10 millions row.

The technique I’m using is iterating through each row (using hyperslab, so dataset(0), dataset(1)… dataset(6000000).

Problem is that this technique seems highly inefficient and it takes a while to read the file.

So what I started trying to do was creating an array of my custom type with for dim the dimension of the dataset, obviously I ran into a stackoverflow error.

The idea then would be to read the dataset let’s say 10 000 row by 10 000 row and populate a dimension(10000) array with it everytime.

But I can’t seem to put my head around hyperslab, and whatever I’m trying my variable gets populated with bad value.

What I tried to do:
state = hdfql_variable_transient_register(dim) ! for later use when iterating
state = hdfql_execute(“SHOW DIMENSION /PATH/TO/DATASET/DATASET1 into memory 0”)
BLOCK
type(custom_type), dimension(2) :: obj ! example with dim 2 easier to read
state = hdfql_variable_transient_register(obj)
state = hdfql_execute(“SELECT FROM /PATH/TO/DATASET/DATASET1(1:1:2) INTO MEMORY 0”) ! Doesn’t work, thought first 1 would be 2nd value of dataset, second 1 the stride, and the 2 the count

also tried :
state = hdfql_execute(“SELECT FROM /PATH/TO/DATASET/DATASET1(1:2) INTO MEMORY 0”) ! kinda pythonic way, doesn’t work

and finally :
state = hdfql_execute(“SELECT FROM /PATH/TO/DATASET/DATASET1(1,2) INTO MEMORY 0”) ! populate first case of array correctly, but bad value afterwards

Sorry if this question seem really simple, but hdf5 hyperslab really doesn’t work for me lol

Speed is really an important factor on what I’m working right now, it’s why I would like to optimize it as much as possible

Also my custom type represents perfectly each row of the dataset (and I retreive the data correctly when iterating through each row individually)

Thanks for your help,
Best regards

Hi @angelbillyguyon,

It seems that the hyperslab specified in the code you have posted is incorrect. Try this instead (as an example):

state = hdfql_execute("SELECT FROM /PATH/TO/DATASET/DATASET1(" // 1000*i // ":::1000) INTO MEMORY 0")

This instruction should be in a loop with its index (i) starting at 0. Every time it is executed, a chunk of 1000 rows is read from the dataset and a variable registered with number 0 is populated with it (make sure variable has enough reserved space to store the chunk - otherwise, a segmentation fault may occur).

In addition, you may want to experiment different cache parameters (to speed-up readings of HDF5 files/datasets) and measure the performance. Setting cache parameters can be done either through HDFql operation SET CACHE (this sets cache parameters used for subsequent HDF5 files/datasets read operations - e.g. SET DATASET CACHE SLOTS 521 SIZE 1048576 PREEMPTION 0.9) or through operation SELECT (this overwrites the settings specified by SET CACHE when reading a dataset - e.g. SELECT FROM /PATH/TO/DATASET/DATASET1 CACHE SIZE 2097152).

Hope it helps!

Thanks again for your previous advice, with them I managed to make it work, going from 10 minutes to 25seconds to read my datafile, using array with dimension 5000 (above that I get a stack overflow error using the select method, but that isn’t a problem because at that point it’s not the select method that are taking time (they take only 2seconds) but my calculus with the data)

I tried playing a bit with the cache but it didn’t seem to matter in the end, (which is not surprising if reading data takes only 2 seconds).

I’ll try playing with it a bit once I’ll work with bigger file! For now I’m trying to understand what the set cache argument do by reading the documentation

Thanks a lot again and sorry for bothering

2 Likes