Access dataset divided into several files

Yves_Rogez · September 26, 2011, 7:30pm

rogez wrote:

Hello,

I'm new to HDF and have read tutorials and some parts of the user's guide
and reference manual but I can't find the best practice to implement the
following structure.

Basically, I need two datasets :

one defining my result data (a one-dimensional array of a complex
compound datatype).
one sorting this data in a 3D space (a three-dimensional dataset
containing an array of indices or references to an item of the previous
dataset).

These data will approach two or three hundreds of gigabytes.
It will be generated on a computing grid with thousands of nodes, each one
writing to a file.
I would like to avoid using MPI but it is not mandatory.

Is there a mean to access data through the second dataset (by passing a 3D
coordinate) and retrieve in a single call all the data in the first
dataset dispatched in the multiple generated files ?

I've seen that I can create a wrapper file that store external links to
each single file and parse all my links to browse my whole data, but I
wonder if there already exists a ready-to-use solution for that...

Thanks a lot in advance,

Yves

I've seen the multi virtual file driver but it seems to be able to separate
the datatypes and not splitting the raw data itself.

Is that right ?

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Access-dataset-divided-into-several-files-tp3369462p3370407.html
Sent from the hdf-forum mailing list archive at Nabble.com.

werner · September 26, 2011, 7:42pm

Hello,

it seems your complex compound data type is related to points in the
3D dataset? If so, why wouldn't you store the respective indices of these
points with each individual file? If all is stored and formulated in HDF5
then you can "mount" one file on another file and thus multiple physical
(HDF5) files appear as one logical file. See

http://www.hdfgroup.org/HDF5/doc/RM/RM_H5F.html#File-Mount

for documentation on how to mount HDF5 files. Maybe you want to provide
one meta-file which is used to mount all those sub-files, but it should
also be possible to do it symmetrically.

Regards,

Werner

···

On Mon, 26 Sep 2011 21:30:20 +0200, rogez <yves.rogez@obs.ujf-grenoble.fr> wrote:

rogez wrote:

Hello,

I'm new to HDF and have read tutorials and some parts of the user's guide
and reference manual but I can't find the best practice to implement the
following structure.

Basically, I need two datasets :

one defining my result data (a one-dimensional array of a complex
compound datatype).
one sorting this data in a 3D space (a three-dimensional dataset
containing an array of indices or references to an item of the previous
dataset).

These data will approach two or three hundreds of gigabytes.
It will be generated on a computing grid with thousands of nodes, each one
writing to a file.
I would like to avoid using MPI but it is not mandatory.

Is there a mean to access data through the second dataset (by passing a 3D
coordinate) and retrieve in a single call all the data in the first
dataset dispatched in the multiple generated files ?

I've seen that I can create a wrapper file that store external links to
each single file and parse all my links to browse my whole data, but I
wonder if there already exists a ready-to-use solution for that...

Thanks a lot in advance,

Yves

I've seen the multi virtual file driver but it seems to be able to separate
the datatypes and not splitting the raw data itself.

Is that right ?

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Access-dataset-divided-into-several-files-tp3369462p3370407.html
Sent from the hdf-forum mailing list archive at Nabble.com.

_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@hdfgroup.org
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

--
___________________________________________________________________________
Dr. Werner Benger Visualization Research
Laboratory for Creative Arts and Technology (LCAT)
Center for Computation & Technology at Louisiana State University (CCT/LSU)
211 Johnston Hall, Baton Rouge, Louisiana 70803
Tel.: +1 225 578 4809 Fax.: +1 225 578-5362

Yves_Rogez · September 27, 2011, 7:47am

Thank you for your quick reply,

in fact what I have to store are rays, and indeed, I expect to store indices
of them into the 3D data structure.
The real issue is that I will produce one file per computing node (there
will be about a few thousand computing nodes). Each file will contain a
dataset with rays definitions and the other dataset the 3D data structure
with the rays indices lists per 3D volumes (cubes in this case).
I can separate the two datasets in two different files and mount or create
an external link between these two. It work for one computing node. What I
really want is merge the 3D structure data of each node in a single dataset
(or virtually single dataset if possible). I would like to do that without
merging the thousands files datasets to avoid a huge post-processing time...
I don't think (but I could have misunderstood something) that mounting files
could provide such a feature...

Yves

···

--
View this message in context: http://hdf-forum.184993.n3.nabble.com/Access-dataset-divided-into-several-files-tp3369462p3371892.html
Sent from the hdf-forum mailing list archive at Nabble.com.