Learn more about the Hadoop Distributed File System HDF5 Connector (an Enterprise Support add on)


#1

We’ve recently put together a deep dive into the Hadoop Distributed File System HDF5 Connector, one of the add-on modules thats included with Enterprise Support. The HDFS HDF5 Connector is a virtual file driver that allows you to use HDF5 command line tools to extract metadata and raw data from HDF5 and NetCDF4 files on HDFS, and use Hadoop streaming to collect data from multiple HDF5 files.

At the link below, you’ll find a brief demo video, as well as text and code snippets for the video content. The installation guide and user’s guide are also posted on this page:

https://www.hdfgroup.org/solutions/enterprise-support/hadoop-hdfs-hdf5-connector

We will be developing similar content for some of the other modules currently included with Enterprise Support and will share as they become available. If you have any questions about Enterprise Support, please feel free to get in touch with us.


#2

Cool!

With my training hat on, am wondering what technologies you used to produce this training video? Was the audio scripted ahead of time? Sheesh, with the wrong font, HDF5 and HDFS are indistinguishable :wink:

No to the point of the video…

  • How a very large HDF5 files split across HDFS blocks? Does the hdfs vfd handle this splitting?
  • Is the main parallel modality here, multiple file based? Or, can you parallelize across a single, very large dataset for example? Or, many datasets in a single large file.

#3

Mark, the VFD is read-only, at the moment. A very large (or small) HDF5 file would get into HDFS via some form of hdfs -[put,copyFromLocal] -Ddfs.block.size=.... (The VFD is not involved in the copy operation.)

Re: parallelism, the HDF5 constraints are the same as with a “regular” file system. You can access datasets in multiple HDF5 files in parallel, if the (read) requests come from different processes (JVM instances), which would be the case in a clustered setup. Accesses from multiple threads in the same JVM instance will be serialized just as they would in a regular C application. The same applies to a single dataset. You can access it in parallel, as long as the requests come from different processes.

Best, G.


#4

Not often I get to answer a forum question, but I can answer your question on video production. Gerd Heber created the visual, and recorded it with his commentary using GoToMeeting. We got a script by uploading Gerd’s video privately on youtube and grabbing the transcript. Software developer Jake Smith cleaned up and simplified the script and recorded the voice over, providing me with an audio file and some instructions on timing. I did a little editing on the video and added Jake’s audio using iMovie.


#5

Thanks for info. It does sound like there is a bit or a production effort involved there. I might like to chat with you a bit further about this off-line as I am going to be involved in similar work this work for other project(s).