Ingestion tool for HDF5 cluster

Can we use Apache Ingestion tools like (Apache Flumes,Apache Sqoop) to ingest HDF5 data into data cluster .
Which Ingestion tools are widely used for HDF5 file format cluster.

I’ll be very interested to see how others in the HDF community respond. In terms of the approach that the HDF Group itself takes when we work on large HDF5 repositories of data (i.e. “piles of files”), we have 2 primary approaches:

  • HDF NoDB: high-performance SQL query engine that works directly on petabytes of HDF5 files without ingestion, ETL, schema creation, etc. (also works on XML, JSON, ASCII, CSV, etc.)
  • Hadoop Virtual File Driver (VFD): Similar to our AWS S3 VFD, this allows HDF5 files to work natively within the Apache Hadoop (HDFS) ecosystem. This is currently in development but we’re interested in Beta clients.

We generally steer clear of any approach that basically “unpacks” the HDF5 file for ingestion into a traditional DBMS. Reason: You can absolutely do it but why?.. you’re taking a binary object that supports rich multi-dimensional schemas, high-performance I/O, and perfect portability / preservation, and then trying to jam it into something less powerful/flexible. It also doesn’t make sense when you’re looking at petabytes or exabytes of data coming in every day… you’ll never be able to ingest or ETL fast-enough.

My two cents… interested in how others tackle this.

– Dave