Azure, DataLake, Spark, Hadoop suggestions....

j.rowe · January 30, 2017, 3:22pm

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail. We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the "adl://" URI-our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?). Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you. We are also potentially looking for some paid consulting help in this area if anyone is interested.

Warm regards,
--Jim

gheber · January 30, 2017, 3:59pm

Jim, do you need barebones RDDs or some of the more structured types (Spark DataFrame, Dataset)?
How about loading the data via HDF5/JDBC?

G.

···

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Rowe, Jim
Sent: Monday, January 30, 2017 9:23 AM
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Cc: Smith, Jacob <J.Smith@questintegrity.com>
Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions....

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail. We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the "adl://" URI-our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?). Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you. We are also potentially looking for some paid consulting help in this area if anyone is interested.

Warm regards,
--Jim

gheber · January 30, 2017, 7:03pm

Jake, are you using 100%, 80%, 60%, ... of the data that you'd be copying?
If you were using just a fraction (< 20%), copying all those files sounds like a waste.

[OK, I'm peddling HDF5/JDBC server here...]

With HDF5/JDBC server you could:

1. Limit (SELECT) the amount of data to be brought in (over the network)
2. With something like Sqoop, you could save the data in any BigData format you like.

G.

···

________________________________________
From: Smith, Jacob <J.Smith@questintegrity.com>
Sent: Monday, January 30, 2017 11:19:55 AM
To: Gerd Heber; HDF Users Discussion List
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Gerd,

Thanks for the response! My name is Jake Smith and I’ll be working with this cloud solution. Currently, our HDF5 files are in DataLake, we use a Python Jupyter notebook around Azure’s HDInsight with a Spark cluster. We want to load our data from HDF5 into a H2O frame to build additional models. We are using Sparkling Water (the integration of H2O and Spark). Since h5py (python module) doesn’t seem to facilitate remote querying of HDF5 files (which I’m not sure if that’s a characteristic of HDF5 itself rather than this python client), we are wondering if it is a good idea to download these files to the Spark cluster before transforming them to RDDs.

From: Gerd Heber [mailto:gheber@hdfgroup.org]
Sent: Monday, January 30, 2017 8:59 AM
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org>
Cc: Smith, Jacob <J.Smith@questintegrity.com>
Subject: RE: Azure, DataLake, Spark, Hadoop suggestions....

Jim, do you need barebones RDDs or some of the more structured types (Spark DataFrame, Dataset)?
How about loading the data via HDF5/JDBC?

G.

From: Hdf-forum [mailto:hdf-forum-bounces@lists.hdfgroup.org] On Behalf Of Rowe, Jim
Sent: Monday, January 30, 2017 9:23 AM
To: HDF Users Discussion List <hdf-forum@lists.hdfgroup.org<mailto:hdf-forum@lists.hdfgroup.org>>
Cc: Smith, Jacob <J.Smith@questintegrity.com<mailto:J.Smith@questintegrity.com>>
Subject: [Hdf-forum] Azure, DataLake, Spark, Hadoop suggestions....

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail. We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?). Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you. We are also potentially looking for some paid consulting help in this area if anyone is interested.

Warm regards,
--Jim

miller86 · January 30, 2017, 7:12pm

Am woefully ignorant of most of this technology so I just suggest maybe one of these might help...

https://github.com/LLNL/spark-hdf5

http://www.nersc.gov/users/data-analytics/data-management/i-o-libraries/hdf5-2/h5spark/

Mark

"Hdf-forum on behalf of Rowe, Jim" wrote:

Hello HDF Gurus,
We are doing some machine learning work against HDF5 data (several hundred files, 5-50GB each).

We are looking for others who may have blazed or been blazing this trail. We are in Azure using Microsoft DataLake storage and working through trying to read data into RDDs for use in Spark.

We have been working with h5py, but running into issues where we cannot access files that MS exposes using the “adl://” URI—our assumption is that however that is implemented, it does not translate to a filesystem the underlying HDF5 libraries can read (?). Our best option so far is to copy the files locally, which introduces an extra step and delay in the process.

If anyone has suggestions or insights on how to architect a cloud solution as roughly described, we would love to talk to you. We are also potentially looking for some paid consulting help in this area if anyone is interested.

Warm regards,
--Jim

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Azure, DataLake, Spark, Hadoop suggestions....