Introducing HDF Kita: HDF5 looks even better in the cloud


#1

Announcing HDF Kita: Cloud optimized HDF5

Big news: we just launched HDF Kita, formerly known as HDF Cloud. HDF Kita is new, cloud-native software built to optimize the way you work with HDF5 data in the cloud. Whether you’re accessing, querying, computing or collaborating HDF5 data, HDF Kita changes the game.

It’s cheaper than your current cloud storage methods for HDF5 data. It’s about 500x faster than your existing setup. Most importantly, it’s perfectly compatible with whatever cloud environment you’re working in.

With one-click deployment, breakthrough is closer than ever before.

Interested in learning more?
Visit our site to learn more about HDF Kita, or get started today with a 30-day free trial of HDF Kita Lab, a JupyterLabs enabled data exploration experience, fully hosted by the HDF Group.

Let us know if you have questions, or any thoughts on your experience with HDF Kita. Email product@hdfgroup.org or start a topic in the HDF Kita category.


#2

Hi all,

Here’s a quick video of how to start your free trial of HDF Kita. It’s a pretty quick process, and you don’t have to put in any financial info to start your trial.

We’d love to have you try out this new service, and let me know about your experiences either by email or in the HDF Kita category.

Thanks!

Lori Cooper
Product Manager
The HDF Group


#3

I’ve been googling around a bit, and I can’t find any simple overview of what Kita is, what it does, how it works …
The only things I can see are filled with marketing-speak. Yes, I gather it’s the best thing that ever happened to mankind. But still.

It’s a bit far-fetched to think I’m going to enroll in something that I don’t even know the basic principles of.


#4

Bert,

I’ve been googling around a bit, and I can’t find any simple overview
of what Kita is, what it does, how it works … Yes, I gather it’s the best
thing that ever happened to mankind. But still.

Thanks for the feedback. Personally, I think sliced bread is the best invention ever… maybe fire. :slight_smile:

As a result of your feedback, we’ve just posted more content and technical details to our website, e.g.

In addition, we’re going to be posting videos and hosting webinars in the coming weeks. Stay tuned.

We’re open to suggestions on the type / format of Kita content that folks are interested in, so please don’t hesitate to reach out.

– Dave Pearah


#5

I gave up after 10 mins to figure out what is is and what it does. Good to know you can explore a lot and that there are a lot of satisfied customers, and the video show a lot of comparisons in costs - but what good is that for me? Maybe this is something for other people. If only I had an idea on …

Is Kita a new cloud-based database system? If so, how does it operate?

If our users have Kita (I am not a candidate myself, I just make applications), how can I read and write stuff? Is it like all the HDF5 APIs, just that you have another ‘open()’?

Does Kita make it possible to reuse the code I already have? To what extent? Or is all the functionality another dimension that I don’t know of? Am I in any way target audience?


#6

I think your questions are addressed in the Architecture page (LINK), but I admittedly don’t have a fresh pair of eyes on this content. So let’s tackle your questions:

Is Kita a new cloud-based database system?

HDF Kita is a server application intended to facilitate fast/cheap access to large collections of HDF5-encoded data, particularly remotely stored data (e.g. Amazon S3). So if none of those things sound interesting, then Kita might not be for you.

If so, how does it operate?

We’ve created a server application that:

  1. Can slice through HDF5 files – without needing to download the entire file locally – which results in performance and cost benefits.
  2. Allows developers to use new REST API and command line interface, OR stick with existing h5py and library SDK (e.g. C/C++).

… how can I read and write stuff? Is it like all the HDF5
APIs, just that you have another ‘open()’?

You can think of it that way… you can use the new APIs or the existing HDF5 library.

Does Kita make it possible to reuse the code I already have? To what extent?

Yes. Full compatibility with your current HDF5 code (with few exceptions).

Or is all the functionality another dimension that I don’t know of?

In addition to basically providing the same functionality as HDF5, there are some additional features since typical Kita users are dealing with collections of HDF5 files (e.g. cross-file management).

Am I in any way target audience?

Maybe? :slight_smile: I’d be happy to connect 1:1 to discuss your current use and see if HDF Kita adds any value: david.pearah@hdfgroup.org

– Dave


#7

“HDF Kita is a server application intended to facilitate fast/cheap access to large collections of HDF5-encoded data, particularly remotely stored data (e.g. Amazon S3).”

This is a key thing. It separates the interested from the uninterested in one sentence. I may have read past it somewhere, but this should be phrase #1 in any ‘hey have you seen this’ communication.

OK. I am probably not interested in Kita, but some of my clients will be.

Now back to me :slight_smile: To get a picture in my head, consider an app. The user has an HDF5 file with data I need, I do some magic, and need to write new data. So I put up a file selector (*.h5, *.hdf, *.hdf5, …), do the open, read data, calculating/displaying stuff, and when the time comes, open another file selector for the resulting data.

What do I need to change to make this scenario cloud-ready? The user can make a local mount but HDF5 should smell out the situation and make a direct connection to the file in the cloud using S3, Azure, … calls. Otherwise, we’re bound to get duplication of data which is a bad idea for huge HDF5 files. Is there a solution for that?


#8

To get a picture in my head, consider an app. The user has an HDF5 file
with data I need, I do some magic, and need to write new data. So I put up
a file selector (*.h5, *.hdf, *.hdf5, …), do the open, read data, calculating/displaying
stuff, and when the time comes, open another file selector for the resulting data.
What do I need to change to make this scenario cloud-ready? The user can make
a local mount but HDF5 should smell out the situation and make a direct connection
to the file in the cloud using S3, Azure, … calls. Otherwise, we’re bound to get
duplication of data which is a bad idea for huge HDF5 files. Is there a solution
for that?

Yes, HDF KIta :slight_smile: Starting with your example, let’s create a hypothetical:

  • Large HDF5 file: 1 GB (or bigger)
  • Data you need to read from this file to do your “magic”: 1 MB
  • Data you then need to write back into this file: 1 MB

With HDF Kita:

  • The library would auto-convert (via REST VOL) your HDF5 library call into corresponding REST API calls for reading and writing transparently
  • The only data movement would be 1 MB download (read) and 1 MB upload (write)… i.e. you don’t need to download + upload the full 1 GB file

You can try this out in Kita Lab (free trial) by uploading a “large” file and then trying these types of operations. Hope this helps. Of course, this all presupposes that you have a situation where you need to deal with lots of remote HDF5 data; otherwise if everything is local (e.g. local NFS mount), then HDF Kita isn’t a good fit.

– dave


#9

Ok got that. Multiply your example by 100-1000 BTW, and reading a lot more than a few percent of the data.

I’m not aware of file discovery tools from HDF. I’d need a simple file selector for cloud data. Or, calls to navigate the cloud ‘tree’. Can this be transparent whether on local disk or in a cloud store?

BTW I’ve always had this dream of HDF implementing something that would keep very small ‘link’ files on local disk but then under the hood stores data in nodes/buckets in the cloud. That would mean zero work for me; everything could be driven by setup files and/or environment variables. Just what mounting as local disks would do but then faster, without duplication and a lot cheaper.


#10

Ok got that. Multiply your example by 100-1000 BTW, and reading a
lot more than a few percent of the data.

HDF Kita also facilitates performance by caching data that you’ve already fetched from remote storage. So even if you’re pulling in a larger subset of data, if your analysis has any locality, the caching will provide a further speed boost. (However, if the use case is to pull in 100% of data from HDF5 file and each datapoint is accessed only once, then HDF Kita can’t help much in terms of speed.)

I’m not aware of file discovery tools from HDF. I’d need a simple
file selector for cloud data. Or, calls to navigate the cloud ‘tree’.
Can this be transparent whether on local disk or in a cloud store?
BTW I’ve always had this dream of HDF implementing something
that would keep very small ‘link’ files on local disk but then under
the hood stores data in nodes/buckets in the cloud.

Check out the NREL data set and Python notebook in HDF Kita Lab. We basically did what you’re suggesting… we took many different HDF5 files and then made them appear like a single HDF5 “file”, but basically you’re navigating seamlessly.

– Dave