How efficient is HDF5 for data retrieval as opposed to data storage?


#1

I would like to a keyed 500GB table into HDF5, and then then retrieve rows matching specific keys.

For an HDF5 file, items like all the data access uses an integer “row” number, so seems like I would have to implement a 'key to row number map" outside of HDF5.

Isn’t retrieval more efficient with a distributed system like Spark which uses HDFS?
UC Browser SHAREit


#2

Key / Value store in HDF5: depending on the underlying datatype you can choose to store your typed data in hyper cubes, or using OPAQUE datatype and get similar behavior to object stores[^1]. For the KEY component I suggest using a hash function murmur3 may be a choice, but when dealing with static dataset perfect hashing can be an option. The associative container stores the KEY / index pairs which you can retrieve from the dataset. Setting the chunk size is function of the access pattern, picking a right strategy can get you to near the throughput of the underlying filesystem.

If C++ is an option for you, you might be interested in checking out H5CPP a project that provides seamless non-intrusive persistence for linear algebra and some STL containers.
Currently stl::map or abseil not supported, but is on the horizon. The Gerd Heber and I put effort into the new improved documentation which you can check out here. Keep in mind this documentation may contain features I have not committed to the main branch.

If you are interested in the project and have questions, or need help with the example, feel free to shoot us a line.
Here are some hints: store keys in std::vector<std::string> in order, then keep them in memory. Their position encodes the location, use this to reacreate your hash-map.

This C++ library runs on serial and parallel filesystems, and if you are interested the performance difference between Hadoop and PVFS there is a study done. If I recall clicking OrangeFS 30% above of HADOOP on the same platform. I personally measured 3GB/sec /node on AWS EC2 based cluster on 25MB/sec fabric upto 100GB/sec throughput.

As for Hadoop or Spark: they not more efficient, only different. With complex distributed systems the throughput maybe limited by the interconnect, available computational power, and memory.
For example CPU maybe used to compress data, to use available bandwidth better at the cost of increased computation. In other words you are always moving on a trade off curve, casting it into a convex problem and solving it you can answer the question of efficiency for a given set of paramters.

[^1]: actually you can take it a notch higher by storing data directly in chunks, at the cost of either you have to provide custom filter to other HDF5 based systems or only you can read the data back

Hope it helps,
steven