How efficient is HDF5 for data retrieval as opposed to data storage?


#1

I would like to a keyed 500GB table into HDF5, and then then retrieve rows matching specific keys.

For an HDF5 file, items like all the data access uses an integer “row” number, so seems like I would have to implement a 'key to row number map" outside of HDF5.

Isn’t retrieval more efficient with a distributed system like Spark which uses HDFS?
UC Browser SHAREit


#2

Key / Value store in HDF5: depending on the underlying datatype you can choose to store your typed data in hyper cubes, or using OPAQUE datatype and get similar behavior to object stores[^1]. For the KEY component I suggest using a hash function murmur3 may be a choice, but when dealing with static dataset perfect hashing can be an option. The associative container stores the KEY / index pairs which you can retrieve from the dataset. Setting the chunk size is function of the access pattern, picking a right strategy can get you to near the throughput of the underlying filesystem.

If C++ is an option for you, you might be interested in checking out H5CPP a project that provides seamless non-intrusive persistence for linear algebra and some STL containers.
Currently stl::map or abseil not supported, but is on the horizon. The Gerd Heber and I put effort into the new improved documentation which you can check out here. Keep in mind this documentation may contain features I have not committed to the main branch.

If you are interested in the project and have questions, or need help with the example, feel free to shoot us a line.
Here are some hints: store keys in std::vector<std::string> in order, then keep them in memory. Their position encodes the location, use this to reacreate your hash-map.

This C++ library runs on serial and parallel filesystems, and if you are interested the performance difference between Hadoop and PVFS there is a study done. If I recall clicking OrangeFS 30% above of HADOOP on the same platform. I personally measured 3GB/sec /node on AWS EC2 based cluster on 25MB/sec fabric upto 100GB/sec throughput.

As for Hadoop or Spark: they not more efficient, only different. With complex distributed systems the throughput maybe limited by the interconnect, available computational power, and memory.
For example CPU maybe used to compress data, to use available bandwidth better at the cost of increased computation. In other words you are always moving on a trade off curve, casting it into a convex problem and solving it you can answer the question of efficiency for a given set of paramters.

[^1]: actually you can take it a notch higher by storing data directly in chunks, at the cost of either you have to provide custom filter to other HDF5 based systems or only you can read the data back

Hope it helps,
steven


#3

Yes using HDFS can be more efficient
But to make it more efficient best case can be by using Hive over HDFS because Map reduce do not provide any default way to separate data by key. If you want to separate data by key you need to code for it. shareit app vidmate


#4

The Industry Foundation Classes (IFC) are a prevalent data model in which Building Information Models can be exchanged, typically with a file-based nature. Processing the full extent of these models can be time-consuming. Considering the multi-disciplinary nature of the construction industry, stakeholders will typically only be interested in a small subset, depending on the purpose of the exchange.

Vivatv Cyberflix


#6

A “keyed 500GB table” does not say how many rows do you have in your table… This info could help getting you the best option. For instance, you might organize your file with two datasets, on keys and one row. If the keys can fit in memory, then load the keys into an unordered_map (the choice of the hashing function depends on the key) when opening the file. The map will then serve as your lookup table for the values. If the keys cannot fit in memory, then maybe a DMBS such as SQLite is a better solution, e.g. with a side-car file for the indexes. A file that you can always rebuild from the data if it’s missing.