I would like to a keyed 500GB table into HDF5, and then then retrieve rows matching specific keys.
For an HDF5 file, items like all the data access uses an integer “row” number, so seems like I would have to implement a 'key to row number map" outside of HDF5.
Isn’t retrieval more efficient with a distributed system like Spark which uses HDFS? UC BrowserSHAREit
Key / Value store in HDF5: depending on the underlying datatype you can choose to store your typed data in hyper cubes, or using OPAQUE datatype and get similar behavior to object stores[^1]. For the KEY component I suggest using a hash function murmur3 may be a choice, but when dealing with static dataset perfect hashing can be an option. The associative container stores the KEY / index pairs which you can retrieve from the dataset. Setting the chunk size is function of the access pattern, picking a right strategy can get you to near the throughput of the underlying filesystem.
If C++ is an option for you, you might be interested in checking out H5CPP a project that provides seamless non-intrusive persistence for linear algebra and some STL containers.
Currently stl::map or abseil not supported, but is on the horizon. The Gerd Heber and I put effort into the new improved documentation which you can check out here. Keep in mind this documentation may contain features I have not committed to the main branch.
If you are interested in the project and have questions, or need help with the example, feel free to shoot us a line. Here are some hints: store keys in std::vector<std::string> in order, then keep them in memory. Their position encodes the location, use this to reacreate your hash-map.
This C++ library runs on serial and parallel filesystems, and if you are interested the performance difference between Hadoop and PVFS there is a study done. If I recall clicking OrangeFS 30% above of HADOOP on the same platform. I personally measured 3GB/sec /node on AWS EC2 based cluster on 25MB/sec fabric upto 100GB/sec throughput.
As for Hadoop or Spark: they not more efficient, only different. With complex distributed systems the throughput maybe limited by the interconnect, available computational power, and memory.
For example CPU maybe used to compress data, to use available bandwidth better at the cost of increased computation. In other words you are always moving on a trade off curve, casting it into a convex problem and solving it you can answer the question of efficiency for a given set of paramters.
[^1]: actually you can take it a notch higher by storing data directly in chunks, at the cost of either you have to provide custom filter to other HDF5 based systems or only you can read the data back
A “keyed 500GB table” does not say how many rows do you have in your table… This info could help getting you the best option. For instance, you might organize your file with two datasets, on keys and one row. If the keys can fit in memory, then load the keys into an unordered_map (the choice of the hashing function depends on the key) when opening the file. The map will then serve as your lookup table for the values. If the keys cannot fit in memory, then maybe a DMBS such as SQLite is a better solution, e.g. with a side-car file for the indexes. A file that you can always rebuild from the data if it’s missing.
I would like to dump a keyed 500GB table into HDF5, and then then retrieve rows matching specific keys. For an HDF5 file, items like all the data access uses an integer “row” number, so seems like I would have to implement a 'key to row number map" outside of HDF5. Isn’t retrieval more efficient with a distributed system like Hadoop or Spark which uses HDFS? Should I be using a distributed system to implement the map/hashfunction?..
This is perhaps a rather open-ended question. How many columns does the table have and what are the types of those columns? How compressible is the data? How many keys did you have in mind and how selective are they? How much time and space are you willing to invest in index creation and maintenance? Is the table write-once-read-many or will there be updates? What does your deployment look like (storage, number of clients, etc.)?