I want to be clear up front that HDF5 (the data model) has no concept of an associative array or matrix. (There is the concept of an HDF5 map, but this is supported only on certain storage types such as Intel DAOS.) Hence, what I’m describing is a layout strategy for persisting an associative matrix in HDF5 using HDF5 primitives. The HDF5 API has no concept of associative structures, and you’ll need to create your own API that creates the associative abstraction on top of this persistence layer. OK?
It’ll be an iterative process, but fundamentally, it’s a divide-and-conquer strategy. There might be hierarchies associated with the row or column namespaces that we could exploit, but I’m not going to make any assumptions about that for now. I will assume that there is no preferred direction (row or column), i.e., it’s equally likely that we will start from documents and move to words or vice versa. The obvious hierarchy is to block the matrix recursively with two goals:
- Create the right size partitions for multi-processing or multi-threading
- Calibrate the density of the storage structures.
In addition to that, I would maintain two “global” label sets, i.e., structures that map row and column IDs to labels. The mapping from labels to row/column IDs will only exist in memory (and constructed at program/service start). These label sets are very simple. Each set is represented as two HDF5 datasets (total of four). One dataset [row,col]_labels
is just a large byte array of the concatenation of the respective labels (strings) in row or column order. The other dataset, [row,col]_offsets,
stores the offsets of where in [row,col]_labels
a given label string begins. (You can check how compressible these datasets are, but I wouldn’t worry about it early on.) The picture so far is this:
/row_labels -> concatenation of all row labels (in row order)
/row_offsets -> offsets into /row_labels
/col_labels -> concatenation of all column labels (in column order)
/col_offsets -> offsets into /col_labels
Partition time! In the simplest case, we will block our matrix into R
row blocks of size rb
and C
column blocks of size cb
. We could also use multiple nested levels of partitions of depth l
.
What makes good values for these parameters depends on how many workers (processes, threads, or both) we are aiming for, and, more importantly, how dense and what logical size we want these blocks to be. Here we also need to take into account if we are dealing with a “very sparse” or “relatively dense” case. In the very sparse case, we use a standard compressed sparse layout (row or column, or coordinate, etc.), whereas in the relatively dense case, we won’t agonize over zeros and treat it as dense (and maybe compress: GZIP, etc.).
Because we assume there is no preferred direction, we can either 1) store a preferred representation (row-compressed or column-compressed) and transpose it on the fly, or 2) store both preferred directions and have faster access but a greater footprint. The good news is that at the leaf level, we can have different representations for different partitions, depending on density.
In the simplest case, the picture could look like this:
/row_labels
/row_offsets
/col_labels
/col_offsets
/paritition0_0/representation0_0/...
/partition0_1/representation0_1
...
In this picture, representation0_0
could be an HDF5 group storing the row- and column-compressed representations of that block whereas representation0_1
could be a dense 2D array (decorated with the right kind of metadata) because that block happens to be on the denser side and we opted for transpose-on-the-fly.
Again, if your labels have a hierarchical structure, you could fold that into the group structure, but that’s speculation.
There are many other HDF5 layout options to explore, but it should be clear that HDF5 is not going to do the work for you in terms of building an access API around this. A layout is not an API. HDF5 doesn’t have a concept of an associative array or matrix.
Does that help?
G.