Implementing an associative matrix in HDF

I have a use case that is common in information/data science and AI. I’d like to use HDF to store an associative array, which is essentially a (float or integer) matrix, with some sort of code for each row and column. In the case of a term-document matrix (a “back of the book index”), the rows might each represent document IDs (“doc1234”, “doc9876”) and the columns English words. I need to:

  • Very efficiently determine which term a row or document corresponds to (e.g. row 123 corresponds to the document key “doc314159”) and
  • equally efficiently do the reverse (the column for “mother” is 8145).

In other use cases, the “row key” and “column key” would be different things (and could be almost anything).

What would be best practice for storing such associations?

1 Like

Scott, what are typical counts for rows and columns? How sparse is the underlying matrix? Is this a read-only structure? G.

Great questions.

  1. Sizes can range from 100-100,000 typically
  2. The matrices are typically either very sparse (<1%) or relatively dense (e.g. for a similarity matrix where the rows are the same as the columns).
  3. The structures are almost always write once read many.

I want to be clear up front that HDF5 (the data model) has no concept of an associative array or matrix. (There is the concept of an HDF5 map, but this is supported only on certain storage types such as Intel DAOS.) Hence, what I’m describing is a layout strategy for persisting an associative matrix in HDF5 using HDF5 primitives. The HDF5 API has no concept of associative structures, and you’ll need to create your own API that creates the associative abstraction on top of this persistence layer. OK?

It’ll be an iterative process, but fundamentally, it’s a divide-and-conquer strategy. There might be hierarchies associated with the row or column namespaces that we could exploit, but I’m not going to make any assumptions about that for now. I will assume that there is no preferred direction (row or column), i.e., it’s equally likely that we will start from documents and move to words or vice versa. The obvious hierarchy is to block the matrix recursively with two goals:

  1. Create the right size partitions for multi-processing or multi-threading
  2. Calibrate the density of the storage structures.

In addition to that, I would maintain two “global” label sets, i.e., structures that map row and column IDs to labels. The mapping from labels to row/column IDs will only exist in memory (and constructed at program/service start). These label sets are very simple. Each set is represented as two HDF5 datasets (total of four). One dataset [row,col]_labels is just a large byte array of the concatenation of the respective labels (strings) in row or column order. The other dataset, [row,col]_offsets, stores the offsets of where in [row,col]_labels a given label string begins. (You can check how compressible these datasets are, but I wouldn’t worry about it early on.) The picture so far is this:

/row_labels -> concatenation of all row labels (in row order)
/row_offsets -> offsets into /row_labels
/col_labels -> concatenation of all column labels (in column order)
/col_offsets -> offsets into /col_labels

Partition time! In the simplest case, we will block our matrix into R row blocks of size rb and C column blocks of size cb. We could also use multiple nested levels of partitions of depth l.

What makes good values for these parameters depends on how many workers (processes, threads, or both) we are aiming for, and, more importantly, how dense and what logical size we want these blocks to be. Here we also need to take into account if we are dealing with a “very sparse” or “relatively dense” case. In the very sparse case, we use a standard compressed sparse layout (row or column, or coordinate, etc.), whereas in the relatively dense case, we won’t agonize over zeros and treat it as dense (and maybe compress: GZIP, etc.).

Because we assume there is no preferred direction, we can either 1) store a preferred representation (row-compressed or column-compressed) and transpose it on the fly, or 2) store both preferred directions and have faster access but a greater footprint. The good news is that at the leaf level, we can have different representations for different partitions, depending on density.

In the simplest case, the picture could look like this:

/row_labels
/row_offsets
/col_labels
/col_offsets
/paritition0_0/representation0_0/...
/partition0_1/representation0_1
...

In this picture, representation0_0 could be an HDF5 group storing the row- and column-compressed representations of that block whereas representation0_1 could be a dense 2D array (decorated with the right kind of metadata) because that block happens to be on the denser side and we opted for transpose-on-the-fly.

Again, if your labels have a hierarchical structure, you could fold that into the group structure, but that’s speculation.

There are many other HDF5 layout options to explore, but it should be clear that HDF5 is not going to do the work for you in terms of building an access API around this. A layout is not an API. HDF5 doesn’t have a concept of an associative array or matrix.

Does that help?

G.

Thank you so much for the informative and detailed response. If I could summarize, HDF5 is a great way to implement a backing store for things that would be used to construct or persist something like an associative data structure (in Python or C++ or whatever), but in that regard, it’s just a backing store, not the dict/map/whatever itself. Is that an accurate summary?

Yes, with an asterisk: Unlike many other backing stores, I would add that this one is intended for sharing between users across platforms and languages because it is self-describing and portable. It has distinct primitives in terms of structures (array variables and groupings) and operations (partial I/O), but it is “in the eye of the domain expert” how they are combined to solve a domain-specific problem.

Best, G.