I work on a generic data analysis application, which has to read HDF5 files. By “generic”, think along the lines of HDFView or Excel: I don’t know the layout of the file prior to reading. I need to inspect the contents and read from selected datasets. I’m using the Java object API. All of the datasets are of compound datatypes.
The dataset types have one dimension and are flat compound types comprised of floats, integers, chars, etc. I am running into a performance problem for datasets where the size of the compound data type is fairly large, say 3000 bytes and up. In that case, it takes roughly 20-30 seconds for a single read from the file, even though I’m selecting a subset.
dataset.getStartDims = startPoint;
dataset.getSelectedDims() = 50; // try to read data for 50 points at once
dataset.getStride() = 1;
Object data = dataset.getData();
Even though I’m selecting a subset, it seems like it’s reading data for all the points in the dataset at once because subsequent ‘getData()’ calls I make return instantaneously, even if the startPoint is well beyond the last ‘stop’ point.
I try to limit the number of points I’m reading at once, because I don’t need it all at once and there could be many datasets all being viewed at the same time. I’m trying to limit I/O time for a single read and memory usage. ‘50’ is an arbitrary number. For debugging, before I read any data at all, I print dataset.getChunkSize() to see what the original chunk size is. It’s 1024. I’ve tried making my selected subset smaller, bigger, equal to the chunk size, etc. It doesn’t really seem to make a difference.
If I read from a dataset with a smaller compound type (< 1000 bytes), reading is very fast.
If anyone has advice, I’d really appreciate some help. I realize that the Java API isn’t full-featured and the C API is preferred, but the analysis application is written in Java. We are currently using HDF Java v2.7.