Finally developed things to a point that I can get useful performance numbers for my application. So far, things look good. But, when I look at the performance numbers I see behavior I don't expect -- namely, that my write throughput is almost 2x greater than my read throughput.
My system: x64 windows xp, ntfs file system, C++, HDF5 compiled w/ VS 2008 (thread-safe).
My data: Dummy GIS data scattered in a region. We have a known grid of geocells that data should be split into, and store the data into some number of HDF5 files such that a given file contains data for neighboring geocells. I take the GIS data, clip it to lat/lon boundaries (not really lat/lon, but it's sort of equivalent to lat/lon), determine which HDF5 file the clipped region should be stored within, and write each clipped GIS dataset in a separate HDF5 dataset (dataset is converted to a 1D stream of 32-bit integer opcodes/data).
I've created a thin wrapper around HDF5 that retains an LRU cache of recently opened HDF5 files and datasets. It also hides the details of our HDF5 file hierarchy and the configuration details of our datasets from its client applications.
Everything appears to be working great and I've been doing some performance testing to determine the effects of compression/chunksize/contiguous-vs-chunked/etc.
The attached images are the results of running some performance tests to look at read/write throughput versus chunksize. At each chunk size, I re-ran the test 8 times, throwing out the min/max. Each node in the graph is the mean of the remaining 6 runs, the error bars represent the stddev.
The test data was 2million randomly generated GIS points, split into a few hundred HDF5 datasets in about 25 HDF5 files.
None - chunked datasets w/out compression
NoneNoChunk - contiguous datasets
lzf - chunked w/ lzf compression
zlib1, 4, 9 - zlib at diff levels
The compression ratio shows what I expect. LZF isn't as good as ZLIB at compression. Minimal compression difference at the various zlib levels.
Not shown here are the runs I did with the shuffle filter, which for my data didn't help compression and just slowed things down. The compression ratio for NoneNoChunk threw me for a bit until I realized I was seeing the increased file size due to the file space allocated for partially-used chunks.
The write throughput graph shows LZF considerably better for my data than the other options for every chunk size. And zlib's mb/sec throughput is significantly worse, and worse than contiguous or no-compression.
The read graph shows better for zlib -- it outperforms the no-compression options. But, again LZF has better throughput than zlib.
So, I confirmed what I had expected performance-wise. But, then I looked both read & write graphs.
On read throughput, my datasets w/ LZF average 70-80 MB/sec.
But, on write throughput, my datasets w/ LZF average 125 MB/sec.
It doesn't just seem to be related to a compression filter. The write throughput for my contiguous dataset runs (NoneNoChunk) was ~60 MB/sec, and its read throughput was ~45 MB/sec.
Unfortunately, I cannot share my code. Any ideas where to look for what might be causing this? Or, any hints for how to diagnose these differences myself?
Writing all this down, I'm starting to wonder if comparing my read/write throughput is a valid comparison at all. The way my performance testing application is writing out data is different than read.
In both, I read/write the same total amount of data and traverse the same datasets. However, the order I do that dataset traversal is different.
My geocell datasets end up similar a 2D array. In the table below each 2digit number represents a dataset. The spacing represents how those datasets are stored in separate HDF5 files -- e.g. datasets 00-03, 10-13, 20-23, 30,33 are stored in a single file.
00 01 02 03 04 05 06 07 08 09
10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
30 31 32 33 34 35 36 37 38 39
40 41 42 43 44 45 46 47 48 49
50 51 52 53 54 55 56 57 58 59
60 61 62 63 64 65 66 67 68 69
70 71 72 73 74 75 76 77 78 79
In my read test, I do a row-major traversal of the datasets (00-09, 10-19, 20-29, etc). In the write tests, that's not the case -- every dataset is held in a hash map before being written to disk.
Maybe the unexpected throughput behavior is due to my wrapper library that implements the LRU cache of files. The file-handle cache is small (<5), and depending on the length of the row by the time the read test reaches the end of the row and moves to the next dataset in the first file, that first file may have fallen out of cache.
Will have to fiddle with my cache configuration and see if that eliminates this behavior.