I have a bunch of fairly big hdf5 files (~30GB compressed, around 15 of them). Each have inside 2 different datasets.
These datasets have the shape (600000, X, 32)
I would like to create a new hdf5 file that merges the datasets in the second axis, so the result should be a dataset of the shape (600000, (X1 + X2 + X3....), 32)
This I’m currently accomplishing through numpy:
for hdf5 in hdf5_files:
with h5py.File(hdf5, "r") as f:
# data structure initialised to zero
f["Data"].read_direct(data)
data_datasets.append(data)
hdf5_file.create_dataset("Data", data=np.concatenate((data_datasets), 1), compression="gzip", compression_opts=5)
However due to the size of the files, memory usage is an issue.
I haven’t found a way to do the concatenation of the second axis without passing through numpy and using np.concatenate( , 1).
Is there any way I can approach this in ‘blocks’ using h5py only so I don’t use that much memory? A pointer to documentation and / or and example would be great.
These files are produced by a tool that doesn’t support parallelisation, a run of this tool will take around 2 weeks. As a workaround we split the input and launch the tool in parallel, reducing the runtime but producing multiple files as output. However these files should be ‘grouped back’ into a single one as if they were generated with a single run.
If this can’t be done via h5py only, that’s a valid answer too. I know what I’m trying to do is misguided at best.
I asked Gemini to use NCO. It gave the following script:
#!/bin/bash
# --- User Configuration ---
# 1. Replace with the actual names from your HDF5 files (use ncdump -h)
DIM_1="dim_600k" # The name of the dimension with size 600000
DIM_X="X" # The name of the dimension you want to merge along
DIM_3="dim_32" # The name of the dimension with size 32
# 2. List all your input HDF5 files
# Using a wildcard is easiest if they are in the same directory
INPUT_FILES=(/path/to/your/files/*.h5)
# 3. Define the final output file name
OUTPUT_FILE="merged_dataset.h5"
# --- Script Logic ---
# Create a directory for temporary files
TEMP_DIR="nco_temp_$$" # $$ adds the process ID to make it unique
mkdir -p "$TEMP_DIR"
echo "Step 1: Permuting dimensions for each input file..."
TEMP_FILES=()
for F in "${INPUT_FILES[@]}"; do
BASENAME=$(basename "$F")
TEMP_FILE="${TEMP_DIR}/permuted_${BASENAME}"
echo " Processing ${F} -> ${TEMP_FILE}"
# Reorder dimensions to make DIM_X the record dimension (leftmost)
# -O: Overwrite existing temp files if script is re-run
# -a: Axis permutation
ncpdq -O -a ${DIM_X},${DIM_1},${DIM_3} "$F" "$TEMP_FILE"
TEMP_FILES+=("$TEMP_FILE")
done
echo
echo "Step 2: Concatenating all permuted files..."
# -h: Suppress appending of history attributes, which can become very large
CONCATENATED_PERMUTED_FILE="${TEMP_DIR}/concatenated_permuted.h5"
ncrcat -h "${TEMP_FILES[@]}" "$CONCATENATED_PERMUTED_FILE"
echo
echo "Step 3: Permuting dimensions back to the original order..."
# Reorder dimensions back to the desired final shape
ncpdq -O -a ${DIM_1},${DIM_X},${DIM_3} "$CONCATENATED_PERMUTED_FILE" "$OUTPUT_FILE"
echo
echo "Step 4: Cleaning up temporary files..."
rm -rf "$TEMP_DIR"
echo "Cleanup complete."
echo
echo "✅ Success! Merged dataset created at: ${OUTPUT_FILE}"
To stay within the h5py realm… You will have to do processing by chunks to limit memory consumption. Looks like your datasets in the files are compressed, which means they must be chunked.
Something like the following:
Decide on the chunk shape for the combined file datasets and compression settings.
Set their chunk caches appropriately, at least one chunk to fit.
Start assembling these chunks from the input files’ chunks as required.
Don’t know how kludgy will this look like but eventually you will arrive at a combined file.
I know what I’m trying to do is misguided at best.
Not at all misguided. Break-up strategies like this are frequently used to handle constraints. It seems to me that your tool workaround was a good choice.
Is there any way I can approach this in ‘blocks’ using h5py only so I don’t use that much memory?
Yes you can merge in blocks, using h5py only. This will require more programming than the NCO suggestion, but will be more efficient in the long run if you are making repeated runs.
Walk down the first dimension, which is the slowest-moving dimension. Let’s start with block size 1 for simplicity. Make a temporary memory array TEMP (1, (X1 + X2 + X3…), 32). For each of the 15 input files, read the first block into another memory array, then copy this input array into the correct positions in TEMP. Do each block read and copy separately, because I can’t guarantee that python would not be stupid and break down a combined operation into highly inefficient element-by-element reads. Always use array syntax for every data moving operation, never loop over single elements.
After the completed block is in TEMP, simply write TEMP into the first block in the output file. Repeat the whole process for all 60000 blocks.
Once you have an initial version working, you can improve efficiency by increasing the block size. The code becomes slightly trickier, because then you have to keep track of the start and end indices of each block.
When you first create the empty output file, be sure to select HDF5 chunk size such that your merge blocks do not straddle chunk boundaries. A good initial choice would be chunk shape of (1, (X1 + X2 + X3…), 32). This should work well for any block size 1 or greater.
Thanks a lot for the pointers, I have managed to process the files with chunks.
To conclude the thread a bit, here is the relevant code I ended up coming up with:
file_descriptors = [h5py.File(hdf5_file, "r") for hdf5_file in hdf5_files]
target_shape = [
file_descriptors[0]["Data"].shape[0],
sum([fd["Data"].shape[1] for fd in file_descriptors]),
file_descriptors[0]["Data"].shape[2],
]
with h5py.File(output, "w") as hdf5_file:
data_dataset = hdf5_file.create_dataset(
"Data",
target_shape,
dtype=[
("field_1", "<i4"),
("field_2", "<i4"),
("field_3", "<i4"),
],
chunks=(1, target_shape[1], target_shape[2]),
compression="gzip",
compression_opts=5,
)
data_tmp = np.zeros(
(constants.HDF5_BLOCK_SIZE, target_shape[1], target_shape[2]),
dtype=[
("field_1", "<i4"),
("field_2", "<i4"),
("field_3", "<i4"),
],
)
counter = 0
while counter < target_shape[0]:
for i in range(constants.HDF5_BLOCK_SIZE):
shape = None
shape_i = 0
for fd in file_descriptors:
shape = fd["Data"].shape
data_tmp[i][shape_i : shape_i + shape[1]] = fd["Data"][counter + i][:]
shape_i = shape_i + shape[1]
data_dataset[counter : counter + constants.HDF5_BLOCK_SIZE, :] = data_tmp
counter = counter + constants.HDF5_BLOCK_SIZE
@dave.allured Your answer helped me a lot, thanks! I’m not fully sure if I’ve followed your advice on doing ‘block read and copy separately’ for Python not to do something stupid, as I’m fairly new to Python. Currently I don’t see an improvement in speed with bigger block sizes.
@hdf52 That looks about right. I am not much of a Python programmer either, but that copy strategy is generic. If you don’t find a speed improvement with larger block size, then stay with block size = 1, which will have the smallest memory footprint.
Also consider creating the output data set with unlimited first dimension rather than fully allocated. That way, the file will grow as blocks are appended, and you can add more blocks later if there is some reason to do so.
You wanted to improve run time better than the estimated 2 weeks. Where is it at now? Better than 1 week, perhaps?
@hdf52 Your loop and block structure is also compatible with @ajelenak’s earlier suggestion to make a virtual data set and break up the output file into smaller files.