Hi everyone,
I’m currently working on optimizing some critical functions for handling AnnData objects by recoding them in C. I have successfully recoded CSR and CSC slicing into memory. However, I’m facing challenges when recoding CSR and CSC slicing from one h5ad file to another on disk.
The difficulty arises because I need to acquire a “write” lock on the destination file twice. Here’s a simplified version of my approach:
- In Python, I open the source and destination files using h5py, acquiring a write lock on the destination file:
def write_slice_h5ad(dest_file_path, rows: slice, cols: slice, memory_fraction: float = 0.1):
"""
Slices specified rows and columns from an HDF5-stored AnnData object and writes them
to a new HDF5 file.
"""
with h5py.File(source_file_path, 'r') as f_src, h5py.File(dest_file_path, 'w') as f_dest:
...
# Iterate through all items in the source file
for key in f_src.keys():
item = f_src[key]
if isinstance(item, h5py.Group):
if key == 'uns':
# Copy 'uns' group directly
f_src.copy(key, f_dest)
else:
# Create or get the destination group
if key not in f_dest:
new_group = f_dest.create_group(key)
else:
new_group = f_dest[key]
copy_attrs(item, new_group)
write_process_group(item, new_group, row_indices, col_indices, batch_size)
elif isinstance(item, h5py.Dataset):
write_process_dataset(item, f_dest, row_indices, col_indices)
- In C, I attempted to pass the source and destination group IDs (interpreted as
hid_t
). However, opening a group based on these IDs without explicitly opening the file within the C function has been problematic. Additionally, opening the file again in the C function for writing is not possible due to the existing write lock:
void write_process_csr_matrix(const char* src_file_path, const char* dst_file_path, const char* src_group_name, int64_t* row_indices, int n_rows, int64_t* col_indices, int n_cols, int batch_size) {
// Open source and destination files
hid_t src_file_id = H5Fopen(src_file_path, H5F_ACC_RDONLY, H5P_DEFAULT);
if (src_file_id < 0) {
fprintf(stderr, "Error opening the source file.\n");
return;
}
hid_t dst_file_id = H5Fopen(dst_file_path, H5F_ACC_RDWR, H5P_DEFAULT);
if (dst_file_id < 0) {
fprintf(stderr, "Error opening the destination file.\n");
H5Fclose(src_file_id);
return;
}
- I tried passing the source and destination group IDs directly to the C function, but could not open groups or datasets based on these IDs without opening the file again.
- Using
H5Fopen
again in the C function for writing is not possible due to the existing write lock by the parent Python function.
This is an example of my non-working current approach where I pre-create the datasets in the destination group and attempt to pass the group IDs to the C function:
def write_process_matrix(source_group, dest_group, row_indices, col_indices, batch_size, is_csr):
if is_csr:
dest_group.create_dataset(
'data', shape=(0,), maxshape=(None,), dtype=source_group['data'].dtype,
compression=source_group['data'].compression
)
dest_group.create_dataset(
'indices', shape=(0,), maxshape=(None,), dtype=source_group['indices'].dtype,
compression=source_group['indices'].compression
)
dest_group.create_dataset(
'indptr', shape=(len(row_indices) + 1,), dtype=source_group['indptr'].dtype,
compression=source_group['indptr'].compression
)
slicers_write.write_process_csr_matrix(source_group.file.id.id, dest_group.file.id.id, source_group.name, row_indices, col_indices, batch_size)
copy_attrs(source_group, dest_group, shape=(len(row_indices), len(col_indices)))
I encounter the following error:
HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 1:
#000: H5F.c line 1473 in H5Freopen(): unable to synchronously reopen file
major: File accessibility
minor: Unable to open file
#001: H5F.c line 1425 in H5F__reopen_api_common(): invalid file identifier
major: Invalid arguments to routine
minor: Inappropriate type
Error reopening source file.
My question:
How can I pass the write lock from Python’s h5py to the C extension to avoid reopening the destination file? Ideally, I want to directly access the destination group or datasets from the C code without reopening the file. Any suggestions or solutions would be greatly appreciated.
Thanks in advance for your help!