Open file to write in Python unable to reopen in C

gajdos.adam2002 · June 11, 2024, 1:30pm

Hi everyone,

I’m currently working on optimizing some critical functions for handling AnnData objects by recoding them in C. I have successfully recoded CSR and CSC slicing into memory. However, I’m facing challenges when recoding CSR and CSC slicing from one h5ad file to another on disk.

The difficulty arises because I need to acquire a “write” lock on the destination file twice. Here’s a simplified version of my approach:

In Python, I open the source and destination files using h5py, acquiring a write lock on the destination file:

def write_slice_h5ad(dest_file_path, rows: slice, cols: slice, memory_fraction: float = 0.1):
    """
    Slices specified rows and columns from an HDF5-stored AnnData object and writes them
    to a new HDF5 file.
    """
    with h5py.File(source_file_path, 'r') as f_src, h5py.File(dest_file_path, 'w') as f_dest:

        ...

        # Iterate through all items in the source file
        for key in f_src.keys():
            item = f_src[key]

            if isinstance(item, h5py.Group):
                if key == 'uns':
                    # Copy 'uns' group directly
                    f_src.copy(key, f_dest)
                else:
                    # Create or get the destination group
                    if key not in f_dest:
                        new_group = f_dest.create_group(key)
                    else:
                        new_group = f_dest[key]
                    copy_attrs(item, new_group)
                    write_process_group(item, new_group, row_indices, col_indices, batch_size)
            elif isinstance(item, h5py.Dataset):
                write_process_dataset(item, f_dest, row_indices, col_indices)

In C, I attempted to pass the source and destination group IDs (interpreted as hid_t ). However, opening a group based on these IDs without explicitly opening the file within the C function has been problematic. Additionally, opening the file again in the C function for writing is not possible due to the existing write lock:

void write_process_csr_matrix(const char* src_file_path, const char* dst_file_path, const char* src_group_name, int64_t* row_indices, int n_rows, int64_t* col_indices, int n_cols, int batch_size) {
    // Open source and destination files
    hid_t src_file_id = H5Fopen(src_file_path, H5F_ACC_RDONLY, H5P_DEFAULT);
    if (src_file_id < 0) {
        fprintf(stderr, "Error opening the source file.\n");
        return;
    }
    hid_t dst_file_id = H5Fopen(dst_file_path, H5F_ACC_RDWR, H5P_DEFAULT);
    if (dst_file_id < 0) {
        fprintf(stderr, "Error opening the destination file.\n");
        H5Fclose(src_file_id);
        return;
    }

I tried passing the source and destination group IDs directly to the C function, but could not open groups or datasets based on these IDs without opening the file again.
Using H5Fopen again in the C function for writing is not possible due to the existing write lock by the parent Python function.

This is an example of my non-working current approach where I pre-create the datasets in the destination group and attempt to pass the group IDs to the C function:

def write_process_matrix(source_group, dest_group, row_indices, col_indices, batch_size, is_csr):
    if is_csr:
        dest_group.create_dataset(
            'data', shape=(0,), maxshape=(None,), dtype=source_group['data'].dtype,
            compression=source_group['data'].compression
        )
        dest_group.create_dataset(
            'indices', shape=(0,), maxshape=(None,), dtype=source_group['indices'].dtype,
            compression=source_group['indices'].compression
        )
        dest_group.create_dataset(
            'indptr', shape=(len(row_indices) + 1,), dtype=source_group['indptr'].dtype,
            compression=source_group['indptr'].compression
        )

        slicers_write.write_process_csr_matrix(source_group.file.id.id, dest_group.file.id.id, source_group.name, row_indices, col_indices, batch_size)
        copy_attrs(source_group, dest_group, shape=(len(row_indices), len(col_indices)))

I encounter the following error:

HDF5-DIAG: Error detected in HDF5 (1.14.3) thread 1:
  #000: H5F.c line 1473 in H5Freopen(): unable to synchronously reopen file
    major: File accessibility
    minor: Unable to open file
  #001: H5F.c line 1425 in H5F__reopen_api_common(): invalid file identifier
    major: Invalid arguments to routine
    minor: Inappropriate type
Error reopening source file.

My question:
How can I pass the write lock from Python’s h5py to the C extension to avoid reopening the destination file? Ideally, I want to directly access the destination group or datasets from the C code without reopening the file. Any suggestions or solutions would be greatly appreciated.

Thanks in advance for your help!

gajdos.adam2002 · June 12, 2024, 5:18pm

Turns out the saying of sort “keep things locked for as short time as possible” exists for a reason. The thread can be closed.

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

Open file to write in Python unable to reopen in C