10x Genomics use of HDF5 and Python

robin073 · October 28, 2020, 2:32am

I’m a biomedical research scientist who wants to reanalyze some single cell RNA-seq data of Covid-19 patients deposited in NCBI Geo several months ago. The 10X Genomics company provided sample Python code on their web site for the particular version of their hardware and software used for this experiment, which they now regard as obsolete and hence no longer supported by the company. (But it’s been used for publicly archived data.)

Their sample code didn’t even conform to the very first guidelines published by Guido van Rossum. Also their notebook didn’t even successfully run to completion. I tried to clean up and simplify the code as much as possible based on my very limited experience with Python (most of my previous work has used R and Rstudio, which IMHO is much more suited to the analysis of gene expression data in biomedical science.)

I looked in the Andrew Collete book “Python and HDF5”, but it didn’t cover the particular style of Python coding used on the sample code. It’s almost like a different dialect of the language, with its extensive use of consecutive bracketed terms. I looked in quite a few different Python books, but never came across this particular dialect.

I realize this is far removed from the types of issues that the HDF group normally works, but I don;'t know who to turn to for assistance. (The Stack Overflow people can get pretty snarky!)

My revised code below:

‘’’ import modules define fns loading saving processing
gene-barcode matrix ‘’’
import os
import socket
import collections
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse
import h5py
from scipy.io import mmread

np.random.seed(0)

FeatureBCMatrix = collections.namedtuple(‘FeatureBCMatrix’,
[‘feature_ids’, ‘feature_names’, ‘barcodes’, ‘matrix’])
def get_matrix_from_h5(filename):
with h5py.File(filename, ‘r’) as f:
feature_ids = [x.decode(‘ascii’, ‘ignore’)
for x in f[‘matrix’][‘features’][‘id’]]
feature_names = [x.decode(‘ascii’, ‘ignore’)
for x in f[‘matrix’][‘features’][‘name’]]
barcodes = list(f[‘matrix’][‘barcodes’][:])
matrix = sp_sparse.csc_matrix((f[‘matrix’][‘data’],
f[‘matrix’][‘indices’], f[‘matrix’][‘indptr’]),
shape = f[‘matrix’][‘shape’])
return FeatureBCMatrix(feature_ids, feature_names,
barcodes, matrix)

I use the parquet Arrow format to move the data across to R for subsequent analysis. One complication is that even though the genetic data is sparse, it is almost at random, so the vast numerical analysis literature on geometric sparse matrices doesn’t apply.

Alan J. Robinson
robin073@umn.edu

aragilar · October 28, 2020, 11:50am

Hi Alan

You may find it easier if you’re familiar with R to use rhdf5, if you are not aware of it.

As for the Python code, can you paste it between ‘```’ so that the code appears as code to make it readable:

it should look like this

Based on the number of imports, I’m guessing this is only a small subset of the code? Is there a link to the actual notebook publicly available, which may help make the code make sense in context (as currently there appears to be nothing unusual about the code from what I can see)?

ajelenak · October 28, 2020, 7:23pm

Hi Alan,

As already stated, more context and some information about the content of your HDF5 file would be helpful here. For example: h5dump -n 1 your_file.h5.

I’ve tried to reformat your posted code and removed some syntax errors as well.

import os
import socket
import collections
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse
import h5py
from scipy.io import mmread

np.random.seed(0)

FeatureBCMatrix = collections.namedtuple(
    'FeatureBCMatrix', ['feature_ids', 'feature_names', 'barcodes', 'matrix'])

def get_matrix_from_h5(filename):
    with h5py.File(filename, 'r') as f:
        feature_ids = [
            x.decode('ascii', 'ignore') for x in f['matrix']['features']['id']
        ]            
        feature_names = [
            x.decode('ascii', 'ignore') for x in f['matrix']['features']['name']
        ]
        barcodes = list(f['matrix']['barcodes'][:])
        matrix = sp_sparse.csc_matrix(
            (f['matrix']['data'], f['matrix']['indices'], f['matrix']['indptr']), 
            shape = f['matrix']['shape'])

    return FeatureBCMatrix(feature_ids, feature_names, barcodes, matrix)

   -Aleksandar

thomas1 · October 29, 2020, 1:48pm

Do you mean the bits like f['matrix']['features']['id']? This is probably accessing the structure of the groups and datasets in the HDF5 file. If so, it can also be written with slashes separating the levels:

f['matrix/features/id']

robin073 · October 29, 2020, 8:14pm

I was hoping that someone knowledgeable in Python would recognize where in the official Python documentation this particular consecutive bracketed notation was defined. Also Aleksandar said that he corrected syntax errors in my posted code derived from the 10x Genomics web site. There are many code examples on the web site. Unfortunately I was unable to find the original web page
i used. (I note that the ordering of some of the statements differ. Is this an example of a Python 2 - Python 3 difference?)

The company appears to be trying to get out of the downstream analysis business all together by assuming that bioinformatics groups such as NYU Langone with their Seurat package is the way to go. i personally think that it should be possible to considerably improve on Seurat, which is why I want to reanalyze the Covid-19 experimental data from NCBI GEO GSE145926.

An h5dump follows of the h5 file for one of the Covid-19 patients:

HDF5 “GSM4339769_C141_filtered_feature_bc_matrix.h5” {
FILE_CONTENTS {
group /
attribute /chemistry_description
attribute /filetype
attribute /library_ids
attribute /original_gem_groups
attribute /version
group /matrix
dataset /matrix/barcodes
dataset /matrix/data
group /matrix/features
dataset /matrix/features/_all_tag_keys
dataset /matrix/features/feature_type
dataset /matrix/features/genome
dataset /matrix/features/id
dataset /matrix/features/name
dataset /matrix/indices
dataset /matrix/indptr
dataset /matrix/shape
}
}

Because of the random pattern of the data dropouts represented by zeros, I have to coerce the columns to the dense format with a “todense” call, before moving the matrix file by parquet to the R environment. By using a file backed representation of gene counts by the filematrix R package, it is just barely possible to handle this data on my 16GB Mac and Windows machines.

The cytokine storms that some of the patients experience as a result of a viral or bacterial infection is one of the most critical problems in the entire history of medical science.

AJR

ajelenak · November 3, 2020, 2:50pm

Hi Alan,

I overlooked some brackets in your original post so I updated my version of your code.

Thanks for providing the h5dump information. Based on the file content, you can simplify f['matrix']['features']['name'] to f['matrix/features/name']. The same applies to the other cases.

Aleksandar

Attention! https://support.hdfgroup.org is the NEW home for documentation from The HDF Group. (Details)

10x Genomics use of HDF5 and Python