I’m a biomedical research scientist who wants to reanalyze some single cell RNA-seq data of Covid-19 patients deposited in NCBI Geo several months ago. The 10X Genomics company provided sample Python code on their web site for the particular version of their hardware and software used for this experiment, which they now regard as obsolete and hence no longer supported by the company. (But it’s been used for publicly archived data.)
Their sample code didn’t even conform to the very first guidelines published by Guido van Rossum. Also their notebook didn’t even successfully run to completion. I tried to clean up and simplify the code as much as possible based on my very limited experience with Python (most of my previous work has used R and Rstudio, which IMHO is much more suited to the analysis of gene expression data in biomedical science.)
I looked in the Andrew Collete book “Python and HDF5”, but it didn’t cover the particular style of Python coding used on the sample code. It’s almost like a different dialect of the language, with its extensive use of consecutive bracketed terms. I looked in quite a few different Python books, but never came across this particular dialect.
I realize this is far removed from the types of issues that the HDF group normally works, but I don;'t know who to turn to for assistance. (The Stack Overflow people can get pretty snarky!)
My revised code below:
‘’’ import modules define fns loading saving processing
gene-barcode matrix ‘’’
import os
import socket
import collections
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.sparse as sp_sparse
import h5py
from scipy.io import mmread
np.random.seed(0)
FeatureBCMatrix = collections.namedtuple(‘FeatureBCMatrix’,
[‘feature_ids’, ‘feature_names’, ‘barcodes’, ‘matrix’])
def get_matrix_from_h5(filename):
with h5py.File(filename, ‘r’) as f:
feature_ids = [x.decode(‘ascii’, ‘ignore’)
for x in f[‘matrix’][‘features’][‘id’]]
feature_names = [x.decode(‘ascii’, ‘ignore’)
for x in f[‘matrix’][‘features’][‘name’]]
barcodes = list(f[‘matrix’][‘barcodes’][:])
matrix = sp_sparse.csc_matrix((f[‘matrix’][‘data’],
f[‘matrix’][‘indices’], f[‘matrix’][‘indptr’]),
shape = f[‘matrix’][‘shape’])
return FeatureBCMatrix(feature_ids, feature_names,
barcodes, matrix)
I use the parquet Arrow format to move the data across to R for subsequent analysis. One complication is that even though the genetic data is sparse, it is almost at random, so the vast numerical analysis literature on geometric sparse matrices doesn’t apply.
Alan J. Robinson
robin073@umn.edu