Compiled on Jan 6, 2021 with scanpy v.1.6

Scanpy tutorial using 10k PBMCs dataset

This notebook should introduce you to some typical tasks, using Scanpy eco-system. Scanpy notebooks and tutorials are available here. An alternative to this vignette in R (Seurat) is also available; interconversion and exploration of datasets from Python to Seurat (and SCE) is described in a separate vignette.

The data consists in 10k PBMCs from a Healthy Donor and is freely available from 10x Genomics from this webpage. We will use the data that had ambient RNA removed using SoupX, as described in yet another vignette.

Let's import the necessary packages and read in the Cellranger-formatted data folder of SoupX output.

PART 1. Basic quality control and filtering.

We start the analysis after two preliminary steps have been completed: 1) ambient RNA correction using soupX; 2) doublet detection using scrublet. Both vignettes can be found in this repository.

Reading the matrix with cache enabled helps save time on I/O operations, which is particularly relevant for bigger datasets. SoupX output only has gene symbols available, so no additional options are needed. If starting from typical Cellranger output, it's possible to choose if you want to use Ensemble ID (gene_ids) or gene symbols (gene_symbols) as expression matrix row names.

Let's make sure all gene names are unique. This is done with var_names_make_unique function.

We can also explore the newly created AnnData object. The obs field will contain all of per-cell metadata; for now, it only has the barcodes.

We can view selected row or column names as follows:

Let's remove all genes expressed in fewer than 3 cells:

Let's show those genes that yield the highest fraction of counts in each single cell, across all cells.

Let's plot some information about mitochondrial genes, which is important for quality control (high MT gene content usually means dead cells). Note that you can also retrieve mitochondrial gene identifiers using sc.queries.mitochondrial_genes_biomart('', 'mmusculus'). We also will calculate the fraction of ribosomal proteins which can be used as another useful identifier of cell state.

A violin plot of the computed quality measures can help us analyze the dataset in general and make decisions about cutoffs.