Compiled on Jan 6, 2021 with scanpy v.1.6

Scanpy tutorial using 10k PBMCs dataset

This notebook should introduce you to some typical tasks, using Scanpy eco-system. Scanpy notebooks and tutorials are available here. An alternative to this vignette in R (Seurat) is also available; interconversion and exploration of datasets from Python to Seurat (and SCE) is described in a separate vignette.

The data consists in 10k PBMCs from a Healthy Donor and is freely available from 10x Genomics from this webpage. We will use the data that had ambient RNA removed using SoupX, as described in yet another vignette.

Let's import the necessary packages and read in the Cellranger-formatted data folder of SoupX output.

PART 1. Basic quality control and filtering.

We start the analysis after two preliminary steps have been completed: 1) ambient RNA correction using soupX; 2) doublet detection using scrublet. Both vignettes can be found in this repository.

Reading the matrix with cache enabled helps save time on I/O operations, which is particularly relevant for bigger datasets. SoupX output only has gene symbols available, so no additional options are needed. If starting from typical Cellranger output, it's possible to choose if you want to use Ensemble ID (gene_ids) or gene symbols (gene_symbols) as expression matrix row names.

Let's make sure all gene names are unique. This is done with var_names_make_unique function.

We can also explore the newly created AnnData object. The obs field will contain all of per-cell metadata; for now, it only has the barcodes.

We can view selected row or column names as follows:

Let's remove all genes expressed in fewer than 3 cells:

Let's show those genes that yield the highest fraction of counts in each single cell, across all cells.

Let's plot some information about mitochondrial genes, which is important for quality control (high MT gene content usually means dead cells). Note that you can also retrieve mitochondrial gene identifiers using sc.queries.mitochondrial_genes_biomart('www.ensembl.org', 'mmusculus'). We also will calculate the fraction of ribosomal proteins which can be used as another useful identifier of cell state.

A violin plot of the computed quality measures can help us analyze the dataset in general and make decisions about cutoffs.

Take another look at per-cell metadata, which has many more columns now:

Finally, let's read in the table of doublet scores and doublet/singlet calls generated by scrublet in another vignette from this repository.

Number of rows matches the number of cells we saw before (10194), which is good. Let's take a quick look at the table:

Let's add the table the anndata.obs so that we could plot the values against each other, use them for filtering, etc:

Plot correlations and make sure that high mitochondrial content corresponds with low total counts (dead cells), and that total counts have strong positive correlation with the number of expressed genes:

High ribosomal protein content, however, strongly anti-correlates with MT, and seems to contain biological signal.

There's also a good correlation between the doublet score and number of expressed genes.

Let's set QC column in metadata and define it in an informative way. Our metadata table looks as follows.

The following command allows us to learn what is the data type of each column.

Let's add a new categorical column that would give us an approximate reason as to why a cell fails QC. Sometimes it would be more than one thing (e.g., low number of expressed genes often happens together with high MT gene content), but for simplicity we would let the order of comparison define the QC category.

After we set the values for QC, we can change the data type to 'category', and take a look at the result.

We can see that doublets don't often overlap with cell with low number of detected genes; at the same time, the latter often co-insides with high mitochondrial content. Let's plot metadata only for cells that pass tentative QC:

PART 2. Normalization and dimensionality reduction.

Total-count normalize (library-size correct) the data matrix 𝐗 to 10,000 reads per cell, so that counts become comparable among cells. The data are also log-transformed with a pseudocount of 1.

Let's take a look at per-gene annotations added by the performed operations:

Identify highly-variable genes to prioritize in further analysis.

Plot the highly variable genes, after and before the normalization.

See how many genes were determined to be highly variable:

Save the raw data in a separate slot. For this, we set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object. You can get back an AnnData of the object in .raw by calling .raw.to_adata().

Filter the genes, only keeping the most variable ones:

Scale each gene to unit variance. Clip values exceeding standard deviation 10.

Reduce the dimensionality of the data by running principal component analysis (PCA), which reveals the main axes of variation and denoises the data.

Plot the first two components of PCA, coloring them by expression of CST3, which is a tentative marker of monocyte- and macrophage-like cells.

Let us inspect the contribution of single PCs to the total variance in the data. This gives us information about how many PCs we should consider in order to compute the neighborhood relations of cells, e.g. used in the clustering function sc.tl.louvain() or tSNE sc.tl.tsne(). In our experience, often, a rough estimate of the number of PCs does fine.

Let us compute the neighborhood graph of cells using the PCA representation of the data matrix. You might simply use default values here. For the sake of similarity with Seurat, let’s use the following values:

UMAP and tSNE are popular methods used for dimensionality reduction and visualisation of very large datasets. UMAP is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories.

Let's plot some markers crudely defining large cell populations: monocytes (LYZ), B cells (MS4A1), and T cells (CD3D). Seems like general populations are nicely defined on our plot.

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section.

Plot the data with UMAP, with coloring according to clustering. Some larger clusters are clearly identifiable as cell populations (see example marker plots above). Other smaller clusters can be attributed to doublets, cell cycle artefacts, etc.

Finally, let's calculate cell cycle scores, as described here. Seurat has a built-in list; in scanpy, we would have to add the lists explicitly (see below), or read them in from the file. Note the use of display that allows you to print the outputs of multiple operations from one cell.

It is important to note that cell cycle analysis has to be done after normalization and scaling, and on the full set of genes, not just highly variable ones (remember that the full expression matrix is stored in raw). Thus, we would first recover the genes.

Let's check how many of the defined cell cycle genes are present in our var_names:

Looks like all the gene names were found successfully. Let's run the following command:

Once this is done, we can do a more detailed exploration of our dataset, and filter the cells that do not pass the quality control. Let's visualize the values we use to define low quality cells:

We can see clear grouping of doublets, which is a sign that algorithm has worked correctly (see the vignette for more details). High MT content cells also form separate sub-clusters. Clusters clearly differ in the number of expressed genes, but this could be simply due to biological reasons.

Let's remove the cells that did not pass QC and compare plots. We can now see much more defined clusters. Our filtered dataset now contains 8824 cells - so approximately 12% of cells were removed for various reasons.

Ribosomal protein genes show very strong dependency on the putative cell type! Some cell clusters seem to have as much as 45%, and some as little as 15%. It seems like monocytes have lower content of ribosomal proteins. Mitochnondrial genes also show certain dependency on cluster - e.g. the MT percentage is much lower in clusters 1 and 15.

Finally, cell cycle score does not seem to depend on the cell type much - however, there are dramatic outliers in each group.

There are also differences in RNA content per cell type. Some clusters will probably disapper when we redo the clustering using the QC-filtered dataset.

PART 3. Normalization and careful clustering of the cleaned dataset.

Let's restart our analysis from the counts matrix again. First make sure we have all genes present in the active assay:

Let's redo HVG, PCA, UMAP, and clustering analysis, since now we have QC-filtered cells.

Much fewer (around 2000) genes are identified as HVG. Let's use these for subsequent analysis, keeping the full matrix in raw, as before:

Now, we shall regress out the confounding variables: total UMI count, MT percentage, and cell cycle scores. This step is somewhat slow (expect at least several minutes):

Scale the data again:

Calculate PCA and plot the first two components.

Evaluate how many PCs is it worth using in the analysis:

Calculate neighborhood graph, do UMAP, and do clustering the same way we did before:

Visualize the obtained clusters using UMAP representation:

If UMAP looks slightly less than perfect, we can use the following trick from the scanpy tutorial page:

After this, take another look at the updated UMAP. Looks better!

Let's visualize more detailed markers of PBMC cell types:

Marker Cell type
LYZ monocytes
MS4A1 B cells
CD8B CD8 T cells
CCR7 CD4 naive T cells
IL32 CD4 memory T cells
NKG7 natural killer cells
LILRA4 plasmacytoid dendritic cells
FCER1A myeloid dendritic cells
PPBP platelets

Cluster 10 is clearly heterogeneous and consists of both platelets (with markers such as PPBP) and myeloid cells, most probably mDCs (LYZ, FCER1A). We need to adjust clustering resolution in such way that we would have these subpopulations split.

Let's first try changing the resolution of Leiden clustering, saving new clusters as separate columns in adata column metadata (obs).

Let's visualize clusters with labels on clusters for clarity. It appears that we get too many cluster before we're able to split the original cluster (10) into sub-populations.

Let's try some other approaches. For example, we can try k-means clustering with 15, 20, or 25 expected clusters.

We can also try hierarchical clustering using Euclidean distance as a metric, and Ward's linkage method.

In general, k-means and hierarchical clusters seem to poorly represent the communities, easily shifting from under- to over-clustering. At the same time, one of the clusters from our initial Leiden clustering is always split into platelet/DC sub-populations. This indirectly confirms our previous conclusions. Thus, let's manually split this cluster using well-known markers of platelets.

One of the clusters is clearly divided into two by the following markers (PPBP is a marker of platelets, and FCER1A is a marker of myeloid DCs)

Let's find which cells of cluster 10 are platelets. The following expression shows a useful way to slice scanpy object on multiple gene expression values. The resulting subset should contain 25 cells.

We can get the indexes of putative platelets from cluster 10 as follows:

The following commands allow us to add another category to the 'leiden' column of the metadata, and set the selected cells to belong to the new cluster:

Finally, let's make sure the clustering looks right:

PART 4. Differential expression and marker selection.

After we have fixed the clusters, let's make gene markers using Wilcoxon test. Score plotting for each cluster allows to estimate the relative "goodness" of a marker in a quick fashion.

While some clusters have uninformative markers (e.g. cluster 0 mostly shows ribosomal proteins), most others identify what seems to be legitimate subpopulations of lymphocytes. Let's define a set of markers that are known from the literature or have been identified in our marker selection process:

Let's take a closer look at top 5 ranked markers for each cluster:

Seaborn (split violin) plots of gene expression in a given cluster vs the rest of cells can help easily identify more robust cluster markers. In the example below, ribosomal genes should not be used as markers, while CD8A and CD8B represent bona fide markers of CD8 T cells.

Simple violin plots grouped by cluster also offer a good visualisation:

Dotplot can generate a bird's eye view of expression of all selected marker genes in all clusters.

Yet another option is given by stacked violin plot.

PART 5. Cell ontology-based automatic cell type prediction with CellO

The following analysis is based on this tutorial provided for CellO package. The package is actively developed and is very useful for broad cell type annotation, as well as for annotation of cell types defined by Cell Ontology. To install CellO, we can run the following commands:

After this (we assume we already imported pandas, numpy, scanpy, and AnnData - see above), we run

First of all, let's "recall" the raw data to adata again; we'll need the lists of retained cells and genes for the downstream analysis, but we'll do the normalization and variable gene selection from the scratch.

Let's normalize the data and find the highly variable genes - note the unusually high number of HVGs (10k). This follows the suggestion in the CellO tutorial:

Now let's cluster the data. Over-clustering is recommended here, for the reasons that would become clear below:

Currently, CellO only supports dense matrix format, so we need to do the transformation below. Note that it needs to be done after all Scanpy operations.

Finally, let's set the CellO resource directory to the dir where this notebook is located. NOTE: You will need at least 5 Gb of free disk space for the following code to work.

Following this, we can train the model using CellO using the following command. This runs for about an hour, and will make a file called 10k_pbmc.model.dill.

model_prefix = "10k_pbmc"

cello.scanpy_cello(
    new_adata, clust_key='leiden', rsrc_loc=cello_resource_loc, 
    out_prefix=model_prefix, log_dir=os.getcwd()
)

Instead, we will download a pre-made model to save time:

Following this, we can run the cell type annotation command with a pre-made model:

Let's run UMAP and visualized the obtained clusters. Quite clearly this is too many, but it's OK:

The following example can be useful to save the plots in the pdf format:

Let's visualize a most specific cell type for each cluster:

We can see that some of these assignments are wrong; for example, what's marked as "dendiritic cell" in the plot above is actually a population of non-classical monocytes (as shown by expression of FCGR3A). At the same time, what's labelled as "mononuclear cell" is in fact two populations of dendritic cells - myeloid DCs (expressing FCER1A) and plasmacytoid DCs (expressing LILRA4):

Assigned probability can be used to visualize the probability of cell being a certain cell type:

We can also use more abstract ontologies, e.g. to identify nucleate cells (platelets don't have nuclei!):

Finally, we can use the result of binary classification:

Additionally, we can visualize cell type probabilities assigned to a specific cluster overlaid on the Cell Ontology graph: