Removal of ambient RNA using SoupX

This notebook shows how to remove ambient RNA counts from 10X expression matrix using SoupX package. The following code installs packages that will be required for the analysis. Normally you would only run these commands if you don’t have the packages installed.

install.packages("SoupX")

## Installing package into '/home/jovyan/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)

Let’s now load all the libraries that will be needed for the tutorial.

library(Seurat)
library(SoupX)
library(DropletUtils)
library(ggplot2)
library(DoubletFinder)
library(knitr)

In a common scenario, you start with the output of CellRanger or a similar tool. Here, we will download 10k of PBMC cells processed with CellRanger 4.0.0. CellRanger, STARsolo and other tools generate both raw and filtered matrices. The former contains empty droplets, while the latter is thought to contain mostly cells. Normally, filtered matrix is enough, but for ambient RNA removal we need both.

download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_filtered_feature_bc_matrix.h5",
              destfile = "pbmc10k_filt.h5")
download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_raw_feature_bc_matrix.h5",
              destfile = "pbmc10k_raw.h5")

These are HDF5 data files, which is an efficient way to pack single cell expression matrices. Let’s read them. Setting “use.names=F” will use Ensemble IDs instead of gene names for row identifiers. Few non-unique gene names are taken care of during the reading.

filt.matrix <- Read10X_h5("pbmc10k_filt.h5",use.names = T)

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

raw.matrix  <- Read10X_h5("pbmc10k_raw.h5",use.names = T)

## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you

This results in two objects of dgCMatrix class. Filtered matrix contains droplets that contain putative cells, while raw matrix also includes empty droplets with ambient RNA (soup). We can explore them as follows:

str(raw.matrix)

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:29226842] 869 11613 27911 2092 5765 15701 21210 31575 36401 10179 ...
##   ..@ p       : int [1:6794881] 0 0 0 0 0 0 0 0 0 0 ...
##   ..@ Dim     : int [1:2] 36601 6794880
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
##   .. ..$ : chr [1:6794880] "AAACCCAAGAAACACT-1" "AAACCCAAGAAACCAT-1" "AAACCCAAGAAACCCA-1" "AAACCCAAGAAACCCG-1" ...
##   ..@ x       : num [1:29226842] 1 1 1 1 1 1 1 1 1 1 ...
##   ..@ factors : list()

str(filt.matrix)

## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   ..@ i       : int [1:24330253] 25 30 32 42 43 44 51 59 60 62 ...
##   ..@ p       : int [1:10195] 0 4803 7036 11360 11703 15846 18178 20413 22584 27802 ...
##   ..@ Dim     : int [1:2] 36601 10194
##   ..@ Dimnames:List of 2
##   .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
##   .. ..$ : chr [1:10194] "AAACCCACATAACTCG-1" "AAACCCACATGTAACC-1" "AAACCCAGTGAGTCAG-1" "AAACCCAGTGCTTATG-1" ...
##   ..@ x       : num [1:24330253] 1 2 1 1 1 3 1 1 1 1 ...
##   ..@ factors : list()

We can make a Seurat object from the sparce matrix as follows:

srat  <- CreateSeuratObject(counts = filt.matrix)
srat

## An object of class Seurat 
## 36601 features across 10194 samples within 1 assay 
## Active assay: RNA (36601 features, 0 variable features)

Let’s make a “SoupChannel”, the object needed to run SoupX. Detailed info is available in SoupX vignette (see link above).

soup.channel  <- SoupChannel(raw.matrix, filt.matrix)
soup.channel

## Channel with 36601 genes and 10194 cells

SoupX requires clusters in order to define marker genes. You can either use CellRanger clustering (see SoupX vignette), or quickly cluster using Seurat. We’ll go for option 2 here. All the following steps will be addressed in more detail in the separate Seurat vignette.

srat    <- SCTransform(srat, verbose = F)
srat    <- RunPCA(srat, verbose = F)
srat    <- RunUMAP(srat, dims = 1:30, verbose = F)
srat    <- FindNeighbors(srat, dims = 1:30, verbose = F)
srat    <- FindClusters(srat, verbose = T)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 10194
## Number of edges: 356652
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8950
## Number of communities: 22
## Elapsed time: 1 seconds

After clustering is obtained, it can be added to the channel using setClusters. setDR is useful for visualizations.

meta    <- srat@meta.data
umap    <- srat@reductions$umap@cell.embeddings
soup.channel  <- setClusters(soup.channel, setNames(meta$seurat_clusters, rownames(meta)))
soup.channel  <- setDR(soup.channel, umap)
head(meta)

##                       orig.ident nCount_RNA nFeature_RNA nCount_SCT
## AAACCCACATAACTCG-1 SeuratProject      22575         4803       8122
## AAACCCACATGTAACC-1 SeuratProject       7758         2233       7745
## AAACCCAGTGAGTCAG-1 SeuratProject      21733         4324       8260
## AAACCCAGTGCTTATG-1 SeuratProject        860          343       5421
## AAACGAACAGTCAGTT-1 SeuratProject      15311         4143       8877
## AAACGAACATTCGGGC-1 SeuratProject      10013         2332       8380
##                    nFeature_SCT SCT_snn_res.0.8 seurat_clusters
## AAACCCACATAACTCG-1         2677              14              14
## AAACCCACATGTAACC-1         2232              17              17
## AAACCCAGTGAGTCAG-1         2533               3               3
## AAACCCAGTGCTTATG-1         1301               6               6
## AAACGAACAGTCAGTT-1         3684              18              18
## AAACGAACATTCGGGC-1         2327               1               1

With defined clusters, run the main SoupX function, calculating ambient RNA profile.

soup.channel  <- autoEstCont(soup.channel)

## 1920 genes passed tf-idf cut-off and 381 soup quantile filter.  Taking the top 100.

## Using 1215 independent estimates of rho.

## Estimated global rho of 0.02

Genes with highest expression in background. These are often enriched for ribosomal proteins.

head(soup.channel$soupProfile[order(soup.channel$soupProfile$est, decreasing = T), ], n = 20)

##                 est counts
## MALAT1  0.041303498 191390
## MT-CO1  0.018444680  85468
## B2M     0.014246351  66014
## MT-CO3  0.014102623  65348
## MT-CO2  0.012735695  59014
## MT-ATP6 0.011987488  55547
## EEF1A1  0.009532025  44169
## MT-CYB  0.009128248  42298
## MT-ND4  0.008099491  37531
## TMSB4X  0.007674349  35561
## RPL13   0.007445808  34502
## TPT1    0.007187270  33304
## RPS12   0.006354683  29446
## RPL10   0.006289509  29144
## LYZ     0.005960402  27619
## FTL     0.005950475  27573
## RPL41   0.005889185  27289
## S100A9  0.005470733  25350
## RPLP1   0.005211548  24149
## TMSB10  0.005023579  23278

We will use roundToInt option to make sure we output integer matrix.

adj.matrix  <- adjustCounts(soup.channel, roundToInt = T)

## Expanding counts from 22 clusters to 10194 cells.

Finally, let’s write the directory with corrected read counts.

DropletUtils:::write10xCounts("soupX_pbmc10k_filt", adj.matrix)

This outputs the matrix of soup-corrected reads which we will use for all future analysis.