This notebook shows how to remove ambient RNA counts from 10X expression matrix using SoupX package. The following code installs packages that will be required for the analysis. Normally you would only run these commands if you don’t have the packages installed.
install.packages("SoupX")
## Installing package into '/home/jovyan/R/x86_64-pc-linux-gnu-library/4.0'
## (as 'lib' is unspecified)
Let’s now load all the libraries that will be needed for the tutorial.
library(Seurat)
library(SoupX)
library(DropletUtils)
library(ggplot2)
library(DoubletFinder)
library(knitr)
In a common scenario, you start with the output of CellRanger or a similar tool. Here, we will download 10k of PBMC cells processed with CellRanger 4.0.0. CellRanger, STARsolo and other tools generate both raw and filtered matrices. The former contains empty droplets, while the latter is thought to contain mostly cells. Normally, filtered matrix is enough, but for ambient RNA removal we need both.
download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_filtered_feature_bc_matrix.h5",
destfile = "pbmc10k_filt.h5")
download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_raw_feature_bc_matrix.h5",
destfile = "pbmc10k_raw.h5")
These are HDF5 data files, which is an efficient way to pack single cell expression matrices. Let’s read them. Setting “use.names=F” will use Ensemble IDs instead of gene names for row identifiers. Few non-unique gene names are taken care of during the reading.
filt.matrix <- Read10X_h5("pbmc10k_filt.h5",use.names = T)
## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
raw.matrix <- Read10X_h5("pbmc10k_raw.h5",use.names = T)
## Warning in sparseMatrix(i = indices[] + 1, p = indptr[], x = as.numeric(x =
## counts[]), : 'giveCsparse' has been deprecated; setting 'repr = "T"' for you
This results in two objects of dgCMatrix class. Filtered matrix contains droplets that contain putative cells, while raw matrix also includes empty droplets with ambient RNA (soup). We can explore them as follows:
str(raw.matrix)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:29226842] 869 11613 27911 2092 5765 15701 21210 31575 36401 10179 ...
## ..@ p : int [1:6794881] 0 0 0 0 0 0 0 0 0 0 ...
## ..@ Dim : int [1:2] 36601 6794880
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. ..$ : chr [1:6794880] "AAACCCAAGAAACACT-1" "AAACCCAAGAAACCAT-1" "AAACCCAAGAAACCCA-1" "AAACCCAAGAAACCCG-1" ...
## ..@ x : num [1:29226842] 1 1 1 1 1 1 1 1 1 1 ...
## ..@ factors : list()
str(filt.matrix)
## Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## ..@ i : int [1:24330253] 25 30 32 42 43 44 51 59 60 62 ...
## ..@ p : int [1:10195] 0 4803 7036 11360 11703 15846 18178 20413 22584 27802 ...
## ..@ Dim : int [1:2] 36601 10194
## ..@ Dimnames:List of 2
## .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. ..$ : chr [1:10194] "AAACCCACATAACTCG-1" "AAACCCACATGTAACC-1" "AAACCCAGTGAGTCAG-1" "AAACCCAGTGCTTATG-1" ...
## ..@ x : num [1:24330253] 1 2 1 1 1 3 1 1 1 1 ...
## ..@ factors : list()
We can make a Seurat object from the sparce matrix as follows:
srat <- CreateSeuratObject(counts = filt.matrix)
srat
## An object of class Seurat
## 36601 features across 10194 samples within 1 assay
## Active assay: RNA (36601 features, 0 variable features)
Let’s make a “SoupChannel”, the object needed to run SoupX. Detailed info is available in SoupX vignette (see link above).
soup.channel <- SoupChannel(raw.matrix, filt.matrix)
soup.channel
## Channel with 36601 genes and 10194 cells
SoupX requires clusters in order to define marker genes. You can either use CellRanger clustering (see SoupX vignette), or quickly cluster using Seurat. We’ll go for option 2 here. All the following steps will be addressed in more detail in the separate Seurat vignette.
srat <- SCTransform(srat, verbose = F)
srat <- RunPCA(srat, verbose = F)
srat <- RunUMAP(srat, dims = 1:30, verbose = F)
srat <- FindNeighbors(srat, dims = 1:30, verbose = F)
srat <- FindClusters(srat, verbose = T)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10194
## Number of edges: 356652
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8950
## Number of communities: 22
## Elapsed time: 1 seconds
After clustering is obtained, it can be added to the channel using setClusters. setDR is useful for visualizations.
meta <- srat@meta.data
umap <- srat@reductions$umap@cell.embeddings
soup.channel <- setClusters(soup.channel, setNames(meta$seurat_clusters, rownames(meta)))
soup.channel <- setDR(soup.channel, umap)
head(meta)
## orig.ident nCount_RNA nFeature_RNA nCount_SCT
## AAACCCACATAACTCG-1 SeuratProject 22575 4803 8122
## AAACCCACATGTAACC-1 SeuratProject 7758 2233 7745
## AAACCCAGTGAGTCAG-1 SeuratProject 21733 4324 8260
## AAACCCAGTGCTTATG-1 SeuratProject 860 343 5421
## AAACGAACAGTCAGTT-1 SeuratProject 15311 4143 8877
## AAACGAACATTCGGGC-1 SeuratProject 10013 2332 8380
## nFeature_SCT SCT_snn_res.0.8 seurat_clusters
## AAACCCACATAACTCG-1 2677 14 14
## AAACCCACATGTAACC-1 2232 17 17
## AAACCCAGTGAGTCAG-1 2533 3 3
## AAACCCAGTGCTTATG-1 1301 6 6
## AAACGAACAGTCAGTT-1 3684 18 18
## AAACGAACATTCGGGC-1 2327 1 1
With defined clusters, run the main SoupX function, calculating ambient RNA profile.
soup.channel <- autoEstCont(soup.channel)
## 1920 genes passed tf-idf cut-off and 381 soup quantile filter. Taking the top 100.
## Using 1215 independent estimates of rho.
## Estimated global rho of 0.02
Genes with highest expression in background. These are often enriched for ribosomal proteins.
head(soup.channel$soupProfile[order(soup.channel$soupProfile$est, decreasing = T), ], n = 20)
## est counts
## MALAT1 0.041303498 191390
## MT-CO1 0.018444680 85468
## B2M 0.014246351 66014
## MT-CO3 0.014102623 65348
## MT-CO2 0.012735695 59014
## MT-ATP6 0.011987488 55547
## EEF1A1 0.009532025 44169
## MT-CYB 0.009128248 42298
## MT-ND4 0.008099491 37531
## TMSB4X 0.007674349 35561
## RPL13 0.007445808 34502
## TPT1 0.007187270 33304
## RPS12 0.006354683 29446
## RPL10 0.006289509 29144
## LYZ 0.005960402 27619
## FTL 0.005950475 27573
## RPL41 0.005889185 27289
## S100A9 0.005470733 25350
## RPLP1 0.005211548 24149
## TMSB10 0.005023579 23278
We will use roundToInt
option to make sure we output integer matrix.
adj.matrix <- adjustCounts(soup.channel, roundToInt = T)
## Expanding counts from 22 clusters to 10194 cells.
Finally, let’s write the directory with corrected read counts.
DropletUtils:::write10xCounts("soupX_pbmc10k_filt", adj.matrix)
This outputs the matrix of soup-corrected reads which we will use for all future analysis.