This vignette should introduce you to some typical tasks, using Seurat eco-system. Seurat vignettes are available here. An alternative to this vignette in Python (using scanpy) is also available in the same repository. The interconversion and exploration of datasets from Python to Seurat (and SCE) is described in a separate vignette. Let’s install the packages currently missing in a default JupyterHub environment.
Only run the chunk below if you’re running the code for the first time, and do not have the packages installed.
remotes::install_github('chris-mcginnis-ucsf/DoubletFinder')
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
BiocManager::install("celldex")
BiocManager::install("SingleR")
BiocManager::install("glmGamPoi")
Let’s now load all the libraries that will be needed for the tutorial.
library(Seurat)
library(DropletUtils)
library(ggplot2)
library(DoubletFinder)
library(SingleR)
library(dplyr)
library(celldex)
library(knitr)
library(RColorBrewer)
We start the analysis after two preliminary steps have been completed: 1) ambient RNA correction using soupX; 2) doublet detection using scrublet. Both vignettes can be found in this repository.
To start the analysis, let’s read in the corrected matrices:
adj.matrix <- Read10X("soupX_pbmc10k_filt")
After this, we will make a Seurat
object. Seurat object summary shows us that 1) number of cells (“samples”) approximately matches the description of each dataset (10194); 2) there are 36601 genes (features) in the reference.
srat <- CreateSeuratObject(adj.matrix,project = "pbmc10k")
srat
## An object of class Seurat
## 36601 features across 10194 samples within 1 assay
## Active assay: RNA (36601 features, 0 variable features)
Let’s look at the combined object a bit closer. Str allows us to see all fields of the class:
str(srat)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:24330253] 25 30 32 42 43 44 51 59 60 62 ...
## .. .. .. .. .. ..@ p : int [1:10195] 0 4803 7036 11360 11703 15846 18178 20413 22584 27802 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 36601 10194
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. .. .. .. .. .. ..$ : chr [1:10194] "AAACCCACATAACTCG-1" "AAACCCACATGTAACC-1" "AAACCCAGTGAGTCAG-1" "AAACCCAGTGCTTATG-1" ...
## .. .. .. .. .. ..@ x : num [1:24330253] 1 2 1 1 1 3 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:24330253] 25 30 32 42 43 44 51 59 60 62 ...
## .. .. .. .. .. ..@ p : int [1:10195] 0 4803 7036 11360 11703 15846 18178 20413 22584 27802 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 36601 10194
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:36601] "MIR1302-2HG" "FAM138A" "OR4F5" "AL627309.1" ...
## .. .. .. .. .. .. ..$ : chr [1:10194] "AAACCCACATAACTCG-1" "AAACCCACATGTAACC-1" "AAACCCAGTGAGTCAG-1" "AAACCCAGTGCTTATG-1" ...
## .. .. .. .. .. ..@ x : num [1:24330253] 1 2 1 1 1 3 1 1 1 1 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 36601 obs. of 0 variables
## .. .. .. ..@ misc : list()
## ..@ meta.data :'data.frame': 10194 obs. of 3 variables:
## .. ..$ orig.ident : Factor w/ 1 level "pbmc10k": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:10194] 22196 7630 21358 857 15007 ...
## .. ..$ nFeature_RNA: int [1:10194] 4734 2191 4246 342 4075 2285 2167 2151 5134 3037 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 1 level "pbmc10k": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:10194] "AAACCCACATAACTCG-1" "AAACCCACATGTAACC-1" "AAACCCAGTGAGTCAG-1" "AAACCCAGTGCTTATG-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "pbmc10k"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 4 0 0
## ..@ commands : list()
## ..@ tools : list()
Meta.data is the most important field for next steps. It can be acessed using both @ and [[]] operators. Right now it has 3 fields per celL: dataset ID, number of UMI reads detected per cell (nCount_RNA), and the number of expressed (detected) genes per same cell (nFeature_RNA).
meta <- srat@meta.data
dim(meta)
## [1] 10194 3
head(meta)
## orig.ident nCount_RNA nFeature_RNA
## AAACCCACATAACTCG-1 pbmc10k 22196 4734
## AAACCCACATGTAACC-1 pbmc10k 7630 2191
## AAACCCAGTGAGTCAG-1 pbmc10k 21358 4246
## AAACCCAGTGCTTATG-1 pbmc10k 857 342
## AAACGAACAGTCAGTT-1 pbmc10k 15007 4075
## AAACGAACATTCGGGC-1 pbmc10k 9855 2285
summary(meta$nCount_RNA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 499 5549 7574 8902 10730 90732
summary(meta$nFeature_RNA)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 47 1725 2113 2348 2948 7265
Let’s add several more values useful in diagnostics of cell quality. Michochondrial genes are useful indicators of cell state. For mouse datasets, change pattern to “Mt-”, or explicitly list gene IDs with the features = … option.
srat[["percent.mt"]] <- PercentageFeatureSet(srat, pattern = "^MT-")
Similarly, we can define ribosomal proteins (their names begin with RPS or RPL), which often take substantial fraction of reads:
srat[["percent.rb"]] <- PercentageFeatureSet(srat, pattern = "^RP[SL]")
Now, let’s add the doublet annotation generated by scrublet
to the Seurat object metadata.
doublets <- read.table("scrublet_calls.tsv",header = F,row.names = 1)
colnames(doublets) <- c("Doublet_score","Is_doublet")
srat <- AddMetaData(srat,doublets)
head(srat[[]])
## orig.ident nCount_RNA nFeature_RNA percent.mt percent.rb
## AAACCCACATAACTCG-1 pbmc10k 22196 4734 5.275725 25.067580
## AAACCCACATGTAACC-1 pbmc10k 7630 2191 8.833552 33.853211
## AAACCCAGTGAGTCAG-1 pbmc10k 21358 4246 6.283360 19.276149
## AAACCCAGTGCTTATG-1 pbmc10k 857 342 31.388565 1.750292
## AAACGAACAGTCAGTT-1 pbmc10k 15007 4075 7.916306 14.986340
## AAACGAACATTCGGGC-1 pbmc10k 9855 2285 7.762557 41.024860
## Doublet_score Is_doublet
## AAACCCACATAACTCG-1 0.30542385 True
## AAACCCACATGTAACC-1 0.01976120 False
## AAACCCAGTGAGTCAG-1 0.03139876 False
## AAACCCAGTGCTTATG-1 0.02755040 False
## AAACGAACAGTCAGTT-1 0.36867997 True
## AAACGAACATTCGGGC-1 0.09723644 False
Let’s make violin plots of the selected metadata features. Note that the plots are grouped by categories named identity class. Identity class can be seen in srat@active.ident, or using Idents()
function. Active identity can be changed using SetIdents()
.
VlnPlot(srat, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rb"),ncol = 4) &
theme(plot.title = element_text(size=10))
Let’s plot some of the metadata features against each other and see how they correlate. The number above each plot is a Pearson correlation coefficient.
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "percent.mt")
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
FeatureScatter(srat, feature1 = "nCount_RNA", feature2 = "percent.rb")
FeatureScatter(srat, feature1 = "percent.rb", feature2 = "percent.mt")
FeatureScatter(srat, feature1 = "nFeature_RNA", feature2 = "Doublet_score")
The plots above clearly show that high MT percentage strongly correlates with low UMI counts, and usually is interpreted as dead cells. High ribosomal protein content, however, strongly anti-correlates with MT, and seems to contain biological signal. There’s also a strong correlation between the doublet score and number of expressed genes.
Let’s set QC column in metadata and define it in an informative way.
srat[['QC']] <- ifelse(srat@meta.data$Is_doublet == 'True','Doublet','Pass')
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC == 'Pass','Low_nFeature',srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC != 'Pass' & srat@meta.data$QC != 'Low_nFeature',paste('Low_nFeature',srat@meta.data$QC,sep = ','),srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$percent.mt > 15 & srat@meta.data$QC == 'Pass','High_MT',srat@meta.data$QC)
srat[['QC']] <- ifelse(srat@meta.data$nFeature_RNA < 500 & srat@meta.data$QC != 'Pass' & srat@meta.data$QC != 'High_MT',paste('High_MT',srat@meta.data$QC,sep = ','),srat@meta.data$QC)
table(srat[['QC']])
##
## Doublet High_MT
## 546 548
## High_MT,Low_nFeature High_MT,Low_nFeature,Doublet
## 275 1
## Pass
## 8824
We can see that doublets don’t often overlap with cell with low number of detected genes; at the same time, the latter often co-insides with high mitochondrial content. Let’s plot metadata only for cells that pass tentative QC:
VlnPlot(subset(srat, subset = QC == 'Pass'),
features = c("nFeature_RNA", "nCount_RNA", "percent.mt","percent.rb"), ncol = 4) &
theme(plot.title = element_text(size=10))
In order to do further analysis, we need to normalize the data to account for sequencing depth. Conventional way is to scale it to 10,000 (as if all cells have 10k UMIs overall), and log2-transform the obtained values. Normalized data are stored in srat[['RNA']]@data
of the ‘RNA’ assay.
srat <- NormalizeData(srat)
Next step discovers the most variable features (genes) - these are usually most interesting for downstream analysis.
srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
Identify the 10 most highly variable genes:
top10 <- head(VariableFeatures(srat), 10)
top10
## [1] "PTGDS" "IGLC3" "PPBP" "CXCL10" "GZMB" "GP1BB" "JCHAIN" "FCER1A"
## [9] "IGKC" "MZB1"
Plot variable features with and without labels:
plot1 <- VariableFeaturePlot(srat)
LabelPoints(plot = plot1, points = top10, repel = TRUE, xnudge = 0, ynudge = 0)
ScaleData converts normalized gene expression to Z-score (values centered at 0 and with variance of 1). It’s stored in srat[[‘RNA’]]@scale.data and used in following PCA. Default is to run scaling only on variable genes.
all.genes <- rownames(srat)
srat <- ScaleData(srat, features = all.genes)
## Centering and scaling data matrix
We can now do PCA, which is a common way of linear dimensionality reduction. By default we use 2000 most variable genes.
srat <- RunPCA(srat, features = VariableFeatures(object = srat))
## PC_ 1
## Positive: FCN1, CST3, LYZ, FGL2, IFI30, MNDA, CTSS, TYMP, TYROBP, SERPINA1
## TMEM176B, CYBB, S100A9, PSAP, NCF2, LST1, SPI1, AIF1, IGSF6, CD68
## CSTA, GRN, TNFSF13B, CFD, S100A8, MPEG1, MS4A6A, FCER1G, CFP, DUSP6
## Negative: LTB, IL32, TRAC, TRBC2, IL7R, CD7, LIME1, ARL4C, CD27, PRKCQ-AS1
## CCR7, FCMR, CD247, LEF1, CD69, TRBC1, GZMM, MAL, TRABD2A, BCL2
## SYNE2, ISG20, CTSW, IKZF3, ITM2A, RORA, AQP3, TRAT1, CD8B, KLRK1
## PC_ 2
## Positive: IGHM, MS4A1, CD79A, BANK1, SPIB, NIBAN3, BCL11A, IGKC, CD79B, LINC00926
## TCF4, RALGPS2, AFF3, TNFRSF13C, IGHD, HLA-DQA1, TSPAN13, BLNK, CD22, PAX5
## BLK, COBLL1, VPREB3, FCER2, JCHAIN, FCRLA, GNG7, HLA-DOB, TCL1A, LINC02397
## Negative: IL32, TRAC, CD7, IL7R, CD247, GZMM, ANXA1, CTSW, S100A4, PRKCQ-AS1
## TRBC1, S100A10, ITGB2, LEF1, KLRK1, RORA, GZMA, S100A6, NEAT1, MAL
## CST7, NKG7, ID2, TRBC2, ARL4C, MT2A, SAMD3, PRF1, LIME1, TRAT1
## PC_ 3
## Positive: GZMB, CLIC3, C12orf75, LILRA4, NKG7, CLEC4C, SERPINF1, CST7, GZMA, GNLY
## SCT, LRRC26, DNASE1L3, PRF1, TPM2, KLRD1, FGFBP2, CTSC, PACSIN1, IL3RA
## HOPX, CCL5, PLD4, LINC00996, GZMH, CCL4, MAP1A, SMPD3, TNFRSF21, PTCRA
## Negative: MS4A1, CD79A, LINC00926, BANK1, TNFRSF13C, IGHD, PAX5, RALGPS2, VPREB3, CD22
## FCER2, CD79B, HLA-DOB, FCRL1, P2RX5, CD24, ARHGAP24, ADAM28, CCR7, SWAP70
## FCRLA, LINC02397, CD19, IGHM, CD40, PKIG, FCRL2, BASP1, POU2AF1, BLK
## PC_ 4
## Positive: NKG7, GNLY, CST7, PRF1, KLRD1, GZMA, FGFBP2, HOPX, CCL4, FCGR3A
## KLRF1, GZMH, CCL5, SPON2, CD160, ADGRG1, PTGDR, LAIR2, TRDC, RHOC
## IFITM2, ABI3, MATK, TBX21, IL2RB, XCL2, PRSS23, FCRL6, CTSW, S1PR5
## Negative: LILRA4, SERPINF1, CLEC4C, SCT, LRRC26, DNASE1L3, TPM2, MAP1A, TNFRSF21, PACSIN1
## LINC00996, SCN9A, PTCRA, EPHB1, ITM2C, SMIM5, LAMP5, DERL3, CIB2, APP
## IL3RA, SMPD3, PLEKHD1, SCAMP5, PLD4, ZFAT, PPM1J, GAS6, LGMN, TLR9
## PC_ 5
## Positive: S100A12, VCAN, ITGAM, CES1, S100A8, CYP1B1, PADI4, MGST1, MEGF9, MCEMP1
## QPCT, GNLY, CD14, RNASE2, CSF3R, RBP7, NKG7, KLRD1, VNN2, CLEC4E
## CRISPLD2, THBS1, PRF1, CST7, CKAP4, BST1, CTSD, CR1, FGFBP2, PGD
## Negative: CDKN1C, HES4, CTSL, BATF3, TCF7L2, SIGLEC10, CSF1R, CKB, MS4A7, CALML4
## FCGR3A, CASP5, RRAS, AC064805.1, MS4A4A, NEURL1, AC104809.2, IFITM3, MTSS1, SMIM25
## CAMK1, GPBAR1, ABI3, HMOX1, ZNF703, FAM110A, RHOC, CXCL16, CALHM6, RNASET2
Prinicpal component “loadings” should match markers of distinct populations for well behaved datasets. Note that you can change many plot parameters using ggplot2 features - passing them with & operator.
VizDimLoadings(srat, dims = 1:9, reduction = "pca") &
theme(axis.text=element_text(size=5), axis.title=element_text(size=8,face="bold"))
Alternatively, one can do heatmap of each principal component or several PCs at once:
DimHeatmap(srat, dims = 1:6, nfeatures = 20, cells = 500, balanced = T)
DimPlot is used to visualize all reduced representations (PCA, tSNE, UMAP, etc). Identity is still set to “orig.ident”. DimPlot has built-in hiearachy of dimensionality reductions it tries to plot: first, it looks for UMAP, then (if not available) tSNE, then PCA.
DimPlot(srat, reduction = "pca")
It’s often good to find how many PCs can be used without much information loss. In our case a big drop happens at 10, so seems like a good initial choice:
ElbowPlot(srat)
We can now do clustering. Higher resolution leads to more clusters (default is 0.8). It would be very important to find the correct cluster resolution in the future, since cell type markers depends on cluster definition.
srat <- FindNeighbors(srat, dims = 1:10)
## Computing nearest neighbor graph
## Computing SNN
srat <- FindClusters(srat, resolution = 0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10194
## Number of edges: 337796
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9082
## Number of communities: 13
## Elapsed time: 1 seconds
For visualization purposes, we also need to generate UMAP reduced dimensionality representation:
srat <- RunUMAP(srat, dims = 1:10, verbose = F)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
Once clustering is done, active identity is reset to clusters (“seurat_clusters” in metadata). Let’s look at cluster sizes.
table(srat@meta.data$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 2854 1671 1163 1079 851 774 426 402 381 245 178 134 36
DimPlot uses UMAP by default, with Seurat clusters as identity:
DimPlot(srat,label.size = 4,repel = T,label = T)
In order to control for clustering resolution and other possible artifacts, we will take a close look at two minor cell populations: 1) dendritic cells (DCs), 2) platelets, aka thrombocytes. Let’s visualise two markers for each of this cell type: LILRA4 and TPM2 for DCs, and PPBP and GP1BB for platelets.
FeaturePlot(srat, features = c("LILRA4", "TPM2", "PPBP", "GP1BB"))
Let’s visualize other confounders:
FeaturePlot(srat, features = "Doublet_score") & theme(plot.title = element_text(size=10))
FeaturePlot(srat, features = "percent.mt") & theme(plot.title = element_text(size=10))
FeaturePlot(srat, features = "nFeature_RNA") & theme(plot.title = element_text(size=10))
Let’s remove the cells that did not pass QC and compare plots. We can now see much more defined clusters. Our filtered dataset now contains 8824 cells - so approximately 12% of cells were removed for various reasons.
DimPlot(srat,label.size = 4,repel = T,label = T)
srat <- subset(srat, subset = QC == 'Pass')
DimPlot(srat,label.size = 4,repel = T,label = T)
Finally, let’s calculate cell cycle scores, as described here. This has to be done after normalization and scaling. Seurat has a built-in list, cc.genes (older) and cc.genes.updated.2019 (newer), that defines genes involved in cell cycle. For CellRanger reference GRCh38 2.0.0 and above, use cc.genes.updated.2019 (three genes were renamed: MLF1IP, FAM64A and HN1 became CENPU, PICALM and JPT). For mouse cell cycle genes you can use the solution detailed here.
cc.genes.updated.2019
## $s.genes
## [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM7" "MCM4"
## [7] "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL"
## [13] "PRIM1" "UHRF1" "CENPU" "HELLS" "RFC2" "POLR1B"
## [19] "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2"
## [25] "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
## [31] "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM"
## [37] "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "MRPL36"
## [43] "E2F8"
##
## $g2m.genes
## [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2" "TOP2A"
## [8] "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67" "TMPO" "CENPF"
## [15] "TACC3" "PIMREG" "SMC4" "CCNB2" "CKAP2L" "CKAP2" "AURKB"
## [22] "BUB1" "KIF11" "ANP32E" "TUBB4B" "GTSE1" "KIF20B" "HJURP"
## [29] "CDCA3" "JPT1" "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1"
## [36] "NCAPD2" "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
## [43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE" "CTCF"
## [50] "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
s.genes <- cc.genes.updated.2019$s.genes
g2m.genes <- cc.genes.updated.2019$g2m.genes
srat <- CellCycleScoring(srat, s.features = s.genes, g2m.features = g2m.genes)
table(srat[[]]$Phase)
##
## G1 G2M S
## 5431 977 2416
Let’s see if we have clusters defined by any of the technical differences. Mitochnondrial genes show certain dependency on cluster, being much lower in clusters 2 and 12.
FeaturePlot(srat,features = "percent.mt",label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
VlnPlot(srat,features = "percent.mt") & theme(plot.title = element_text(size=10))
Ribosomal protein genes show very strong dependency on the putative cell type! Some cell clusters seem to have as much as 45%, and some as little as 15%.
FeaturePlot(srat,features = "percent.rb",label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
VlnPlot(srat,features = "percent.rb") & theme(plot.title = element_text(size=10))
There are also differences in RNA content per cell type. From earlier considerations, clusters 6 and 7 are probably lower quality cells that will disapper when we redo the clustering using the QC-filtered dataset.
VlnPlot(srat,features = c("nCount_RNA","nFeature_RNA")) & theme(plot.title = element_text(size=10))
Finally, cell cycle score does not seem to depend on the cell type much - however, there are dramatic outliers in each group.
FeaturePlot(srat,features = c("S.Score","G2M.Score"),label.size = 4,repel = T,label = T) & theme(plot.title = element_text(size=10))
VlnPlot(srat,features = c("S.Score","G2M.Score")) & theme(plot.title = element_text(size=10))
Since we have performed extensive QC with doublet and empty cell removal, we can now apply SCTransform normalization, that was shown to be beneficial for finding rare cell populations by improving signal/noise ratio. Single SCTransform command replaces NormalizeData(), ScaleData(), and FindVariableFeatures(). We will also correct for % MT genes and cell cycle scores.
srat <- SCTransform(srat, method = "glmGamPoi", ncells = 8824,
vars.to.regress = c("percent.mt","S.Score","G2M.Score"), verbose = F)
srat
## An object of class Seurat
## 56857 features across 8824 samples within 2 assays
## Active assay: SCT (20256 features, 3000 variable features)
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
After this let’s do standard PCA, UMAP, and clustering. Note that SCT is the active assay now. It is conventional to use more PCs with SCTransform; the exact number can be adjusted depending on your dataset.
srat <- RunPCA(srat, verbose = F)
srat <- RunUMAP(srat, dims = 1:30, verbose = F)
srat <- FindNeighbors(srat, dims = 1:30, verbose = F)
srat <- FindClusters(srat, verbose = F)
table(srat[[]]$seurat_clusters)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 1765 1560 1211 1060 918 380 336 309 302 268 248 132 108 107 102 18
DimPlot(srat, label = T)
It is very important to set clusters correctly. Let’s check the markers of smaller cell populations we have mentioned before - namely, platelets and dendritic cells. Let’s also try another color scheme - just to show how it can be done.
FeaturePlot(srat,"PPBP") & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"LILRA4") & scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
We can see there’s a cluster of platelets located between clusters 6 and 14, that has not been identified. Increasing clustering resolution in FindClusters
to 2 would help separate the platelet cluster (try it!), but also generates too many clusters.
If need arises, we can separate some clusters manualy. There are also clustering methods geared towards indentification of rare cell populations. Let’s try using fewer neighbors in the KNN graph, combined with Leiden algorithm (now default in scanpy
) and slightly increased resolution:
srat <- FindNeighbors(srat, dims = 1:30, k.param = 15, verbose = F)
srat <- FindClusters(srat, verbose = F, algorithm = 4, resolution = 0.9)
table(srat[[]]$seurat_clusters)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1753 1551 1213 1052 944 380 322 321 301 266 250 132 112 105 86 18
## 17
## 18
DimPlot(srat, label = T)
We already know that cluster 16 corresponds to platelets, and cluster 15 to dendritic cells. Let’s get a very crude idea of what the big cell clusters are.
FeaturePlot(srat,"MS4A1") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("MS4A1: B cells")
FeaturePlot(srat,"LYZ") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("LYZ: monocytes")
FeaturePlot(srat,"NKG7") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("NKG7: natural killers")
FeaturePlot(srat,"CD8B") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("CD8B: CD8 T cells")
FeaturePlot(srat,"IL7R") +
scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral"))) + ggtitle("IL7R: CD4 T cells")
We can see better separation of some subpopulations. These will be further addressed in PART 5.
Differential expression allows us to define gene markers specific to each cluster. By definition it is influenced by how clusters are defined, so it’s important to find the correct resolution. If some clusters lack any notable markers, adjust the clustering. It is recommended to do differential expression on RNA assay, and not the SCTransform. Differential expression can be done between two clusters, as well as between a cluster and all other cells.
First, let’s set the active assay back to “RNA”, and re-do the normalization and scaling (since we removed a notable fraction of cells that failed QC):
DefaultAssay(srat) <- "RNA"
srat <- NormalizeData(srat)
srat <- FindVariableFeatures(srat, selection.method = "vst", nfeatures = 2000)
all.genes <- rownames(srat)
srat <- ScaleData(srat, features = all.genes)
## Centering and scaling data matrix
The following function allows to find markers for every cluster by comparing it to all remaining cells, while reporting only the positive ones. There are many tests that can be used to define markers, including a very fast and intuitive tf-idf. By default, Wilcoxon Rank Sum test is used. This takes a while - take few minutes to make coffee or a cup of tea!
all.markers <- FindAllMarkers(srat, only.pos = T, min.pct = 0.25, logfc.threshold = 0.25)
Let’s take a quick glance at the markers.
dim(all.markers)
## [1] 11983 7
table(all.markers$cluster)
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 1403 362 336 335 1308 467 1099 327 624 853 470 856 466 433 1565 567
## 17
## 512
top3_markers <- as.data.frame(all.markers %>% group_by(cluster) %>% top_n(n = 3, wt = avg_log2FC))
top3_markers
## p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene
## 1 0.000000e+00 4.553943 0.995 0.176 0.000000e+00 1 S100A8
## 2 0.000000e+00 4.308641 0.995 0.199 0.000000e+00 1 S100A9
## 3 0.000000e+00 3.287507 0.970 0.114 0.000000e+00 1 S100A12
## 4 0.000000e+00 1.271117 0.907 0.357 0.000000e+00 2 TCF7
## 5 0.000000e+00 1.210681 0.825 0.273 0.000000e+00 2 CCR7
## 6 2.450285e-209 1.281942 0.683 0.296 8.968287e-205 2 TRBC1
## 7 0.000000e+00 2.503712 0.981 0.069 0.000000e+00 3 CD8B
## 8 0.000000e+00 1.773465 0.676 0.023 0.000000e+00 3 LINC02446
## 9 0.000000e+00 1.639813 0.877 0.077 0.000000e+00 3 CD8A
## 10 0.000000e+00 1.636633 0.987 0.475 0.000000e+00 4 IL32
## 11 0.000000e+00 1.603725 0.991 0.653 0.000000e+00 4 LTB
## 12 4.487473e-273 1.443272 0.933 0.407 1.642460e-268 4 IL7R
## 13 0.000000e+00 2.184959 0.887 0.194 0.000000e+00 5 APOBEC3A
## 14 0.000000e+00 2.112094 0.979 0.324 0.000000e+00 5 IFITM3
## 15 0.000000e+00 2.104551 0.969 0.290 0.000000e+00 5 MARCKS
## 16 0.000000e+00 3.343097 1.000 0.051 0.000000e+00 6 MS4A1
## 17 0.000000e+00 3.167270 1.000 0.088 0.000000e+00 6 CD79A
## 18 0.000000e+00 3.065551 0.745 0.067 0.000000e+00 6 IGHM
## 19 0.000000e+00 3.736996 1.000 0.140 0.000000e+00 7 FCGR3A
## 20 0.000000e+00 3.595612 0.929 0.027 0.000000e+00 7 CDKN1C
## 21 1.096791e-238 2.832979 1.000 0.425 4.014363e-234 7 LST1
## 22 0.000000e+00 3.304739 0.835 0.033 0.000000e+00 8 GZMK
## 23 0.000000e+00 2.693310 0.922 0.120 0.000000e+00 8 CCL5
## 24 1.189505e-232 1.763085 0.798 0.125 4.353709e-228 8 GZMA
## 25 0.000000e+00 3.908568 0.973 0.034 0.000000e+00 9 GZMH
## 26 0.000000e+00 3.756204 0.997 0.119 0.000000e+00 9 CCL5
## 27 0.000000e+00 3.741082 1.000 0.172 0.000000e+00 9 NKG7
## 28 0.000000e+00 5.237652 1.000 0.073 0.000000e+00 10 GNLY
## 29 0.000000e+00 4.205646 1.000 0.176 0.000000e+00 10 NKG7
## 30 0.000000e+00 3.828515 0.989 0.103 0.000000e+00 10 PRF1
## 31 0.000000e+00 4.008059 1.000 0.070 0.000000e+00 11 IGHM
## 32 0.000000e+00 3.793061 1.000 0.072 0.000000e+00 11 IGKC
## 33 0.000000e+00 3.569792 0.968 0.021 0.000000e+00 11 TCL1A
## 34 0.000000e+00 2.782539 0.848 0.008 0.000000e+00 12 FCER1A
## 35 1.372912e-110 3.002976 0.992 0.299 5.024994e-106 12 HLA-DQA1
## 36 1.740697e-87 2.719363 1.000 0.495 6.371124e-83 12 HLA-DPB1
## 37 0.000000e+00 2.942011 0.982 0.051 0.000000e+00 13 GZMK
## 38 3.078269e-210 3.581390 0.991 0.113 1.126677e-205 13 KLRB1
## 39 5.445159e-140 2.242777 0.982 0.139 1.992983e-135 13 GZMA
## 40 0.000000e+00 4.480950 0.905 0.025 0.000000e+00 14 IGLC2
## 41 0.000000e+00 3.950251 0.686 0.018 0.000000e+00 14 IGLC3
## 42 1.608644e-238 3.807863 1.000 0.085 5.887798e-234 14 IGHM
## 43 0.000000e+00 5.023932 0.581 0.012 0.000000e+00 15 PTGDS
## 44 0.000000e+00 3.823238 1.000 0.040 0.000000e+00 15 JCHAIN
## 45 2.417095e-288 4.350077 0.977 0.055 8.846811e-284 15 GZMB
## 46 0.000000e+00 8.111661 1.000 0.006 0.000000e+00 16 PPBP
## 47 0.000000e+00 7.019204 1.000 0.004 0.000000e+00 16 GP1BB
## 48 2.544519e-27 7.170606 1.000 0.184 9.313194e-23 16 NRGN
## 49 2.349996e-124 7.970498 1.000 0.030 8.601222e-120 17 IGLC1
## 50 4.210624e-97 9.358249 0.944 0.036 1.541130e-92 17 IGHA1
## 51 6.692969e-27 8.013624 0.778 0.097 2.449694e-22 17 IGKC
Some markers are less informative than others. For detailed dissection, it might be good to do differential expression between subclusters (see below).
Given the markers that we’ve defined, we can mine the literature and identify each observed cell type (it’s probably the easiest for PBMC). However, we can try automaic annotation with SingleR is workflow-agnostic (can be used with Seurat, SCE, etc). Detailed signleR manual with advanced usage can be found here.
Let’s get reference datasets from celldex package. Note that there are two cell type assignments, label.main
and label.fine
:
hpca.ref <- celldex::HumanPrimaryCellAtlasData()
dice.ref <- celldex::DatabaseImmuneCellExpressionData()
monaco.ref <- celldex::MonacoImmuneData()
Let’s convert our Seurat object to single cell experiment (SCE) for convenience. After this, using SingleR
becomes very easy:
sce <- as.SingleCellExperiment(srat)
sce
## class: SingleCellExperiment
## dim: 36601 8824
## metadata(0):
## assays(2): counts logcounts
## rownames(36601): MIR1302-2HG FAM138A ... AC007325.4 AC007325.2
## rowData names(5): vst.mean vst.variance vst.variance.expected
## vst.variance.standardized vst.variable
## colnames(8824): AAACCCACATGTAACC-1 AAACCCAGTGAGTCAG-1 ...
## TTTGTTGTCGTTATCT-1 TTTGTTGTCTTTGCTA-1
## colData names(18): orig.ident nCount_RNA ... SCT_snn_res.0.9 ident
## reducedDimNames(2): PCA UMAP
## altExpNames(0):
hpca.main <- SingleR(test = sce,assay.type.test = 1,ref = hpca.ref,labels = hpca.ref$label.main)
hpca.fine <- SingleR(test = sce,assay.type.test = 1,ref = hpca.ref,labels = hpca.ref$label.fine)
dice.main <- SingleR(test = sce,assay.type.test = 1,ref = dice.ref,labels = dice.ref$label.main)
dice.fine <- SingleR(test = sce,assay.type.test = 1,ref = dice.ref,labels = dice.ref$label.fine)
monaco.main <- SingleR(test = sce,assay.type.test = 1,ref = monaco.ref,labels = monaco.ref$label.main)
monaco.fine <- SingleR(test = sce,assay.type.test = 1,ref = monaco.ref,labels = monaco.ref$label.fine)
Let’s see the summary of general cell type annotations. These match our expectations (and each other) reasonably well.
table(hpca.main$pruned.labels)
##
## B_cell CMP GMP HSC_-G-CSF
## 773 2 3 3
## Monocyte NK_cell Platelets Pre-B_cell_CD34-
## 3097 333 13 5
## T_cells
## 4477
table(dice.main$pruned.labels)
##
## B cells Monocytes NK cells T cells, CD4+ T cells, CD8+
## 728 3086 762 2482 1465
table(monaco.main$pruned.labels)
##
## B cells CD4+ T cells CD8+ T cells Dendritic cells Monocytes
## 740 2350 1402 200 2984
## NK cells Progenitors T cells
## 305 18 728
Finer cell type annotations are harder to achieve; as you can see, many of these do not match:
table(hpca.fine$pruned.labels)
##
## B_cell B_cell:immature
## 24 261
## B_cell:Memory B_cell:Naive
## 97 374
## B_cell:Plasma_cell CMP
## 9 2
## DC:monocyte-derived:anti-DC-SIGN_2h GMP
## 1 21
## Monocyte Monocyte:CD14+
## 19 1684
## Monocyte:CD16- Monocyte:CD16+
## 900 468
## Monocyte:leukotriene_D4 NK_cell
## 2 251
## NK_cell:CD56hiCD62L+ NK_cell:IL2
## 22 39
## Platelets Pre-B_cell_CD34-
## 13 4
## Pro-B_cell_CD34+ T_cell:CD4+
## 2 4
## T_cell:CD4+_central_memory T_cell:CD4+_effector_memory
## 1305 550
## T_cell:CD4+_Naive T_cell:CD8+
## 2102 320
## T_cell:CD8+_Central_memory T_cell:CD8+_effector_memory
## 2 188
## T_cell:CD8+_effector_memory_RA T_cell:CD8+_naive
## 7 6
## T_cell:gamma-delta
## 9
table(dice.fine$pruned.labels)
##
## B cells, naive Monocytes, CD14+
## 750 2758
## Monocytes, CD16+ NK cells
## 343 703
## T cells, CD4+, memory TREG T cells, CD4+, naive
## 133 1097
## T cells, CD4+, naive TREG T cells, CD4+, TFH
## 127 563
## T cells, CD4+, Th1 T cells, CD4+, Th1_17
## 276 142
## T cells, CD4+, Th17 T cells, CD4+, Th2
## 273 191
## T cells, CD8+, naive
## 1302
table(monaco.fine$pruned.labels)
##
## Central memory CD8 T cells Classical monocytes
## 158 2454
## Effector memory CD8 T cells Exhausted B cells
## 31 34
## Follicular helper T cells Intermediate monocytes
## 250 346
## MAIT cells Myeloid dendritic cells
## 131 138
## Naive B cells Naive CD4 T cells
## 354 1237
## Naive CD8 T cells Natural killer cells
## 1234 294
## Non classical monocytes Non-switched memory B cells
## 153 253
## Non-Vd2 gd T cells Plasmablasts
## 151 10
## Plasmacytoid dendritic cells Progenitor cells
## 78 15
## Switched memory B cells T regulatory cells
## 82 260
## Terminal effector CD4 T cells Terminal effector CD8 T cells
## 42 83
## Th1 cells Th1/Th17 cells
## 162 211
## Th17 cells Th2 cells
## 184 221
## Vd2 gd T cells
## 158
Let’s add the annotations to the Seurat object metadata so we can use them:
srat@meta.data$hpca.main <- hpca.main$pruned.labels
srat@meta.data$dice.main <- dice.main$pruned.labels
srat@meta.data$monaco.main <- monaco.main$pruned.labels
srat@meta.data$hpca.fine <- hpca.fine$pruned.labels
srat@meta.data$dice.fine <- dice.fine$pruned.labels
srat@meta.data$monaco.fine <- monaco.fine$pruned.labels
Finally, let’s visualize the fine-grained annotations.
srat <- SetIdent(srat, value = "hpca.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
srat <- SetIdent(srat, value = "dice.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
srat <- SetIdent(srat, value = "monaco.fine")
DimPlot(srat, label = T , repel = T, label.size = 3) + NoLegend()
## Warning: ggrepel: 5 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps
Comparing the labels obtained from the three sources, we can see many interesting discrepancies. Sorthing those out requires manual curation. However, many informative assignments can be seen.
For example, small cluster 17 is repeatedly identified as plasma B cells. This distinct subpopulation displays markers such as CD38 and CD59.
FeaturePlot(srat,"CD38") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"CD59") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
Similarly, cluster 13 is identified to be MAIT cells. Literature suggests that blood MAIT cells are characterized by high expression of CD161 (KLRB1), and chemokines like CXCR6. This indeed seems to be the case:
FeaturePlot(srat,"KLRB1") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
FeaturePlot(srat,"CXCR6") + scale_colour_gradientn(colours = rev(brewer.pal(n = 11, name = "Spectral")))
In general, even simple example of PBMC shows how complicated cell type assignment can be, and how much effort it requires. A detailed book on how to do cell type assignment / label transfer with singleR
is available.