scRNA-seq Dataset Integration using R-based tools

## Warning: package 'knitr' was built under R version 4.0.5

The code below will install some packages if you don’t have them already.

if (!require("RColorBrewer"))   install.packages("RColorBrewer")
if (!require("knitr"))     install.packages("knitr")
if (!require("dplyr"))     install.packages("dplyr")
if (!require("cowplot"))   install.packages("cowplot")
if (!require("harmony"))   install.packages("harmony")
if (!require("rliger"))    install.packages("rliger")
if (!require("remotes"))    install.packages("remotes")

if (!require("SeuratWrappers"))     remotes::install_github('satijalab/seurat-wrappers')
## sitmo        (2.0.1   -> 2.0.2 ) [CRAN]
## dqrng        (0.2.1   -> 0.3.0 ) [CRAN]
## irlba        (2.3.3   -> 2.3.5 ) [CRAN]
## RcppAnnoy    (0.0.18  -> 0.0.19) [CRAN]
## goftest      (1.2-2   -> 1.2-3 ) [CRAN]
## fs           (1.5.0   -> 1.5.2 ) [CRAN]
## digest       (0.6.28  -> 0.6.29) [CRAN]
## jquerylib    (0.1.3   -> 0.1.4 ) [CRAN]
## sass         (0.3.1   -> 0.4.0 ) [CRAN]
## cachem       (1.0.4   -> 1.0.6 ) [CRAN]
## bslib        (0.2.4   -> 0.3.1 ) [CRAN]
## glue         (1.4.2   -> 1.5.1 ) [CRAN]
## crayon       (1.4.1   -> 1.4.2 ) [CRAN]
## later        (1.1.0.1 -> 1.3.0 ) [CRAN]
## htmltools    (0.5.1.1 -> 0.5.2 ) [CRAN]
## fontawesome  (NA      -> 0.2.2 ) [CRAN]
## mime         (0.10    -> 0.12  ) [CRAN]
## httpuv       (1.5.5   -> 1.6.3 ) [CRAN]
## parallelly   (1.24.0  -> 1.29.0) [CRAN]
## future       (1.21.0  -> 1.23.0) [CRAN]
## future.apply (1.7.0   -> 1.8.1 ) [CRAN]
## bitops       (1.0-6   -> 1.0-7 ) [CRAN]
## gtools       (3.8.2   -> 3.9.2 ) [CRAN]
## here         (NA      -> 1.0.1 ) [CRAN]
## openssl      (1.4.3   -> 1.4.5 ) [CRAN]
## generics     (0.1.0   -> 0.1.1 ) [CRAN]
## crosstalk    (1.1.1   -> 1.2.0 ) [CRAN]
## htmlwidgets  (1.5.3   -> 1.5.4 ) [CRAN]
## shiny        (1.6.0   -> 1.7.1 ) [CRAN]
## igraph       (1.2.6   -> 1.2.9 ) [CRAN]
## reticulate   (1.18    -> 1.22  ) [CRAN]
## uwot         (0.1.10  -> 0.1.11) [CRAN]
## SeuratObject (4.0.0   -> 4.0.4 ) [CRAN]
## plotly       (4.9.3   -> 4.10.0) [CRAN]
## pbapply      (1.4-3   -> 1.5-0 ) [CRAN]
## lmtest       (0.9-38  -> 0.9-39) [CRAN]
## leiden       (0.3.7   -> 0.3.9 ) [CRAN]
## fitdistrplus (1.1-3   -> 1.1-6 ) [CRAN]
## Seurat       (4.0.1   -> 4.0.5 ) [CRAN]
## rsvd         (1.0.3   -> 1.0.5 ) [CRAN]
##   
   checking for file ‘/tmp/RtmpsTai4O/remotesc7b2357539a/satijalab-seurat-wrappers-8510069/DESCRIPTION’ ...
  
✓  checking for file ‘/tmp/RtmpsTai4O/remotesc7b2357539a/satijalab-seurat-wrappers-8510069/DESCRIPTION’
## 
  
─  preparing ‘SeuratWrappers’:
##    checking DESCRIPTION meta-information ...
  
✓  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
─  building ‘SeuratWrappers_0.3.0.tar.gz’
## 
  
   
## 
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")
if (!require("glmGamPoi")) BiocManager::install("glmGamPoi")

Other packages required for this tutorial should already be installed in your JupyterHub by default:

library(Seurat)
library(SeuratWrappers)
library(DropletUtils)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(dplyr)
library(harmony)
library(rliger)
library(glmGamPoi)
library(patchwork)
library(cowplot)

Let’s also load a custom function, plot_integrated_clusters, that is stored in /data as custom_seurat_functions.R:

source("../data/custom_seurat_functions.R")

Introduction

As more and more scRNA-seq datasets become available, carrying merged_seurat comparisons between them is key. There are two main approaches to comparing scRNASeq datasets. The first approach is “label-centric” which is focused on trying to identify equivalent cell-types/states across datasets by comparing individual cells or groups of cells. The other approach is “cross-dataset normalization” which attempts to computationally remove experiment-specific technical/biological effects so that data from multiple experiments can be combined and jointly analyzed.

We will illustrate this process using two very different datasets: 1) 10k PBMC (3’ v3 chemistry) downloaded from 10x Genomics website; 2) major cell populations isolated from whole human blood using FACS, and sequenced using STRT-seq (GSE149938).

Although we already have all the necessary files in our /data folder, we can download the necessary files from GEO database:

download.file("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE149nnn/GSE149938/suppl/GSE149938_umi_matrix.csv.gz",
              destfile = "GSE149938_umi_matrix.csv.gz")
download.file("https://cf.10xgenomics.com/samples/cell-exp/4.0.0/Parent_NGSC3_DI_PBMC/Parent_NGSC3_DI_PBMC_filtered_feature_bc_matrix.h5",
              destfile = "pbmc10k_filt.h5")

Let’s read the processed file available via GEO, and the 10x file we’ve processed using SoupX (see another notebook in this repository):

umi_gz <- gzfile("../data/GSE149938_umi_matrix.csv.gz",'rt')  
umi <- read.csv(umi_gz,check.names = F,quote = "")
matrix_3p    <- Read10X("../data/soupX_pbmc10k_filt")

Next, let’s make Seurat objects and re-define some of the metadata columns (GEO dataset simply puts the cell type into the orig.ident slot, which will interfere with what we want to do next):

srat_wb <- CreateSeuratObject(t(umi),project = "whole_blood")
## Warning: Non-unique features (rownames) present in the input matrix, making
## unique
srat_3p <- CreateSeuratObject(matrix_3p,project = "pbmc10k_3p")
rm(umi_gz)
rm(umi)
rm(matrix_3p)
colnames(srat_wb@meta.data)[1] <- "cell_type"
srat_wb@meta.data$orig.ident <- "whole_blood"
srat_wb@meta.data$orig.ident <- as.factor(srat_wb@meta.data$orig.ident)
head(srat_wb[[]])
##                    cell_type nCount_RNA nFeature_RNA  orig.ident
## BNK_spBM1_L1_bar25       BNK      24494         1869 whole_blood
## BNK_spBM1_L1_bar26       BNK      61980         2051 whole_blood
## BNK_spBM1_L1_bar27       BNK     124382         3872 whole_blood
## BNK_spBM1_L1_bar28       BNK       8144         1475 whole_blood
## BNK_spBM1_L1_bar29       BNK      53612         2086 whole_blood
## BNK_spBM1_L1_bar30       BNK      33582         2038 whole_blood

Do basic quality control. STRT-Seq is quite different from 10x and has a lot more detected genes per cell. Also, for some reason we don’t have the MT genes in the quantified matrix of the whole blood dataset. That’s unfortunate, but not critical.

srat_wb <- SetIdent(srat_wb,value = "orig.ident")

srat_wb[["percent.mt"]] <- PercentageFeatureSet(srat_wb, pattern = "^MT-")
srat_wb[["percent.rbp"]] <- PercentageFeatureSet(srat_wb, pattern = "^RP[SL]")
srat_3p[["percent.mt"]] <- PercentageFeatureSet(srat_3p, pattern = "^MT-")
srat_3p[["percent.rbp"]] <- PercentageFeatureSet(srat_3p, pattern = "^RP[SL]")

VlnPlot(srat_wb, features = c("nFeature_RNA","nCount_RNA","percent.mt","percent.rbp"), ncol = 4)