monocle3-example.utf8

In order to run monocle3 a few steps must first be followed. Monocle3 requires R version 3.5 or higher, bioconductor 3.5 or higher and monocle3 0.1.0 to run the latest features. 1) Installing bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install()

A few bioconductor dependencies are not automatically installed so you should install the,. 2) Installing bioconductor dependencies:

BiocManager::install(c('BiocGenerics', 'DelayedArray', 'DelayedMatrixStats',
                   'limma', 'S4Vectors', 'SingleCellExperiment',
                   'SummarizedExperiment', 'batchelor'))

In order to install monocle3 a few preparations must be made. First devtools must be installed into your R environment via the R console. 3) Installing devtools:

install.packages('devtools')

If your rlang version is out of data this can cause a failure when trying to install monocle3 and garnett later on. 4) Updating rlang

install.packages('rlang')
library(rlang)

Next your must install libudunits2-dev in your terminal. 5) Installing libudunits2-dev: sudo apt-get update
sudo apt-get install libudunits2-dev Finally libgdal-dev must also be installed on your terminal. 6) Installing libgdal-dev: sudo apt-get update
sudo apt-get install –fix-missing libgdal-dev Now we can finally installmonocle3.

7)Installing monocle3:

devtools::install_github('cole-trapnell-lab/monocle3')

Installing garnett:

BiocManager::install('org.Mm.eg.db')
BiocManager::install('org.Hs.eg.db')
devtools::install_github("cole-trapnell-lab/garnett", ref = 'monocle3')

Run the following code chunk to check everything has been installed correctly, and to load all relevant packages for this tutorial.

library(monocle3)
library(shiny)
library(ggplot2)
library(dplyr)
library(garnett)
library(Matrix)
library(irlba)

#Clustering and classifying cells

The first thing we must do is read cellranger data into R. In order for the to correctly be done you must first ensure the data is in the right folder system. The folder system used depends on whether you have 10X version 2 or 10X version 3 data. Incase your aren’t sure which version of 10X data you have: v2 barcode and gene files end .tsv and matrix file ends .mtx while v3 barcode and gene files end .tsv.gz and matrix file ends .mtx.gz. 7) a) If 10X version 2 the folder system is: ~/data/monocle-data/cellranger-data/outs/filtered_gene_bc_matrices/genome(e.g ‘hg19’ or ‘mm10’ etc depending on genome used)/ b) If 10X version 3 the folder system is: ~/data/monocle-data/cellranger-data/outs/filtered_feature_bc_matrix/genome(e.g ‘hg19’ or ‘mm10’ etc depending on genome used)/ Next we preprocess the data. Firstly we state how may principle components (PC) we want monocle to use, this can be changed by changing the value num_dimis equal to. Then we plot the data set to see if increasing the PC number would capture significantly more variance. The elbow plot produced will then inform you whether increasing PC number will capture a significant amount of more variance. When visualising the data you can use t-SNE or UMAP, UMAP is used by default as the people at monocle believe it is both faster and more suited to clustering. When doing gene expression analysis it is important to remove batch effect. You should always check for batch effect when performing dimensionality reduction, batch effect is reduced by the preprocess_cds function. When order_cells() runs it will try to load a shiny web page but will fail. That is expected. Go to your Jupyter terminal and run ngrok http PORT (where PORT is the 4 digit number at the end of the http message ran by shiny). This will produce multiple lines of code, copy the http message on the penultimate line (should start with the word ‘fowarding’). Select from ‘http’ to before the arrow ‘->’. Enter this web address into a new tab and the then select the node on the graph you wish to use as a basis. You can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up but will produce slightly different output each time it runs.

# Load RDS data
expression_matrix <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation <- readRDS("../data/monocle-data/cao_l2_rowData.rds")

# Make the cell dataset object from the RDS data
cds <- new_cell_data_set(expression_matrix,
                                   cell_metadata = cell_metadata,
                                   gene_metadata = gene_annotation)

# Reading in cellranger data
cdr <- load_cellranger_data("/home/jovyan/data/monocle-data/cellranger-data")

# Principle component analysis
cds = preprocess_cds(cds, num_dim = 100)

# Checking if 100 principle components captures enough variance
plot_pc_variance_explained(cds)

# Reducing dimensionality for visualisation of the cells
# can add umap.fast_sgd=TRUE to the reduce dimensions command to speed it up 
# but will produce slightly different output each time its run
cds = reduce_dimension(cds)

## No preprocess_method specified, using preprocess_method = 'PCA'

# Clustering cells using community detection
cds = cluster_cells(cds, resolution=c(10^seq(-6,-1)))

# Plotting the data colouring by partition
plot_cells(cds, color_cells_by="partition", group_cells_by="partition")

## No trajectory to plot. Has learn_graph() been called yet?

The data frame marker_test_res contains metrics which tell us how differently expressed each gene is in each partition. You can rank the cells by one or more specifity metrics and take the top genes for each cluster, the way in which is do this is pseudo_R2. The default number of marker genes in this alogrithm is 3 but if you want to change this edit the 3 in top_n(3, pseudo_R2) to however many marker genes you want. Then we can plot the expression and fraction of cells that express each marker in each group usng plot_genes_by_group().

pheatmap::pheatmap(log(table(clusters(cds), colData(cds)$cao_cell_type)+1),
                   clustering_method="ward.D2",
                   fontsize=6)

# Find marker genes expressed by each cluster
top_specific_markers = marker_test_res %>%
    filter(fraction_expressing >= 0.10) %>%
    group_by(cell_group) %>%
    top_n(3, pseudo_R2)

top_specific_marker_ids = unique(top_specific_markers %>% pull(gene_id))

# Plotting marker genes
plot_genes_by_group(cds,
                    top_specific_marker_ids,
                    group_cells_by="partition",
                    ordering_type="cluster_row_col",
                    max.size=3)

Monocle3 has developed some software which automates the the annotating of cells called Garnett. Garnett classifies cells based on marker genes. Firstly we must fnd the top marker that each annotated cell type expresses using top_markers(). Next we filter these markers based on the conditions required (in this example it is JS specificity > 0.5 and be significant in logistic test marker_test_q_value < 0.01). We then generate a marker file called ./marker_file.txt which contains each cell type and top 5 expressed markers.

# Require that markers have at least JS specificty score > 0.5 and be significant
# in the logistic test for identifying their cell type:
garnett_markers = assigned_type_marker_test_res %>%
  filter(marker_test_q_value < 0.01 & specificity >= 0.5) %>%
  group_by(cell_group) %>%
  top_n(5, marker_score)
# Exclude genes that are good markers for more than one cell type:
garnett_markers = garnett_markers %>% group_by(gene_short_name) %>%
  filter(n() == 1)
generate_garnett_marker_file(garnett_markers, file="./marker_file.txt")

## Garnett marker file written to ./marker_file.txt

plot_cells(cds,
           group_cells_by="cluster",
           color_cells_by="cluster")

#Constructing Single Cell Trajectories

Loading a different data set in for example purposes. If you are usin same data set as earlier the first 2 steps can be missed (assuming data has already been downloaded and preprocessed)

# Loading more RDS data
expression_matrix2 = readRDS("../data/monocle-data/packer_embryo_expression.rds")
cell_metadata2 = readRDS("../data/monocle-data/packer_embryo_colData.rds")
gene_annotation2 = readRDS("../data/monocle-data/packer_embryo_rowData.rds")

cdt <- new_cell_data_set(expression_matrix2,
                         cell_metadata = cell_metadata2,
                         gene_metadata = gene_annotation2)

cdt <- preprocess_cds(cdt, num_dim = 100, residual_model_formula_str = "~ bg.300.loading + bg.400.loading + bg.500.1.loading + bg.500.2.loading + bg.r17.loading + bg.b01.loading + bg.b02.loading")

cdt <- reduce_dimension(cdt)

cdt <- cluster_cells(cdt, reduction_method = 'UMAP')

A principal graph is fitted within each partition in the next component.

The next part is creating a list of genes to try and identify in the trajectory of the sample. If you have different genes you want to check then switch out the names in the ciliated genes list and add or remove genes as required.

ciliated_genes = c("che-1",
                   "hlh-17",
                   "nhr-6",
                   "dmd-6",
                   "ceh-36",
                   "ham-1")

The following is plotting the trajectory multiple times, one for each gene in ciliated genes. And identifying where on the trajectory of cells this gene is expressed.

# Visualise how individual genes vary along the trajectory
plot_cells(cdt,
           genes=ciliated_genes,
           label_cell_groups=FALSE,
           show_trajectory_graph=FALSE)

Now you have to select which cells you want as you starting point. A pop up will appear with options “trust browser” or cancel. Click “try again”. Then the graph should appear and you can select your starting cells.

# a helper function to identify the root principal points:
get_earliest_principal_node <- function(cds, time_bin="130-170"){
  cell_ids <- which(colData(cds)[, "embryo.time.bin"] == time_bin)
  
  closest_vertex <-
  cds@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cds), ])
  root_pr_nodes <-
  igraph::V(principal_graph(cds)[["UMAP"]])$name[as.numeric(names
  (which.max(table(closest_vertex[cell_ids,]))))]
  
  root_pr_nodes
}

cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))

# Visualising pseudotime on the graph when manually selecting the root principal points
plot_cells(cdt,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

## Warning: `select_()` was deprecated in dplyr 0.7.0.
## Please use `select()` instead.

# A helper function to identify the root principal points:
get_earliest_principal_node <- function(cdt, time_bin="130-170"){
  cell_ids <- which(colData(cdt)[, "embryo.time.bin"] == time_bin)

  closest_vertex <-
    cdt@principal_graph_aux[["UMAP"]]$pr_graph_cell_proj_closest_vertex
  closest_vertex <- as.matrix(closest_vertex[colnames(cdt), ])
  root_pr_nodes <-
    igraph::V(principal_graph(cdt)[["UMAP"]])$name[as.numeric(names
      (which.max(table(closest_vertex[cell_ids,]))))]

  root_pr_nodes
}
cdt = order_cells(cdt, root_pr_nodes=get_earliest_principal_node(cdt))

# Visualising pseudotime on the graph when the helper function decideds the root principal points
plot_cells(cdt,
           color_cells_by = "pseudotime",
           label_cell_groups=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE,
           graph_label_size=1.5)

# 3D visualisation of data
plot_cells_3d(cds_3d, color_cells_by="pseudotime")

#Differential expression analysis

ciliated_genes = c("che-1",
                   "hlh-17",
                   "nhr-6",
                   "dmd-6",
                   "ceh-36",
                   "ham-1")
cds_subset = cdt[rowData(cdt)$gene_short_name %in% ciliated_genes,]

gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time")

fit_coefs = coefficient_table(gene_fits)

emb_time_terms = fit_coefs %>% filter(term == "embryo.time")

emb_time_terms %>% filter (q_value < 0.05) %>%
  select(gene_short_name, term, q_value, estimate)

## # A tibble: 5 x 4
##   gene_short_name term          q_value estimate
##   <chr>           <chr>           <dbl>    <dbl>
## 1 ham-1           embryo.time 2.76e- 77 -0.00448
## 2 dmd-6           embryo.time 1.48e-201  0.00698
## 3 nhr-6           embryo.time 1.90e- 21  0.00540
## 4 ceh-36          embryo.time 7.70e- 17  0.00220
## 5 che-1           embryo.time 2.39e-  4 -0.00204

plot_genes_violin(cds_subset, group_cells_by="embryo.time.bin", ncol=2) +
    theme(axis.text.x=element_text(angle=45, hjust=1))

gene_fits = fit_models(cds_subset, model_formula_str = "~embryo.time + batch")
fit_coefs = coefficient_table(gene_fits)
fit_coefs %>% filter(term != "(Intercept)") %>%
  select(gene_short_name, term, q_value, estimate)

## # A tibble: 42 x 4
##    gene_short_name term                                 q_value estimate
##    <chr>           <chr>                                  <dbl>    <dbl>
##  1 ham-1           embryo.time                        3.65e- 30 -0.00331
##  2 ham-1           batchWaterston_400_minutes         2.44e-  2  0.208  
##  3 ham-1           batchWaterston_500_minutes_batch_1 7.51e-  2 -0.292  
##  4 ham-1           batchWaterston_500_minutes_batch_2 2.71e-  9 -1.34   
##  5 ham-1           batchMurray_r17                    1   e+  0 -0.0282 
##  6 ham-1           batchMurray_b01                    3.76e-  1 -0.192  
##  7 ham-1           batchMurray_b02                    1.28e-  1 -0.472  
##  8 dmd-6           embryo.time                        1.08e-124  0.00689
##  9 dmd-6           batchWaterston_400_minutes         7.33e-  4  2.78   
## 10 dmd-6           batchWaterston_500_minutes_batch_1 2.49e-  6  3.64   
## # … with 32 more rows

evaluate_fits(gene_fits)

## # A tibble: 6 x 14
##   id      gene_short_name num_cells_expres… gene_id  model  model_summary status
##   <chr>   <chr>                       <int> <chr>    <name> <named list>  <chr> 
## 1 WBGene… ham-1                        1618 WBGene0… <spee… <smmry.sp>    OK    
## 2 WBGene… dmd-6                         873 WBGene0… <spee… <smmry.sp>    OK    
## 3 WBGene… hlh-17                         51 WBGene0… <spee… <smmry.sp>    OK    
## 4 WBGene… nhr-6                         133 WBGene0… <spee… <smmry.sp>    OK    
## 5 WBGene… ceh-36                       1123 WBGene0… <spee… <smmry.sp>    OK    
## 6 WBGene… che-1                         292 WBGene0… <spee… <smmry.sp>    OK    
## # … with 7 more variables: null_deviance <dbl>, df_null <int>, logLik <lgl>,
## #   AIC <dbl>, BIC <dbl>, deviance <dbl>, df_residual <int>

time_batch_models = fit_models(cds_subset,
                               model_formula_str = "~embryo.time + batch",
                               expression_family="negbinomial")
time_models = fit_models(cds_subset,
                        model_formula_str = "~embryo.time",
                        expression_family="negbinomial")
compare_models(time_batch_models, time_models) %>% select(gene_short_name, q_value)

## # A tibble: 6 x 2
##   gene_short_name  q_value
##   <chr>              <dbl>
## 1 ham-1           3.74e-20
## 2 dmd-6           7.11e-37
## 3 hlh-17          1.04e- 5
## 4 nhr-6           5.95e- 3
## 5 ceh-36          6.23e-10
## 6 che-1           5.06e- 6

# reload the data
expression_matrix3 <- readRDS("../data/monocle-data/cao_l2_expression.rds")
cell_metadata3 <- readRDS("../data/monocle-data/cao_l2_colData.rds")
gene_annotation3 <- readRDS("../data/monocle-data/cao_l2_rowData.rds")

# Make the CDS object
cdn <- new_cell_data_set(expression_matrix3,
                         cell_metadata = cell_metadata3,
                         gene_metadata = gene_annotation3)

neurons_cds <- cdn[,grepl("neurons", colData(cdn)$assigned_cell_type, ignore.case=TRUE)]
plot_cells(neurons_cds)

plot_cells(cdt,
           color_cells_by = "cell.type",
           label_groups_by_cluster=FALSE,
           label_leaves=FALSE,
           label_branch_points=FALSE)

plot_cells(cdt, genes=c("hlh-4", "gcy-8", "dac-1", "oig-8"),
           show_trajectory_graph=FALSE,
           label_cell_groups=FALSE,
           label_leaves=FALSE)

plot_cells(cdt, show_trajectory_graph=FALSE)