Quick start guide of single-cell TI with Totem

QuickStart

This tutorial is a quick start guide with Totem, a user-friendly tree-shaped single-cell trajectory inference (TI) method (Smolander, Junttila, and Elo 2022). The minimum requirement to run Totem is a low-dimensional result, such as tSNE or UMAP, as far as this low-dimensional representation represents fairly the cell dynamic process in study, e.g., cell cycle, activation or differentiation.

Commonly one is interested in inferring cell trajectory after running an integration and/or clustering analysis. In either case the scRNA-seq data was most likely projected onto the low-dimensional space, such as PCA and UMAP, along the analysis. Under this scenario, the user might be interested in preserving these low-dimensional reductions during the cell TI analysis.

With this in mind, this quick start guide applies Totem’s TI to a scRNA-seq data set of CD34+ bone marrow cells (replicate 1) used as an example in the Palantir’s TI publication (Setty et al. 2019) and publicly available at Human Cell Atlas Portal (link: https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45). The processed data set with tSNE projection, cluster identities,palantir pseudotime, transformed log-normalized and raw data provided at human_cd34_bm_rep1.h5ad will be used for demonstration of the user-friendliness and experience of Totem’s TI analysis. The human_cd34_bm_rep1.h5ad anndata h5ad object was downloaded, parsed and distributed as a RDS SingleCellExperiment class object at data/human_cd34_bm_rep1.rds. The R script used for this object is provided at scripts/download_h5ad_to_SCE_rds_script.R.

The content of this tutorial is enumerated and described below.

Preprocessing: importing data to R and preparing it to Totem
Totem’s TI: two-step Totem’s TI workflow:

2.1. clustering & multiple spanning trees (MSTs)

2.2. smoothing selected clustering/MSTs
Visualization: visualizing Totem’s cell connectivity, clustering, and trajectory’s topology
Pseudotime: re-rooting and comparison of Totem’s pseudotime estimates to those of Palantir

(1) Preprocessing

Load the following R packages.

# Load packages
library("dplyr")
library("Totem")
library("scater")
library("ggplot2")
library("SingleCellExperiment")

What does every package?

dplyr v.1.1.0 (Wickham et al. 2022): data wrangling, e.g., piping data with %>%
ggplot2 v.3.4.1 (Wickham 2016): data visualization
Totem v.0.99.2 (Smolander, Junttila, and Elo 2022): trajectory inference
scater v.1.26.1 (McCarthy et al. 2017): Bioconductor scRNA toolkit - used for low-dimensional visualization
SingleCellExperiment v.1.20.0 (Amezquita et al. 2020): Bioconductor single-cell data object - R class object used to interact with Totem

Set seed to keep the reproducibility of the analyses generated.

# Set seed
set.seed(1204)

Import the RDS SingleCellExperiment class object storing the scRNA-seq data set with the function readRDS (file available at: data/human_cd34_bm_rep1.rds).

## Import the scRNA-seq SCE object
sce <- readRDS(file = "../data/human_cd34_bm_rep1.rds")

What is a SingleCellExperiment class object?

Totem conveniently uses the SingleCellExperiment (SCE for short) R package and class to store all the information that generates in its workflow. This can be accessed through: metadata(sce)$totem (which returns a list of objects generated in the Totem workflow).

A SCE object can store several gene expression assays (raw, log counts, etc), metadata (cell/gene) and dimensional reduction data. Please have a look into the Chapter 4 The SingleCellExperiment class of the OSCA book to learn more about the SingleCellExperiment class and functionality.

Before starting the workflow, we exclude any potential non expressed gene from our SCE object.

## Prepare data for Totem: remove non-expressed genes
sce <- PrepareTotem(object = sce)

##

## The data matrix has 14651 features.

## The data matrix has 5780 cells.

## Please note that the filtering of poor-quality cells needs to be 
##           done before Totem analysis. In addition, the cells need to be part of 
##           a single, connected, tree-shaped trajectory 
##           (not multiple disconnected trajectories.

##

## 14624/14651 features remain after filtering genes with only zero values.

##

The sce object contains the assays counts, logcounts, scaled, the cell metadata clusters, palantir_pseudotime, palantir_diff_potential, cell_types_long, cell_types_short with 14624 genes and 5780 cells.

(2) `Totem`’s TI

The two-step Totem’s TI workflow consists:

clustering & multiple spanning trees (MSTs): RunClustering(...)
smoothing selected clustering/MSTs: SelectClusterings(...) followed by RunSmoothing(...)

The only input required is the SingleCellExperiment object named sce with a low-dimensional reduction result in reducedDim(sce), which in this case is a tsne with two dimensions (run the following to see the first tSNE 6 cell coordinates - head(reducedDim(sce))).

## Totem's TI workflow: 
# (1) clustering dimensional reduction w/ CLARA (k-medoids), MSTs 
# (2) smoothing best clustering/MSTs w/ principal curves algorithm
set.seed(123) # keep reproducibility
sce <- RunClustering(object = sce) %>% 
  SelectClusterings(object = .) %>% 
  RunSmoothing(object = .)

## using tsne as the dim.reduced data

## number of clustering results: 10000

What does %>% mean/do?

The expression %>% is a pipe. Pipes are a convenient computational way of passing the output generated by the previous command to the input of the next command. If you’re not familiar with pipes (%>% coming from the package dplyr), you can run the equivalent to the previous code without using pipes.

## Totem's TI workflow: 
# (1) clustering dimensional reduction w/ CLARA (k-medoids), MSTs 
# (2) smoothing best clustering/MSTs w/ principal curves algorithm
set.seed(123) # keep reproducibility
sce <- RunClustering(object = sce) 
sce <- SelectClusterings(object = sce) 
sce <- RunSmoothing(object = sce)

What does every individual function above?

RunClustering(...): It performs clustering and MSTs. Clustering is one of the main steps in the Totem workflow and the one that takes more time and computational resources. Totem uses the dimensional latent representations provided in reducedDim(sce) to cluster the data set 10K times (N.clusterings = 10000) with the k-medoids (CLARA) algorithm using an expected number of clusters varying between 3-20 (k.range = 3:20). It filters out clusters with <5 cells (min.cluster.size = 5). All the parameters mentioned are the default and they can be changed. After filtering out clustering results due to min.cluster.size, the total number of clustering results will be lower than N.clusterings. At this stage, it is also estimated a MST for each clustering result.
SelectClusterings(...): Totem performs a large number of clustering results as part of its approach. Since they can not be all inspected manually, Totem provides 5 methods to select the top clustering results (by default 1 - selection.N.models = 1). All the methods are described in detail in the Totem vignette (see vignette). You can also see the function documentation by typing the following in the R console: ?SelectClusterings. The default method used was method 1 (selection.method = 1) which uses the Variance Ratio Criterion (VRC, aka Calinski-Harabasz score), which measures the within- and between-cluster dispersion, to select the best clustering result.
RunSmoothing(...): Smoothing the top selected MST(s) using the principle curves slingshot algorithm (Street et al. 2018) resulting in directed trajectories randomly rooted and discrete pseudotime along the lineage. The user can adjust the root later.

(3) Visualization

The Totem’s cell connectivities, clustering/MST and trajectory can be visual inspected after running Totem’s TI method.

Cell connectivity

Cell connectivity is a novel concept introduced in Totem. For a given clustering result and the respective MST, the connectivity of a cell cluster consists in the ratio of its connections to other clusters by the number of clusters. Higher the value farther the distance for the leafs/ending points of the trajectory and, thus, more likely to represent, e.g., branching points (see Fig.4 (Smolander, Junttila, and Elo 2022)).

Totem relies on the dynplot (Cannoodt and Saelens 2021) package for visualization which requires to give the dimensional reduction result as a parameter (viz.dim.red). Thus the tSNE dimensional reduction is first retrieved from the sce object with dim_red <- reducedDim(sce, "tsne") and given as parameter (`viz.dim.red = dim_red) to the function VizCellConnectivity() together with the sce object to plot cell connectivities.

## Visualization of cell connectivity
dim_red <- reducedDim(sce, "tsne") # retrieve tSNE
VizCellConnectivity(object = sce, viz.dim.red = dim_red) # plot

## Coloring by milestone

## Using milestone_percentages from trajectory

## Coloring by expression

The cell connectivity plot suggests that HSC and MyP are branching points during hematopoiesis.

Clustering/MST

Next the best clustering and MST can be inspected with the function VizMST(). The name of the best clustering and MST needs to be retrieved (select.cluster <- names(metadata(sce)$totem$slingshot_trajectory)) and given as parameter (clustering.names = select.cluster). In addition, the sce object and tSNE result (viz.dim.red = dim_red)) are also given.

In the code below it is also plotted the Palantir’s cluster and cell types abbreviated. For this end it is used the function plotReducedDim() from the scater package (read more about the function ?plotReducedDim).

## Visualization of clustering/MST
select.cluster <- names(metadata(sce)$totem$slingshot_trajectory) # retrieve the name of the best cluster
cowplot::plot_grid(
  plotReducedDim(sce, dimred = "tsne", color_by = "clusters") + 
    scale_color_manual(name= "Clusters", values=unlist(metadata(sce)$cluster_colors)) + 
    ggtitle("Palantir's clusters") + 
    theme_void() + 
    theme(legend.position = "bottom"),
  plotReducedDim(sce, dimred = "tsne", color_by = "cell_types_short") +
    scale_color_manual(name= "Cell Types", 
                       values=as.character(unlist(metadata(sce)$cluster_colors)[!duplicated(levels(sce$cell_types_short))])) +
    ggtitle("Cell types") + 
    theme_void() + 
    theme(legend.position = "bottom"),
  VizMST(object = sce, clustering.names = select.cluster, viz.dim.red = dim_red), 
  ncol=3, align = "v"
)

## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Coloring by milestone
## 
## Using milestone_percentages from trajectory

(4) Pseudotime

Rooting

At this stage the root can be defined by giving the cluster number. From the plots visualized above cluster 15 seems to be a good candidate for the root which corresponds to the more imature HSC cell type (hematopoietic stem cells).

The ChangeTrajRoot() allow us to change the root by providing the name of the clustering/MST result (traj.name = select.cluster) and the cluster number that should be root (root.cluster = root.cluster).

This step can be skipped, but, in that case, the root will be randomly assigned.

## Define the root of the cell trajectory
root.cluster <- 15
sce <- ChangeTrajRoot(object = sce, traj.name = select.cluster, root.cluster = root.cluster)

Pseudotime

The smoothed, rooted and directed trajectory for the best clustering/MST result with (plot.pseudotime = TRUE) or without (plot.pseudotime = FALSE) pseudotime can be highlighted below with the function VizSmoothedTraj().

cowplot::plot_grid(
  VizSmoothedTraj(object = sce,
                  traj.names = select.cluster,
                  viz.dim.red = dim_red, plot.pseudotime = FALSE),
  VizSmoothedTraj(object = sce,
                  traj.names = select.cluster,
                  viz.dim.red = dim_red, plot.pseudotime = TRUE), 
  ncol=2
)

## Coloring by milestone

## Using milestone_percentages from trajectory

## Pseudotime not provided, will calculate pseudotime from root milestone

## Warning in dynwrap::calculate_pseudotime(trajectory): Trajectory is not rooted.
## Add a root to the trajectory using dynwrap::add_root(). This will result in an
## error in future releases.

## root cell or milestone not provided, trying first outgoing milestone_id

## Using '15' as root

Pseudotime comparison

Totem’s pseudotime (on the left side) can be compared with Palantir’s pseudotime (on the right side). The Pearson’s correlation between pseudotimes was 0.88.

## Compare Totem pseudotime against ground-truth
cowplot::plot_grid(
  (VizSmoothedTraj(object = sce,
                traj.names = select.cluster,
                viz.dim.red = dim_red, plot.pseudotime = TRUE) + 
     ggtitle("Totem")),
  (plotReducedDim(sce, dimred = "tsne", colour_by = "palantir_pseudotime") + 
      ggtitle("Palantir's pseudotime")) + 
    theme_void() + 
    theme(legend.position = "bottom"), 
   ncol=2
   )

## Pseudotime not provided, will calculate pseudotime from root milestone

## Warning in dynwrap::calculate_pseudotime(trajectory): Trajectory is not rooted.
## Add a root to the trajectory using dynwrap::add_root(). This will result in an
## error in future releases.

## root cell or milestone not provided, trying first outgoing milestone_id

## Using '15' as root

Despite the good correlation between pseudotime obtained between Totem’s and Palantir’s and the correct prediction of terminal states once the root was provided, the lymphoid lineage was wrongly predicted. CLP (common lymphoid progenitor) should had been diverging from HSC (hematopoietic stem cells).

One of the reasons that may explain this wrong prediction is how well the low-dimensional representation used for clustering fairly represents the relationship between cell types. The CLP population appears farther apart from the HSC than the cluster MyP (myeloid progenitor) in the two-dimensional tSNE projection making more likely to branch from the latter than the former.

In general, the user should be very critical about lineages towards cell populations/clusters that do not show a continuous development in the dimensional reduction projection used for clustering.

Finally, the sce object with all the results stored on it is exported.

## Export sce object
saveRDS(object = sce, file = "../results/sce_quickstart.rds")

R packages used and respective versions

## R packages and versions used in these analyses
sessionInfo()

## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] scater_1.26.1               ggplot2_3.4.1              
##  [3] scuttle_1.8.4               SingleCellExperiment_1.20.0
##  [5] SummarizedExperiment_1.28.0 Biobase_2.58.0             
##  [7] GenomicRanges_1.50.2        GenomeInfoDb_1.34.9        
##  [9] IRanges_2.32.0              S4Vectors_0.36.2           
## [11] BiocGenerics_0.44.0         MatrixGenerics_1.10.0      
## [13] matrixStats_0.63.0          Totem_0.99.2               
## [15] dplyr_1.1.0                
## 
## loaded via a namespace (and not attached):
##   [1] dyndimred_1.0.4           babelwhale_1.1.0         
##   [3] plyr_1.8.8                igraph_1.4.1             
##   [5] proxyC_0.3.3              dynwrap_1.2.3            
##   [7] BiocParallel_1.32.5       digest_0.6.31            
##   [9] foreach_1.5.2             htmltools_0.5.4          
##  [11] viridis_0.6.2             fansi_1.0.4              
##  [13] magrittr_2.0.3            carrier_0.1.0            
##  [15] ScaledMatrix_1.6.0        cluster_2.1.4            
##  [17] tzdb_0.3.0                remotes_2.4.2            
##  [19] readr_2.1.4               graphlayouts_0.8.4       
##  [21] RcppParallel_5.1.7        dynutils_1.0.11          
##  [23] princurve_2.1.6           colorspace_2.1-0         
##  [25] ggrepel_0.9.3             xfun_0.37                
##  [27] crayon_1.5.2              RCurl_1.98-1.10          
##  [29] jsonlite_1.8.4            iterators_1.0.14         
##  [31] ape_5.7-1                 glue_1.6.2               
##  [33] polyclip_1.10-4           gtable_0.3.1             
##  [35] zlibbioc_1.44.0           XVector_0.38.0           
##  [37] DelayedArray_0.24.0       BiocSingular_1.14.0      
##  [39] kernlab_0.9-32            prabclus_2.3-2           
##  [41] DEoptimR_1.0-11           dynparam_1.0.2           
##  [43] scales_1.2.1              Rcpp_1.0.10              
##  [45] TrajectoryUtils_1.6.0     viridisLite_0.4.1        
##  [47] aricode_1.0.2             rsvd_1.0.5               
##  [49] mclust_6.0.0              RColorBrewer_1.1-3       
##  [51] fpc_2.2-10                modeltools_0.2-23        
##  [53] ellipsis_0.3.2            pkgconfig_2.0.3          
##  [55] flexmix_2.3-18            farver_2.1.1             
##  [57] nnet_7.3-18               sass_0.4.5               
##  [59] utf8_1.2.3                labeling_0.4.2           
##  [61] tidyselect_1.2.0          rlang_1.1.0              
##  [63] reshape2_1.4.4            munsell_0.5.0            
##  [65] tools_4.2.3               cachem_1.0.7             
##  [67] cli_3.6.0                 generics_0.1.3           
##  [69] evaluate_0.20             stringr_1.5.0            
##  [71] fastmap_1.1.1             yaml_2.3.7               
##  [73] processx_3.8.0            knitr_1.42               
##  [75] tidygraph_1.2.3           robustbase_0.95-0        
##  [77] lmds_0.1.0                purrr_1.0.1              
##  [79] ggraph_2.1.0              nlme_3.1-162             
##  [81] sparseMatrixStats_1.10.0  GA_3.2.3                 
##  [83] compiler_4.2.3            beeswarm_0.4.0           
##  [85] tibble_3.2.0              tweenr_2.0.2             
##  [87] bslib_0.4.2               stringi_1.7.12           
##  [89] highr_0.10                ps_1.7.2                 
##  [91] desc_1.4.2                lattice_0.20-45          
##  [93] Matrix_1.5-3              vctrs_0.5.2              
##  [95] pillar_1.8.1              lifecycle_1.0.3          
##  [97] jquerylib_0.1.4           BiocNeighbors_1.16.0     
##  [99] cowplot_1.1.1             bitops_1.0-7             
## [101] irlba_2.3.5.1             patchwork_1.1.2          
## [103] R6_2.5.1                  gridExtra_2.3            
## [105] vipor_0.4.5               codetools_0.2-19         
## [107] MASS_7.3-58.2             assertthat_0.2.1         
## [109] rprojroot_2.0.3           withr_2.5.0              
## [111] GenomeInfoDbData_1.2.9    diptest_0.76-0           
## [113] parallel_4.2.3            hms_1.1.2                
## [115] grid_4.2.3                beachmat_2.14.0          
## [117] tidyr_1.3.0               class_7.3-21             
## [119] rmarkdown_2.20            DelayedMatrixStats_1.20.0
## [121] slingshot_2.6.0           ggforce_0.4.1            
## [123] dynplot_1.1.2             ggbeeswarm_0.6.0

References

Amezquita, Robert, Aaron Lun, Etienne Becht, Vince Carey, Lindsay Carpp, Ludwig Geistlinger, Federico Marini, et al. 2020. “Orchestrating Single-Cell Analysis with Bioconductor.” Nature Methods 17: 137–45. https://www.nature.com/articles/s41592-019-0654-x.

Cannoodt, Robrecht, and Wouter Saelens. 2021. Dynplot: Visualising Single-Cell Trajectories. https://github.com/dynverse/dynplot.

McCarthy, Davis J., Kieran R. Campbell, Aaron T. L. Lun, and Quin F. Willis. 2017. “Scater: Pre-Processing, Quality Control, Normalisation and Visualisation of Single-Cell RNA-Seq Data in R.” Bioinformatics 33: 1179–86. https://doi.org/10.1093/bioinformatics/btw777.

Setty, Manu, Vaidotas Kiseliovas, Jacob Levine, Adam Gayoso, Linas Mazutis, and Dana Pe’Er. 2019. “Characterization of Cell Fate Probabilities in Single-Cell Data with Palantir.” Nature Biotechnology 37 (4): 451–60.

Smolander, Johannes, Sini Junttila, and Laura L Elo. 2022. “Totem: A User-Friendly Tool for Clustering-Based Inference of Tree-Shaped Trajectories from Single-Cell Data.” bioRxiv, 2022–09.

Street, Kelly, Davide Risso, Russell B Fletcher, Diya Das, John Ngai, Nir Yosef, Elizabeth Purdom, and Sandrine Dudoit. 2018. “Slingshot: Cell Lineage and Pseudotime Inference for Single-Cell Transcriptomics.” BMC Genomics 19: 1–16.

Wickham, Hadley. 2016. Ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. https://ggplot2.tidyverse.org.

Wickham, Hadley, Romain François, Lionel Henry, and Kirill Müller. 2022. Dplyr: A Grammar of Data Manipulation.

Quick start guide of single-cell TI with Totem

Elo lab - https://elolab.utu.fi

20/04/2023

QuickStart

(1) Preprocessing

(2) `Totem`’s TI

(3) Visualization

Cell connectivity

Clustering/MST

(4) Pseudotime

Rooting

Pseudotime

Pseudotime comparison

R packages used and respective versions

References

Quick start guide of single-cell TI with Totem

Elo lab - https://elolab.utu.fi

20/04/2023

QuickStart

(1) Preprocessing

(2) Totem’s TI

(3) Visualization

Cell connectivity

Clustering/MST

(4) Pseudotime

Rooting

Pseudotime

Pseudotime comparison

R packages used and respective versions

References

(2) `Totem`’s TI