This tutorial is a quick start guide with Totem
, a
user-friendly tree-shaped single-cell trajectory inference (TI) method
(Smolander, Junttila, and Elo 2022). The
minimum requirement to run Totem
is a low-dimensional
result, such as tSNE or UMAP, as far as this low-dimensional
representation represents fairly the cell dynamic process in study,
e.g., cell cycle, activation or differentiation.
Commonly one is interested in inferring cell trajectory after running an integration and/or clustering analysis. In either case the scRNA-seq data was most likely projected onto the low-dimensional space, such as PCA and UMAP, along the analysis. Under this scenario, the user might be interested in preserving these low-dimensional reductions during the cell TI analysis.
With this in mind, this quick start guide applies
Totem
’s TI to a scRNA-seq data set of CD34+ bone
marrow cells (replicate 1) used as an example in the
Palantir
’s TI publication (Setty et
al. 2019) and publicly available at Human Cell Atlas Portal
(link: https://data.humancellatlas.org/explore/projects/091cf39b-01bc-42e5-9437-f419a66c8a45).
The processed data set with tSNE projection, cluster
identities,palantir
pseudotime, transformed log-normalized
and raw data provided at human_cd34_bm_rep1.h5ad
will be used for demonstration of the user-friendliness and experience
of Totem
’s TI analysis. The
human_cd34_bm_rep1.h5ad
anndata
h5ad
object was downloaded, parsed and distributed as a RDS
SingleCellExperiment
class object at
data/human_cd34_bm_rep1.rds
. The R script used for this
object is provided at
scripts/download_h5ad_to_SCE_rds_script.R
.
The content of this tutorial is enumerated and described below.
Preprocessing: importing data to R and preparing it to
Totem
Totem
’s TI: two-step Totem
’s TI
workflow:
2.1. clustering & multiple spanning trees (MSTs)
2.2. smoothing selected clustering/MSTs
Visualization: visualizing Totem
’s cell
connectivity, clustering, and trajectory’s topology
Pseudotime: re-rooting and comparison of Totem
’s
pseudotime estimates to those of Palantir
Load the following R
packages.
# Load packages
library("dplyr")
library("Totem")
library("scater")
library("ggplot2")
library("SingleCellExperiment")
dplyr
v.1.1.0 (Wickham et al.
2022): data wrangling, e.g., piping data with
%>%
ggplot2
v.3.4.1 (Wickham
2016): data visualization
Totem
v.0.99.2 (Smolander,
Junttila, and Elo 2022): trajectory inference
scater
v.1.26.1 (McCarthy et
al. 2017): Bioconductor
scRNA toolkit - used for
low-dimensional visualization
SingleCellExperiment
v.1.20.0 (Amezquita et al. 2020):
Bioconductor
single-cell data object - R
class
object used to interact with Totem
Set seed to keep the reproducibility of the analyses generated.
# Set seed
set.seed(1204)
Import the RDS
SingleCellExperiment
class
object storing the scRNA-seq data set with the function
readRDS
(file available at:
data/human_cd34_bm_rep1.rds
).
## Import the scRNA-seq SCE object
sce <- readRDS(file = "../data/human_cd34_bm_rep1.rds")
SingleCellExperiment
class object?
Totem
conveniently uses the
SingleCellExperiment
(SCE
for short)
R
package and class to store all the information that
generates in its workflow. This can be accessed through:
metadata(sce)$totem
(which returns a list of objects
generated in the Totem
workflow).
A SCE
object can store several gene expression assays
(raw, log counts, etc), metadata (cell/gene) and dimensional reduction
data. Please have a look into the Chapter
4 The SingleCellExperiment class of the OSCA book to
learn more about the SingleCellExperiment
class and
functionality.
Before starting the workflow, we exclude any potential non expressed gene from our SCE object.
## Prepare data for Totem: remove non-expressed genes
sce <- PrepareTotem(object = sce)
##
## The data matrix has 14651 features.
## The data matrix has 5780 cells.
## Please note that the filtering of poor-quality cells needs to be
## done before Totem analysis. In addition, the cells need to be part of
## a single, connected, tree-shaped trajectory
## (not multiple disconnected trajectories.
##
## 14624/14651 features remain after filtering genes with only zero values.
##
The sce
object contains the assays counts, logcounts,
scaled, the cell metadata clusters, palantir_pseudotime,
palantir_diff_potential, cell_types_long, cell_types_short with 14624
genes and 5780 cells.
Totem
’s TIThe two-step Totem
’s TI workflow consists:
clustering & multiple spanning trees (MSTs):
RunClustering(...)
smoothing selected clustering/MSTs:
SelectClusterings(...)
followed by
RunSmoothing(...)
The only input required is the SingleCellExperiment
object named sce
with a low-dimensional reduction result in
reducedDim(sce)
, which in this case is a tsne
with two dimensions (run the following to see the first tSNE 6 cell
coordinates - head(reducedDim(sce))
).
## Totem's TI workflow:
# (1) clustering dimensional reduction w/ CLARA (k-medoids), MSTs
# (2) smoothing best clustering/MSTs w/ principal curves algorithm
set.seed(123) # keep reproducibility
sce <- RunClustering(object = sce) %>%
SelectClusterings(object = .) %>%
RunSmoothing(object = .)
## using tsne as the dim.reduced data
## number of clustering results: 10000
%>%
mean/do?
The expression %>%
is a pipe. Pipes are a convenient
computational way of passing the output generated by the previous
command to the input of the next command. If you’re not familiar with
pipes (%>%
coming from the package dplyr
),
you can run the equivalent to the previous code without using pipes.
## Totem's TI workflow:
# (1) clustering dimensional reduction w/ CLARA (k-medoids), MSTs
# (2) smoothing best clustering/MSTs w/ principal curves algorithm
set.seed(123) # keep reproducibility
sce <- RunClustering(object = sce)
sce <- SelectClusterings(object = sce)
sce <- RunSmoothing(object = sce)
RunClustering(...)
: It performs clustering and MSTs.
Clustering is one of the main steps in the Totem
workflow
and the one that takes more time and computational resources.
Totem
uses the dimensional latent representations provided
in reducedDim(sce)
to cluster the data set 10K times
(N.clusterings = 10000
) with the k-medoids (CLARA)
algorithm using an expected number of clusters varying between 3-20
(k.range = 3:20
). It filters out clusters with <5 cells
(min.cluster.size = 5
). All the parameters mentioned are
the default and they can be changed. After filtering out clustering
results due to min.cluster.size
, the total number of
clustering results will be lower than N.clusterings
. At
this stage, it is also estimated a MST for each clustering
result.
SelectClusterings(...)
: Totem
performs
a large number of clustering results as part of its approach. Since they
can not be all inspected manually, Totem
provides 5 methods
to select the top clustering results (by default 1 -
selection.N.models = 1
). All the methods are described in
detail in the Totem
vignette (see vignette).
You can also see the function documentation by typing the following in
the R console: ?SelectClusterings
. The default method used
was method 1 (selection.method = 1
) which uses the Variance
Ratio Criterion (VRC, aka Calinski-Harabasz score), which
measures the within- and between-cluster dispersion, to select the best
clustering result.
RunSmoothing(...)
: Smoothing the top selected MST(s)
using the principle curves slingshot
algorithm (Street et al. 2018) resulting in directed
trajectories randomly rooted and discrete pseudotime along the lineage.
The user can adjust the root later.
The Totem
’s cell connectivities, clustering/MST and
trajectory can be visual inspected after running Totem
’s TI
method.
Cell connectivity is a novel concept introduced in
Totem
. For a given clustering result and the respective
MST, the connectivity of a cell cluster consists in the ratio of its
connections to other clusters by the number of clusters. Higher the
value farther the distance for the leafs/ending points of the trajectory
and, thus, more likely to represent, e.g., branching points (see Fig.4
(Smolander, Junttila, and Elo 2022)).
Totem
relies on the dynplot
(Cannoodt and Saelens 2021) package for
visualization which requires to give the dimensional reduction result as
a parameter (viz.dim.red
). Thus the tSNE dimensional
reduction is first retrieved from the sce
object with
dim_red <- reducedDim(sce, "tsne")
and given as
parameter (`viz.dim.red = dim_red
) to the function
VizCellConnectivity()
together with the sce
object to plot cell connectivities.
## Visualization of cell connectivity
dim_red <- reducedDim(sce, "tsne") # retrieve tSNE
VizCellConnectivity(object = sce, viz.dim.red = dim_red) # plot
## Coloring by milestone
## Using milestone_percentages from trajectory
## Coloring by expression
The cell connectivity plot suggests that HSC and MyP are branching points during hematopoiesis.
Next the best clustering and MST can be inspected with the function
VizMST()
. The name of the best clustering and MST needs to
be retrieved
(select.cluster <- names(metadata(sce)$totem$slingshot_trajectory)
)
and given as parameter (clustering.names = select.cluster
).
In addition, the sce
object and tSNE result
(viz.dim.red = dim_red)
) are also given.
In the code below it is also plotted the Palantir
’s
cluster and cell types abbreviated. For this end it is used the function
plotReducedDim()
from the scater
package (read
more about the function ?plotReducedDim
).
## Visualization of clustering/MST
select.cluster <- names(metadata(sce)$totem$slingshot_trajectory) # retrieve the name of the best cluster
cowplot::plot_grid(
plotReducedDim(sce, dimred = "tsne", color_by = "clusters") +
scale_color_manual(name= "Clusters", values=unlist(metadata(sce)$cluster_colors)) +
ggtitle("Palantir's clusters") +
theme_void() +
theme(legend.position = "bottom"),
plotReducedDim(sce, dimred = "tsne", color_by = "cell_types_short") +
scale_color_manual(name= "Cell Types",
values=as.character(unlist(metadata(sce)$cluster_colors)[!duplicated(levels(sce$cell_types_short))])) +
ggtitle("Cell types") +
theme_void() +
theme(legend.position = "bottom"),
VizMST(object = sce, clustering.names = select.cluster, viz.dim.red = dim_red),
ncol=3, align = "v"
)
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Scale for colour is already present.
## Adding another scale for colour, which will replace the existing scale.
## Coloring by milestone
##
## Using milestone_percentages from trajectory
At this stage the root can be defined by giving the cluster number. From the plots visualized above cluster 15 seems to be a good candidate for the root which corresponds to the more imature HSC cell type (hematopoietic stem cells).
The ChangeTrajRoot()
allow us to change the root by
providing the name of the clustering/MST result
(traj.name = select.cluster
) and the cluster number that
should be root (root.cluster = root.cluster
).
This step can be skipped, but, in that case, the root will be randomly assigned.
## Define the root of the cell trajectory
root.cluster <- 15
sce <- ChangeTrajRoot(object = sce, traj.name = select.cluster, root.cluster = root.cluster)
The smoothed, rooted and directed trajectory for the best
clustering/MST result with (plot.pseudotime = TRUE
) or
without (plot.pseudotime = FALSE
) pseudotime can be
highlighted below with the function VizSmoothedTraj()
.
cowplot::plot_grid(
VizSmoothedTraj(object = sce,
traj.names = select.cluster,
viz.dim.red = dim_red, plot.pseudotime = FALSE),
VizSmoothedTraj(object = sce,
traj.names = select.cluster,
viz.dim.red = dim_red, plot.pseudotime = TRUE),
ncol=2
)
## Coloring by milestone
## Using milestone_percentages from trajectory
## Pseudotime not provided, will calculate pseudotime from root milestone
## Warning in dynwrap::calculate_pseudotime(trajectory): Trajectory is not rooted.
## Add a root to the trajectory using dynwrap::add_root(). This will result in an
## error in future releases.
## root cell or milestone not provided, trying first outgoing milestone_id
## Using '15' as root
Totem
’s pseudotime (on the left side) can be compared
with Palantir
’s pseudotime (on the right side). The
Pearson’s correlation between pseudotimes was 0.88.
## Compare Totem pseudotime against ground-truth
cowplot::plot_grid(
(VizSmoothedTraj(object = sce,
traj.names = select.cluster,
viz.dim.red = dim_red, plot.pseudotime = TRUE) +
ggtitle("Totem")),
(plotReducedDim(sce, dimred = "tsne", colour_by = "palantir_pseudotime") +
ggtitle("Palantir's pseudotime")) +
theme_void() +
theme(legend.position = "bottom"),
ncol=2
)
## Pseudotime not provided, will calculate pseudotime from root milestone
## Warning in dynwrap::calculate_pseudotime(trajectory): Trajectory is not rooted.
## Add a root to the trajectory using dynwrap::add_root(). This will result in an
## error in future releases.
## root cell or milestone not provided, trying first outgoing milestone_id
## Using '15' as root
Despite the good correlation between pseudotime obtained between
Totem
’s and Palantir
’s and the correct
prediction of terminal states once the root was provided, the lymphoid
lineage was wrongly predicted. CLP (common lymphoid progenitor) should
had been diverging from HSC (hematopoietic stem cells).
One of the reasons that may explain this wrong prediction is how well the low-dimensional representation used for clustering fairly represents the relationship between cell types. The CLP population appears farther apart from the HSC than the cluster MyP (myeloid progenitor) in the two-dimensional tSNE projection making more likely to branch from the latter than the former.
In general, the user should be very critical about lineages towards cell populations/clusters that do not show a continuous development in the dimensional reduction projection used for clustering.
Finally, the sce
object with all the results stored on
it is exported.
## Export sce object
saveRDS(object = sce, file = "../results/sce_quickstart.rds")
## R packages and versions used in these analyses
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] scater_1.26.1 ggplot2_3.4.1
## [3] scuttle_1.8.4 SingleCellExperiment_1.20.0
## [5] SummarizedExperiment_1.28.0 Biobase_2.58.0
## [7] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9
## [9] IRanges_2.32.0 S4Vectors_0.36.2
## [11] BiocGenerics_0.44.0 MatrixGenerics_1.10.0
## [13] matrixStats_0.63.0 Totem_0.99.2
## [15] dplyr_1.1.0
##
## loaded via a namespace (and not attached):
## [1] dyndimred_1.0.4 babelwhale_1.1.0
## [3] plyr_1.8.8 igraph_1.4.1
## [5] proxyC_0.3.3 dynwrap_1.2.3
## [7] BiocParallel_1.32.5 digest_0.6.31
## [9] foreach_1.5.2 htmltools_0.5.4
## [11] viridis_0.6.2 fansi_1.0.4
## [13] magrittr_2.0.3 carrier_0.1.0
## [15] ScaledMatrix_1.6.0 cluster_2.1.4
## [17] tzdb_0.3.0 remotes_2.4.2
## [19] readr_2.1.4 graphlayouts_0.8.4
## [21] RcppParallel_5.1.7 dynutils_1.0.11
## [23] princurve_2.1.6 colorspace_2.1-0
## [25] ggrepel_0.9.3 xfun_0.37
## [27] crayon_1.5.2 RCurl_1.98-1.10
## [29] jsonlite_1.8.4 iterators_1.0.14
## [31] ape_5.7-1 glue_1.6.2
## [33] polyclip_1.10-4 gtable_0.3.1
## [35] zlibbioc_1.44.0 XVector_0.38.0
## [37] DelayedArray_0.24.0 BiocSingular_1.14.0
## [39] kernlab_0.9-32 prabclus_2.3-2
## [41] DEoptimR_1.0-11 dynparam_1.0.2
## [43] scales_1.2.1 Rcpp_1.0.10
## [45] TrajectoryUtils_1.6.0 viridisLite_0.4.1
## [47] aricode_1.0.2 rsvd_1.0.5
## [49] mclust_6.0.0 RColorBrewer_1.1-3
## [51] fpc_2.2-10 modeltools_0.2-23
## [53] ellipsis_0.3.2 pkgconfig_2.0.3
## [55] flexmix_2.3-18 farver_2.1.1
## [57] nnet_7.3-18 sass_0.4.5
## [59] utf8_1.2.3 labeling_0.4.2
## [61] tidyselect_1.2.0 rlang_1.1.0
## [63] reshape2_1.4.4 munsell_0.5.0
## [65] tools_4.2.3 cachem_1.0.7
## [67] cli_3.6.0 generics_0.1.3
## [69] evaluate_0.20 stringr_1.5.0
## [71] fastmap_1.1.1 yaml_2.3.7
## [73] processx_3.8.0 knitr_1.42
## [75] tidygraph_1.2.3 robustbase_0.95-0
## [77] lmds_0.1.0 purrr_1.0.1
## [79] ggraph_2.1.0 nlme_3.1-162
## [81] sparseMatrixStats_1.10.0 GA_3.2.3
## [83] compiler_4.2.3 beeswarm_0.4.0
## [85] tibble_3.2.0 tweenr_2.0.2
## [87] bslib_0.4.2 stringi_1.7.12
## [89] highr_0.10 ps_1.7.2
## [91] desc_1.4.2 lattice_0.20-45
## [93] Matrix_1.5-3 vctrs_0.5.2
## [95] pillar_1.8.1 lifecycle_1.0.3
## [97] jquerylib_0.1.4 BiocNeighbors_1.16.0
## [99] cowplot_1.1.1 bitops_1.0-7
## [101] irlba_2.3.5.1 patchwork_1.1.2
## [103] R6_2.5.1 gridExtra_2.3
## [105] vipor_0.4.5 codetools_0.2-19
## [107] MASS_7.3-58.2 assertthat_0.2.1
## [109] rprojroot_2.0.3 withr_2.5.0
## [111] GenomeInfoDbData_1.2.9 diptest_0.76-0
## [113] parallel_4.2.3 hms_1.1.2
## [115] grid_4.2.3 beachmat_2.14.0
## [117] tidyr_1.3.0 class_7.3-21
## [119] rmarkdown_2.20 DelayedMatrixStats_1.20.0
## [121] slingshot_2.6.0 ggforce_0.4.1
## [123] dynplot_1.1.2 ggbeeswarm_0.6.0