Dataset
To quickly illustrate the multi-level integration algorithm, we will
use in this vignette two 10X PBMCs (Peripheral Blood Mononuclear Cells)
3’ assays: V1 and V2. The datasets
have been downloaded from 10X website. The PBMC dataset
V1 corresponds to sample pbmc6k and
V2 to pbmc8k
:
Cells were annotated using the annotations provided by Korsunsky et
al., 2019 (Source Data Figure 4 file). The overall data was
downsampled to 2K cells (1K per assay) and 2K highly variable genes
selected with scran
R package. To facilitate the
reproduction of this vignette, the data is distributed through
Zenodo as a SingleCellExperiment
object, the
object (class) required by most functions in Coralysis
(see
Chapter
4 The SingleCellExperiment class - OSCA manual). The
SCE
object provided comprises counts
(raw
count data), logcounts
(log-normalized data) and cell
colData
(which includes batch and cell labels, designated
as batch
and cell_type
, respectively).
Run the code below to import the R
packages and data
required to reproduce this vignette.
# Import packages
library("Coralysis")
library("SingleCellExperiment")
# Import data from Zenodo
data.url <- "https://zenodo.org/records/14871436/files/pbmc_10Xassays.rds?download=1"
pbmc_10Xassays <- readRDS(file = url(data.url))
pbmc_10Xassays # print SCE object
## class: SingleCellExperiment
## dim: 2000 2000
## metadata(0):
## assays(2): counts logcounts
## rownames(2000): LYZ S100A9 ... IGSF8 KIR2DL1
## rowData names(8): highly_variable mean ... FDR per.block
## colnames(2000): ACATACCTGTCAAC ATTGAAACTCGTAG ... AGAGTGGCAGGGTTAG
## GACTGCGAGCGTTCCG
## colData names(3): barcode batch cell_type
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
Preprocess
The PrepareData()
function checks whether a sparse
matrix is available in the logcounts
assay (which
corresponds to log-normalized data) and removes non-expressed
features.
# Prepare data:
#checks 'logcounts' format & removes non-expressed genes
pbmc_10Xassays <- PrepareData(object = pbmc_10Xassays)
## Data in `logcounts` slot already of `dgCMatrix` class...
## 2000/2000 features remain after filtering features with only zero values.
Multi-level integration
The multi-level integration algorithm is implemented in the
RunParallelDivisiveICP()
function, the main function in
Coralysis
. It only requires a
SingleCellExperiment
object, which in this case is
pbmc_10Xassays
.
To perform integration across a batch, a batch.label
available in colData(pbmc_10Xassays)
must be provided. In
this case, it is "batch"
. The ensemble algorithm runs 50
times by default, but for illustrative purposes, this has been reduced
to 10 (L=10
).
Two threads are allocated to speed up the process
(threads=2
), though by default, the function uses all
available system threads. Specify one thread if you prefer not to use
any additional threads.
The result consists of a set of models and their respective
probability matrices (n = 40; log2(k
) * L
),
stored in metadata(pbmc_10Xassays)$coralysis$models
and
metadata(pbmc_10Xassays)$coralysis$joint.probability
,
respectively.
# Multi-level integration
set.seed(123)
pbmc_10Xassays <- RunParallelDivisiveICP(object = pbmc_10Xassays,
batch.label = "batch",
L = 10, threads = 2)
##
## Building training set...
## Training set successfully built.
##
## Computing cluster seed.
##
## Initializing divisive ICP clustering...
## | | | 0% | |======== | 11% | |================ | 22% | |======================= | 33% | |=============================== | 44% | |======================================= | 56% | |=============================================== | 67% | |====================================================== | 78% | |============================================================== | 89% | |======================================================================| 100%
##
## Divisive ICP clustering completed successfully.
##
## Predicting cell cluster probabilities using ICP models...
## Prediction of cell cluster probabilities completed successfully.
##
## Multi-level integration completed successfully.
Integrated embedding
The integrated result of Coralysis
consist of an
integrated embedding which can be obtained by running the function
RunPCA
. This integrated PCA can, in turn, be used
downstream for clustering or non-linear dimensional reduction
techniques, such as RunTSNE
or RunUMAP
. The
function RunPCA
runs by default the PCA method implemented
the R
package irlba
(pca.method="irlba"
), which requires a seed to ensure the
same PCA result.
## Divisive ICP: selecting ICP tables multiple of 4
UMAP
Compute UMAP by running the function RunUMAP()
.
Visualize batch & cell types
Finally, the integration can be visually inspected by highlighting the batch and cell type labels into the UMAP projection.
# Visualize categorical variables integrated emb.
vars <- c("batch", "cell_type")
plots <- lapply(X = vars, FUN = function(x) {
PlotDimRed(object = pbmc_10Xassays, color.by = x,
point.size = 0.25, point.stroke = 0.5,
legend.nrow = 3)
})
cowplot::plot_grid(plotlist = plots, ncol = 2, align = "vh") # join plots together
R session
# R session
sessionInfo()
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
## [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
## [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
## [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
##
## time zone: UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
## [3] Biobase_2.66.0 GenomicRanges_1.58.0
## [5] GenomeInfoDb_1.42.3 IRanges_2.40.1
## [7] S4Vectors_0.44.0 BiocGenerics_0.52.0
## [9] MatrixGenerics_1.18.1 matrixStats_1.5.0
## [11] Coralysis_1.0.0
##
## loaded via a namespace (and not attached):
## [1] rlang_1.1.5 magrittr_2.0.3 flexclust_1.4-2
## [4] compiler_4.4.2 png_0.1-8 systemfonts_1.2.1
## [7] vctrs_0.6.5 reshape2_1.4.4 stringr_1.5.1
## [10] pkgconfig_2.0.3 crayon_1.5.3 fastmap_1.2.0
## [13] XVector_0.46.0 labeling_0.4.3 scuttle_1.16.0
## [16] rmarkdown_2.29 ggbeeswarm_0.7.2 UCSC.utils_1.2.0
## [19] ragg_1.3.3 xfun_0.50 modeltools_0.2-23
## [22] bluster_1.16.0 zlibbioc_1.52.0 cachem_1.1.0
## [25] beachmat_2.22.0 jsonlite_1.8.9 DelayedArray_0.32.0
## [28] BiocParallel_1.40.0 irlba_2.3.5.1 parallel_4.4.2
## [31] aricode_1.0.3 cluster_2.1.6 R6_2.6.0
## [34] bslib_0.9.0 stringi_1.8.4 RColorBrewer_1.1-3
## [37] reticulate_1.40.0 limma_3.62.2 jquerylib_0.1.4
## [40] Rcpp_1.0.14 iterators_1.0.14 knitr_1.49
## [43] snow_0.4-4 Matrix_1.7-1 igraph_2.1.4
## [46] tidyselect_1.2.1 abind_1.4-8 yaml_2.3.10
## [49] codetools_0.2-20 doRNG_1.8.6.1 lattice_0.22-6
## [52] tibble_3.2.1 plyr_1.8.9 withr_3.0.2
## [55] askpass_1.2.1 ggrastr_1.0.2 evaluate_1.0.3
## [58] desc_1.4.3 pillar_1.10.1 rngtools_1.5.2
## [61] foreach_1.5.2 generics_0.1.3 ggplot2_3.5.1
## [64] sparseMatrixStats_1.18.0 munsell_0.5.1 scales_1.3.0
## [67] class_7.3-22 glue_1.8.0 metapod_1.14.0
## [70] pheatmap_1.0.12 LiblineaR_2.10-24 tools_4.4.2
## [73] BiocNeighbors_2.0.1 ScaledMatrix_1.14.0 SparseM_1.84-2
## [76] RSpectra_0.16-2 locfit_1.5-9.11 RANN_2.6.2
## [79] fs_1.6.5 scran_1.34.0 Cairo_1.6-2
## [82] cowplot_1.1.3 grid_4.4.2 umap_0.2.10.0
## [85] edgeR_4.4.2 colorspace_2.1-1 GenomeInfoDbData_1.2.13
## [88] beeswarm_0.4.0 BiocSingular_1.22.0 vipor_0.4.7
## [91] cli_3.6.4 rsvd_1.0.5 textshaping_1.0.0
## [94] S4Arrays_1.6.0 dplyr_1.1.4 doSNOW_1.0.20
## [97] gtable_0.3.6 sass_0.4.9 digest_0.6.37
## [100] SparseArray_1.6.1 dqrng_0.4.1 farver_2.1.2
## [103] htmltools_0.5.8.1 pkgdown_2.1.1 lifecycle_1.0.4
## [106] httr_1.4.7 statmod_1.5.0 openssl_2.3.2
References
Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.
Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023