Integration

Dataset

To illustrate the multi-level integration algorithm, we will use in this vignette two 10X PBMCs (Peripheral Blood Mononuclear Cells) 3’ assays: V1 and V2. The datasets have been downloaded from 10X website. The PBMC dataset V1 corresponds to sample pbmc6k and V2 to pbmc8k:

V1: pbmc6k
V2: pbmc8k

Cells were annotated using the annotations provided by Korsunsky et al., 2019 (Source Data Figure 4 file). The overall data was downsampled to 2K cells (1K per assay) and 2K highly variable genes selected with scran R package. To facilitate the reproduction of this vignette, the data is distributed through Zenodo as a SingleCellExperiment object, the object (class) required by most functions in Coralysis (see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE object provided comprises counts (raw count data), logcounts (log-normalized data) and cell colData (which includes batch and cell labels, designated as batch and cell_type, respectively).

Run the code below to import the R packages and data required to reproduce this vignette.

# Packages
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")

# Import data from Zenodo
data.url <- "https://zenodo.org/records/14845751/files/pbmc_10Xassays.rds?download=1"
pbmc_10Xassays <- readRDS(file = url(data.url))

DimRed: pre-integration

The batch effect between assays can be inspected below by projecting the data onto t-distributed Stochastic Neighbor Embedding (t-SNE). This can be achieved by running sequentially the Coralysis functions RunPCA and RunTSNE. Provide a seed before running each one of these functions to ensure reproducibility. The function RunPCA runs by default the PCA method implemented the R package irlba (pca.method="irlba"), which requires a seed to ensure the same PCA result. In addition, the assay.name argument needs to be provided, otherwise uses by default the probabilities which are obtained only after integration (after running RunParallelDivisiveICP). The assay logcounts, corresponding to the log-normalized data, and number of principal components to use p were provided. In this case, the data has been previously normalized, but it could have been normalized using methods available in Bioconductor (see Chapter 7 Normalization - OSCA manual). Any categorical variable available in colData(pbmc_10Xassays), such as batch or cell_type, can be visualized in a low dimensional embedding stored in reducedDimNames(pbmc_10Xassays) with the Coralysis function PlotDimRed.

# Compute PCA & TSNE
set.seed(123)
pbmc_10Xassays <- RunPCA(object = pbmc_10Xassays, 
                         assay.name = "logcounts", 
                         p = 30, dimred.name = "unintPCA")
set.seed(123)
pbmc_10Xassays <- RunTSNE(pbmc_10Xassays, 
                          dimred.type = "unintPCA", 
                          dimred.name = "unintTSNE")

# Plot TSNE highlighting the batch & cell type
unint.batch.plot <- PlotDimRed(object = pbmc_10Xassays, 
                               color.by = "batch", 
                               dimred = "unintTSNE",
                               point.size = 0.01, 
                               legend.nrow = 1, 
                               seed.color = 1024)
unint.cell.plot <- PlotDimRed(object = pbmc_10Xassays, 
                              color.by = "cell_type", 
                              dimred = "unintTSNE", 
                              point.size = 0.01, 
                              legend.nrow = 5, 
                              seed.color = 7)
cowplot::plot_grid(unint.batch.plot, unint.cell.plot, ncol = 2, align = "vh")

Multi-level integration

Integrate assays with the multi-level integration algorithm implemented in Coralysis by running the function RunParallelDivisiveICP. The only arguments required by this function are object and batch.label. The object requires a SingleCellExperiment object with the assay logcounts. The matrix in logcounts should be sparse, i.e., is(logcounts(pbmc_10Xassays), "dgCMatrix") is TRUE, and it should not contain non-expressing genes. This is ensured by running PrepareData before. The batch.label argument requires a label column name in colData(pbmc_10Xassays) corresponding to the batch label that should be used for integration. In the absence of a batch, the same function, RunParallelDivisiveICP, can be run without providing batch.label (i.e., batch.label = NULL), in which case the data will be modeled through the algorithm to identify fine-grained populations that do not required batch correction. An higher number of threads can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10), but generally, this number should be higher (with the default being L = 50).

# Prepare data for integration: 
#remove non-expressing genes & logcounts is from `dgCMatrix` class
pbmc_10Xassays <- PrepareData(object = pbmc_10Xassays)

## Data in `logcounts` slot already of `dgCMatrix` class...

## 2000/2000 features remain after filtering features with only zero values.

# Perform integration with Coralysis
set.seed(1024)
pbmc_10Xassays <- RunParallelDivisiveICP(object = pbmc_10Xassays, 
                                         batch.label = "batch", 
                                         L = 10, threads = 2)

## 
## Building training set...

## Training set successfully built.

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

##   |                                                                              |                                                                      |   0%  |                                                                              |========                                                              |  11%  |                                                                              |================                                                      |  22%  |                                                                              |=======================                                               |  33%  |                                                                              |===============================                                       |  44%  |                                                                              |=======================================                               |  56%  |                                                                              |===============================================                       |  67%  |                                                                              |======================================================                |  78%  |                                                                              |==============================================================        |  89%  |                                                                              |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

DimRed: post-integration

The integration result can be visually inspected by running sequentially the functions RunPCA and RunTSNE. The assay.name provided to RunPCA must be joint.probability (the default), the primary output of integration with Coralysis. The probability matrices from Coralysis (i.e., joint.probability) can be used to obtain an integrated embedding by running RunPCA(..., assay.name = "joint.probability"). This integrated PCA can, in turn, be used downstream for clustering or non-linear dimensional reduction techniques, such as RunTSNE. Below, the integrated PCA was named intPCA.

# Compute PCA with joint cluster probabilities & TSNE
set.seed(123)
pbmc_10Xassays <- RunPCA(pbmc_10Xassays, 
                         assay.name = "joint.probability", 
                         dimred.name = "intPCA")

## Divisive ICP: selecting ICP tables multiple of 4

set.seed(123)
pbmc_10Xassays <- RunTSNE(pbmc_10Xassays, 
                          dimred.type = "intPCA", 
                          dimred.name = "intTSNE")

# Plot TSNE highlighting the batch & cell type
int.batch.plot <- PlotDimRed(object = pbmc_10Xassays, 
                             color.by = "batch", 
                             dimred = "intTSNE", 
                             point.size = 0.01, 
                             legend.nrow = 1, 
                             seed.color = 1024)
int.cell.plot <- PlotDimRed(object = pbmc_10Xassays, 
                            color.by = "cell_type", 
                            dimred = "intTSNE", 
                            point.size = 0.01, 
                            legend.nrow = 5, 
                            seed.color = 7)
cowplot::plot_grid(int.batch.plot, int.cell.plot, 
                   ncol = 2, align = "vh")

Clustering

Run graph-based clustering with the scran function clusterCells (see Chapter 5 Clustering - OSCA manual).

# Graph-based clustering on the integrated PCA w/ 'scran' package
blusparams <- bluster::SNNGraphParam(k = 15, cluster.fun = "louvain")
set.seed(123)
pbmc_10Xassays$cluster <- scran::clusterCells(pbmc_10Xassays, 
                                              use.dimred = "intPCA", 
                                              BLUSPARAM = blusparams)

# Plot clustering
clt.plot <- PlotDimRed(object = pbmc_10Xassays, 
                       color.by = "cluster", 
                       dimred = "intTSNE", 
                       point.size = 0.01, 
                       legend.nrow = 3, 
                       seed.color = 65)
cowplot::plot_grid(int.batch.plot, int.cell.plot, 
                   clt.plot, ncol = 3, align = "h")

Cluster markers

Identify the cluster markers by running the Coralysis function FindAllClusterMarkers. Provide the clustering.label, in this case, the label used above, i.e., cluster. The top three positive markers per cluster were retrieved and plotted below using the Coralysis function HeatmapFeatures.

# Cluster markers 
cluster.markers <- FindAllClusterMarkers(object = pbmc_10Xassays, clustering.label = "cluster")

## -----------------------------------
## testing cluster 1
## 1128 features left after min.pct filtering
## 1128 features left after min.diff.pct filtering
## 215 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 2
## 1203 features left after min.pct filtering
## 1203 features left after min.diff.pct filtering
## 287 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 3
## 1167 features left after min.pct filtering
## 1167 features left after min.diff.pct filtering
## 427 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 4
## 1171 features left after min.pct filtering
## 1171 features left after min.diff.pct filtering
## 443 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 5
## 1194 features left after min.pct filtering
## 1194 features left after min.diff.pct filtering
## 283 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 6
## 1130 features left after min.pct filtering
## 1130 features left after min.diff.pct filtering
## 289 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 7
## 1199 features left after min.pct filtering
## 1199 features left after min.diff.pct filtering
## 189 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 8
## 1154 features left after min.pct filtering
## 1154 features left after min.diff.pct filtering
## 392 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 9
## 1239 features left after min.pct filtering
## 1239 features left after min.diff.pct filtering
## 363 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 10
## 1473 features left after min.pct filtering
## 1473 features left after min.diff.pct filtering
## 359 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 11
## 1138 features left after min.pct filtering
## 1138 features left after min.diff.pct filtering
## 280 features left after log2fc.threshold filtering
## -----------------------------------
## -----------------------------------
## testing cluster 12
## 1208 features left after min.pct filtering
## 1208 features left after min.diff.pct filtering
## 344 features left after log2fc.threshold filtering
## -----------------------------------

# Select the top 3 positive markers per cluster 
top3.markers <- lapply(X = split(x = cluster.markers, f = cluster.markers$cluster), FUN = function(x) {
    head(x[order(x$log2FC, decreasing = TRUE),], n = 3)
})
top3.markers <- do.call(rbind, top3.markers)
top3.markers <- top3.markers[order(as.numeric(top3.markers$cluster)),]

# Heatmap of the top 3 positive markers per cluster
HeatmapFeatures(object = pbmc_10Xassays, 
                clustering.label = "cluster", 
                features = top3.markers$marker, 
                seed.color = 65)

DGE

Coralysis was able to separate the CD8 effector T cells into two clusters: 6 and 11. From the differential gene expression (DGE) analysis below, it is clear that cluster 11 is more cytotoxic and similar to NK cells (expressing GZMH and GZMB) than cluster 6.

# DGE analysis: cluster 6 vs 11
dge.clt6vs11 <- FindClusterMarkers(pbmc_10Xassays, 
                                   clustering.label = "cluster", 
                                   clusters.1 = "6", 
                                   clusters.2 = "11")

## testing cluster group.1
## 997 features left after min.pct filtering
## 997 features left after min.diff.pct filtering
## 303 features left after log2fc.threshold filtering

head(dge.clt6vs11[order(abs(dge.clt6vs11$log2FC), decreasing = TRUE),])

##           p.value  adj.p.value    log2FC      pct.1     pct.2  diff.pct marker
## NKG7 3.395687e-65 6.791373e-62 -4.087289 0.11403509 1.0000000 0.8859649   NKG7
## CCL5 2.349768e-63 4.699536e-60 -3.838459 0.12573099 1.0000000 0.8742690   CCL5
## GZMH 9.265926e-86 1.853185e-82 -3.170614 0.01461988 1.0000000 0.9853801   GZMH
## CST7 1.986018e-69 3.972037e-66 -2.447930 0.04970760 0.9436620 0.8939544   CST7
## GZMA 3.278546e-66 6.557091e-63 -2.417989 0.05263158 0.9154930 0.8628614   GZMA
## LTB  2.631790e-33 5.263580e-30  2.325730 0.96491228 0.2253521 0.7395602    LTB

top6.degs <- head(dge.clt6vs11[order(abs(dge.clt6vs11$log2FC), 
                                     decreasing = TRUE),"marker"])
exp.plots <- lapply(X = top6.degs, FUN = function(x) {
    PlotExpression(object = pbmc_10Xassays, color.by = x,
                   scale.values = TRUE, point.size = 0.5, point.stroke = 0.5)
})
cowplot::plot_grid(plotlist = exp.plots, align = "vh", ncol = 3)

R session

# R session
sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
##  [3] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [5] GenomeInfoDb_1.42.3         IRanges_2.40.1             
##  [7] S4Vectors_0.44.0            BiocGenerics_0.52.0        
##  [9] MatrixGenerics_1.18.1       matrixStats_1.5.0          
## [11] Coralysis_1.0.0             ggplot2_3.5.1              
## 
## loaded via a namespace (and not attached):
##   [1] rlang_1.1.5              magrittr_2.0.3           flexclust_1.4-2         
##   [4] compiler_4.4.2           systemfonts_1.2.1        vctrs_0.6.5             
##   [7] reshape2_1.4.4           stringr_1.5.1            pkgconfig_2.0.3         
##  [10] crayon_1.5.3             fastmap_1.2.0            XVector_0.46.0          
##  [13] labeling_0.4.3           scuttle_1.16.0           rmarkdown_2.29          
##  [16] ggbeeswarm_0.7.2         UCSC.utils_1.2.0         ragg_1.3.3              
##  [19] xfun_0.50                modeltools_0.2-23        bluster_1.16.0          
##  [22] zlibbioc_1.52.0          cachem_1.1.0             beachmat_2.22.0         
##  [25] jsonlite_1.8.9           DelayedArray_0.32.0      BiocParallel_1.40.0     
##  [28] irlba_2.3.5.1            parallel_4.4.2           aricode_1.0.3           
##  [31] cluster_2.1.6            R6_2.6.0                 bslib_0.9.0             
##  [34] stringi_1.8.4            RColorBrewer_1.1-3       limma_3.62.2            
##  [37] jquerylib_0.1.4          Rcpp_1.0.14              iterators_1.0.14        
##  [40] knitr_1.49               snow_0.4-4               Matrix_1.7-1            
##  [43] igraph_2.1.4             tidyselect_1.2.1         abind_1.4-8             
##  [46] yaml_2.3.10              codetools_0.2-20         doRNG_1.8.6.1           
##  [49] lattice_0.22-6           tibble_3.2.1             plyr_1.8.9              
##  [52] withr_3.0.2              ggrastr_1.0.2            Rtsne_0.17              
##  [55] evaluate_1.0.3           desc_1.4.3               pillar_1.10.1           
##  [58] rngtools_1.5.2           foreach_1.5.2            generics_0.1.3          
##  [61] sparseMatrixStats_1.18.0 munsell_0.5.1            scales_1.3.0            
##  [64] class_7.3-22             glue_1.8.0               metapod_1.14.0          
##  [67] pheatmap_1.0.12          LiblineaR_2.10-24        tools_4.4.2             
##  [70] BiocNeighbors_2.0.1      ScaledMatrix_1.14.0      SparseM_1.84-2          
##  [73] RSpectra_0.16-2          locfit_1.5-9.11          RANN_2.6.2              
##  [76] fs_1.6.5                 scran_1.34.0             Cairo_1.6-2             
##  [79] cowplot_1.1.3            grid_4.4.2               edgeR_4.4.2             
##  [82] colorspace_2.1-1         GenomeInfoDbData_1.2.13  beeswarm_0.4.0          
##  [85] BiocSingular_1.22.0      vipor_0.4.7              cli_3.6.3               
##  [88] rsvd_1.0.5               textshaping_1.0.0        viridisLite_0.4.2       
##  [91] S4Arrays_1.6.0           dplyr_1.1.4              doSNOW_1.0.20           
##  [94] gtable_0.3.6             sass_0.4.9               digest_0.6.37           
##  [97] SparseArray_1.6.1        dqrng_0.4.1              farver_2.1.2            
## [100] htmltools_0.5.8.1        pkgdown_2.1.1            lifecycle_1.0.4         
## [103] httr_1.4.7               statmod_1.5.0

References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.

Lun ATL, McCarthy DJ, Marioni JC (2016). “A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor.” F1000Res., 5, 2122. doi:10.12688/f1000research.9501.2.

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.

Compiled: 13/02/2025