Reference-mapping

Dataset

To illustrate the Coralysis reference-mapping method, we will use in this vignette two 10X PBMCs (Peripheral Blood Mononuclear Cells) 3’ assays: V1 and V2. The assay V2 will be the reference and V1 the query, i.e., the dataset that will be mapped against the reference. The datasets have been downloaded from 10X website. The PBMC dataset V1 corresponds to sample pbmc6k and V2 to pbmc8k:

V1: pbmc6k
V2: pbmc8k

Cells were annotated using the annotations provided by Korsunsky et al., 2019 (Source Data Figure 4 file). The overall data was downsampled to 2K cells (1K per assay) and 2K highly variable genes selected with scran R package. To facilitate the reproduction of this vignette, the data is distributed through Zenodo as a SingleCellExperiment object, the object (class) required by most functions in Coralysis (see Chapter 4 The SingleCellExperiment class - OSCA manual). The SCE object provided comprises counts (raw count data), logcounts (log-normalized data) and cell colData (which includes batch and cell labels, designated as batch and cell_type, respectively).

Run the code below to import the R packages and data required to reproduce this vignette.

# Packages
library("ggplot2")
library("Coralysis")
library("SingleCellExperiment")

# Import data from Zenodo
data.url <- "https://zenodo.org/records/14845751/files/pbmc_10Xassays.rds?download=1"
pbmc_10Xassays <- readRDS(file = url(data.url))

# Split the SCE object by assay 
ref <- pbmc_10Xassays[,pbmc_10Xassays$batch=="V2"] # let V2 assay batch be the reference data set
query <- pbmc_10Xassays[,pbmc_10Xassays$batch=="V1"] # let V1 be the query (unknown annotations)

Train reference

The first step in performing reference mapping with Coralysis is training a dataset that is representative of the biological system under study. In this example, ref and query correspond to the SingleCellExperiment objects of the reference and query PBMC samples, V2 and V1 3’ assays, respectively. The reference ref is trained through RunParallelDivisiveICP function without providing a batch.label. In case the reference requires integration, a batch.label should be provided. An higher number of threads can be provided to speed up computing time depending on the number of cores available. For this example, the algorithm was run 10 times (L = 10), but generally, this number should be higher (with the default being L = 50). Next, run the function RunPCA to obtain the main result required for cell type prediction later on. In addition to cell type prediction, the query dataset(s) can be projected onto UMAP. To allow this, the argument return.model should be set to TRUE in both functions RunPCA and RunUMAP.

# Train the reference 
set.seed(123)
ref <- RunParallelDivisiveICP(object = ref, L = 10, threads = 2) # runs without 'batch.label'

## WARNING: Setting 'divisive.method' to 'cluster' as 'batch.label=NULL'. 
## If 'batch.label=NULL', 'divisive.method' can be one of: 'cluster', 'random'.

## 
## Building training set...

## Training set successfully built.

## 
## Computing cluster seed.

## 
## Initializing divisive ICP clustering...

##   |                                                                              |                                                                      |   0%  |                                                                              |========                                                              |  11%  |                                                                              |================                                                      |  22%  |                                                                              |=======================                                               |  33%  |                                                                              |===============================                                       |  44%  |                                                                              |=======================================                               |  56%  |                                                                              |===============================================                       |  67%  |                                                                              |======================================================                |  78%  |                                                                              |==============================================================        |  89%  |                                                                              |======================================================================| 100%

## 
## Divisive ICP clustering completed successfully.

## 
## Predicting cell cluster probabilities using ICP models...

## Prediction of cell cluster probabilities completed successfully.

## 
## Multi-level integration completed successfully.

# Compute reference PCA
ref <- RunPCA(ref, return.model = TRUE, pca.method = "stats")

## Divisive ICP: selecting ICP tables multiple of 4

# Compute reference UMAP
set.seed(123)
ref <- RunUMAP(ref, return.model = TRUE)

Below is the UMAP plot for the reference sample with cell type annotations. In this example, we are using the annotations provided in the object. Ideally, the sample should be annotated after training by performing clustering and manual cluster annotation. Only the resulting manual cell type annotations should be used for prediction. For simplicity, we will use the annotations provided in the object.

# Vizualize reference
ref.celltype.plot <- PlotDimRed(object = ref, 
                                color.by = "cell_type", 
                                dimred = "UMAP", 
                                point.size = 0.01, 
                                legend.nrow = 6, 
                                seed.color = 7) + 
    ggtitle("reference (ground-truth)")
ref.celltype.plot

Map query

Perform reference-mapping with Coralysis by running the function ReferenceMapping. This requires to provide the trained reference (ref) with cell type annotations intended for the prediction (ref.label = "cell_type") and the query dataset (query). The label in the reference aimed to be used for prediction needs to be available on colData(ref). In this case, we are providing the cell type labels from the column cell_type available in the reference ref. Since we want to project the query onto the reference UMAP, set project.umap as TRUE. The argument dimred.name.prefix just sets the name given as prefix of the low dimensional embeddings stored in reducedDimNames(map). The SingleCellExperiment object map will contain the same information as query, with the predictions and embeddings mapped onto the reference. The predictions consist in coral_labels and coral_probability stored in colData(map). The coral_labels correspond to the cell type predictions obtained against the reference. The coral_probability represents the proportion of K neighbors from the winning class (k.nn equal 10 by default); the higher the value, the better.

## Reference-mapping
set.seed(1024)
map <- ReferenceMapping(ref = ref, query = query, ref.label = "cell_type", 
                        project.umap = TRUE, dimred.name.prefix = "ref")

Prediction accuracy

The accuracy of Coralysis reference-mapping method is presented below together with a confusion matrix between the predicted (rows) versus ground-truth cell type labels (columns).

Confusion matrix

# Confusion matrix
preds_x_truth <- table(map$coral_labels, map$cell_type)
stopifnot(all(row.names(preds_x_truth)==colnames(preds_x_truth)))

# Accuracy
acc <- sum(diag(preds_x_truth)) / sum(preds_x_truth)
#print(paste0("Prediction accuracy: ", acc*100, "%"))

# Print confusion matrix
preds_x_truth

##                 
##                  aDC B mem B naive CD4 mem CD4 naive CD8 eff CD8 T HSC
##   aDC             14     0       0       0         0       0     0   0
##   B mem            0    40      15       0         0       0     0   3
##   B naive          0     4      69       0         0       0     0   0
##   CD4 mem          0     0       0     108        39       3     6   0
##   CD4 naive        0     0       0      14       153       0    12   2
##   CD8 eff          0     1       0       2         0      93     2   0
##   CD8 T            0     0       1       3         4      11    62   0
##   HSC              0     0       0       0         0       0     0   0
##   Megakaryocyte    0     0       0       0         0       0     0   0
##   Monocyte         1     0       0       0         0       0     0   0
##   CD16+ monocyte   0     0       0       0         0       0     0   0
##   NK               0     0       0       0         0       1     0   0
##   pDC              0     0       0       0         0       0     0   0
##   Treg             0     0       0       1         0       0     0   0
##                 
##                  Megakaryocyte Monocyte CD16+ monocyte  NK pDC Treg
##   aDC                        0        0              0   0   0    0
##   B mem                      0        0              0   0   0    0
##   B naive                    0        0              0   0   1    0
##   CD4 mem                    0        0              0   0   0    8
##   CD4 naive                  0        0              0   0   0    7
##   CD8 eff                    1        0              0   1   0    0
##   CD8 T                      1        0              0   0   0    0
##   HSC                        0        0              0   0   0    0
##   Megakaryocyte              0        0              0   0   0    0
##   Monocyte                   3      173              4   1   0    0
##   CD16+ monocyte             1        5             72   0   0    0
##   NK                         0        0              0  55   0    0
##   pDC                        0        0              0   0   3    0
##   Treg                       0        0              0   0   0    0

The accuracy of Coralysis reference-mapping method was 84.2%.

DimRed

Visualize below the query cells projected onto reference UMAP and how well the predictions match the query ground-truth. The coral_probability is a prediction confidence score. Predictions with low scores (<0.5) should be carefully inspected.

# Plot query and reference UMAP side-by-side 
#with ground-truth & predicted cell labels
use.color <- c("aDC" = "#E6F5C9", 
               "B mem" = "#CCEBC5", 
               "B naive" = "#FB8072", 
               "CD4 mem" = "#A6761D", 
               "CD4 naive" = "#666666", 
               "CD8 eff" = "#80B1D3",
               "CD8 T" = "#CBD5E8", 
               "HSC" = "#E31A1C", 
               "Megakaryocyte" = "#377EB8", 
               "Monocyte" = "#FCCDE5", 
               "CD16+ monocyte" = "#A6D854", 
               "NK" = "#6A3D9A",
               "pDC" = "#E7298A", 
               "Treg" = "#FFFF33")
query.ground_truth.plot <- PlotDimRed(object = map, 
                                      color.by = "cell_type", 
                                      dimred = "refUMAP", 
                                      point.size = 0.01, 
                                      legend.nrow = 6, 
                                      seed.color = 7) + 
    ggtitle("query (ground-truth)")
query.predicted.plot <-PlotDimRed(object = map, 
                                  color.by = "coral_labels", 
                                  dimred = "refUMAP", point.size = 0.01, 
                                  legend.nrow = 6, 
                                  use.color = use.color) + 
    ggtitle("query (predicted)")
query.confidence.plot <- PlotExpression(object = map, 
                                        color.by = "coral_probability", 
                                        dimred = "refUMAP", 
                                        point.size = 0.01, 
                                        color.scale = "viridis") + 
    ggtitle("query (confidence)")
cowplot::plot_grid(ref.celltype.plot, query.ground_truth.plot, 
                   query.predicted.plot, query.confidence.plot, 
                   ncol = 2, align = "vh")

R session

# R session
sessionInfo()

## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
##  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
##  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
## [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
## 
## time zone: UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] SingleCellExperiment_1.28.1 SummarizedExperiment_1.36.0
##  [3] Biobase_2.66.0              GenomicRanges_1.58.0       
##  [5] GenomeInfoDb_1.42.3         IRanges_2.40.1             
##  [7] S4Vectors_0.44.0            BiocGenerics_0.52.0        
##  [9] MatrixGenerics_1.18.1       matrixStats_1.5.0          
## [11] Coralysis_1.0.0             ggplot2_3.5.1              
## 
## loaded via a namespace (and not attached):
##   [1] rlang_1.1.5              magrittr_2.0.3           flexclust_1.4-2         
##   [4] compiler_4.4.2           png_0.1-8                systemfonts_1.2.1       
##   [7] vctrs_0.6.5              reshape2_1.4.4           stringr_1.5.1           
##  [10] pkgconfig_2.0.3          crayon_1.5.3             fastmap_1.2.0           
##  [13] XVector_0.46.0           labeling_0.4.3           scuttle_1.16.0          
##  [16] rmarkdown_2.29           ggbeeswarm_0.7.2         UCSC.utils_1.2.0        
##  [19] ragg_1.3.3               xfun_0.50                modeltools_0.2-23       
##  [22] bluster_1.16.0           zlibbioc_1.52.0          cachem_1.1.0            
##  [25] beachmat_2.22.0          jsonlite_1.8.9           DelayedArray_0.32.0     
##  [28] BiocParallel_1.40.0      irlba_2.3.5.1            parallel_4.4.2          
##  [31] aricode_1.0.3            cluster_2.1.6            R6_2.6.0                
##  [34] bslib_0.9.0              stringi_1.8.4            RColorBrewer_1.1-3      
##  [37] reticulate_1.40.0        limma_3.62.2             jquerylib_0.1.4         
##  [40] Rcpp_1.0.14              iterators_1.0.14         knitr_1.49              
##  [43] snow_0.4-4               Matrix_1.7-1             igraph_2.1.4            
##  [46] tidyselect_1.2.1         abind_1.4-8              yaml_2.3.10             
##  [49] codetools_0.2-20         doRNG_1.8.6.1            lattice_0.22-6          
##  [52] tibble_3.2.1             plyr_1.8.9               withr_3.0.2             
##  [55] askpass_1.2.1            ggrastr_1.0.2            evaluate_1.0.3          
##  [58] desc_1.4.3               pillar_1.10.1            rngtools_1.5.2          
##  [61] foreach_1.5.2            generics_0.1.3           sparseMatrixStats_1.18.0
##  [64] munsell_0.5.1            scales_1.3.0             class_7.3-22            
##  [67] glue_1.8.0               metapod_1.14.0           pheatmap_1.0.12         
##  [70] LiblineaR_2.10-24        tools_4.4.2              BiocNeighbors_2.0.1     
##  [73] ScaledMatrix_1.14.0      SparseM_1.84-2           RSpectra_0.16-2         
##  [76] locfit_1.5-9.11          RANN_2.6.2               fs_1.6.5                
##  [79] scran_1.34.0             Cairo_1.6-2              cowplot_1.1.3           
##  [82] grid_4.4.2               umap_0.2.10.0            edgeR_4.4.2             
##  [85] colorspace_2.1-1         GenomeInfoDbData_1.2.13  beeswarm_0.4.0          
##  [88] BiocSingular_1.22.0      vipor_0.4.7              cli_3.6.3               
##  [91] rsvd_1.0.5               textshaping_1.0.0        viridisLite_0.4.2       
##  [94] S4Arrays_1.6.0           dplyr_1.1.4              doSNOW_1.0.20           
##  [97] gtable_0.3.6             sass_0.4.9               digest_0.6.37           
## [100] SparseArray_1.6.1        dqrng_0.4.1              farver_2.1.2            
## [103] htmltools_0.5.8.1        pkgdown_2.1.1            lifecycle_1.0.4         
## [106] httr_1.4.7               statmod_1.5.0            openssl_2.3.2

References

Amezquita R, Lun A, Becht E, Carey V, Carpp L, Geistlinger L, Marini F, Rue-Albrecht K, Risso D, Soneson C, Waldron L, Pages H, Smith M, Huber W, Morgan M, Gottardo R, Hicks S (2020). “Orchestrating single-cell analysis with Bioconductor.” Nature Methods, 17, 137-145. https://www.nature.com/articles/s41592-019-0654-x.

Sousa A, Smolander J, Junttila S, Elo L (2025). “Coralysis enables sensitive identification of imbalanced cell types and states in single-cell data via multi-level integration.” bioRxiv. doi:10.1101/2025.02.07.637023

Wickham H (2016). “ggplot2: Elegant Graphics for Data Analysis.” Springer-Verlag New York.

Compiled: 13/02/2025