This single-cell RNA-seq integration task describes an example of cell lines integration: Jurkat versus HEK293T versus a 50:50 mixture of Jurkat:HEK293T cell lines. The data sets were downloaded from 10X genomics website (check the R script 01_create_datasets.R
):
Jurkat (jurkat
): 3,257 cells
HEK293T (t293
): 2,854 cells
50:50 Jurkat:HEK293T (jurkat_t293_50:50
): 2,553 cells
The identity of the data set - jurkat
or t293
or jurkat_t293_50:50
- for every cell was saved in the Seurat
meta.data
column variable batch
. The ground-truth cell identities were also provided in the column variable cell_type
but avoid checking them until the end of this notebook to make these analyses more interesting.
The analyses performed in this notebook rely in the Seurat
R package (v.5.1.0).
Import the main packages used in this notebook: Seurat
(v.5.1.0), SeuratWrappers
(v.0.3.2 - integration wrappers for Seurat), dplyr
(v.1.1.4 - wrangling data), patchwork
(v.1.2.0 - visualization), scIntegrationMetrics
(v 1.1 - compute LISI integration metrics).
## Import packages
library("dplyr") # data wrangling
library("Seurat") # scRNA-seq analysis
library("patchwork") # viz
library("SeuratWrappers") # integration wrappers
library("scIntegrationMetrics") # compute LISI integration metrics
Create output directories to save intermediate results, figures, tables and R objects.
## Output directories
res.dir <- file.path("../results", "cell_lines_task", c("plots", "tables", "objects"))
for (folder in res.dir) if (!dir.exists(folder)) dir.create(path = folder, recursive = TRUE)
(5 min)
AIM: Import and explore the Seurat object data.
Import the cell lines Seurat
R object jurkat.rds
located in the folder data
.
# Import data
data.dir <- "../data"
seu <- readRDS(file = file.path(data.dir, "jurkat.rds"))
You may want to down sample this data set depending on the amount of RAM memory you have. The jurkat
data set has 8,664 cells.
## Downsample data set
downsample <- TRUE # replace to FALSE in case you don't want to down sample
prop.down <- 0.4 # proportion of cells to down sample per batch: 40% of the cells
if (downsample) {
no.cells.batch <- ceiling(table(seu$batch) * 0.4) # CTRL = 1310 and STIM = 1491
cell.idx.batch <- split(x = colnames(seu), f = seu$batch) # split into a list the cell names per batch
cell.idx.batch.down <- lapply(X = setNames(names(cell.idx.batch), names(cell.idx.batch)), FUN = function(x) {
set.seed(123)
sample(x = cell.idx.batch[[x]], size = no.cells.batch[[x]], replace = FALSE)
}) # downsample each batch cell names
cell.idx.downsample <- do.call(c, cell.idx.batch.down) # join cell name labels from the two batches into one character vector
seu <- subset(seu, cells = cell.idx.downsample)
}
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3456689 184.7 5494345 293.5 5494345 293.5
## Vcells 23302446 177.8 82194859 627.1 85002278 648.6
cat("No. of cells downsampled for `jurkat` was:", table(seu$batch)[1], "\n")
## No. of cells downsampled for `jurkat` was: 1303
cat("No. of cells downsampled for `jurkat_t293_50:50` was:", table(seu$batch)[2], "\n")
## No. of cells downsampled for `jurkat_t293_50:50` was: 1022
cat("No. of cells downsampled for `t293` was:", table(seu$batch)[3], "\n")
## No. of cells downsampled for `t293` was: 1142
Explore quickly the Seurat
seu
object.
## Explore Seurat object
# Print Seurat object
seu
## An object of class Seurat
## 32738 features across 3467 samples within 1 assay
## Active assay: RNA (32738 features, 0 variable features)
## 1 layer present: counts
# Structure
str(seu)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay5' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ layers :List of 1
## .. .. .. .. ..$ counts:Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. .. ..@ i : int [1:11492361] 40 57 67 68 70 78 81 94 105 113 ...
## .. .. .. .. .. .. ..@ p : int [1:3468] 0 4315 6707 10427 13868 17802 21103 23835 28482 31180 ...
## .. .. .. .. .. .. ..@ Dim : int [1:2] 32738 3467
## .. .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. .. ..$ : NULL
## .. .. .. .. .. .. ..@ x : num [1:11492361] 11 1 4 1 8 1 4 2 5 2 ...
## .. .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ cells :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:3467, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:3467] "t293_AAACATACACTGGT-1" "t293_AAACATACAGACTC-1" "t293_AAACATTGAGGCGA-1" "t293_AAACCGTGTCCTAT-1" ...
## .. .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. .. .. ..$ dim : int [1:2] 3467 1
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:3467] "t293_AAACATACACTGGT-1" "t293_AAACATACAGACTC-1" "t293_AAACATTGAGGCGA-1" "t293_AAACCGTGTCCTAT-1" ...
## .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. ..@ features :Formal class 'LogMap' [package "SeuratObject"] with 1 slot
## .. .. .. .. .. ..@ .Data: logi [1:32738, 1] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. .. .. .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. .. .. .. ..$ : chr [1:32738] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7" ...
## .. .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. .. .. ..$ dim : int [1:2] 32738 1
## .. .. .. .. .. ..$ dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:32738] "MIR1302-10" "FAM138A" "OR4F5" "RP11-34P13.7" ...
## .. .. .. .. .. .. ..$ : chr "counts"
## .. .. .. ..@ default : int 1
## .. .. .. ..@ assay.orig: chr(0)
## .. .. .. ..@ meta.data :'data.frame': 32738 obs. of 0 variables
## .. .. .. ..@ misc : list()
## .. .. .. ..@ key : chr "rna_"
## ..@ meta.data :'data.frame': 3467 obs. of 7 variables:
## .. ..$ orig.ident : chr [1:3467] "SeuratProject" "SeuratProject" "SeuratProject" "SeuratProject" ...
## .. ..$ nCount_RNA : num [1:3467] 19337 8786 16030 15967 21930 ...
## .. ..$ nFeature_RNA : int [1:3467] 4315 2392 3720 3441 3934 3301 2732 4647 2698 3078 ...
## .. ..$ batch : chr [1:3467] "t293" "t293" "t293" "t293" ...
## .. ..$ jurkat.pos.CD3D: logi [1:3467] FALSE FALSE FALSE FALSE FALSE FALSE ...
## .. ..$ t293.pos.XIST : logi [1:3467] TRUE TRUE TRUE TRUE TRUE TRUE ...
## .. ..$ cell_type : chr [1:3467] "293T" "293T" "293T" "293T" ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 1 level "SeuratProject": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:3467] "t293_AAACATACACTGGT-1" "t293_AAACATACAGACTC-1" "t293_AAACATTGAGGCGA-1" "t293_AAACCGTGTCCTAT-1" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "jurkat"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 5 0 2
## ..@ commands : list()
## ..@ tools : list()
# Check meta.data
head(seu@meta.data)
## orig.ident nCount_RNA nFeature_RNA batch
## t293_AAACATACACTGGT-1 SeuratProject 19337 4315 t293
## t293_AAACATACAGACTC-1 SeuratProject 8786 2392 t293
## t293_AAACATTGAGGCGA-1 SeuratProject 16030 3720 t293
## t293_AAACCGTGTCCTAT-1 SeuratProject 15967 3441 t293
## t293_AAACGCTGAGGCGA-1 SeuratProject 21930 3934 t293
## t293_AAACGCTGGTTGAC-1 SeuratProject 14128 3301 t293
## jurkat.pos.CD3D t293.pos.XIST cell_type
## t293_AAACATACACTGGT-1 FALSE TRUE 293T
## t293_AAACATACAGACTC-1 FALSE TRUE 293T
## t293_AAACATTGAGGCGA-1 FALSE TRUE 293T
## t293_AAACCGTGTCCTAT-1 FALSE TRUE 293T
## t293_AAACGCTGAGGCGA-1 FALSE TRUE 293T
## t293_AAACGCTGGTTGAC-1 FALSE TRUE 293T
# Check how many cells per data set
table(seu$batch)
##
## jurkat jurkat_t293_50:50 t293
## 1303 1022 1142
# Check no. of genes
nrow(seu)
## [1] 32738
# Check no. of cells
ncol(seu)
## [1] 3467
(7 min)
AIM: See how much the two data sets overlap each other in the low dimensional reductions.
Run the standard Seurat
upstream workflow to jointly compute a PCA and UMAP for the datasets:
NormalizeData()
: log1p-normalization with a scaling factor of 10K
FindVariableFeatures()
: identification of 2K HVG
ScaleData()
: standardization of the 2K HVG
RunPCA()
: computation of a PCA with the 2K HVG standardized
RunUMAP()
: computation of a UMAP using the first dims
of the previously computed PCA
## Joint analysis
# Standard Seurat upstream workflow
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu)
seu <- ScaleData(seu)
seu <- RunPCA(seu)
seu <- RunUMAP(seu, dims = 1:30, reduction = "pca", reduction.name = "umap.unintegrated")
Plot the PCA and UMAP side-by-side below.
## Plot jointly dimreds
pca.unint <- DimPlot(seu, reduction = "pca", group.by = "batch")
umap.unint <- DimPlot(seu, reduction = "umap.unintegrated", group.by = "batch")
pca.unint + umap.unint
(5 min)
AIM: Check if cells from different datasets share well-known cell-specific markers.
Plot below cell lines-specific markers: CD3D
for jurkat and XIST
for HEK193T.
## Joint celltype markers
# List of jurkat and T293 cell lines
markers.plot <- list(
"jurkat" = "CD3D",
"t293" = "XIST"
)
# Plot
jurkat.markers.unint.plot <- FeaturePlot(seu, features = markers.plot$jurkat, split.by = "batch",
max.cutoff = 3, cols = c("grey", "red"),
reduction = "umap.unintegrated", ncol = 4,
pt.size = 0.5)
t293.markers.unint.plot <- FeaturePlot(seu, features = markers.plot$t293, split.by = "batch",
max.cutoff = 3, cols = c("grey", "red"),
reduction = "umap.unintegrated", ncol = 4,
pt.size = 0.5)
## Plot jointly celltype markers
# Print
jurkat.markers.unint.plot
t293.markers.unint.plot
(10 min)
AIM: Compare different integration methods.
First, split
the layers of data by batch before performing integration. Then, apply the standard Seurat
workflow. Finally, call the function IntegrateLayers()
to integrate the datasets. In this function you can specify the method you want to run by providing the integration method function.
Seurat
provides three methods: CCA (CCAIntegration
), RPCA (RPCAIntegration
) and Harmony (HarmonyIntegration
). In addition, other methods can be called by using functions from SeuratWrappers
such as: FastMNN (FastMNNIntegration
) or scVI (scVIIntegration
) among others. Harmony (from the harmony
R package), FastMNN (from the batchelor
R package) and scVI (python package installed with conda) need to be installed independently from Seurat
.
Run the R chunk code below to run the integration methods: CCA, RPCA, Harmony and FastMNN (you can try to run scVI if you’ve it installed in your system). Join the layers back after integration to project the integrated data onto UMAP. The UMAP highlights the batch
and ground-truth cell_type
labels.
## Perform integration
# Split layers for integration
seu[["RNA"]] <- split(x = seu[["RNA"]], f = seu$batch)
# Standard workflow
seu <- NormalizeData(seu)
seu <- FindVariableFeatures(seu)
seu <- ScaleData(seu)
seu <- RunPCA(seu)
# Integrate layers
int.methods <- c("CCA" = "CCAIntegration", "RPCA" = "RPCAIntegration",
"Harmony" = "HarmonyIntegration", "FastMNN" = "FastMNNIntegration",
"scVI" = "scVIIntegration")
for (m in names(int.methods)[1:4]) {
cat("\nRunning integration method", m, "...\n")
int.dimred <- paste0("integrated.", m)
umap.dimred <- paste0("umap.", m)
# Integration
if (m=="scVI") {
seu <- IntegrateLayers(object = seu, method = get(eval(substitute(int.methods[m]))),
orig.reduction = "pca",
new.reduction = int.dimred,
conda_env = "~/miniconda3/envs/scvi-env", # substitute this by your installation
verbose = TRUE)
} else {
seu <- IntegrateLayers(object = seu, method = get(eval(substitute(int.methods[m]))),
orig.reduction = "pca",
new.reduction = int.dimred,
verbose = TRUE)
}
}
##
## Running integration method CCA ...
##
## Running integration method RPCA ...
##
## Running integration method Harmony ...
##
## Running integration method FastMNN ...
# Re-join layers after integration
seu[["RNA"]] <- JoinLayers(seu[["RNA"]])
# Run UMAP for every integration method
int.umaps.plots <- list()
for (m in names(int.methods)[1:4]) {
cat("\nRunning UMAP for", m, "integrated result...\n")
int.dimred <- paste0("integrated.", m)
umap.dimred <- paste0("umap.", m)
seu <- RunUMAP(seu, dims = 1:30, reduction = int.dimred, reduction.name = umap.dimred)
int.umaps.plots[[m]] <- DimPlot(object = seu, reduction = umap.dimred, group.by = c("batch", "cell_type"),
combine = FALSE, label.size = 2)
}
##
## Running UMAP for CCA integrated result...
##
## Running UMAP for RPCA integrated result...
##
## Running UMAP for Harmony integrated result...
##
## Running UMAP for FastMNN integrated result...
# Save Seurat object
saveRDS(object = seu, file = file.path(res.dir[3], "seu_integrated.rds"))
(15 min)
AIM: Assess integration qualitatively and quantitatively through dimensional reduction visualizations and LISI scores.
Plot the integrated embeddings below highlighting the batch
and ground-truth cell_type
labels.
## Assess integration by printing the plots using the "batch" and "cell_type" (ground-truth) labels
wrap_plots(c(int.umaps.plots$CCA, int.umaps.plots$RPCA, int.umaps.plots$Harmony, int.umaps.plots$FastMNN),
ncol = 2, byrow = TRUE)
Run the code below to compute the i/cLISI scores for every integrated embedding with the function getIntegrationMetrics()
from the package scIntegrationMetrics
(read more about the meaning of these metrics here).
## Assess quantitatively integration with scIntegrationMetrics
# Calculate metrics
int.mthds.names <- paste0("integrated.", names(int.methods)[1:4])
names(int.mthds.names) <- int.mthds.names
metrics <- list()
for (m in int.mthds.names) {
key <- gsub("integrated.", "", m)
cat("Computing i/cLISI metrics for integration method:", gsub("integrated.", "", key), "\n")
metrics[[key]] <- getIntegrationMetrics(seu, meta.label = "cell_type", meta.batch = "batch",
method.reduction = m, metrics = c("iLISI", "norm_iLISI",
#"CiLISI", "CiLISI_means",
"norm_cLISI", "norm_cLISI_means"))
}
## Computing i/cLISI metrics for integration method: CCA
## Computing i/cLISI metrics for integration method: RPCA
## Computing i/cLISI metrics for integration method: Harmony
## Computing i/cLISI metrics for integration method: FastMNN
# Join metrics
metrics <- as.data.frame(do.call(cbind, metrics))
Print the result below.
# Print table
knitr::kable(metrics)
CCA | RPCA | Harmony | FastMNN | |
---|---|---|---|---|
iLISI | 2.220135 | 1.941316 | 1.601459 | 1.936321 |
norm_iLISI | 0.6100677 | 0.4706582 | 0.3007295 | 0.4681604 |
norm_cLISI | 0.1250434 | 0.01725079 | 0.9135949 | 0.02656604 |
norm_cLISI_means | 0.1271145 | 0.01720213 | 0.9069922 | 0.02562172 |
## R packages and versions used in these analyses
sessionInfo()
## R version 4.1.0 (2021-05-18)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.5 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] scIntegrationMetrics_1.1 SeuratWrappers_0.3.2 patchwork_1.2.0
## [4] Seurat_5.1.0 SeuratObject_5.0.2 sp_2.1-4
## [7] dplyr_1.1.4
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.4 spatstat.explore_3.2-7
## [3] reticulate_1.38.0 R.utils_2.12.3
## [5] tidyselect_1.2.1 htmlwidgets_1.6.4
## [7] grid_4.1.0 BiocParallel_1.28.3
## [9] Rtsne_0.17 munsell_0.5.1
## [11] ScaledMatrix_1.2.0 codetools_0.2-18
## [13] ica_1.0-3 future_1.33.2
## [15] miniUI_0.1.1.1 withr_3.0.0
## [17] batchelor_1.10.0 spatstat.random_3.2-3
## [19] colorspace_2.1-0 progressr_0.14.0
## [21] Biobase_2.54.0 highr_0.11
## [23] knitr_1.47 rstudioapi_0.13
## [25] stats4_4.1.0 SingleCellExperiment_1.16.0
## [27] ROCR_1.0-11 tensor_1.5
## [29] listenv_0.9.1 MatrixGenerics_1.6.0
## [31] labeling_0.4.3 harmony_1.2.0
## [33] GenomeInfoDbData_1.2.7 polyclip_1.10-6
## [35] farver_2.1.2 parallelly_1.37.1
## [37] vctrs_0.6.5 generics_0.1.3
## [39] xfun_0.45 R6_2.5.1
## [41] GenomeInfoDb_1.30.1 rsvd_1.0.5
## [43] bitops_1.0-7 spatstat.utils_3.0-5
## [45] cachem_1.1.0 DelayedArray_0.20.0
## [47] assertthat_0.2.1 promises_1.3.0
## [49] scales_1.3.0 gtable_0.3.5
## [51] beachmat_2.10.0 globals_0.16.3
## [53] goftest_1.2-3 klippy_0.0.0.9500
## [55] spam_2.10-0 rlang_1.1.4
## [57] splines_4.1.0 lazyeval_0.2.2
## [59] spatstat.geom_3.2-9 BiocManager_1.30.23
## [61] yaml_2.3.8 reshape2_1.4.4
## [63] abind_1.4-5 httpuv_1.6.15
## [65] tools_4.1.0 ggplot2_3.5.1
## [67] jquerylib_0.1.4 RColorBrewer_1.1-3
## [69] BiocGenerics_0.40.0 ggridges_0.5.6
## [71] Rcpp_1.0.12 plyr_1.8.9
## [73] sparseMatrixStats_1.6.0 zlibbioc_1.40.0
## [75] purrr_1.0.2 RCurl_1.98-1.14
## [77] deldir_2.0-4 pbapply_1.7-2
## [79] cowplot_1.1.3 S4Vectors_0.32.4
## [81] zoo_1.8-12 SummarizedExperiment_1.24.0
## [83] ggrepel_0.9.5 cluster_2.1.2
## [85] magrittr_2.0.3 data.table_1.15.4
## [87] RSpectra_0.16-1 scattermore_1.2
## [89] ResidualMatrix_1.4.0 lmtest_0.9-40
## [91] RANN_2.6.1 fitdistrplus_1.1-11
## [93] matrixStats_1.1.0 mime_0.12
## [95] evaluate_0.24.0 xtable_1.8-4
## [97] RhpcBLASctl_0.23-42 fastDummies_1.7.3
## [99] IRanges_2.28.0 gridExtra_2.3
## [101] compiler_4.1.0 tibble_3.2.1
## [103] KernSmooth_2.23-20 R.oo_1.26.0
## [105] htmltools_0.5.8.1 mgcv_1.8-35
## [107] later_1.3.2 tidyr_1.3.1
## [109] DBI_1.2.3 MASS_7.3-54
## [111] Matrix_1.6-5 permute_0.9-7
## [113] cli_3.6.3 R.methodsS3_1.8.2
## [115] parallel_4.1.0 dotCall64_1.1-1
## [117] igraph_2.0.3 GenomicRanges_1.46.1
## [119] pkgconfig_2.0.3 scuttle_1.4.0
## [121] plotly_4.10.4 spatstat.sparse_3.1-0
## [123] bslib_0.7.0 XVector_0.34.0
## [125] stringr_1.5.1 digest_0.6.36
## [127] sctransform_0.4.1 RcppAnnoy_0.0.22
## [129] vegan_2.6-6.1 spatstat.data_3.1-2
## [131] rmarkdown_2.27 leiden_0.4.3.1
## [133] uwot_0.2.2 DelayedMatrixStats_1.16.0
## [135] shiny_1.8.1.1 lifecycle_1.0.4
## [137] nlme_3.1-152 jsonlite_1.8.8
## [139] BiocNeighbors_1.12.0 viridisLite_0.4.2
## [141] fansi_1.0.6 pillar_1.9.0
## [143] lattice_0.20-44 fastmap_1.2.0
## [145] httr_1.4.7 survival_3.2-11
## [147] glue_1.7.0 remotes_2.5.0
## [149] png_0.1-8 stringi_1.8.4
## [151] sass_0.4.9 RcppHNSW_0.6.0
## [153] BiocSingular_1.10.0 irlba_2.3.5.1
## [155] future.apply_1.11.2