library("Seurat")
library("dplyr")
library("ggplot2")
library("SingleCellExperiment")
library("ILoReg")
For this clustering tutorial we’ll use the data publicly available at GEO GSE115978 published by Jerby-Arnon et al., 2018. The data comprises single-cell RNA (modified SMART-Seq2) from resected tumors from a cohort of 31 patients (n=15, untreated; n=15, post-immunotherapy resistant; n=1, post-immunotherapy responder).
Read more about it in the Methods section scRNA-seq cohort data collection.
Hopefully, the authors kindly provided the data and code used to produce the analyses in the publication:
(5 min)
Below we’ll import the table with the information to download.
## Download datasets
# import table with datasets to download
data2download <- read.table("data/GEO_GSE115978_project.tsv",
sep="\t", header=TRUE)
Print below the table imported above.
# Print table
knitr::kable(data2download)
filename | description | ftp_url |
---|---|---|
GSE115978_counts.csv.gz | raw counts | https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE115978&format=file&file=GSE115978%5Fcounts%2Ecsv%2Egz |
GSE115978_cell.annotations.csv.gz | metadata | https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE115978&format=file&file=GSE115978%5Fcell%2Eannotations%2Ecsv%2Egz |
Now we’ll download the metadata and raw counts from the GEO
repository. Both tables will be imported as data.frame
s. We
will provide this information to the function
CreateSeuratObject()
to convert this data into a
Seurat
class object (read more about
it).
The two files that were are downloading are:
GSE115978_counts.csv.gz
: raw counts with malignant
and immune/stroma cells
GSE115978_cell.annotations.csv.gz
: metadata file
including cell type annotations
We’ll import all the data, but we’ll only select the immune/stroma cells to continue this analysis: CD4 and CD8 T cells, T and NK cells, B cell, macrophage, endothelial and CAF (cancer-associated fibroblasts) cells.
# Download
# create directory to save files
down_dir <- "data/GSE115978" # directory to save datasets
if (!dir.exists(down_dir)) dir.create(down_dir, recursive=TRUE) # create folder if doesn't exist
# loop over the table rows & download each dataset
for (f in 1:nrow(data2download)) {
down_filename <- data2download[f,"filename"]
down_filepath <- paste(down_dir, down_filename, sep="/")# download file to: 'data/GSE115978'
down_url <- data2download[f,"ftp_url"]
cat("Downloading file", down_filename, "to", down_filepath, "\n")
download.file(url=down_url, destfile=down_filepath)
}
## Downloading file GSE115978_counts.csv.gz to data/GSE115978/GSE115978_counts.csv.gz
## Downloading file GSE115978_cell.annotations.csv.gz to data/GSE115978/GSE115978_cell.annotations.csv.gz
# Import
counts <- read.table(gzfile(paste(down_dir, "GSE115978_counts.csv.gz", sep="/")),
sep=",", row.names=1, header=TRUE)
cell.annot <- read.table(gzfile(paste(down_dir, "GSE115978_cell.annotations.csv.gz", sep="/")),
sep=",", row.names=1, header=TRUE)
stopifnot(all(colnames(counts)==row.names(cell.annot)))
immune.cells <- row.names(cell.annot[!(cell.annot$cell.types %in% c("?", "Mal")),]) # remove malignant cells and unknown
seu <- CreateSeuratObject(counts[,immune.cells], meta.data=cell.annot[immune.cells,])
Question: How many genes and cells compose the GEO GSE115978 dataset? (after selecting immune/stroma cells) (1 min)
Tip: type the object
seu
in the R console and interpret the message printed or provide the objectseu
to the functionsnrow()
andncol()
which count the no. of rows (=genes) and columns (=cells).
The GEO GSE115978 dataset comprises 23686 genes and 4861 cells.
We’ll assume that the data is already processed and filtered based on the number of cells that we obtained above.
(5 min)
Let’s inspect the metadata (save in the Seurat
object at
seu@meta.data
).
## Print the first few rows of the metadata
seu@meta.data[1:10,] # first 10 rows & all cols
## orig.ident nCount_RNA nFeature_RNA
## cy82_CD45_pos_2_H12_S576_comb cy82 12102 4345
## cy81_Bulk_CD45_neg_E02_S146_comb cy81 416093 6003
## CY88CD45POS_7_C07_S223_comb CY88CD45POS 473502 7079
## cy88_cd45pos_5_C01_S217_comb cy88 3698566 9241
## cy82_CD45_pos_1_E08_S536_comb cy82 381769 5675
## cy79_p3_CD45_neg_PDL1_neg_F07_S259_comb cy79 136780 4461
## CY89CORE11_F03_S159_comb CY89CORE11 104604 5142
## cy82_CD45_neg_2_H12_S288_comb cy82 30858 5103
## CY89A_CD45_POS_10_A07_S199_comb CY89A 126199 3673
## cy79_p5_CD45_neg_PDL1_neg_C04_S796_comb cy79 119939 4172
## samples cell.types treatment.group
## cy82_CD45_pos_2_H12_S576_comb Mel82 Macrophage treatment.naive
## cy81_Bulk_CD45_neg_E02_S146_comb Mel81 Endo. treatment.naive
## CY88CD45POS_7_C07_S223_comb Mel88 Macrophage post.treatment
## cy88_cd45pos_5_C01_S217_comb Mel88 T.CD4 post.treatment
## cy82_CD45_pos_1_E08_S536_comb Mel82 T.CD4 treatment.naive
## cy79_p3_CD45_neg_PDL1_neg_F07_S259_comb Mel79 Endo. treatment.naive
## CY89CORE11_F03_S159_comb Mel89 Endo. treatment.naive
## cy82_CD45_neg_2_H12_S288_comb Mel82 CAF treatment.naive
## CY89A_CD45_POS_10_A07_S199_comb Mel89 T.CD8 treatment.naive
## cy79_p5_CD45_neg_PDL1_neg_C04_S796_comb Mel79 Endo. treatment.naive
## Cohort no.of.genes no.of.reads
## cy82_CD45_pos_2_H12_S576_comb Tirosh 4431 12102
## cy81_Bulk_CD45_neg_E02_S146_comb Tirosh 6154 416093
## CY88CD45POS_7_C07_S223_comb Tirosh 7234 473502
## cy88_cd45pos_5_C01_S217_comb Tirosh 9341 3698566
## cy82_CD45_pos_1_E08_S536_comb Tirosh 5880 381769
## cy79_p3_CD45_neg_PDL1_neg_F07_S259_comb Tirosh 4665 136780
## CY89CORE11_F03_S159_comb Tirosh 5300 104604
## cy82_CD45_neg_2_H12_S288_comb Tirosh 5251 30858
## CY89A_CD45_POS_10_A07_S199_comb Tirosh 3872 126199
## cy79_p5_CD45_neg_PDL1_neg_C04_S796_comb Tirosh 4378 119939
The metadata comprises the following fields/columns:
orig.ident
: barcode prefix id
nCount_RNA
: UMI counts
samples
: tumor samples (n=32)
cell.types
: cell types annotated:
B.cell
(n=818), CAF
(n=106),
Endo.
(n=104), Macrophage
(n=420), NK
(n=92), T.CD4
(n=856), T.CD8
(n=1759),
T.cell
(n=706)
treatment.group
: treatment group
treatment.naive
(n=2254) and
post.treatment
(n=2607)
Cohort
: New
(n=2021) and
Tirosh
(n=2840)
no.of.genes
: no. of genes
no.of.reads
: no. of reads
Now that we’ve an understanding about the experimental design and
meta data fields we’ll project the data into low-dimensional space and
inspect some of meta data variables, such as samples
,
cell.types
, treatment.group
,
Cohort
.
(5 min)
To explore how some technical variables (e.g., samples
,
cell.types
, treatment.group
,
Cohort
) might influence cells we’ll run the basic
Seurat
workflow to obtain a non-linear dimensional
reduction UMAP plot: log-normalization (NormalizeData()
),
finding 2K HVG (FindVariableFeatures()
), scaling
(ScaleData()
), PCA (RunPCA()
) and UMAP with
the top 15 PCs (RunUMAP()
).
## Dimensional reduction: workflow
set.seed(1024) # keep reproducibility
seu <- NormalizeData(seu, normalization.method="LogNormalize",
scale.factor=10000) # normalization
seu <- FindVariableFeatures(seu, selection.method="vst",
nfeatures=2000) # find HVG
seu <- ScaleData(seu) # scale data
seu <- Seurat::RunPCA(seu) # run PCA - the exp./prefix 'Seurat::' is necessary to solve i
#the conflict w/ SingleCellExperiment::RunPCA - the same below
seu <- Seurat::RunUMAP(seu, dims=1:15) # run UMAP w/ 15 dims (inspect by: 'ElbowPlot(seu)')
Then, we’ll use the function DimPlot()
to plot each one
of the variables. The function named plot_dimred_bygroup()
created below just loops over a list of variables and plots them all (to
automatize this task).
# Plot variables
# create function to automatize
plot_dimred_bygroup <- function(seu, batches) {
# 'plot_dimred_bygroup()': plot dimensional reduction plots
#by group.
# 'seu': Seurat object with the following dimensional reductions:
#'umap'.
# 'batches': a character vector of variables available in the
#'@meta.data' data frame of the given Seurat obj.
qc_plots <- lapply(setNames(batches, batches), function(x) {
seu@meta.data[,x] <- factor(seu@meta.data[,x],
levels=unique(seu@meta.data[,x]))
DimPlot(seu, group.by=x, reduction="umap")
})
return(qc_plots)
}
# Plot
qc_dr <- plot_dimred_bygroup(seu, c("samples", "cell.types", "treatment.group", "Cohort"))
# Save
dirs2save <- paste("results/GSE115978", c("plots", "tables", "objects"), sep="/")
for (d in dirs2save) if (!dir.exists(d)) dir.create(d, recursive=TRUE)
pdf(paste(dirs2save[1], "qc_meta_vars_umap.pdf", sep="/"), width=20, height=20)
cowplot::plot_grid(plotlist=qc_dr, ncol=2)
dev.off()
## png
## 2
# Print
cowplot::plot_grid(plotlist=qc_dr, ncol=2)
Question: In your opinion, based on the three technical variables explored above (“samples”, “treatment.group”, “Cohort”), should the data be integrated/batch-corrected for any of those or not? (3 min)
Tip: integration or batch-correction aims to remove unwanted/technical noise from the data by finding shared biological variation across the batches in order to identify shared cell types across batches/genomic layers. Ideally, the same cell types from different batches/genomic layers will be project to the same low-dimensional space. Thus the analysis done above tries to identify if cells are being projected to similar/close low-dimensional space based on the technical variables explored or not.
The data projected above into the low-dimensional space, i.e., UMAP,
shows a great overlap/mixing between the categorical variables explored
(i.e., samples
, treatment.group
,
Cohort
) across the two first axes meaning that the data is
not being project based on potential technical noise (or covariates) but
rather by true biological variability, i.e., cell type (at least across
the first two axes, i.e., UMAP_1 and UMAP_2). This does not mean that
these variables are not ‘distorting’ the cell types, but if they’re
there seems not be enough to be detected across these two axes.
Therefore the choice between applying or not integration in this case
can be left to the researcher, in the sense that by integrating we might
be removing true biological variability and by not, we might
incorporating some noise. Keep in mind that both options seem to be
valid or at least debatable in this case. Since if a covariate effect
exist in this case it seems to be weak, we’ll not apply any integration
(also because it is more convenient to use ILoReg
below).
Now it is time to perform clustering. For this end we’ll apply two different methods:
Seurat
publication: Hao et al., 2021
repository: website
ILoReg
:publication: Smolander et al., 2020
repository: bioconductor
Then we’ll compare both clustering results and discuss the similarities and differences using as ‘ground-truth’ the cell-types provided.
(10 min)
Seurat
graph-based clustering relies on two main
functions (explanation given below is valid for the default
options):
FindNeighbors()
(help):
computes a shared nearest neighbor (SNN) graph based on
the Euclidean distance between cells in the
PCA dimensions provided
FindClusters()
(help):
identifies clusters based on the resolution
value provided
with the original Louvain algorithm for modularity
optimization
We don’t know how many cell clusters do we have. Therefore we will
provide a vector of resolution
values:
c(0.3, 0.5, 0.7)
. A value closer or higher than 1 forms
smaller clusters and, thus, more cell clusters. A value lower than 1
forms larger clusters (and less no. of clusters).
Seurat
saves the clustering labels in the
seu@meta.data
slot with the prefix
RNA_snn_res.
plus the resolution
number used.
Thus in our case we expect: RNA_snn_res.0.3
,
RNA_snn_res.0.5
and RNA_snn_res.0.7
.
Run the R
chunk code below to perform clustering and
visualize the clusters through UMAP.
# Cluster the data w/ Seurat
set.seed(1024)
seu <- FindNeighbors(seu, reduction="pca", dims=1:15)
seu <- FindClusters(seu, resolution=c(0.3,0.5,0.7))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4861
## Number of edges: 173796
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9159
## Number of communities: 12
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4861
## Number of edges: 173796
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8812
## Number of communities: 13
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 4861
## Number of edges: 173796
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.8538
## Number of communities: 16
## Elapsed time: 0 seconds
# Plot & save UMAP: clusters
clts_seurat <- plot_dimred_bygroup(seu, paste0("RNA_snn_res.", c(0.3, 0.5, 0.7)))
pdf(paste(dirs2save[1], "seurat_clusters_umap.pdf", sep="/"), width=18, height=6)
cowplot::plot_grid(plotlist=clts_seurat, ncol=3)
dev.off()
## png
## 2
# Print
cowplot::plot_grid(plotlist=clts_seurat, ncol=3)
Question: Does the clustering change a lot across the resolution values applied? (3 min)
The clustering results are relatively similar to each other with increasing values of resolution suggesting that the clusters are relatively stable and robust.
(70 min)
ILoReg
is an R package for clustering single-cell RNA
sequencing (scRNA-seq) data developed by Smolander
et al., 2021.
It can be found at:
The clustering results are relatively similar to each other with increasing values of resolution suggesting that the clusters are relatively stable and robust.
ILoReg
aims to find fine-grained clusters from scRNA-seq
data through a novel machine learning algorithm designated
iterative clustering projection (ICP). ICP relies
mainly on two consecutive steps run L times (where L
is 200 by default):
Logistic Regression with L1 regularization (aka LASSO) to predict clusters
Adjusted Rand Index (aka ARI) to compare cluster classifications across runs
Step 1 does not actually aim to predict clusters but
instead use the ability of LASSO to learn from the data to find
important features and translate that into a probability matrix, n x
k, with the likelihood of each cell n belong to each one
of the k clusters (where k is 15 by default). In order
to predict clusters, step 1 starts by using a balanced
random initial clustering (S) as training data set, where
S is a gene expression matrix with m x n (m
genes and n cells), with m cells being partitioned
into k clusters (by default 15). The gene expression table
provided to ILoReg
should be already normalized. The first
prediction results into a projected clustering (S’)
with predicted n cells distributed across k clusters.
Then, step 2 takes place by comparing the ARI
score between S and S’ clusterings. If
ARI(S’) > ARI(S), it uses the new clustering
S’ as the training data set to perform the next prediction,
i.e., ICP run, otherwise it uses S. Therefore,
step 2 works as a measure to direct predictions towards
more stable clusters and improve the ability of LASSO to learn from the
data with that purpose in mind. Please see the figure below for a more
intuitive interpretation.
After running ICP L times, the result is L probability matrices, each one with n x k dimension, where each entry represents the probability of a n cell belong to each one of the k clusters. This result is combined in a joint probability matrix used to feed a Principal Component Analysis (PCA). The PCA analysis finds a consensus matrix of p PCs for n cells which is then used to hierarchical clustering the n cells into a number of selected and expected clusters using the Ward’s method. The Silhouette method can be applied to inform the decision regarding the number of clusters to select.
ILoReg
was designed to use all the expressed
genes/features of log-normalized data. ILoReg
is a
bioconductor
package and thus it relies on the
SingleCellExperiment
class object. An object similar to the
Seurat
object, but with a different organization, but with
the same aim - gather together different layers of information.
First we start by converting the Seurat
object to
SingleCellExperiment
with the function
as.SingleCellExperiment()
. Then we run ICP
with the function RunParallelICP()
. This function has
several parameters explained below:
k=15
: number of clusters to split the data
by
d=0.3
: proportion of cells down- or over-sampled
from each cluster to use for the training data set during ICP. A value
ranging 0-1. Recommended between 0.2-0.3.
L=50
: number of ICP runs. Recommended
200
. Here we’ll run 50
because this is a
relatively small data set and due to time constraints.
r=5
: a positive integer referring to the maximum
number of reiterations until ICP stops.
C=0.3
: number that defines the trade-off between
correct classification and regularization. Decreasing C
results into the selection of less genes/features to perform logistic
regression and, thus, more stringent.
reg.type="L1"
: regression type. Use
"L1"
for LASSO regression. Use "L2"
for ridge
regression.
threads=4
: number of threads to use to paralelize
the computation.
Then the joint probabilities are use to build a PCA
(ILoReg::RunPCA()
) which is then used for UMAP
(ILoReg::RunUMAP()
) and hierarchical clustering with
HierarchicalClustering()
. Finally, we can specify the
number of clusters that we want to select:
SelectKClusters()
(in our case lets select
15
).
Run the R
chunk code below.
## Clustering: ILoReg
iloreg_params <- list(icp_k=15, icp_d=0.3, icp_r=5,
icp_c=0.3, icp_l=50, icp_threads=4,
icp_iter=50, sel_clts=15
)
#seu <- FindVariableFeatures(seu, selection.method="vst", nfeatures=5000)
#hvg <- VariableFeatures(seu)
sce <- as.SingleCellExperiment(seu)
#sce <- sce[hvg,]
## Prepare data to ILoReg
sce <- PrepareILoReg(sce)
## Clean DR layers
reducedDim(sce, "PCA") <- NULL
reducedDim(sce, "UMAP") <- NULL
reducedDim(sce, "TSNE") <- NULL
## Running ICP
set.seed(1024)
sce <- RunParallelICP(object=sce, k=iloreg_params[["icp_k"]],
max.iter=iloreg_params[["icp_iter"]],
d=iloreg_params[["icp_d"]], L=iloreg_params[["icp_l"]],
r=iloreg_params[["icp_r"]], C=iloreg_params[["icp_c"]],
reg.type="L1", threads=iloreg_params[["icp_threads"]])
##
|
| | 0%
|
|= | 2%
|
|=== | 4%
|
|==== | 6%
|
|====== | 8%
|
|======= | 10%
|
|========= | 12%
|
|========== | 14%
|
|=========== | 16%
|
|============= | 18%
|
|============== | 20%
|
|================ | 22%
|
|================= | 24%
|
|=================== | 27%
|
|==================== | 29%
|
|===================== | 31%
|
|======================= | 33%
|
|======================== | 35%
|
|========================== | 37%
|
|=========================== | 39%
|
|============================= | 41%
|
|============================== | 43%
|
|=============================== | 45%
|
|================================= | 47%
|
|================================== | 49%
|
|==================================== | 51%
|
|===================================== | 53%
|
|======================================= | 55%
|
|======================================== | 57%
|
|========================================= | 59%
|
|=========================================== | 61%
|
|============================================ | 63%
|
|============================================== | 65%
|
|=============================================== | 67%
|
|================================================= | 69%
|
|================================================== | 71%
|
|=================================================== | 73%
|
|===================================================== | 76%
|
|====================================================== | 78%
|
|======================================================== | 80%
|
|========================================================= | 82%
|
|=========================================================== | 84%
|
|============================================================ | 86%
|
|============================================================= | 88%
|
|=============================================================== | 90%
|
|================================================================ | 92%
|
|================================================================== | 94%
|
|=================================================================== | 96%
|
|===================================================================== | 98%
|
|======================================================================| 100%
## Running PCA and joint probability matrix
sce <- RunPCA(sce, p=50, scale=FALSE)
## Run t-SNE and UMAP DR
sce <- RunUMAP(sce)
## Hierarchical clustering
sce <- HierarchicalClustering(sce)
## Select clusters based on no. of cell types expected
sce <- SelectKClusters(sce, K=iloreg_params[["sel_clts"]])
Check the clusters found by ILoReg
below.
# Plot & save UMAP: clusters
clts_iloreg <- ClusteringScatterPlot(sce, dim.reduction.type="umap",
return.plot=TRUE, title="ILoReg clusters",
show.legend=TRUE)
pdf(paste(dirs2save[1], "iloreg_clusters_umap.pdf", sep="/"), width=6, height=6)
clts_iloreg
dev.off()
## png
## 2
# Print
clts_iloreg
Question: which is the greatest difference between the projections of the Seurat and ILoReg clustering results? (1 min)
ILoReg
is able to find slight differences between more
similar cell clusters. The proportion of cells by cluster is more even
in ILoReg
than Seurat
.
(10 min)
Let’s compare the Seurat
clustering result
RNA_snn_res.0.7
with ILoReg
below finding the
correspondence between clusters and cell.types
using a
Sankey
plot.
## Compare clustering results
# Install package
if (!"networkD3" %in% installed.packages()) install.packages("networkD3", repos="https://cloud.r-project.org")
# Get & parse data
stopifnot(all(row.names(seu@meta.data)==names(metadata(sce)$iloreg$clustering.manual)))
#clt.comp <- as.data.frame(table(seu@meta.data$RNA_snn_res.0.7, metadata(sce)$iloreg$clustering.manual))
#colnames(clt.comp) <- c("source", "target", "value")
#nodes.names <- data.frame("name"=c(paste0("seu.",levels(clt.comp$source)), paste0("sce.", levels(clt.comp$target))))
#clt.comp$source <- as.integer(clt.comp$source)-1
#clt.comp$target <- as.integer(clt.comp$target)+(length(levels(seu@meta.data$RNA_snn_res.0.7))-1)
# Add cell annotations
seu.clt.cell.types <- as.data.frame(table(seu@meta.data$RNA_snn_res.0.7, seu@meta.data$cell.types))
sce.clt.cell.types <- as.data.frame(table(metadata(sce)$iloreg$clustering.manual, colData(sce)[,"cell.types"]))
colnames(seu.clt.cell.types) <- colnames(sce.clt.cell.types) <- c("source", "target", "value")
nodes.names <- data.frame("name"=c(paste0("seu.", levels(seu.clt.cell.types$source)),
levels(seu.clt.cell.types$target),
paste0("sce.", levels(sce.clt.cell.types$source))))
seu.clt.cell.types$source <- as.integer(seu.clt.cell.types$source)-1
sce.clt.cell.types$source <- as.integer(sce.clt.cell.types$source)+
max(seu.clt.cell.types$source) + length(levels(seu.clt.cell.types$target))
seu.clt.cell.types$target <- as.integer(seu.clt.cell.types$target)+max(seu.clt.cell.types$source)
sce.clt.cell.types$target <- as.integer(sce.clt.cell.types$target)+max(seu.clt.cell.types$source)
sce.clt.cell.types <- sce.clt.cell.types[,c("target", "source", "value")]
colnames(sce.clt.cell.types) <- c("source", "target", "value")
clt.comp <- do.call("rbind", list(seu.clt.cell.types, sce.clt.cell.types))
# Plot Sankey plot
snk.plot <- networkD3::sankeyNetwork(Links=clt.comp, Nodes=nodes.names, Source="source",
Target="target", Value="value", NodeID="name",
units="TWh", fontSize=12, nodeWidth=30)
# Save it
networkD3::saveNetwork(snk.plot, paste(dirs2save[1], "cluster_comparison_seurat_iloreg_sankey_plot.html", sep="/"),
selfcontained=FALSE)
# Plot it
snk.plot
Question: Which were the more robust/stable clusters across Seurat/ILoReg clustering results? (1 min)
Clusters related with B cell (seurat cluster 2 –> ILoReg cluster 12) was the most robust/stable cluster across both methods. Then, macrophages, endothelial and CAF related clusters show some degree of stability across methods.
TASK 1: Discuss within your group why some clusters were more stable across clustering results and others not as well as if you think that any of the sub-clusters found may represent a sub-cell-type or not (based on your opinion - there isn’t any ‘ground-truth’ for this). Some tips: explore the markers by cluster for both results and discuss within your group if these make sense. You can find markers in the
Seurat
object by calling:markers <- FindAllMarkers(seu, only.pos=TRUE, min.pct=0.25, logfc.threshold=0.25)
and, after selecting the top 2 markers by cluster (withtop.markers <- markers %>% group_by(cluster) %>% slice_max(n=2, order_by=avg_log2FC)
), highlight them in a UMAP plotFeaturePlot(seu, features=top.markers)
- use more markers if you feel these top 2 are not informative. You may need to scale the whole data (instead of only the top 2K HVG) to use some visualizations (all.genes <- row.names(seu)
andseu <- ScaleData(seu, features=all.genes)
). WithILoReg
you can do a similar approach:markers <- FindAllGeneMarkers(sce, clustering.type="manual",test="wilcox",log2fc.threshold=0.25, min.pct=0.25, min.diff.pct=NULL, min.cells.group=3, return.thresh=0.01, only.pos=TRUE, max.cells.per.cluster=NULL)
;top.markers <- SelectTopGenes(markers, top.N=2, criterion.type="adj.p.value", inverse=TRUE)
;GeneScatterPlot(sce, genes = unique(top.markers$gene), dim.reduction.type="umap", point.size=0.5, ncol=2)
. You can check other cell markers that you know for these cells and use other types of visualization. These were just some suggestions. Check the Seurat and ILoReg documentation for more information and other types of visualization.
(7 min)
This data set do not contain cell type annotations. Therefore, we’ll use the recent published tool called CellTypist for the automated immune cell annotation using logistic regression (publication).
CellTypist
requires an anndata
object with
log1p normalized data (scaled with a factor of 10000). Thus we’ll export
the Seurat
object with the ILoReg
cluster
annotations. Then, this object will be converted to anndata
(.h5ad
extension) using the SeuratDisk
function Convert()
.
# Install SeuratDisk if not available
# Install packages
if ( ! ("hdf5r" %in% installed.packages()) ) devtools::install_github("hhoeflin/hdf5r")
if ( ! ("SeuratDisk" %in% installed.packages()) ) remotes::install_github("mojaveazure/seurat-disk")
# Export Seurat object
seu@assays$RNA@scale.data <- matrix(ncol=0,nrow=0) # fill with nothing
#stopifnot(all(row.names(seu@meta.data)==names(metadata(sce)$iloreg$clustering.manual)))
#seu@meta.data[["iloreg.clustering.manual"]] <- metadata(sce)$iloreg$clustering.manual
seu@meta.data <- seu@meta.data %>%
mutate_if(is.factor, as.character)
SeuratDisk::SaveH5Seurat(seu, paste(dirs2save[3], "seu2celltypist.h5Seurat", sep="/"))
SeuratDisk::Convert(paste(dirs2save[3], "seu2celltypist.h5Seurat", sep="/"), dest="h5ad")
Now we’ll perform annotations with CellTypist
using the
reference model Immune_All_Low.pkl
and with the
majority_voting
parameter.
## CellTypist cell annotation
# Import modules
import os
import random
import celltypist
import scanpy as sc
# Params
input_h5ad_file="results/GSE115978/objects/seu2celltypist.h5ad"
model="data/models/Immune_All_Low.pkl"
output_h5ad_file="results/GSE115978/objects/seu.annot.h5ad"
if not os.path.isdir(os.path.dirname(model)): os.makedirs(os.path.dirname(model))
# Define models path (otherwise it attemps: '~/.celltypist/data/models')
celltypist.models.models_path=os.path.join(os.getcwd(), "data/models") # ..
# Download CellTypist models
celltypist.models.download_models(force_update=True)
# Import scRNA-seq anndata object
## 📜 Retrieving model list from server https://celltypist.cog.sanger.ac.uk/models/models.json
## 📚 Total models in list: 14
## 📂 Storing models in /wrk/local2/B21027_scRNAseq_signatures/070622_embl_ebi_tcell_course/EMBL_EBI_scRNA_bioinformatics_Tcell_course_2022/workflow/data/models
## 💾 Downloading model [1/14]: Immune_All_Low.pkl
## 💾 Downloading model [2/14]: Immune_All_High.pkl
## 💾 Downloading model [3/14]: Immune_All_PIP.pkl
## 💾 Downloading model [4/14]: Immune_All_AddPIP.pkl
## 💾 Downloading model [5/14]: Adult_Mouse_Gut.pkl
## 💾 Downloading model [6/14]: COVID19_Immune_Landscape.pkl
## 💾 Downloading model [7/14]: Cells_Fetal_Lung.pkl
## 💾 Downloading model [8/14]: Cells_Intestinal_Tract.pkl
## 💾 Downloading model [9/14]: Cells_Lung_Airway.pkl
## 💾 Downloading model [10/14]: Developing_Mouse_Brain.pkl
## 💾 Downloading model [11/14]: Healthy_COVID19_PBMC.pkl
## 💾 Downloading model [12/14]: Human_Lung_Atlas.pkl
## 💾 Downloading model [13/14]: Nuclei_Lung_Airway.pkl
## 💾 Downloading model [14/14]: Pan_Fetal_Human.pkl
adata=sc.read(input_h5ad_file)
# Perform annotation: allow 'majority_voting' by clustering - major cell type should
#represent 75% of the labels for that cluster
random.seed(1024) # try to set this seed for reproducibility
predictions=celltypist.annotate(adata, model=os.path.basename(model), majority_voting=True)
## 🔬 Input data has 4861 cells and 23686 genes
## 🔗 Matching reference genes in the model
## 🧬 4312 features used for prediction
## ⚖️ Scaling input data
## 🖋️ Predicting labels
## ✅ Prediction done!
## 👀 Can not detect a neighborhood graph, will construct one before the over-clustering
## ⛓️ Over-clustering input data with resolution set to 5
## 🗳️ Majority voting the predictions
## ✅ Majority voting done!
adata=predictions.to_adata()
adata.write(output_h5ad_file)
## ... storing 'orig.ident' as categorical
## ... storing 'samples' as categorical
## ... storing 'cell.types' as categorical
## ... storing 'treatment.group' as categorical
## ... storing 'Cohort' as categorical
## ... storing 'RNA_snn_res.0.3' as categorical
## ... storing 'RNA_snn_res.0.5' as categorical
## ... storing 'RNA_snn_res.0.7' as categorical
## ... storing 'seurat_clusters' as categorical
Let’s import and visualize the annotations below.
## Explore CellTypist annotations
if (! ("anndata" %in% installed.packages()) ) install.packages("anndata", repos="https://cloud.r-project.org") # SeuratDisk is unable
#to import data processed with newer versions of anndata/scanpy; use the anndata package instead
annot <- anndata::read_h5ad(paste(dirs2save[3], "seu.annot.h5ad", sep="/"))
stopifnot(all(row.names(annot$obs)==colnames(seu)))
vars2add <- c("predicted_labels", "majority_voting")
seu@meta.data[,vars2add] <- annot$obs[,vars2add]
colData(sce)[,vars2add] <- annot$obs[,vars2add]
## Remove 'predicted_labels' with a size less than 10
seu@meta.data[,"new.predicted_labels"] <- seu@meta.data[,"predicted_labels"]
levels(seu@meta.data[,"new.predicted_labels"])[which(table(seu@meta.data[,"new.predicted_labels"])<10)] <- NA
colData(sce)[,"new.predicted_labels"] <- colData(sce)[,"predicted_labels"]
levels(colData(sce)[,"new.predicted_labels"])[which(table(colData(sce)[,"new.predicted_labels"])<10)] <- NA
# Plot 'majority_voting'
pdf(paste(dirs2save[1], "celltypist_majority_voting_annot_umap.pdf", sep="/"), width=20, height=12)
cowplot::plot_grid(DimPlot(seu),
(DimPlot(seu, group.by="new.predicted_labels")),
DimPlot(seu, group.by="majority_voting"),
clts_iloreg,
scater::plotReducedDim(sce, dimred="UMAP", colour_by="new.predicted_labels"),
scater::plotUMAP(sce, colour_by="majority_voting"),
ncol=3, rel_widths= c(0.75, 1.25, 1))
dev.off()
## png
## 2
# Print
cowplot::plot_grid(DimPlot(seu),
(DimPlot(seu, group.by="new.predicted_labels")),
DimPlot(seu, group.by="majority_voting"),
clts_iloreg,
scater::plotReducedDim(sce, dimred="UMAP", colour_by="new.predicted_labels"),
scater::plotUMAP(sce, colour_by="majority_voting"),
ncol=3, rel_widths= c(0.75, 1.25, 1))
TASK 2: Discuss within your group the
CellTypist
annotations and if these annotations help you or not to achieve any conclusion about the reliability of the sub-clusters obtained either withSeurat
orILoReg
.
(15 min)
CellTypist
identified some cytotoxic T
cells which are particularly important in cancer. Also in the
context of cancer, exhausted T cells (T cells
characterized by the loss of effector function due to the expression of
inhibitory receptors) are key to fight cancer. Therefore, in this
section, we’ll attempt to identify potential clusters of
exhausted T cells.
Below some markers related with exhaustion (“PDCD1”, “CTLA4”, “CCL3”, “CXCR5”, “CXCL10”, “CD8A”, “TIGIT”, “LAG3”, “IL2RG”) were plotted for both clustering results. Inspect the results below and see the last and third task (if you have time).
## exhausted T cells
## Scale the whole data - it might be usefull for some visualizations
all.genes <- row.names(seu)
seu <- ScaleData(seu, features=all.genes)
# markers of exhaustion
markers <- c("PDCD1", "CTLA4", "CCL3", "CXCR5", "CXCL10", "CD8A", "TIGIT", "LAG3", "IL2RG")
# Plot
ilo.mrks <- ILoReg::GeneScatterPlot(sce, genes=markers, dim.reduction.type = "umap",
point.size = 0.1, ncol=3)
seu.mrks <- Seurat::FeaturePlot(seu, features=markers)
pdf(paste(dirs2save[1], "markers_exhaustion_by_cluster_violin_plots.pdf", sep="/"), width=12, height=16)
cowplot::plot_grid(ilo.mrks, seu.mrks, ncol=1)
dev.off()
## png
## 2
TASK 3: Discuss within your group which is(are) the cluster(s) more likely to represent exhausted T cells. Some tips: inspect some of the exhausted T cell signatures found in the original publication (Jerby-Arnon et al., 2018) and described in the Table S3 of the same publication. You can use other types of visualization not only to find the most likely clusters of exhausted T cells as well to compare if some of these markers were more/less expressed in the
treatment.naive
versuspost.treatment
(treatment.group
metadata variable).
## Save objects
saveRDS(seu, paste(dirs2save[3], "seu.rds", sep="/"))
saveRDS(sce, paste(dirs2save[3], "int.rds", sep="/"))
## R packages and versions used in these analyses
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /wrk/local2/B21027_scRNAseq_signatures/070622_embl_ebi_tcell_course/EMBL_EBI_scRNA_bioinformatics_Tcell_course_2022/workflow/.snakemake/conda/2b532403d31becf1b5412328473ca837/lib/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ILoReg_1.4.0 SingleCellExperiment_1.16.0
## [3] SummarizedExperiment_1.24.0 Biobase_2.54.0
## [5] GenomicRanges_1.46.1 GenomeInfoDb_1.30.0
## [7] IRanges_2.28.0 S4Vectors_0.32.3
## [9] BiocGenerics_0.40.0 MatrixGenerics_1.6.0
## [11] matrixStats_0.62.0 ggplot2_3.3.6
## [13] dplyr_1.0.9 sp_1.5-0
## [15] SeuratObject_4.1.0 Seurat_4.1.1
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.2 reticulate_1.25
## [3] tidyselect_1.1.2 htmlwidgets_1.5.4
## [5] BiocParallel_1.28.3 grid_4.1.1
## [7] Rtsne_0.16 ScaledMatrix_1.2.0
## [9] munsell_0.5.0 codetools_0.2-18
## [11] ica_1.0-2 umap_0.2.8.0
## [13] future_1.26.1 miniUI_0.1.1.1
## [15] withr_2.5.0 spatstat.random_2.2-0
## [17] colorspace_2.0-3 progressr_0.10.1
## [19] highr_0.9 knitr_1.39
## [21] rstudioapi_0.13 ROCR_1.0-11
## [23] DescTools_0.99.45 tensor_1.5
## [25] listenv_0.8.0 labeling_0.4.2
## [27] GenomeInfoDbData_1.2.7 polyclip_1.10-0
## [29] bit64_4.0.5 farver_2.1.0
## [31] pheatmap_1.0.12 rprojroot_2.0.3
## [33] parallelly_1.32.0 vctrs_0.4.1
## [35] generics_0.1.2 xfun_0.31
## [37] fastcluster_1.2.3 R6_2.5.1
## [39] ggbeeswarm_0.6.0 rsvd_1.0.5
## [41] hdf5r_1.3.5 bitops_1.0-7
## [43] spatstat.utils_2.3-1 DelayedArray_0.20.0
## [45] assertthat_0.2.1 networkD3_0.4.9000
## [47] promises_1.2.0.1 scales_1.2.0
## [49] beeswarm_0.4.0 rgeos_0.5-9
## [51] rootSolve_1.8.2.3 gtable_0.3.0
## [53] beachmat_2.10.0 globals_0.15.0
## [55] goftest_1.2-3 lmom_2.9
## [57] klippy_0.0.0.9500 rlang_1.0.2
## [59] splines_4.1.1 lazyeval_0.2.2
## [61] spatstat.geom_2.4-0 yaml_2.3.5
## [63] reshape2_1.4.4 abind_1.4-5
## [65] httpuv_1.6.5 SeuratDisk_0.0.0.9020
## [67] tools_4.1.1 ellipsis_0.3.2
## [69] spatstat.core_2.4-4 jquerylib_0.1.4
## [71] RColorBrewer_1.1-3 proxy_0.4-27
## [73] ggridges_0.5.3 Rcpp_1.0.8.3
## [75] plyr_1.8.7 sparseMatrixStats_1.6.0
## [77] zlibbioc_1.40.0 purrr_0.3.4
## [79] RCurl_1.98-1.7 openssl_2.0.2
## [81] rpart_4.1.16 deldir_1.0-6
## [83] pbapply_1.5-0 viridis_0.6.2
## [85] cowplot_1.1.1 zoo_1.8-10
## [87] ggrepel_0.9.1 cluster_2.1.3
## [89] here_1.0.1 magrittr_2.0.3
## [91] data.table_1.14.2 RSpectra_0.16-1
## [93] scattermore_0.8 SparseM_1.81
## [95] lmtest_0.9-40 RANN_2.6.1
## [97] mvtnorm_1.1-3 parallelDist_0.2.6
## [99] anndata_0.7.5.3 fitdistrplus_1.1-8
## [101] patchwork_1.1.1 mime_0.12
## [103] evaluate_0.15 xtable_1.8-4
## [105] readxl_1.4.0 gridExtra_2.3
## [107] scater_1.22.0 compiler_4.1.1
## [109] tibble_3.1.7 KernSmooth_2.23-20
## [111] crayon_1.5.1 htmltools_0.5.2
## [113] mgcv_1.8-40 later_1.3.0
## [115] snow_0.4-4 tidyr_1.2.0
## [117] expm_0.999-6 Exact_3.1
## [119] RcppParallel_5.1.5 DBI_1.1.3
## [121] rappdirs_0.3.3 MASS_7.3-57
## [123] boot_1.3-28 LiblineaR_2.10-12
## [125] data.tree_1.0.0 Matrix_1.4-1
## [127] cli_3.3.0 parallel_4.1.1
## [129] igraph_1.3.2 pkgconfig_2.0.3
## [131] scuttle_1.4.0 plotly_4.10.0
## [133] spatstat.sparse_2.1-1 foreach_1.5.2
## [135] vipor_0.4.5 bslib_0.3.1
## [137] rngtools_1.5.2 XVector_0.34.0
## [139] doRNG_1.8.2 stringr_1.4.0
## [141] digest_0.6.29 sctransform_0.3.3
## [143] RcppAnnoy_0.0.19 spatstat.data_2.2-0
## [145] rmarkdown_2.11 cellranger_1.1.0
## [147] leiden_0.4.2 gld_2.6.4
## [149] dendextend_1.15.2 uwot_0.1.11
## [151] DelayedMatrixStats_1.16.0 shiny_1.7.1
## [153] lifecycle_1.0.1 nlme_3.1-157
## [155] aricode_1.0.0 jsonlite_1.8.0
## [157] BiocNeighbors_1.12.0 askpass_1.1
## [159] viridisLite_0.4.0 fansi_1.0.3
## [161] pillar_1.7.0 lattice_0.20-45
## [163] fastmap_1.1.0 httr_1.4.3
## [165] survival_3.3-1 glue_1.6.2
## [167] png_0.1-7 iterators_1.0.14
## [169] bit_4.0.4 class_7.3-20
## [171] stringi_1.7.6 sass_0.4.1
## [173] BiocSingular_1.10.0 doSNOW_1.0.20
## [175] irlba_2.3.5 e1071_1.7-11
## [177] future.apply_1.9.0