## Import packages
library("dplyr") # data wrangling
library("ggplot2") # plotting
library("Seurat") # scRNA-seq analysis
library("ComplexHeatmap") # plot heatmaps
library("aricode") # ARI index clt comp
library("reticulate") # to work w/ python packages in R
scanorama <- import('scanorama') # import the Scanorama python package
source("scripts/helper_functions.R") # import functions adapted to automatize integration with scanorama
(7 min)
Run the R
chunk code below to import the table
data/GEO_GSE173303_project.tsv
which contains the
information to download the data sets that we’ll be using herein.
## Download datasets
# import table with datasets to download
data2download <- read.table("data/GEO_GSE173303_project.tsv",
sep="\t", header=TRUE)
Check below the content of the table imported.
# print table
knitr::kable(data2download)
filename | description | ftp_url |
---|---|---|
GSE173303_barcodes.tsv.gz | table of barcodes | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fbarcodes%2Etsv%2Egz |
GSE173303_features.tsv.gz | table of genes | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Ffeatures%2Etsv%2Egz |
GSE173303_matrix.mtx.gz | gene expression data | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fmatrix%2Emtx%2Egz |
GSE173303_subT_data_metadata.tsv.gz | cell metadata for sub T cells | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5FsubT%5Fdata%5Fmetadata%2Etsv%2Egz |
GSE173303_whole_data_metadata.tsv.gz | cell metadata for all | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fwhole%5Fdata%5Fmetadata%2Etsv%2Egz |
Now use the information from the table above to download all the data from the GEO project GSE173303 (published by Kim et al., 2022). In this study the authors used peripheral blood and/or synovial fluid samples from 20 patients with arthritis-autoimmune related adverse events.
The GEO GSE173303 repository comprises two types of files:
10X Genomics Cell Ranger (droplet based scRNA-seq technology - read more):
GSE173303_barcodes.tsv.gz
(cell barcodes)
GSE173303_features.tsv.gz
(genes)
GSE173303_matrix.mtx.gz
(gene expression
values)
Metadata (data describing cell attributes):
GSE173303_whole_data_metadata.tsv.gz
(for
all)
GSE173303_subT_data_metadata.tsv.gz
(for T
cells)
The 10X data can be directly imported into Seurat
, a
R
package developed for the analysis of scRNA-seq data (page; Hao
et al., 2021). While the metadata will be imported separately as
tab-separated tables.
Since the function used to import 10X data into R
and
Seurat
, Read10X, looks
into a given directory and searches for the following file names,
barcodes.tsv
, genes.tsv
, and
matrix.mtx
(it allows compressed .gz
files),
it is more convenient to remove the prefix file name ‘GSE173303_’ from
every file.
## Download & import data
# create directory to save files
down_dir <- "data/GSE173303" # directory to save datasets
if (!dir.exists(down_dir)) dir.create(down_dir, recursive=TRUE) # create folder if doesn't exist
# loop over the table rows & download each dataset
for (f in 1:nrow(data2download)) {
down_filename <- data2download[f,"filename"]
down_filepath <- paste(down_dir, gsub("GSE173303_", "", down_filename),
sep="/")# download file to: 'data/GSE173303'
down_url <- data2download[f,"ftp_url"]
cat("Downloading file", down_filename, "to", down_filepath, "\n")
download.file(url=down_url, destfile=down_filepath)
}
## Downloading file GSE173303_barcodes.tsv.gz to data/GSE173303/barcodes.tsv.gz
## Downloading file GSE173303_features.tsv.gz to data/GSE173303/features.tsv.gz
## Downloading file GSE173303_matrix.mtx.gz to data/GSE173303/matrix.mtx.gz
## Downloading file GSE173303_subT_data_metadata.tsv.gz to data/GSE173303/subT_data_metadata.tsv.gz
## Downloading file GSE173303_whole_data_metadata.tsv.gz to data/GSE173303/whole_data_metadata.tsv.gz
# import gene expression 10X data into Seurat
data10x <- Read10X(data.dir=down_dir) # import 10x data as sparse matrix
seu <- CreateSeuratObject(counts=data10x, project="GSE173303") # convert gene exp. sparse matrix into Seurat class object
# import metadata
metadata <- list.files(down_dir, pattern="metadata", full.names=TRUE) # get path for metadata based on 'pattern' argument
names(metadata) <- gsub("_data_metadata.tsv.gz", "", basename(metadata)) # name the paths
metadata <- lapply(setNames(metadata, names(metadata)), function(x) {
read.table(x, header=TRUE, sep="\t", stringsAsFactors=FALSE)
}) # import tables into a list
(5 min)
The data was imported into R
as a sparse matrix (RAM
efficient - most of the entries are zero) with the Seurat
R
function data10x <- Read10X()
. Then, the
sparse matrix, where genes are rows and columns are cells,
data10x
was converted in a Seurat
class object
at
seu <- CreateSeuratObject(counts=data10x, project="GSE173303")
.
The object seu
will be used downstream to analyze the
scRNA-seq data.
Look into the structure of the Seurat class object by running the code below.
# Seurat object structure
str(seu)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
## .. .. .. .. .. ..@ p : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 21831 89717
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
## .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## .. .. .. .. .. ..@ x : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
## .. .. .. .. .. ..@ p : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 21831 89717
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
## .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## .. .. .. .. .. ..@ x : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 21831 obs. of 0 variables
## .. .. .. ..@ misc : list()
## ..@ meta.data :'data.frame': 89717 obs. of 3 variables:
## .. ..$ orig.ident : Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:89717] 2669 3727 4039 3123 6949 ...
## .. ..$ nFeature_RNA: int [1:89717] 1024 1205 1728 1332 2041 1456 2507 3089 2526 1623 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "GSE173303"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 4 1 0
## ..@ commands : list()
## ..@ tools : list()
An object of Seurat
class has several layers of
information. You don’t need to know all. Let’s focus on
@assays
and @meta.data
. The former stores the
gene expression data. Assays
are further divided into
slots
of data. Particularly important are the
slots
: @counts
(unnormalized/raw counts),
@data
(normalized data, e.g., log-noramlized with 10K
scaling factor), @scale.data
(scaled data, e.g.,
standardized by Z-score). The latter, i.e., @meta.data
stores the cell metadata (number of genes expressed in the cell, cluster
identity, etc).
The object seu
imported above will be called almost in
every function during the analysis and most of the results will be saved
directly into it. Seurat
provides functions to access the
data stored in its own class.
Read more about the structure of a Seurat
object in the
Seurat wiki.
Question: How many genes and cells compose the GEO GSE173303 dataset? (1 min)
Tip: type the object
seu
in the R console and interpret the message printed or provide the objectseu
to the functionsnrow()
andncol()
which count the no. of rows (=genes) and columns (=cells).
The GEO GSE173303 dataset comprises 21831 genes and 89717 cells.
Print below the gene expression of the first 10 genes across the
first 10 cells stored in the main assay, called RNA
by
default, in the slot @counts
.
## Inspect data
seu@assays$RNA@counts[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## AL627309.1 . . . . . . . . . .
## AL669831.5 . . . . . . . . . .
## FAM87B . . . . . . . . . .
## LINC00115 . . . . . . . . . .
## FAM41C . . . . . . . . . .
## NOC2L . . . . . . . . . .
## KLHL17 . . . . . . . . . .
## PLEKHN1 . . . . . . . . . .
## HES4 . . . . . . . . . .
## ISG15 . . . . . . 1 2 2 3
The gene expression counts stored in the slot @counts
are in sparse format. Points represent zero.
Question: Which is the expression value for the 10th gene in the 10th cell? (1 min)
It is 3.
(5 min)
Let’s inspect below what the @meta.data
slot contains
(run the code).
# Print the first 6 rows of '@meta.data'
head(seu@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## 45_AAACCTGAGAATGTGT 45 2669 1024
## 45_AAACCTGAGACTAGAT 45 3727 1205
## 45_AAACCTGAGACTGGGT 45 4039 1728
## 45_AAACCTGAGAGAGCTC 45 3123 1332
## 45_AAACCTGAGATATGGT 45 6949 2041
## 45_AAACCTGAGATCTGCT 45 4210 1456
Question: What do you think each column represents? (3 min)
Tip: discuss this in group.
The slot @meta.data
contains 3 columns/variables:
orig.ident
, nCount_RNA
,
nFeature_RNA
. These variables were created automatically
from the data without us providing any information.
orig.ident
refers to the identity of the cell. This matches
the prefix of the cell barcode represented as row names of the
orig.ident
. This is because the function
CreateSeuratObject()
contains two parameters:
names.delim = "_"
and names.field = 1
. These
parameters will split the cell barcode names by names.delim
and pick the first element, i.e., 45 for the first 6 cells. You will
understand the meaning of these values below. nCount_RNA
is
just the total number of UMIs in a cell (obtained by summing all the
expression values in a cell/column). nFeature_RNA
is the
number of different genes expressed in a cell (obtained by summing the
number of genes expressing a value different than zero).
Now, let’s check the metadata provided in the GEO repository that was
imported above. The object metadata
is a list with two
tables (or data frames) stored in it named: subT, whole. To access each
one of them, you can call the object with the accessor $
sign like this: metadata$subT
(to access subT
)
or metadata$whole
(to access whole
). You can
use the functions dim()
(to see the dimensions, i.e., no.
of rows x cols), summary()
(to print a summary for every
column) and head()
(to print the first 6 rows) to inspect
the tables. Do not attempt to print the whole tables, because the
thousand of rows they comprise.
For now let’s ignore the metadata$subT
data frame, since
it represents a subset of the whole
data.
# Print the first 6 lines of the 'whole' metadata
head(metadata$whole)
## orig.ident nCount_RNA nFeature_RNA percent.mito sbt_is_doublet
## 1 45 2669 1024 0.11689771 FALSE
## 2 45 3727 1205 0.07593239 FALSE
## 3 45 4039 1728 0.08294132 FALSE
## 4 45 3123 1332 0.05315402 FALSE
## 5 45 6949 2041 0.07281623 FALSE
## 6 45 4210 1456 0.08622328 FALSE
## sbt_doublet_score manually_is_doublet batch control GROUP RNA_snn_res.0.5
## 1 0.05778302 FALSE 2 SF Mono 2
## 2 0.04066337 FALSE 2 SF Mono 0
## 3 0.05618845 FALSE 2 SF Mono 3
## 4 0.06669545 FALSE 2 SF Mono 0
## 5 0.08527529 FALSE 2 SF Mono 2
## 6 0.05618845 FALSE 2 SF Mono 2
## seurat_clusters newClusterID cell.types barcode
## 1 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 2 0 3 Effector CD8 45_AAACCTGAGACTAGAT
## 3 3 5 Activated CD4 45_AAACCTGAGACTGGGT
## 4 0 3 Effector CD8 45_AAACCTGAGAGAGCTC
## 5 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 6 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT
# Look into the summary
summary(metadata$whole)
## orig.ident nCount_RNA nFeature_RNA percent.mito
## Length:89717 Min. : 353 Min. : 202 Min. :0.00000
## Class :character 1st Qu.: 2373 1st Qu.: 986 1st Qu.:0.04151
## Mode :character Median : 4260 Median :1537 Median :0.05605
## Mean : 4974 Mean :1586 Mean :0.06067
## 3rd Qu.: 6067 3rd Qu.:1988 3rd Qu.:0.07425
## Max. :87816 Max. :7709 Max. :0.14999
## sbt_is_doublet sbt_doublet_score manually_is_doublet batch
## Mode :logical Min. :0.001366 Mode :logical Min. :0.000
## FALSE:89717 1st Qu.:0.030928 FALSE:89717 1st Qu.:1.000
## Median :0.053980 Median :2.000
## Mean :0.069264 Mean :1.786
## 3rd Qu.:0.089419 3rd Qu.:3.000
## Max. :0.626984 Max. :3.000
## control GROUP RNA_snn_res.0.5 seurat_clusters
## Length:89717 Length:89717 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 2.00 1st Qu.: 2.00
## Mode :character Mode :character Median : 4.00 Median : 4.00
## Mean : 6.43 Mean : 6.43
## 3rd Qu.: 9.00 3rd Qu.: 9.00
## Max. :30.00 Max. :30.00
## newClusterID cell.types barcode
## Min. : 1.00 Length:89717 Length:89717
## 1st Qu.: 5.00 Class :character Class :character
## Median : 8.00 Mode :character Mode :character
## Mean :10.62
## 3rd Qu.:14.00
## Max. :31.00
The first three columns of metadata$whole
seem to match
the first three columns from seu@meta.data
(this can be
computational tested by doing:
all_equal(seu@meta.data, metadata$whole[,c("orig.ident", "nCount_RNA", "nFeature_RNA")])
).
If you run the previous code you’ll realize that the message printed
suggests they are different. This is because the data types/classes of
the same columns are different across tables, but the values are the
same. How can we be sure? By comparing the cell barcodes in the
two tables:
all(row.names(seu@meta.data)==metadata$whole$barcodes)
. It
returns TRUE
meaning the cell barcodes between tables are
exactly the same. This is useful, because we know now that we can
combine the metadata$whole
to seu@meta.data
in
order to access this data from our Seurat
object. This will
be convenient to plot and query the data later.
To understand the meaning of every column in
metadata$whole
you would need to read carefully the
information available in the GEO GSE173303
repository from where we downloaded the data from. For reasons of time,
this was done for you and summarized below.
orig.ident
: original identity of every cell
regarding the 8 matched donor sample from arthritis-irAE subjects (8
synovial fluid samples vs 8 PBMCs, n=16) that they came from: A7, A40,
45, A56, 62, 63, 65, 76 (synovial), A7B, A40B, 45B, A56B, 62B, 63B, 65B,
76B (PBMCs).
nCount_RNA
: total no. of UMIs expressed by
cell.
nFeature_RNA
: no. of different genes expressed by
cell.
percent.mito
: percentage of mitochondrial genes
expressed by cell.
sbt_is_doublet
: logical value checking if a cell is
a doublet
(obtained with the software scrublet
). ‘FALSE’ for all the
cells, meaning that the authors removed doublets before submitting the
data to GEO.
sbt_doublet_score
: doublet score obtained with
scrublet
.
manually_is_doublet
: similar to
sbt_is_doublet
.
batch
: batch factors/categories. It assumes the
values: 0, 1, 2, 3.
control
: sample type. One of two: Blood
(Blood/PBMC samples, n=59255) or SF
(Synovial Fluid,
n=30462).
Group
: the identity of the treatment group. One of
two: Combo
(mixture of ICI therapy medicines, n=37465) or
Mono
(monotherapy, n=52252)
RNA_snn_res.0.5
: Seurat clusters obtained by using a
resolution value of 0.5.
seurat_clusters
: Seurat clusters. The same as
RNA_snn_res.0.5
.
newClusterID
: new cluster ID.
seurat_clusters
renamed.
cell.types
: cell-type annotations.
barcode
: cell barcodes with donor sample id
prefix.
Now that we know what every column means let’s combine this table
with our Seurat
object metadata:
seu@meta.data
. Before doing so, we’ll rename the columns
from metadata$whole
to avoid conflicts in the column names
at seu@meta.data
(because some of the column names are the
same between tables).
## Combine metadata
# Rename col names from 'metadata$whole' by adding the suffix '.orig' to the col names
tmp <- metadata$whole # temporary modified table - a copy from 'metadata$whole'
colnames(tmp) <- paste0(colnames(tmp),".orig")
# Combine metadata tables
stopifnot(all(row.names(seu@meta.data)==tmp$barcode.orig)) # ensure that cell barcodes match
seu@meta.data <- cbind(seu@meta.data, tmp)
# Look how the updated Seurat meta.data looks like
head(seu@meta.data)
## orig.ident nCount_RNA nFeature_RNA orig.ident.orig
## 45_AAACCTGAGAATGTGT 45 2669 1024 45
## 45_AAACCTGAGACTAGAT 45 3727 1205 45
## 45_AAACCTGAGACTGGGT 45 4039 1728 45
## 45_AAACCTGAGAGAGCTC 45 3123 1332 45
## 45_AAACCTGAGATATGGT 45 6949 2041 45
## 45_AAACCTGAGATCTGCT 45 4210 1456 45
## nCount_RNA.orig nFeature_RNA.orig percent.mito.orig
## 45_AAACCTGAGAATGTGT 2669 1024 0.11689771
## 45_AAACCTGAGACTAGAT 3727 1205 0.07593239
## 45_AAACCTGAGACTGGGT 4039 1728 0.08294132
## 45_AAACCTGAGAGAGCTC 3123 1332 0.05315402
## 45_AAACCTGAGATATGGT 6949 2041 0.07281623
## 45_AAACCTGAGATCTGCT 4210 1456 0.08622328
## sbt_is_doublet.orig sbt_doublet_score.orig
## 45_AAACCTGAGAATGTGT FALSE 0.05778302
## 45_AAACCTGAGACTAGAT FALSE 0.04066337
## 45_AAACCTGAGACTGGGT FALSE 0.05618845
## 45_AAACCTGAGAGAGCTC FALSE 0.06669545
## 45_AAACCTGAGATATGGT FALSE 0.08527529
## 45_AAACCTGAGATCTGCT FALSE 0.05618845
## manually_is_doublet.orig batch.orig control.orig GROUP.orig
## 45_AAACCTGAGAATGTGT FALSE 2 SF Mono
## 45_AAACCTGAGACTAGAT FALSE 2 SF Mono
## 45_AAACCTGAGACTGGGT FALSE 2 SF Mono
## 45_AAACCTGAGAGAGCTC FALSE 2 SF Mono
## 45_AAACCTGAGATATGGT FALSE 2 SF Mono
## 45_AAACCTGAGATCTGCT FALSE 2 SF Mono
## RNA_snn_res.0.5.orig seurat_clusters.orig newClusterID.orig
## 45_AAACCTGAGAATGTGT 2 2 8
## 45_AAACCTGAGACTAGAT 0 0 3
## 45_AAACCTGAGACTGGGT 3 3 5
## 45_AAACCTGAGAGAGCTC 0 0 3
## 45_AAACCTGAGATATGGT 2 2 8
## 45_AAACCTGAGATCTGCT 2 2 8
## cell.types.orig barcode.orig
## 45_AAACCTGAGAATGTGT CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 45_AAACCTGAGACTAGAT Effector CD8 45_AAACCTGAGACTAGAT
## 45_AAACCTGAGACTGGGT Activated CD4 45_AAACCTGAGACTGGGT
## 45_AAACCTGAGAGAGCTC Effector CD8 45_AAACCTGAGAGAGCTC
## 45_AAACCTGAGATATGGT CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 45_AAACCTGAGATCTGCT CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT
Now we’re ready to start our analysis with Seurat
!
(5 min)
The first step in every analysis is: Quality-Control (QC). In this case, the QC that we’ll perform is relatively small because the data was already processed by the 10X Genomics Cell Ranger. In addition, the authors filter out bad-quality cells such as doublets as well cells comprising a high percentage of mitochondrial genes (see a full description in the GEO GSE173303 repository or in the publication doi:10.1038/s41467-022-29539-3).
The percentage of mitochondrial genes was already estimated by the
authors, but let’s calculate it below to see if the values match. This
is done with the function PercentageFeatureSet()
. This
function looks into the genes with the pattern given at
pattern
to determine the percentage of these features. This
is done using regular
expressions and the pattern
given depends on the set of
features that you’re looking into and the annotation used. In our case
we’ve human data annotated against the reference GRCh38. All the
mitochondrial genes have the prefix ‘MT-’. Thus we will use this to
provide this pattern to pattern = "^MT-"
.
Question: Which and how many mitochondrial genes are present in this data set? (1 min)
Tip: use the following code
grep("^MT-", row.names(seu), value=TRUE)
to retrieve the mitochondrial genes together with the functionlength()
to get the number of mt genes.
The mitochondrial genes present are: MT-ND1, MT-ND2, MT-CO1, MT-CO2, MT-ATP8, MT-ATP6, MT-CO3, MT-ND3, MT-ND4L, MT-ND4, MT-ND5, MT-ND6, MT-CYB.
The total no. of mt genes is: 13.
## Calculate percentage of mitochondrial genes by cell
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-") # add the 'percent.mt' to the 'meta.data' slot
# if you work with mouse data usually the prefix for mitochondrial genes is '^mt-' instead
The previous percent.mito.orig
is different than the
percentage of mitochondrial genes that you had just calculated
percent.mt
. It seems that the former is in relative
abundance, with a scale 0-1, whereas your calculation is in percentage
(0-100%).
Now let’s plot the three cell properties that Seurat
calculated: nFeature_RNA
, nCount_RNA
,
percent.mt
. By default Seurat uses the
@meta.data
variable orig.ident
to set the
identity of every cell. This means that the cells will be highlighted by
orig.ident
, i.e., donor sample origin. This can be changed,
but for now it is good to look into the QC highlighted by the donor
sample origin. In addition, we’ll take opportunity of the functionality
of visualization in Seurat
to highlight the data by
control.orig
(the metadata column with the categorical
variable of the sample type: Blood
or SF
):
SF
, in light blue, and Blood
, in light red,
violins.
## QC: violin plots
# Create folder to save the results
res_dirs <- paste("results/GSE173303", c("plots", "tables"), sep="/")
for (d in res_dirs) if (!dir.exists(d)) dir.create(d, recursive=TRUE)
# Plot
qc_vln <- VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3, split.by="control.orig")
# Seurat removes legend if there is more than 1 plot: https://github.com/satijalab/seurat/blob/a1294c4d363780548dbf9cc4a4abb3a6078a6d64/R/visualization.R#L5729
# Save
pdf(paste(res_dirs[1], "qc_violin_plots.pdf", sep="/"), width=12, height=4)
print(qc_vln)
dev.off()
## png
## 2
# Print plot
print(qc_vln)
What do you think?
Let’s get some more concrete numbers by calculating the
median()
and mean()
of every metric across the
donor samples.
## Mean & median of cell features
# across donor samples
seu@meta.data[,c("orig.ident", "control.orig", "nFeature_RNA", "nCount_RNA", "percent.mt")] %>%
group_by(orig.ident, control.orig) %>%
summarise_if(is.numeric, list("Mean"=mean, "Median"=median))
## # A tibble: 16 × 8
## # Groups: orig.ident [16]
## orig.ident control.orig nFeature_RNA_Mean nCount_RNA_Mean percent.mt_Mean
## <fct> <chr> <dbl> <dbl> <dbl>
## 1 45 SF 1595. 4792. 6.83
## 2 45B Blood 781. 2201. 9.10
## 3 62 SF 1564. 4741. 5.70
## 4 62B Blood 999. 2849. 6.47
## 5 63 SF 2024. 6580. 4.54
## 6 63B Blood 1699. 5295. 6.25
## 7 65 SF 2519. 9181. 5.16
## 8 65B Blood 1425. 4574. 6.14
## 9 76 SF 1496. 4129. 5.82
## 10 76B Blood 1582. 4944. 5.62
## 11 A40 SF 1646. 4886. 6.19
## 12 A40B Blood 1341. 4182. 6.52
## 13 A56 SF 1383. 4062. 7.57
## 14 A56B Blood 1435. 4054. 7.16
## 15 A7 SF 1956. 7134. 5.99
## 16 A7B Blood 1872. 5542. 4.14
## # … with 3 more variables: nFeature_RNA_Median <dbl>, nCount_RNA_Median <dbl>,
## # percent.mt_Median <dbl>
# across control.orig groups
seu@meta.data[,c("orig.ident", "control.orig", "nFeature_RNA", "nCount_RNA", "percent.mt")] %>%
group_by(control.orig) %>%
summarise_if(is.numeric, list("Mean"=mean, "Median"=median))
## # A tibble: 2 × 7
## control.orig nFeature_RNA_Me… nCount_RNA_Mean percent.mt_Mean nFeature_RNA_Me…
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Blood 1420. 4294. 6.22 1447
## 2 SF 1910. 6297. 5.76 1776
## # … with 2 more variables: nCount_RNA_Median <dbl>, percent.mt_Median <dbl>
There are differences between donor samples as well as across
Blood
and SF
groups, but the values are on the
same range and comparable.
Plot the relationship between nCount_RNA
and
percent.mt
or nFeature_RNA
below.
## QC: scatter plots
# Plot
qc_scatter1 <- FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "percent.mt")
qc_scatter2 <- FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
# Save
pdf(paste(res_dirs[1], "qc_scatter_plots.pdf", sep="/"), width=8, height=4)
qc_scatter1 + qc_scatter2
dev.off()
## png
## 2
# Print
print((qc_scatter1 + qc_scatter2))
The scatter plot of nCount_RNA x percent.mt
shows no
relationship/dependence as expected, whereas the plot
nCount_RNA x nFeature_RNA
shows a positive relationship
(highly correlated - 0.92), i.e., cells sequenced more deeply are
expressing more distinct genes.
(1 min)
Filtering bad quality cells based on their properties, i.e.,
nCount_RNA
, nFeature_RNA
or/and
percent.mt
, and low abundant genes will not be performed
because the authors removed this already. Although the code below is an
example that you could use to filter the bad quality cells based on
their properties.
## Filter bad-quality cells
#seu <- subset(seu, subset = nFeature_RNA > 1000 & nFeature_RNA < 4000 & percent.mt < 10) # This would select cells expressing more than 1K distinct genes and lower than 4K as well as cells expressing less than 10% mitochondrial genes.
Low abundant genes need to be filtered when creating the
Seurat
object
(CreateSeuratObject(..., min.cells=<genes exp. at least in this no. of cells>)
).
(3 min)
Since we’ve some computing limitations, we’ll remove six samples:
65
, 65B
(all from batch 0
),
A7
, A7B
(from batch 1
),
45
, 45B
(batch 2
).
## Select samples
samps <- unique(as.character(seu@meta.data$orig.ident))
samps2sel <- samps[!(samps %in% c("45", "45B", "65", "65B", "A7", "A7B"))]
seu <- subset(seu, subset = orig.ident %in% samps2sel)
seu@meta.data$orig.ident <- factor(as.character(seu@meta.data$orig.ident), levels=samps2sel)
(5 min)
First, the data is normalized to reduce the impact of distinct sequencing depth across the cells to make them comparable. There are several normalization methods (e.g. SCTransform).
Here, we’ll focus on the most common used method: log-normalization
(NormalizeData(..., normalization.method="LogNormalize")
).
This method consists in dividing the expression values by the library
size (=total no. UMIs or nCount_RNA
) for each respective
cell. In other words getting the relative abundance of gene expression
by cell. Then, the values obtained are multiplied by a factor, usually
10,000 (scale.factor=10000
). Finally this result is log1p
transformed.
## Log-normalization
seu <- NormalizeData(seu, normalization.method="LogNormalize", scale.factor=10000)
# Inspect the result: first 10 genes x 10 cells
seu@assays$RNA@data[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## AL627309.1 . . . . . . . . . .
## AL669831.5 . . . . . 0.4009989 . . . .
## FAM87B . . . . . . . . . .
## LINC00115 . . . . . . . . . .
## FAM41C . . . . . . . . . .
## NOC2L . 0.9993235 0.6851444 . . . 1.192614 . . .
## KLHL17 . . . . . . . . . .
## PLEKHN1 . . . . . . . . . .
## HES4 . . . . . . . . . .
## ISG15 . . 1.7783857 . 2.012912 . . . 1.296671 .
(5 min)
Highly variable genes or features (HVG or HVF) are a set of genes/features with a high variance across the cells. In other terms, genes that change their expression level a lot across cells. Why are we interested on these? Take the example below.
We know that roughly some subsets of T cells can be classified in CD4+ or CD8+. Let’s plot these two features one against other.
## CD4+ vs CD8+
FeatureScatter(seu, feature1="CD8A", feature2="CD4", group.by="cell.types.orig")
What do you think? Are these two genes important or not?
Now look into two genes with very low variance: PPP1R36 and CD177.
## PPP1R36 vs CD177
FeatureScatter(seu, feature1="PPP1R36", feature2="CD177", group.by="cell.types.orig")
What do you think about these two genes? Are they informative to distinguish the different cell types observed?
Now that you understand why we’re using HVG, let’s determine the top 2000 (HVG)! This value can be higher or lower. The plot below will help you to decide, but the top 2K HVG is quite often choosen.
## Determine HVG
seu <- FindVariableFeatures(seu, selection.method="vst", nfeatures=2000)
# Get the top 10 HVG
top10_hvg <- head(VariableFeatures(seu), 10)
# Plot HVG
hvg_plot1 <- VariableFeaturePlot(seu)
hvg_plot2 <- LabelPoints(plot=hvg_plot1, points=top10_hvg, repel=TRUE)
# Save
pdf(paste(res_dirs[1], "hvg_plots.pdf", sep="/"), width=12, height=6)
print((hvg_plot1 + hvg_plot2))
dev.off()
## png
## 2
# Print
print((hvg_plot1 + hvg_plot2))
Question: Do you recognize any of the top 20 HVG? (1 min)
Tip: use the function
head()
withVariableFeatures()
to retrieve the top 20 HVG (the top 10 also appeared highlighted in the plot above).
The top 20 HVG are: JCHAIN, CXCL10, IGKC, IGHA1, CCL2, SPP1, IGLV2-23, CST3, CCL8, CXCL13, IGLV2-8, IGKV1D-16, S100B, HSPA6, HSPA1A, PPBP, FSCN1, C1QB, HLA-DQA1, IGKV1-27.
Cytokines such as CCL2
or CCL8
.
(3 min)
Now that we’ve our set of highly informative genes, we’ll perform scaling. Scaling consists in standardizing the genes/features to make them comparable and avoid the highly abundant ones of standing out which is extremely important for downstream analysis such as Principal Component Analysis. There are several methods. The one that we’ll use is the Z-score which consists in subtracting to a gene its mean and divide it by its standard deviation (SD) in order to obtain mean 0 and SD of 1.
Let’s scaled our scRNA-seq dataset.
## Scaling
seu <- ScaleData(seu)
## Print the first 10 rows x 10 cols
seu@assays$RNA@scale.data[1:10,1:10]
## 62_AAACCTGAGCGTGAGT 62_AAACCTGAGTATGACA 62_AAACCTGGTAATAGCA
## HES4 -0.13065564 -0.13065564 -0.13065564
## ISG15 -0.67572425 -0.67572425 1.36133826
## TNFRSF18 -0.25452320 -0.25452320 -0.25452320
## TNFRSF4 2.28553326 -0.24922695 4.57493715
## ACOT7 -0.23131895 -0.23131895 -0.23131895
## TNFRSF9 -0.13428611 -0.13428611 -0.13428611
## ENO1 -1.19726288 1.20329131 0.16332666
## GPR157 -0.08960485 -0.08960485 -0.08960485
## RBP7 -0.18559818 -0.18559818 -0.18559818
## PGD -0.45555101 1.22870799 -0.45555101
## 62_AAACGGGAGATGGGTC 62_AAACGGGTCAGGATCT 62_AAAGATGAGAGTAATC
## HES4 -0.13065564 -0.13065564 -0.1306556
## ISG15 -0.67572425 1.62997819 -0.6757242
## TNFRSF18 -0.25452320 -0.25452320 -0.2545232
## TNFRSF4 -0.24922695 -0.24922695 -0.2492270
## ACOT7 -0.23131895 -0.23131895 -0.2313189
## TNFRSF9 -0.13428611 -0.13428611 4.8553534
## ENO1 1.62735289 1.04680414 1.2507066
## GPR157 -0.08960485 -0.08960485 10.0000000
## RBP7 -0.18559818 -0.18559818 -0.1855982
## PGD -0.45555101 1.48452581 0.7013749
## 62_AAAGATGCAGGGAGAG 62_AAAGATGTCGGCCGAT 62_AAAGCAAAGACACTAA
## HES4 -0.13065564 -0.13065564 -0.13065564
## ISG15 -0.67572425 -0.67572425 0.80955508
## TNFRSF18 -0.25452320 -0.25452320 -0.25452320
## TNFRSF4 -0.24922695 -0.24922695 5.22879325
## ACOT7 -0.23131895 -0.23131895 2.70599623
## TNFRSF9 -0.13428611 -0.13428611 -0.13428611
## ENO1 0.84740685 0.81743760 0.62719929
## GPR157 -0.08960485 -0.08960485 -0.08960485
## RBP7 -0.18559818 -0.18559818 -0.18559818
## PGD -0.45555101 1.51390688 1.72985686
## 62_AAAGCAAAGGATGCGT
## HES4 -0.13065564
## ISG15 -0.67572425
## TNFRSF18 2.51614411
## TNFRSF4 4.64515580
## ACOT7 -0.23131895
## TNFRSF9 -0.13428611
## ENO1 0.40943849
## GPR157 -0.08960485
## RBP7 -0.18559818
## PGD -0.45555101
During scaling it is possible to regress out undesirable variables,
such as nCount_RNA
or percent.mt
. With this
regression you can get the residuals and remove the unwanted variation
caused by unwanted variables (such as the ones mentioned). Here we’ll
ignore this, but this could have been done above by providing the
option: vars.to.regress="percent.mt"
(to regress out
mitochondrial genes).
(5 min)
Principal Component Analysis (PCA) is a deterministic and linear dimensional reduction method that provides the basis for other analyses. The PCA aims to reduce the dimensionality of high-dimensional data, as it is the case of scRNA-seq, maintaining most of the variation present in the data across a few dozens of components.
Run it below (it will use the scaled data obtained above).
## PCA
seu <- RunPCA(seu)
Often the first few PCs (Principal Components) show the variation
caused by the total no. of UMIs or percentage of mitochondrial genes,
cell cycling genes (cell division), rather than cell type. Let’s plot
the PCA and highlight the cells by nCount_RNA
,
nFeature_RNA
, percent.mt
.
## PCA: cell features
pca_features <- FeaturePlot(seu, reduction = "pca", features=c("nCount_RNA", "nFeature_RNA", "percent.mt"))
# Save
pdf(paste(res_dirs[1], "pca_qc_feature_plots.pdf", sep="/"))
print(pca_features)
dev.off()
## png
## 2
# Plot
print(pca_features)
By default we computed the first 50 PCs. Again, not all of these PCs are informative. Some of them comprise very low variation.
In order to decide the top most important PCs, i.e., the ones that hold more data variance, we’ll plot below an elbow plot. As the name suggests the aim is to find the elbow in the plot, i.e., the point where there is a drastic reduction of variance, and select these first PCs with more variance, and, thus more informative which we’ll use downstream.
## Elbow plot
elbow_plot <- ElbowPlot(seu, ndims=50)
# Save
pdf(paste(res_dirs[1], "elbow_plot.pdf", sep="/"))
print(elbow_plot)
dev.off()
## png
## 2
# Print
print(elbow_plot)
Question: How many principal components (PCs) do you think we should select for downstream analyses (e.g., clustering, UMAP)? (1 min)
Let’s use 30. Although other values would be equally valid such as 10, 20 or even 40.
(15 min)
PCA is the basis for other methods. Although it reduces the high-dimensional data to low-dimensional space, usually is not a good method to visualize the cell populations or clusters. For this purpose there are better methods such as non-linear dimensional reduction methods like tSNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection). UMAP is a much faster method compared to tSNE and it claims to preserve the local and most of the global structure, whereas tSNE only preserves the local structure.
Run only UMAP below (you can attempt to run tSNE but it takes more time).
## tSNE & UMAP
# tSNE
#seu <- RunTSNE(seu, dims=1:30, perplexity=30)
# UMAP
seu <- RunUMAP(seu, dims=1:30)
Let’s explore potential batch effects such as the donor
(=orig.ident
), batch (=batch.orig
), control
(=control.orig
) and group (=GROUP.orig
)
categorical variables.
## Plot variables (potential batches as well as others)
vars2plot <- c("cell.types.orig", "newClusterID.orig",
"orig.ident", "control.orig", "GROUP.orig", "batch.orig")
dr_plots[["tsne"]] <- dr_plots[["umap"]] <- dr_plots <- list()
for (v in vars2plot) {
for (mth in c("tsne", "umap")) {
if (mth %in% names(seu@reductions)) {
dr_plots[[mth]][[v]] <- DimPlot(seu, reduction=mth, group.by=v, label=TRUE) +
ggtitle(v)
}
}
}
# Save tSNE & UMAP
# pdf(paste(res_dirs[1], "unintegrated_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "unintegrated_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_plots[["umap"]], ncol=3)
dev.off()
## png
## 2
# Print
# cowplot::plot_grid(plotlist=dr_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_plots[["umap"]], ncol=3)
TASK 1: Discuss within your group if there is any batch in the data influencing the global cell structure. Some tips: look into the dimensional reduction plots and compare the projections between the technical variables and the
cell.types.orig
.
In the next sections we’ll cluster the data without performing any integration. Then, we’ll integrate the data using the same variable that the authors used, but with two different integration methods. Afterwards, we’ll compare the three results with the ground-truth cell-types provided by authors to assess which was the best method.
(5 min)
Clustering in Seurat
is done in two steps:
building a clustering based graph (SNN):
FindNeighbors()
find communities/populations (Louvain):
FindClusters()
The second step depends on the resolution
value given.
Here, we’ll use a resolution value of 0.5 which seems to had been what
the authors used based on the @meta.data
column name
RNA_snn_res.0.5.orig
(this is the column name that
Seurat
gives automatically and the last value corresponds
to the resolution value used to cluster).
## Clustering: unintegrated data
seu <- FindNeighbors(seu, dims=1:30)
seu <- FindClusters(seu, resolution=0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 46663
## Number of edges: 1659433
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9317
## Number of communities: 25
## Elapsed time: 10 seconds
Now let’s compare the new clustering result with the unintegrated data with the original clustering result below.
## Unintegrated clustering DR plots
vars2plot <- c("cell.types.orig", "newClusterID.orig", "seurat_clusters")
dr_clts_plots[["tsne"]] <- dr_clts_plots[["umap"]] <- dr_clts_plots <- list()
for (v in vars2plot) {
for (mth in c("tsne", "umap")) {
if (mth %in% names(seu@reductions)) {
dr_clts_plots[[mth]][[v]] <- DimPlot(seu, reduction=mth, group.by=v, label=TRUE) +
ggtitle(v)
}
}
}
# Save tSNE & UMAP
# pdf(paste(res_dirs[1], "unintegrated_clts_comp_tsne_plots.pdf", sep="/"), width=24, height=10)
# cowplot::plot_grid(plotlist=dr_clts_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "unintegrated_clts_comp_umap_plots.pdf", sep="/"), width=24, height=10)
cowplot::plot_grid(plotlist=dr_clts_plots[["umap"]], ncol=3)
dev.off()
## png
## 2
# Print
#cowplot::plot_grid(plotlist=dr_clts_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_clts_plots[["umap"]], ncol=3)
Does the new unintegrated clustering result looks similar to the
original one, i.e., newClusterID.orig
?
(3 min)
The data will be integrated by the batch variable present in the
@meta.data
table: batch.orig
.
batch.orig
comprises 4 categories/factor levels: 0, 1,
2, 3. Each batch comprises a different set of samples.
batch 0
: 65, 65B (removed above)
batch 1
: A7, A7B (removed), A13, A13B, A40,
A40B
batch 2
: 45, 45B (removed), A50, A50B, A56,
A56B
batch 3
: 62, 62B, 63, 63B, 76, 76B
The authors used the software Harmony
to integrate the
data and these batches appeared described in their github repository
with the code used for this analysis: Stnhy1/AR. The batches are
defined in the script addBatchInfo.R
use later on run-harmony.R.
(30 min)
Now let’s perform integration with Seurat
.
Seurat
provides different methods for
integration. Here, since we’re working with a
relatively large data set, we’ll use the fast integration
reciprocal PCA (RPCA) method. This page
discusses the difference between the RPCA method and the ‘traditional’
CCA based method. In sum, RPCA
is faster and useful when
the data sets being integrated are not expected to share the majority of
cell types. In this case, we have several batches, with samples coming
from blood and synovial fluid. Therefore, the RPCA
seems to be more suitable for our data set.
To perform integration using the RPCA
Seurat
method we start by normalizing
(NormalizeData()
), finding HVG shared across data sets (by
FindVariableFeatures()
and later
SelectIntegrationFeatures()
), scaling
(ScaleData()
) and performing PCA (RunPCA()
)
independently for each data set to be integrated. Then it comes the
integration part starting by finding anchors
(FindIntegrationAnchors()
) and integrating the data sets
(IntegrateData()
).
## Integration - Seurat fast RPCA
# Split Seurat object by batch samples to perform integration
seu_ls <- SplitObject(seu, split.by = "batch.orig")
# Log-normalize & find HVG across donor batch data sets
set.seed(1024)
seu_ls <- lapply(X = seu_ls, FUN = function(x) {
x <- NormalizeData(x)
x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
})
# Select robust & shared HVG across datasets
features <- SelectIntegrationFeatures(object.list = seu_ls)
seu_ls <- lapply(X = seu_ls, FUN = function(x) {
x <- ScaleData(x, features = features, verbose = FALSE)
x <- RunPCA(x, features = features, verbose = FALSE)
})
# Find integration anchors
anchors <- FindIntegrationAnchors(object.list=seu_ls,
anchor.features = features,
reduction = "rpca")
# Integrate data sets
int <- IntegrateData(anchorset=anchors)
# Change the default assay to 'integrated'
DefaultAssay(int) <- "integrated"
# Perform DR
int <- ScaleData(int)
int <- RunPCA(int, npcs=50)
#int <- RunTSNE(int, dims=1:30, perplexity=30)
int <- RunUMAP(int, reduction="pca", dims=1:30)
Inspect the DR plots below.
## Plot variables after integration (potential batches as well as others)
vars2plot <- c("cell.types.orig", "orig.ident", "newClusterID.orig", "seurat_clusters")
dr_int_plots[["tsne"]] <- dr_int_plots[["umap"]] <- dr_int_plots <- list()
for (v in vars2plot) {
for (mth in c("tsne", "umap")) {
if (mth %in% names(seu@reductions)) {
dr_int_plots[[mth]][[v]] <- DimPlot(int, reduction=mth, group.by=v, label=TRUE) +
ggtitle(v)
}
}
}
# Save tSNE & UMAP
# pdf(paste(res_dirs[1], "int_seurat_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_int_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_seurat_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_int_plots[["umap"]], ncol=3)
dev.off()
## png
## 2
# Print
#cowplot::plot_grid(plotlist=dr_int_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_int_plots[["umap"]], ncol=3)
Let’s cluster the result below and see how that looks like.
## Clustering: integrated data
# rename the previous unintegrated cluster
if ("seurat_clusters" %in% colnames(int@meta.data)) colnames(int@meta.data)[which(colnames(int@meta.data)=="seurat_clusters")] <- "unint_seurat_clusters"
set.seed(1024)
int <- FindNeighbors(int, dims=1:30)
int <- FindClusters(int, resolution=0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 46663
## Number of edges: 1703879
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9346
## Number of communities: 25
## Elapsed time: 11 seconds
## Integrated clustering DR plots
vars2plot <- c("cell.types.orig", "newClusterID.orig", "seurat_clusters")
dr_clts_seu_plots[["tsne"]] <- dr_clts_seu_plots[["umap"]] <- dr_clts_seu_plots <- list()
for (v in vars2plot) {
for (mth in c("tsne", "umap")) {
if (mth %in% names(seu@reductions)) {
dr_clts_seu_plots[[mth]][[v]] <- DimPlot(int, reduction=mth, group.by=v, label=TRUE) +
ggtitle(v)
}
}
}
# Save tSNE & UMAP
# pdf(paste(res_dirs[1], "int_seurat_clts_comp_tsne_plots.pdf", sep="/"), width=24, height=10)
# cowplot::plot_grid(plotlist=dr_clts_seu_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_seurat_clts_comp_umap_plots.pdf", sep="/"), width=24, height=10)
cowplot::plot_grid(plotlist=dr_clts_seu_plots[["umap"]], ncol=3)
dev.off()
## png
## 2
# Print
#cowplot::plot_grid(plotlist=dr_clts_seu_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_clts_seu_plots[["umap"]], ncol=3)
(35 min)
Scanorama is a
python
package developed for integration based on panorama
stitching (read more about - Hie et al.,
2019).
Contrary to Seurat
RPCA integration method above, where
we used the integrated corrected gene expression matrix
(at
int@assays$integrated@data
) to scale the data, run the PCA
and cluster, here we’ll use the corrected joint embedding
(saved as PCA
) for dimensional reduction and clustering.
The corrected gene expression matrix of Scanorama
is saved
at int_sca@assays$integrated@counts
.
We’re using the corrected joint embedding
for downstream
analyses, such as DR & clustering, because in an independent
benchmark comparison of several integration methods performed by Luecken et
al., 2021 this was the batch-corrected/integrated result that worked
better. Although you can also use the
integrated corrected gene expression matrix
for the same
purpose.
It was created a wrapper function scanorama_int()
that
calls and runs Scanorama
from R
taking as
input a list of Seurat
data sets to integrate (see how to
run it directly - page).
## Integration: Scanorama
# Run integration
set.seed(1024)
int_sca <- scanorama_int(seu_ls)
# DR
#int_sca <- RunTSNE(int_sca, dims=1:30, perplexity=30)
int_sca <- RunUMAP(int_sca, reduction="pca", dims=1:30)
# Cluster the data
int_sca <- FindNeighbors(int_sca, dims=1:30)
int_sca <- FindClusters(int_sca, resolution=0.5)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 46663
## Number of edges: 1596807
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9390
## Number of communities: 29
## Elapsed time: 9 seconds
Let’s look into the results.
## Plot variables after integration (potential batches as well as others)
vars2plot <- c("cell.types.orig", "orig.ident", "newClusterID.orig", "seurat_clusters")
dr_int_sca_plots[["tsne"]] <- dr_int_sca_plots[["umap"]] <- dr_int_sca_plots <- list()
for (v in vars2plot) {
for (mth in c("tsne", "umap")) {
if (mth %in% names(seu@reductions)) {
dr_int_sca_plots[[mth]][[v]] <- DimPlot(int_sca, reduction=mth, group.by=v, label=TRUE) +
ggtitle(v)
}
}
}
# Save tSNE & UMAP
# pdf(paste(res_dirs[1], "int_scanorama_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_int_sca_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_scanorama_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_int_sca_plots[["umap"]], ncol=3)
dev.off()
## png
## 2
# Print
#cowplot::plot_grid(plotlist=dr_int_sca_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_int_sca_plots[["umap"]], ncol=3)
TASK 2: Discuss within your group if the projections and, particularly, the clustering of integrated data with Seurat and Scanorama, look better now after integrate the datasets by ‘batch.ident’. Some tips: look into the comparison results in the last section.
(15 min)
Now, let’s compare the three clustering results with the
ground-truth
, i.e., newClusterID
(which
correspond to the cell types annotated): unintegrated, Seurat
integrated, Scanorama integrated.
Start by plotting the dimensional reduction plots below:
## Clustering comparison: DR plots
# pdf(paste(res_dirs[1], "clts_comp_all_tsne_plots.pdf", sep="/"), width=24, height=24)
# cowplot::plot_grid(plotlist=list(dr_clts_plots[["tsne"]][[1]], dr_clts_plots[["tsne"]][[2]], dr_clts_plots[["tsne"]][[3]],
# dr_clts_seu_plots[["tsne"]][[1]], dr_clts_seu_plots[["tsne"]][[2]], dr_clts_seu_plots[["tsne"]][[3]],
# dr_int_sca_plots[["tsne"]][[1]],dr_int_sca_plots[["tsne"]][[3]],dr_int_sca_plots[["tsne"]][[4]]
# ), ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "clts_comp_all_umap_plots.pdf", sep="/"), width=24, height=24)
cowplot::plot_grid(plotlist=list(dr_clts_plots[["umap"]][[1]], dr_clts_plots[["umap"]][[2]], dr_clts_plots[["umap"]][[3]],
dr_clts_seu_plots[["umap"]][[1]], dr_clts_seu_plots[["umap"]][[2]], dr_clts_seu_plots[["umap"]][[3]],
dr_int_sca_plots[["umap"]][[1]],dr_int_sca_plots[["umap"]][[3]],dr_int_sca_plots[["umap"]][[4]]
), ncol=3)
dev.off()
## png
## 2
# Print
# cowplot::plot_grid(plotlist=list(dr_clts_plots[["tsne"]][[1]], dr_clts_plots[["tsne"]][[2]], dr_clts_plots[["tsne"]][[3]],
# dr_clts_seu_plots[["tsne"]][[1]], dr_clts_seu_plots[["tsne"]][[2]], dr_clts_seu_plots[["tsne"]][[3]],
# dr_int_sca_plots[["tsne"]][[1]],dr_int_sca_plots[["tsne"]][[3]],dr_int_sca_plots[["tsne"]][[4]]
# ), ncol=3)
cowplot::plot_grid(plotlist=list(dr_clts_plots[["umap"]][[1]], dr_clts_plots[["umap"]][[2]], dr_clts_plots[["umap"]][[3]],
dr_clts_seu_plots[["umap"]][[1]], dr_clts_seu_plots[["umap"]][[2]], dr_clts_seu_plots[["umap"]][[3]],
dr_int_sca_plots[["umap"]][[1]],dr_int_sca_plots[["umap"]][[3]],dr_int_sca_plots[["umap"]][[4]]
), ncol=3)
In order to get a more concrete metric, let’s compare the clusterings
obtained by us with the ground-truth
provided by authors,
i.e., newClusterID.orig
and/or
cell.types.orig
. For this purpose we can check a confusion
matrix and use the Adjusted Rand Index -
ARI to compare two clustering results.
Let’s create a confusion matrix for every comparison, where we’ll
compare the clustering results obtained here with the
cell.types.orig
.
## Clustering comparison - confusion matrix
clts2celltype <- list()
clts2celltype[["orig"]] <- data.frame(rbind(table(seu@meta.data$newClusterID.orig, seu@meta.data$cell.types.orig)))
clts2celltype[["unint"]] <- data.frame(rbind(table(seu@meta.data$seurat_clusters, seu@meta.data$cell.types.orig)))
clts2celltype[["int_seu"]] <- data.frame(rbind(table(int@meta.data$seurat_clusters, int@meta.data$cell.types.orig)))
clts2celltype[["int_sca"]] <- data.frame(rbind(table(int_sca@meta.data$seurat_clusters, int_sca@meta.data$cell.types.orig)))
# save table
for (type in names(clts2celltype)) {
write.table(x=cbind("Clusters"=row.names(clts2celltype[[type]]), clts2celltype[[type]]),
file=paste(res_dirs[2], paste0("conf_mtx_", type, "_clusters_vs_celltype.tsv"), sep="/"),
row.names=FALSE, quote=FALSE, sep="\t")
}
# Print heatmaps of the 4 confusion matrices
h1 <- Heatmap(t(clts2celltype[[1]]), name="No. cells", column_title="original", cluster_rows=FALSE, cluster_columns=FALSE)
h2 <- Heatmap(t(clts2celltype[[2]]), name="No. cells", column_title="unintegrated",
cluster_rows=FALSE, cluster_columns=FALSE)
h3 <- Heatmap(t(clts2celltype[[3]]), name="No. cells", column_title="Seurat", cluster_rows=FALSE, cluster_columns=FALSE)
h4 <- Heatmap(t(clts2celltype[[4]]), name="No. cells", column_title="Scanorama", cluster_rows=FALSE, cluster_columns=FALSE)
pdf(paste(res_dirs[1], "heatmap_conf_mtx_comp.pdf", sep="/"), width=18)
(h1+h2+h3+h4)
dev.off()
## png
## 2
print((h1+h2+h3+h4))
# print clusters to cell types
knitr::kable(clts2celltype[["orig"]])
Activated.CD4 | Activated.CD8 | CD16..NK.and.NKT | CD16..NK.and.NKT.1 | CD38..CD8.T | CD38..CD8.T.1 | Classical.monocytes | CX3CR1hi.Effector.CD8 | CX3CR1lo.Effector.CD8 | Cycling.T | Effector.CD8 | Gamma.Delta.T | mDC | Megakaryocytes | Memory.B | Naive.B | Naive.T | Neutrophil | Non.classical.monocytes | pDC | Plasmablasts | Synovial.cells | Treg | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1015 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 796 |
3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5139 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 767 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | 3630 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | 0 | 2180 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
7 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5593 | 0 | 0 | 0 | 0 | 0 | 0 |
8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 4824 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 393 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
10 | 0 | 0 | 0 | 0 | 241 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
11 | 0 | 0 | 0 | 0 | 0 | 415 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
12 | 0 | 0 | 0 | 6013 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
13 | 0 | 0 | 728 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1565 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 936 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 200 | 0 | 0 |
18 | 0 | 0 | 0 | 0 | 0 | 0 | 2952 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
19 | 0 | 0 | 0 | 0 | 0 | 0 | 1317 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 947 | 0 | 0 | 0 | 0 |
21 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2206 | 0 | 0 | 0 | 0 | 0 |
22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 982 | 0 | 0 | 0 | 0 | 0 |
23 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 188 | 0 | 0 | 0 | 0 | 0 |
24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 38 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
25 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1922 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
26 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 159 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
27 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 158 | 0 | 0 | 0 |
28 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 269 | 0 | 0 | 0 |
29 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 96 | 0 | 0 | 0 |
30 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 939 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
31 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 55 | 0 |
Above we’ve just printed the confusion matrix comparing the original cluster labels versus cell types given by the authors. To make the results more interpretable, the confusion matrices were printed as heatmaps.
You can open the other matrices if you want, but the ARI metric below will give us a more concrete value.
Question: Which original renamed cluster(s) corresponds to Tregs? (1 min)
Tregs (n=796) belong to cluster 2.
Print the R
chunk code below to get the
ARI metric.
## Clustering comparison - ARI
true_labels <- list("unint"=seu@meta.data$cell.types.orig,
"int_seu"=int@meta.data$cell.types.orig,
"int_sca"=int_sca@meta.data$cell.types.orig)
new_clts <- list("unint"=seu@meta.data$seurat_clusters,
"int_seu"=int@meta.data$seurat_clusters,
"int_sca"=int_sca@meta.data$seurat_clusters)
comp <- names(new_clts)
lapply(setNames(comp, comp), function(x) clustComp(true_labels[[x]], new_clts[[x]])[["ARI"]]) %>%
unlist()
## unint int_seu int_sca
## 0.5915921 0.6302659 0.5960452
TASK 3: Discuss within your group which result reproduces better the cell types/clusters found by the authors.
## Save objects
res_objs <- "results/GSE173303/objects"
if(!dir.exists(res_objs)) dir.create(res_objs, recursive=TRUE)
saveRDS(seu, paste(res_objs, "seu.rds", sep="/"))
saveRDS(int, paste(res_objs, "int.rds", sep="/"))
saveRDS(int_sca, paste(res_objs, "int_sca.rds", sep="/"))
## R packages and versions used in these analyses
sessionInfo()
## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
##
## Matrix products: default
## BLAS/LAPACK: /wrk/local2/B21027_scRNAseq_signatures/070622_embl_ebi_tcell_course/EMBL_EBI_scRNA_bioinformatics_Tcell_course_2022/workflow/.snakemake/conda/2b532403d31becf1b5412328473ca837/lib/libopenblasp-r0.3.20.so
##
## locale:
## [1] LC_CTYPE=en_GB.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_GB.UTF-8
## [5] LC_MONETARY=fi_FI.UTF-8 LC_MESSAGES=en_GB.UTF-8
## [7] LC_PAPER=fi_FI.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] reticulate_1.25 aricode_1.0.0 ComplexHeatmap_2.10.0
## [4] sp_1.5-0 SeuratObject_4.1.0 Seurat_4.1.1
## [7] ggplot2_3.3.6 dplyr_1.0.9
##
## loaded via a namespace (and not attached):
## [1] circlize_0.4.15 plyr_1.8.7 igraph_1.3.2
## [4] lazyeval_0.2.2 splines_4.1.1 listenv_0.8.0
## [7] scattermore_0.8 digest_0.6.29 foreach_1.5.2
## [10] htmltools_0.5.2 fansi_1.0.3 magrittr_2.0.3
## [13] tensor_1.5 cluster_2.1.3 doParallel_1.0.17
## [16] ROCR_1.0-11 globals_0.15.0 matrixStats_0.62.0
## [19] spatstat.sparse_2.1-1 colorspace_2.0-3 rappdirs_0.3.3
## [22] ggrepel_0.9.1 xfun_0.31 crayon_1.5.1
## [25] jsonlite_1.8.0 progressr_0.10.1 spatstat.data_2.2-0
## [28] survival_3.3-1 zoo_1.8-10 iterators_1.0.14
## [31] glue_1.6.2 polyclip_1.10-0 gtable_0.3.0
## [34] leiden_0.4.2 GetoptLong_1.0.5 future.apply_1.9.0
## [37] shape_1.4.6 BiocGenerics_0.40.0 abind_1.4-5
## [40] scales_1.2.0 DBI_1.1.3 spatstat.random_2.2-0
## [43] miniUI_0.1.1.1 Rcpp_1.0.8.3 viridisLite_0.4.0
## [46] xtable_1.8-4 clue_0.3-60 spatstat.core_2.4-4
## [49] klippy_0.0.0.9500 stats4_4.1.1 htmlwidgets_1.5.4
## [52] httr_1.4.3 RColorBrewer_1.1-3 ellipsis_0.3.2
## [55] ica_1.0-2 pkgconfig_2.0.3 farver_2.1.0
## [58] sass_0.4.1 uwot_0.1.11 deldir_1.0-6
## [61] utf8_1.2.2 here_1.0.1 labeling_0.4.2
## [64] tidyselect_1.1.2 rlang_1.0.2 reshape2_1.4.4
## [67] later_1.3.0 munsell_0.5.0 tools_4.1.1
## [70] cli_3.3.0 generics_0.1.2 ggridges_0.5.3
## [73] evaluate_0.15 stringr_1.4.0 fastmap_1.1.0
## [76] yaml_2.3.5 goftest_1.2-3 knitr_1.39
## [79] fitdistrplus_1.1-8 purrr_0.3.4 RANN_2.6.1
## [82] pbapply_1.5-0 future_1.26.1 nlme_3.1-157
## [85] mime_0.12 compiler_4.1.1 plotly_4.10.0
## [88] png_0.1-7 spatstat.utils_2.3-1 tibble_3.1.7
## [91] bslib_0.3.1 stringi_1.7.6 highr_0.9
## [94] RSpectra_0.16-1 rgeos_0.5-9 lattice_0.20-45
## [97] Matrix_1.4-1 vctrs_0.4.1 pillar_1.7.0
## [100] lifecycle_1.0.1 spatstat.geom_2.4-0 lmtest_0.9-40
## [103] jquerylib_0.1.4 GlobalOptions_0.1.2 RcppAnnoy_0.0.19
## [106] data.table_1.14.2 cowplot_1.1.1 irlba_2.3.5
## [109] httpuv_1.6.5 patchwork_1.1.1 R6_2.5.1
## [112] promises_1.2.0.1 KernSmooth_2.23-20 gridExtra_2.3
## [115] IRanges_2.28.0 parallelly_1.32.0 codetools_0.2-18
## [118] MASS_7.3-57 assertthat_0.2.1 rprojroot_2.0.3
## [121] rjson_0.2.21 withr_2.5.0 sctransform_0.3.3
## [124] S4Vectors_0.32.3 mgcv_1.8-40 parallel_4.1.1
## [127] rpart_4.1.16 tidyr_1.2.0 rmarkdown_2.11
## [130] Rtsne_0.16 shiny_1.7.1