Notebook

## Import packages
library("dplyr") # data wrangling
library("ggplot2") # plotting
library("Seurat") # scRNA-seq analysis
library("ComplexHeatmap") # plot heatmaps
library("aricode") # ARI index clt comp
library("reticulate") # to work w/ python packages in R
scanorama <- import('scanorama') # import the Scanorama python package
source("scripts/helper_functions.R") # import functions adapted to automatize integration with scanorama

Data

Download & Import Datasets

(7 min)

Run the R chunk code below to import the table data/GEO_GSE173303_project.tsv which contains the information to download the data sets that we’ll be using herein.

## Download datasets

# import table with datasets to download
data2download <- read.table("data/GEO_GSE173303_project.tsv", 
                sep="\t", header=TRUE)

Check below the content of the table imported.

# print table 
knitr::kable(data2download)

filename	description	ftp_url
GSE173303_barcodes.tsv.gz	table of barcodes	https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fbarcodes%2Etsv%2Egz
GSE173303_features.tsv.gz	table of genes	https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Ffeatures%2Etsv%2Egz
GSE173303_matrix.mtx.gz	gene expression data	https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fmatrix%2Emtx%2Egz
GSE173303_subT_data_metadata.tsv.gz	cell metadata for sub T cells	https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5FsubT%5Fdata%5Fmetadata%2Etsv%2Egz
GSE173303_whole_data_metadata.tsv.gz	cell metadata for all	https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fwhole%5Fdata%5Fmetadata%2Etsv%2Egz

Now use the information from the table above to download all the data from the GEO project GSE173303 (published by Kim et al., 2022). In this study the authors used peripheral blood and/or synovial fluid samples from 20 patients with arthritis-autoimmune related adverse events.

The GEO GSE173303 repository comprises two types of files:

10X Genomics Cell Ranger (droplet based scRNA-seq technology - read more):
- GSE173303_barcodes.tsv.gz (cell barcodes)
- GSE173303_features.tsv.gz (genes)
- GSE173303_matrix.mtx.gz (gene expression values)
Metadata (data describing cell attributes):
- GSE173303_whole_data_metadata.tsv.gz (for all)
- GSE173303_subT_data_metadata.tsv.gz (for T cells)

The 10X data can be directly imported into Seurat, a R package developed for the analysis of scRNA-seq data (page; Hao et al., 2021). While the metadata will be imported separately as tab-separated tables.

Since the function used to import 10X data into R and Seurat, Read10X, looks into a given directory and searches for the following file names, barcodes.tsv, genes.tsv, and matrix.mtx (it allows compressed .gz files), it is more convenient to remove the prefix file name ‘GSE173303_’ from every file.

## Download & import data

# create directory to save files
down_dir <- "data/GSE173303" # directory to save datasets 
if (!dir.exists(down_dir)) dir.create(down_dir, recursive=TRUE) # create folder if doesn't exist 

# loop over the table rows & download each dataset
for (f in 1:nrow(data2download)) {
    down_filename <- data2download[f,"filename"]
    down_filepath <- paste(down_dir, gsub("GSE173303_", "", down_filename), 
                   sep="/")# download file to: 'data/GSE173303'
    down_url <- data2download[f,"ftp_url"]
    cat("Downloading file", down_filename, "to", down_filepath, "\n")
    download.file(url=down_url, destfile=down_filepath)
}

## Downloading file GSE173303_barcodes.tsv.gz to data/GSE173303/barcodes.tsv.gz 
## Downloading file GSE173303_features.tsv.gz to data/GSE173303/features.tsv.gz 
## Downloading file GSE173303_matrix.mtx.gz to data/GSE173303/matrix.mtx.gz 
## Downloading file GSE173303_subT_data_metadata.tsv.gz to data/GSE173303/subT_data_metadata.tsv.gz 
## Downloading file GSE173303_whole_data_metadata.tsv.gz to data/GSE173303/whole_data_metadata.tsv.gz

# import gene expression 10X data into Seurat
data10x <- Read10X(data.dir=down_dir) # import 10x data as sparse matrix 
seu <- CreateSeuratObject(counts=data10x, project="GSE173303") # convert gene exp. sparse matrix into Seurat class object

# import metadata
metadata <- list.files(down_dir, pattern="metadata", full.names=TRUE) # get path for metadata based on 'pattern' argument
names(metadata) <- gsub("_data_metadata.tsv.gz", "", basename(metadata)) # name the paths
metadata <- lapply(setNames(metadata, names(metadata)), function(x) {
               read.table(x, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                   }) # import tables into a list

Seurat object

(5 min)

The data was imported into R as a sparse matrix (RAM efficient - most of the entries are zero) with the Seurat R function data10x <- Read10X(). Then, the sparse matrix, where genes are rows and columns are cells, data10x was converted in a Seurat class object at seu <- CreateSeuratObject(counts=data10x, project="GSE173303"). The object seu will be used downstream to analyze the scRNA-seq data.

Look into the structure of the Seurat class object by running the code below.

# Seurat object structure
str(seu)

## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
##   .. .. .. .. .. ..@ p       : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 21831 89717
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
##   .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   .. .. .. .. .. ..@ x       : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
##   .. .. .. .. .. ..@ p       : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 21831 89717
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
##   .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   .. .. .. .. .. ..@ x       : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   21831 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 89717 obs. of  3 variables:
##   .. ..$ orig.ident  : Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nCount_RNA  : num [1:89717] 2669 3727 4039 3123 6949 ...
##   .. ..$ nFeature_RNA: int [1:89717] 1024 1205 1728 1332 2041 1456 2507 3089 2526 1623 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "GSE173303"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 1 0
##   ..@ commands    : list()
##   ..@ tools       : list()

An object of Seurat class has several layers of information. You don’t need to know all. Let’s focus on @assays and @meta.data. The former stores the gene expression data. Assays are further divided into slots of data. Particularly important are the slots: @counts (unnormalized/raw counts), @data (normalized data, e.g., log-noramlized with 10K scaling factor), @scale.data (scaled data, e.g., standardized by Z-score). The latter, i.e., @meta.data stores the cell metadata (number of genes expressed in the cell, cluster identity, etc).

The object seu imported above will be called almost in every function during the analysis and most of the results will be saved directly into it. Seurat provides functions to access the data stored in its own class.

Read more about the structure of a Seurat object in the Seurat wiki.

Question: How many genes and cells compose the GEO GSE173303 dataset? (1 min)

Tip: type the object seu in the R console and interpret the message printed or provide the object seu to the functions nrow() and ncol() which count the no. of rows (=genes) and columns (=cells).

Answer

The GEO GSE173303 dataset comprises 21831 genes and 89717 cells.

Print below the gene expression of the first 10 genes across the first 10 cells stored in the main assay, called RNA by default, in the slot @counts.

## Inspect data
seu@assays$RNA@counts[1:10,1:10]

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                               
## AL627309.1 . . . . . . . . . .
## AL669831.5 . . . . . . . . . .
## FAM87B     . . . . . . . . . .
## LINC00115  . . . . . . . . . .
## FAM41C     . . . . . . . . . .
## NOC2L      . . . . . . . . . .
## KLHL17     . . . . . . . . . .
## PLEKHN1    . . . . . . . . . .
## HES4       . . . . . . . . . .
## ISG15      . . . . . . 1 2 2 3

The gene expression counts stored in the slot @counts are in sparse format. Points represent zero.

Question: Which is the expression value for the 10th gene in the 10th cell? (1 min)

Answer

It is 3.

Metadata

(5 min)

Let’s inspect below what the @meta.data slot contains (run the code).

# Print the first 6 rows of '@meta.data'
head(seu@meta.data)

##                     orig.ident nCount_RNA nFeature_RNA
## 45_AAACCTGAGAATGTGT         45       2669         1024
## 45_AAACCTGAGACTAGAT         45       3727         1205
## 45_AAACCTGAGACTGGGT         45       4039         1728
## 45_AAACCTGAGAGAGCTC         45       3123         1332
## 45_AAACCTGAGATATGGT         45       6949         2041
## 45_AAACCTGAGATCTGCT         45       4210         1456

Question: What do you think each column represents? (3 min)

Tip: discuss this in group.

Answer

The slot @meta.data contains 3 columns/variables: orig.ident, nCount_RNA, nFeature_RNA. These variables were created automatically from the data without us providing any information. orig.ident refers to the identity of the cell. This matches the prefix of the cell barcode represented as row names of the orig.ident. This is because the function CreateSeuratObject() contains two parameters: names.delim = "_" and names.field = 1. These parameters will split the cell barcode names by names.delim and pick the first element, i.e., 45 for the first 6 cells. You will understand the meaning of these values below. nCount_RNA is just the total number of UMIs in a cell (obtained by summing all the expression values in a cell/column). nFeature_RNA is the number of different genes expressed in a cell (obtained by summing the number of genes expressing a value different than zero).

Now, let’s check the metadata provided in the GEO repository that was imported above. The object metadata is a list with two tables (or data frames) stored in it named: subT, whole. To access each one of them, you can call the object with the accessor $ sign like this: metadata$subT (to access subT) or metadata$whole (to access whole). You can use the functions dim() (to see the dimensions, i.e., no. of rows x cols), summary() (to print a summary for every column) and head() (to print the first 6 rows) to inspect the tables. Do not attempt to print the whole tables, because the thousand of rows they comprise.

For now let’s ignore the metadata$subT data frame, since it represents a subset of the whole data.

# Print the first 6 lines of the 'whole' metadata 
head(metadata$whole)

##   orig.ident nCount_RNA nFeature_RNA percent.mito sbt_is_doublet
## 1         45       2669         1024   0.11689771          FALSE
## 2         45       3727         1205   0.07593239          FALSE
## 3         45       4039         1728   0.08294132          FALSE
## 4         45       3123         1332   0.05315402          FALSE
## 5         45       6949         2041   0.07281623          FALSE
## 6         45       4210         1456   0.08622328          FALSE
##   sbt_doublet_score manually_is_doublet batch control GROUP RNA_snn_res.0.5
## 1        0.05778302               FALSE     2      SF  Mono               2
## 2        0.04066337               FALSE     2      SF  Mono               0
## 3        0.05618845               FALSE     2      SF  Mono               3
## 4        0.06669545               FALSE     2      SF  Mono               0
## 5        0.08527529               FALSE     2      SF  Mono               2
## 6        0.05618845               FALSE     2      SF  Mono               2
##   seurat_clusters newClusterID            cell.types             barcode
## 1               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 2               0            3          Effector CD8 45_AAACCTGAGACTAGAT
## 3               3            5         Activated CD4 45_AAACCTGAGACTGGGT
## 4               0            3          Effector CD8 45_AAACCTGAGAGAGCTC
## 5               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 6               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT

# Look into the summary
summary(metadata$whole)

##   orig.ident          nCount_RNA     nFeature_RNA   percent.mito    
##  Length:89717       Min.   :  353   Min.   : 202   Min.   :0.00000  
##  Class :character   1st Qu.: 2373   1st Qu.: 986   1st Qu.:0.04151  
##  Mode  :character   Median : 4260   Median :1537   Median :0.05605  
##                     Mean   : 4974   Mean   :1586   Mean   :0.06067  
##                     3rd Qu.: 6067   3rd Qu.:1988   3rd Qu.:0.07425  
##                     Max.   :87816   Max.   :7709   Max.   :0.14999  
##  sbt_is_doublet  sbt_doublet_score  manually_is_doublet     batch      
##  Mode :logical   Min.   :0.001366   Mode :logical       Min.   :0.000  
##  FALSE:89717     1st Qu.:0.030928   FALSE:89717         1st Qu.:1.000  
##                  Median :0.053980                       Median :2.000  
##                  Mean   :0.069264                       Mean   :1.786  
##                  3rd Qu.:0.089419                       3rd Qu.:3.000  
##                  Max.   :0.626984                       Max.   :3.000  
##    control             GROUP           RNA_snn_res.0.5 seurat_clusters
##  Length:89717       Length:89717       Min.   : 0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 2.00   1st Qu.: 2.00  
##  Mode  :character   Mode  :character   Median : 4.00   Median : 4.00  
##                                        Mean   : 6.43   Mean   : 6.43  
##                                        3rd Qu.: 9.00   3rd Qu.: 9.00  
##                                        Max.   :30.00   Max.   :30.00  
##   newClusterID    cell.types          barcode         
##  Min.   : 1.00   Length:89717       Length:89717      
##  1st Qu.: 5.00   Class :character   Class :character  
##  Median : 8.00   Mode  :character   Mode  :character  
##  Mean   :10.62                                        
##  3rd Qu.:14.00                                        
##  Max.   :31.00

The first three columns of metadata$whole seem to match the first three columns from seu@meta.data (this can be computational tested by doing: all_equal(seu@meta.data, metadata$whole[,c("orig.ident", "nCount_RNA", "nFeature_RNA")])). If you run the previous code you’ll realize that the message printed suggests they are different. This is because the data types/classes of the same columns are different across tables, but the values are the same. How can we be sure? By comparing the cell barcodes in the two tables: all(row.names(seu@meta.data)==metadata$whole$barcodes). It returns TRUE meaning the cell barcodes between tables are exactly the same. This is useful, because we know now that we can combine the metadata$whole to seu@meta.data in order to access this data from our Seurat object. This will be convenient to plot and query the data later.

To understand the meaning of every column in metadata$whole you would need to read carefully the information available in the GEO GSE173303 repository from where we downloaded the data from. For reasons of time, this was done for you and summarized below.

orig.ident: original identity of every cell regarding the 8 matched donor sample from arthritis-irAE subjects (8 synovial fluid samples vs 8 PBMCs, n=16) that they came from: A7, A40, 45, A56, 62, 63, 65, 76 (synovial), A7B, A40B, 45B, A56B, 62B, 63B, 65B, 76B (PBMCs).
nCount_RNA: total no. of UMIs expressed by cell.
nFeature_RNA: no. of different genes expressed by cell.
percent.mito: percentage of mitochondrial genes expressed by cell.
sbt_is_doublet: logical value checking if a cell is a doublet (obtained with the software scrublet). ‘FALSE’ for all the cells, meaning that the authors removed doublets before submitting the data to GEO.
sbt_doublet_score: doublet score obtained with scrublet.
manually_is_doublet: similar to sbt_is_doublet.
batch: batch factors/categories. It assumes the values: 0, 1, 2, 3.
control: sample type. One of two: Blood (Blood/PBMC samples, n=59255) or SF (Synovial Fluid, n=30462).
Group: the identity of the treatment group. One of two: Combo (mixture of ICI therapy medicines, n=37465) or Mono (monotherapy, n=52252)
RNA_snn_res.0.5: Seurat clusters obtained by using a resolution value of 0.5.
seurat_clusters: Seurat clusters. The same as RNA_snn_res.0.5.
newClusterID: new cluster ID. seurat_clusters renamed.
cell.types: cell-type annotations.
barcode: cell barcodes with donor sample id prefix.

Now that we know what every column means let’s combine this table with our Seurat object metadata: seu@meta.data. Before doing so, we’ll rename the columns from metadata$whole to avoid conflicts in the column names at seu@meta.data (because some of the column names are the same between tables).

## Combine metadata

# Rename col names from 'metadata$whole' by adding the suffix '.orig' to the col names
tmp <- metadata$whole # temporary modified table - a copy from 'metadata$whole'
colnames(tmp) <- paste0(colnames(tmp),".orig")

# Combine metadata tables
stopifnot(all(row.names(seu@meta.data)==tmp$barcode.orig)) # ensure that cell barcodes match
seu@meta.data <- cbind(seu@meta.data, tmp)

# Look how the updated Seurat meta.data looks like
head(seu@meta.data)

##                     orig.ident nCount_RNA nFeature_RNA orig.ident.orig
## 45_AAACCTGAGAATGTGT         45       2669         1024              45
## 45_AAACCTGAGACTAGAT         45       3727         1205              45
## 45_AAACCTGAGACTGGGT         45       4039         1728              45
## 45_AAACCTGAGAGAGCTC         45       3123         1332              45
## 45_AAACCTGAGATATGGT         45       6949         2041              45
## 45_AAACCTGAGATCTGCT         45       4210         1456              45
##                     nCount_RNA.orig nFeature_RNA.orig percent.mito.orig
## 45_AAACCTGAGAATGTGT            2669              1024        0.11689771
## 45_AAACCTGAGACTAGAT            3727              1205        0.07593239
## 45_AAACCTGAGACTGGGT            4039              1728        0.08294132
## 45_AAACCTGAGAGAGCTC            3123              1332        0.05315402
## 45_AAACCTGAGATATGGT            6949              2041        0.07281623
## 45_AAACCTGAGATCTGCT            4210              1456        0.08622328
##                     sbt_is_doublet.orig sbt_doublet_score.orig
## 45_AAACCTGAGAATGTGT               FALSE             0.05778302
## 45_AAACCTGAGACTAGAT               FALSE             0.04066337
## 45_AAACCTGAGACTGGGT               FALSE             0.05618845
## 45_AAACCTGAGAGAGCTC               FALSE             0.06669545
## 45_AAACCTGAGATATGGT               FALSE             0.08527529
## 45_AAACCTGAGATCTGCT               FALSE             0.05618845
##                     manually_is_doublet.orig batch.orig control.orig GROUP.orig
## 45_AAACCTGAGAATGTGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGACTAGAT                    FALSE          2           SF       Mono
## 45_AAACCTGAGACTGGGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGAGAGCTC                    FALSE          2           SF       Mono
## 45_AAACCTGAGATATGGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGATCTGCT                    FALSE          2           SF       Mono
##                     RNA_snn_res.0.5.orig seurat_clusters.orig newClusterID.orig
## 45_AAACCTGAGAATGTGT                    2                    2                 8
## 45_AAACCTGAGACTAGAT                    0                    0                 3
## 45_AAACCTGAGACTGGGT                    3                    3                 5
## 45_AAACCTGAGAGAGCTC                    0                    0                 3
## 45_AAACCTGAGATATGGT                    2                    2                 8
## 45_AAACCTGAGATCTGCT                    2                    2                 8
##                           cell.types.orig        barcode.orig
## 45_AAACCTGAGAATGTGT CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 45_AAACCTGAGACTAGAT          Effector CD8 45_AAACCTGAGACTAGAT
## 45_AAACCTGAGACTGGGT         Activated CD4 45_AAACCTGAGACTGGGT
## 45_AAACCTGAGAGAGCTC          Effector CD8 45_AAACCTGAGAGAGCTC
## 45_AAACCTGAGATATGGT CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 45_AAACCTGAGATCTGCT CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT

Now we’re ready to start our analysis with Seurat!

QC

Viz

(5 min)

The first step in every analysis is: Quality-Control (QC). In this case, the QC that we’ll perform is relatively small because the data was already processed by the 10X Genomics Cell Ranger. In addition, the authors filter out bad-quality cells such as doublets as well cells comprising a high percentage of mitochondrial genes (see a full description in the GEO GSE173303 repository or in the publication doi:10.1038/s41467-022-29539-3).

The percentage of mitochondrial genes was already estimated by the authors, but let’s calculate it below to see if the values match. This is done with the function PercentageFeatureSet(). This function looks into the genes with the pattern given at pattern to determine the percentage of these features. This is done using regular expressions and the pattern given depends on the set of features that you’re looking into and the annotation used. In our case we’ve human data annotated against the reference GRCh38. All the mitochondrial genes have the prefix ‘MT-’. Thus we will use this to provide this pattern to pattern = "^MT-".

Question: Which and how many mitochondrial genes are present in this data set? (1 min)

Tip: use the following code grep("^MT-", row.names(seu), value=TRUE) to retrieve the mitochondrial genes together with the function length() to get the number of mt genes.

Answer

The mitochondrial genes present are: MT-ND1, MT-ND2, MT-CO1, MT-CO2, MT-ATP8, MT-ATP6, MT-CO3, MT-ND3, MT-ND4L, MT-ND4, MT-ND5, MT-ND6, MT-CYB.

The total no. of mt genes is: 13.

## Calculate percentage of mitochondrial genes by cell

seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-") # add the 'percent.mt' to the 'meta.data' slot

# if you work with mouse data usually the prefix for mitochondrial genes is '^mt-' instead

The previous percent.mito.orig is different than the percentage of mitochondrial genes that you had just calculated percent.mt. It seems that the former is in relative abundance, with a scale 0-1, whereas your calculation is in percentage (0-100%).

Now let’s plot the three cell properties that Seurat calculated: nFeature_RNA, nCount_RNA, percent.mt. By default Seurat uses the @meta.data variable orig.ident to set the identity of every cell. This means that the cells will be highlighted by orig.ident, i.e., donor sample origin. This can be changed, but for now it is good to look into the QC highlighted by the donor sample origin. In addition, we’ll take opportunity of the functionality of visualization in Seurat to highlight the data by control.orig (the metadata column with the categorical variable of the sample type: Blood or SF): SF, in light blue, and Blood, in light red, violins.

## QC: violin plots

# Create folder to save the results
res_dirs <- paste("results/GSE173303", c("plots", "tables"), sep="/")
for (d in res_dirs) if (!dir.exists(d)) dir.create(d, recursive=TRUE)

# Plot
qc_vln <- VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
          ncol = 3, split.by="control.orig") 
# Seurat removes legend if there is more than 1 plot: https://github.com/satijalab/seurat/blob/a1294c4d363780548dbf9cc4a4abb3a6078a6d64/R/visualization.R#L5729

# Save
pdf(paste(res_dirs[1], "qc_violin_plots.pdf", sep="/"), width=12, height=4)
print(qc_vln)
dev.off()

## png 
##   2

# Print plot
print(qc_vln)

What do you think?

Let’s get some more concrete numbers by calculating the median() and mean() of every metric across the donor samples.

## Mean & median of cell features

# across donor samples
seu@meta.data[,c("orig.ident", "control.orig", "nFeature_RNA", "nCount_RNA", "percent.mt")] %>% 
    group_by(orig.ident, control.orig) %>% 
    summarise_if(is.numeric, list("Mean"=mean, "Median"=median))

## # A tibble: 16 × 8
## # Groups:   orig.ident [16]
##    orig.ident control.orig nFeature_RNA_Mean nCount_RNA_Mean percent.mt_Mean
##    <fct>      <chr>                    <dbl>           <dbl>           <dbl>
##  1 45         SF                       1595.           4792.            6.83
##  2 45B        Blood                     781.           2201.            9.10
##  3 62         SF                       1564.           4741.            5.70
##  4 62B        Blood                     999.           2849.            6.47
##  5 63         SF                       2024.           6580.            4.54
##  6 63B        Blood                    1699.           5295.            6.25
##  7 65         SF                       2519.           9181.            5.16
##  8 65B        Blood                    1425.           4574.            6.14
##  9 76         SF                       1496.           4129.            5.82
## 10 76B        Blood                    1582.           4944.            5.62
## 11 A40        SF                       1646.           4886.            6.19
## 12 A40B       Blood                    1341.           4182.            6.52
## 13 A56        SF                       1383.           4062.            7.57
## 14 A56B       Blood                    1435.           4054.            7.16
## 15 A7         SF                       1956.           7134.            5.99
## 16 A7B        Blood                    1872.           5542.            4.14
## # … with 3 more variables: nFeature_RNA_Median <dbl>, nCount_RNA_Median <dbl>,
## #   percent.mt_Median <dbl>

# across control.orig groups
seu@meta.data[,c("orig.ident", "control.orig", "nFeature_RNA", "nCount_RNA", "percent.mt")] %>% 
    group_by(control.orig) %>% 
    summarise_if(is.numeric, list("Mean"=mean, "Median"=median))

## # A tibble: 2 × 7
##   control.orig nFeature_RNA_Me… nCount_RNA_Mean percent.mt_Mean nFeature_RNA_Me…
##   <chr>                   <dbl>           <dbl>           <dbl>            <dbl>
## 1 Blood                   1420.           4294.            6.22             1447
## 2 SF                      1910.           6297.            5.76             1776
## # … with 2 more variables: nCount_RNA_Median <dbl>, percent.mt_Median <dbl>

There are differences between donor samples as well as across Blood and SF groups, but the values are on the same range and comparable.

Plot the relationship between nCount_RNA and percent.mt or nFeature_RNA below.

## QC: scatter plots

# Plot
qc_scatter1 <- FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "percent.mt")
qc_scatter2 <- FeatureScatter(seu, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")

# Save
pdf(paste(res_dirs[1], "qc_scatter_plots.pdf", sep="/"), width=8, height=4)
qc_scatter1 + qc_scatter2
dev.off()

## png 
##   2

# Print
print((qc_scatter1 + qc_scatter2))

The scatter plot of nCount_RNA x percent.mt shows no relationship/dependence as expected, whereas the plot nCount_RNA x nFeature_RNA shows a positive relationship (highly correlated - 0.92), i.e., cells sequenced more deeply are expressing more distinct genes.

Filter cells

(1 min)

Filtering bad quality cells based on their properties, i.e., nCount_RNA, nFeature_RNA or/and percent.mt, and low abundant genes will not be performed because the authors removed this already. Although the code below is an example that you could use to filter the bad quality cells based on their properties.

## Filter bad-quality cells
#seu <- subset(seu, subset = nFeature_RNA > 1000 & nFeature_RNA < 4000 & percent.mt < 10) # This would select cells expressing more than 1K distinct genes and lower than 4K as well as cells expressing less than 10% mitochondrial genes.

Low abundant genes need to be filtered when creating the Seurat object (CreateSeuratObject(..., min.cells=<genes exp. at least in this no. of cells>)).

Select samples

(3 min)

Since we’ve some computing limitations, we’ll remove six samples: 65, 65B (all from batch 0), A7, A7B (from batch 1), 45, 45B (batch 2).

## Select samples
samps <- unique(as.character(seu@meta.data$orig.ident))
samps2sel <- samps[!(samps %in% c("45", "45B", "65", "65B", "A7", "A7B"))]
seu <- subset(seu, subset = orig.ident %in% samps2sel)
seu@meta.data$orig.ident <- factor(as.character(seu@meta.data$orig.ident), levels=samps2sel)

Normalization

(5 min)

First, the data is normalized to reduce the impact of distinct sequencing depth across the cells to make them comparable. There are several normalization methods (e.g. SCTransform).

Here, we’ll focus on the most common used method: log-normalization (NormalizeData(..., normalization.method="LogNormalize")). This method consists in dividing the expression values by the library size (=total no. UMIs or nCount_RNA) for each respective cell. In other words getting the relative abundance of gene expression by cell. Then, the values obtained are multiplied by a factor, usually 10,000 (scale.factor=10000). Finally this result is log1p transformed.

## Log-normalization
seu <- NormalizeData(seu, normalization.method="LogNormalize", scale.factor=10000)

# Inspect the result: first 10 genes x 10 cells
seu@assays$RNA@data[1:10,1:10]

## 10 x 10 sparse Matrix of class "dgCMatrix"
##                                                                            
## AL627309.1 . .         .         . .        .         .        . .        .
## AL669831.5 . .         .         . .        0.4009989 .        . .        .
## FAM87B     . .         .         . .        .         .        . .        .
## LINC00115  . .         .         . .        .         .        . .        .
## FAM41C     . .         .         . .        .         .        . .        .
## NOC2L      . 0.9993235 0.6851444 . .        .         1.192614 . .        .
## KLHL17     . .         .         . .        .         .        . .        .
## PLEKHN1    . .         .         . .        .         .        . .        .
## HES4       . .         .         . .        .         .        . .        .
## ISG15      . .         1.7783857 . 2.012912 .         .        . 1.296671 .

HVG

(5 min)

Highly variable genes or features (HVG or HVF) are a set of genes/features with a high variance across the cells. In other terms, genes that change their expression level a lot across cells. Why are we interested on these? Take the example below.

We know that roughly some subsets of T cells can be classified in CD4+ or CD8+. Let’s plot these two features one against other.

## CD4+ vs CD8+
FeatureScatter(seu, feature1="CD8A", feature2="CD4", group.by="cell.types.orig")

What do you think? Are these two genes important or not?

Now look into two genes with very low variance: PPP1R36 and CD177.

## PPP1R36 vs CD177
FeatureScatter(seu, feature1="PPP1R36", feature2="CD177", group.by="cell.types.orig")

What do you think about these two genes? Are they informative to distinguish the different cell types observed?

Now that you understand why we’re using HVG, let’s determine the top 2000 (HVG)! This value can be higher or lower. The plot below will help you to decide, but the top 2K HVG is quite often choosen.

## Determine HVG
seu <- FindVariableFeatures(seu, selection.method="vst", nfeatures=2000)

# Get the top 10 HVG
top10_hvg <- head(VariableFeatures(seu), 10)

# Plot HVG
hvg_plot1 <- VariableFeaturePlot(seu)
hvg_plot2 <- LabelPoints(plot=hvg_plot1, points=top10_hvg, repel=TRUE)

# Save
pdf(paste(res_dirs[1], "hvg_plots.pdf", sep="/"), width=12, height=6)
print((hvg_plot1 + hvg_plot2))
dev.off()

## png 
##   2

# Print
print((hvg_plot1 + hvg_plot2))

Question: Do you recognize any of the top 20 HVG? (1 min)

Tip: use the function head() with VariableFeatures() to retrieve the top 20 HVG (the top 10 also appeared highlighted in the plot above).

Answer

The top 20 HVG are: JCHAIN, CXCL10, IGKC, IGHA1, CCL2, SPP1, IGLV2-23, CST3, CCL8, CXCL13, IGLV2-8, IGKV1D-16, S100B, HSPA6, HSPA1A, PPBP, FSCN1, C1QB, HLA-DQA1, IGKV1-27.

Cytokines such as CCL2 or CCL8.

Scaling

(3 min)

Now that we’ve our set of highly informative genes, we’ll perform scaling. Scaling consists in standardizing the genes/features to make them comparable and avoid the highly abundant ones of standing out which is extremely important for downstream analysis such as Principal Component Analysis. There are several methods. The one that we’ll use is the Z-score which consists in subtracting to a gene its mean and divide it by its standard deviation (SD) in order to obtain mean 0 and SD of 1.

Let’s scaled our scRNA-seq dataset.

## Scaling
seu <- ScaleData(seu)

## Print the first 10 rows x 10 cols
seu@assays$RNA@scale.data[1:10,1:10]

##          62_AAACCTGAGCGTGAGT 62_AAACCTGAGTATGACA 62_AAACCTGGTAATAGCA
## HES4             -0.13065564         -0.13065564         -0.13065564
## ISG15            -0.67572425         -0.67572425          1.36133826
## TNFRSF18         -0.25452320         -0.25452320         -0.25452320
## TNFRSF4           2.28553326         -0.24922695          4.57493715
## ACOT7            -0.23131895         -0.23131895         -0.23131895
## TNFRSF9          -0.13428611         -0.13428611         -0.13428611
## ENO1             -1.19726288          1.20329131          0.16332666
## GPR157           -0.08960485         -0.08960485         -0.08960485
## RBP7             -0.18559818         -0.18559818         -0.18559818
## PGD              -0.45555101          1.22870799         -0.45555101
##          62_AAACGGGAGATGGGTC 62_AAACGGGTCAGGATCT 62_AAAGATGAGAGTAATC
## HES4             -0.13065564         -0.13065564          -0.1306556
## ISG15            -0.67572425          1.62997819          -0.6757242
## TNFRSF18         -0.25452320         -0.25452320          -0.2545232
## TNFRSF4          -0.24922695         -0.24922695          -0.2492270
## ACOT7            -0.23131895         -0.23131895          -0.2313189
## TNFRSF9          -0.13428611         -0.13428611           4.8553534
## ENO1              1.62735289          1.04680414           1.2507066
## GPR157           -0.08960485         -0.08960485          10.0000000
## RBP7             -0.18559818         -0.18559818          -0.1855982
## PGD              -0.45555101          1.48452581           0.7013749
##          62_AAAGATGCAGGGAGAG 62_AAAGATGTCGGCCGAT 62_AAAGCAAAGACACTAA
## HES4             -0.13065564         -0.13065564         -0.13065564
## ISG15            -0.67572425         -0.67572425          0.80955508
## TNFRSF18         -0.25452320         -0.25452320         -0.25452320
## TNFRSF4          -0.24922695         -0.24922695          5.22879325
## ACOT7            -0.23131895         -0.23131895          2.70599623
## TNFRSF9          -0.13428611         -0.13428611         -0.13428611
## ENO1              0.84740685          0.81743760          0.62719929
## GPR157           -0.08960485         -0.08960485         -0.08960485
## RBP7             -0.18559818         -0.18559818         -0.18559818
## PGD              -0.45555101          1.51390688          1.72985686
##          62_AAAGCAAAGGATGCGT
## HES4             -0.13065564
## ISG15            -0.67572425
## TNFRSF18          2.51614411
## TNFRSF4           4.64515580
## ACOT7            -0.23131895
## TNFRSF9          -0.13428611
## ENO1              0.40943849
## GPR157           -0.08960485
## RBP7             -0.18559818
## PGD              -0.45555101

During scaling it is possible to regress out undesirable variables, such as nCount_RNA or percent.mt. With this regression you can get the residuals and remove the unwanted variation caused by unwanted variables (such as the ones mentioned). Here we’ll ignore this, but this could have been done above by providing the option: vars.to.regress="percent.mt" (to regress out mitochondrial genes).

DR

PCA

(5 min)

Principal Component Analysis (PCA) is a deterministic and linear dimensional reduction method that provides the basis for other analyses. The PCA aims to reduce the dimensionality of high-dimensional data, as it is the case of scRNA-seq, maintaining most of the variation present in the data across a few dozens of components.

Run it below (it will use the scaled data obtained above).

## PCA
seu <- RunPCA(seu)

Often the first few PCs (Principal Components) show the variation caused by the total no. of UMIs or percentage of mitochondrial genes, cell cycling genes (cell division), rather than cell type. Let’s plot the PCA and highlight the cells by nCount_RNA, nFeature_RNA, percent.mt.

## PCA: cell features
pca_features <- FeaturePlot(seu, reduction = "pca", features=c("nCount_RNA", "nFeature_RNA", "percent.mt"))

# Save 
pdf(paste(res_dirs[1], "pca_qc_feature_plots.pdf", sep="/"))
print(pca_features)
dev.off()

## png 
##   2

# Plot
print(pca_features)

By default we computed the first 50 PCs. Again, not all of these PCs are informative. Some of them comprise very low variation.

In order to decide the top most important PCs, i.e., the ones that hold more data variance, we’ll plot below an elbow plot. As the name suggests the aim is to find the elbow in the plot, i.e., the point where there is a drastic reduction of variance, and select these first PCs with more variance, and, thus more informative which we’ll use downstream.

## Elbow plot
elbow_plot <- ElbowPlot(seu, ndims=50)

# Save
pdf(paste(res_dirs[1], "elbow_plot.pdf", sep="/"))
print(elbow_plot)
dev.off()

## png 
##   2

# Print
print(elbow_plot)

Question: How many principal components (PCs) do you think we should select for downstream analyses (e.g., clustering, UMAP)? (1 min)

Answer

Let’s use 30. Although other values would be equally valid such as 10, 20 or even 40.

tSNE & UMAP

(15 min)

PCA is the basis for other methods. Although it reduces the high-dimensional data to low-dimensional space, usually is not a good method to visualize the cell populations or clusters. For this purpose there are better methods such as non-linear dimensional reduction methods like tSNE (t-distributed Stochastic Neighbor Embedding) and UMAP (Uniform Manifold Approximation and Projection). UMAP is a much faster method compared to tSNE and it claims to preserve the local and most of the global structure, whereas tSNE only preserves the local structure.

Run only UMAP below (you can attempt to run tSNE but it takes more time).

## tSNE & UMAP

# tSNE
#seu <- RunTSNE(seu, dims=1:30, perplexity=30)

# UMAP
seu <- RunUMAP(seu, dims=1:30)

Let’s explore potential batch effects such as the donor (=orig.ident), batch (=batch.orig), control (=control.orig) and group (=GROUP.orig) categorical variables.

## Plot variables (potential batches as well as others)
vars2plot <- c("cell.types.orig", "newClusterID.orig", 
           "orig.ident", "control.orig", "GROUP.orig", "batch.orig")
dr_plots[["tsne"]] <- dr_plots[["umap"]] <- dr_plots <- list()
for (v in vars2plot) {
    for (mth in c("tsne", "umap")) {
      if (mth %in% names(seu@reductions)) {
                dr_plots[[mth]][[v]] <- DimPlot(seu, reduction=mth, group.by=v, label=TRUE) + 
                  ggtitle(v)
      }
    }
}

# Save tSNE & UMAP 
# pdf(paste(res_dirs[1], "unintegrated_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "unintegrated_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_plots[["umap"]], ncol=3)
dev.off()

## png 
##   2

# Print
# cowplot::plot_grid(plotlist=dr_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_plots[["umap"]], ncol=3)

TASK 1: Discuss within your group if there is any batch in the data influencing the global cell structure. Some tips: look into the dimensional reduction plots and compare the projections between the technical variables and the cell.types.orig.

In the next sections we’ll cluster the data without performing any integration. Then, we’ll integrate the data using the same variable that the authors used, but with two different integration methods. Afterwards, we’ll compare the three results with the ground-truth cell-types provided by authors to assess which was the best method.

Clustering

(5 min)

Clustering in Seurat is done in two steps:

building a clustering based graph (SNN): FindNeighbors()
find communities/populations (Louvain): FindClusters()

The second step depends on the resolution value given. Here, we’ll use a resolution value of 0.5 which seems to had been what the authors used based on the @meta.data column name RNA_snn_res.0.5.orig (this is the column name that Seurat gives automatically and the last value corresponds to the resolution value used to cluster).

## Clustering: unintegrated data
seu <- FindNeighbors(seu, dims=1:30)
seu <- FindClusters(seu, resolution=0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 46663
## Number of edges: 1659433
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9317
## Number of communities: 25
## Elapsed time: 10 seconds

Now let’s compare the new clustering result with the unintegrated data with the original clustering result below.

## Unintegrated clustering DR plots 
vars2plot <- c("cell.types.orig", "newClusterID.orig", "seurat_clusters")
dr_clts_plots[["tsne"]] <- dr_clts_plots[["umap"]] <- dr_clts_plots <- list()
for (v in vars2plot) {
    for (mth in c("tsne", "umap")) {
      if (mth %in% names(seu@reductions)) {
        dr_clts_plots[[mth]][[v]] <- DimPlot(seu, reduction=mth, group.by=v, label=TRUE) + 
          ggtitle(v)
        }
    }
}

# Save tSNE & UMAP 
# pdf(paste(res_dirs[1], "unintegrated_clts_comp_tsne_plots.pdf", sep="/"), width=24, height=10)
# cowplot::plot_grid(plotlist=dr_clts_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "unintegrated_clts_comp_umap_plots.pdf", sep="/"), width=24, height=10)
cowplot::plot_grid(plotlist=dr_clts_plots[["umap"]], ncol=3)
dev.off()

## png 
##   2

# Print
#cowplot::plot_grid(plotlist=dr_clts_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_clts_plots[["umap"]], ncol=3)

Does the new unintegrated clustering result looks similar to the original one, i.e., newClusterID.orig?

Integration

(3 min)

The data will be integrated by the batch variable present in the @meta.data table: batch.orig.

batch.orig comprises 4 categories/factor levels: 0, 1, 2, 3. Each batch comprises a different set of samples.

batch 0: 65, 65B (removed above)
batch 1: A7, A7B (removed), A13, A13B, A40, A40B
batch 2: 45, 45B (removed), A50, A50B, A56, A56B
batch 3: 62, 62B, 63, 63B, 76, 76B

The authors used the software Harmony to integrate the data and these batches appeared described in their github repository with the code used for this analysis: Stnhy1/AR. The batches are defined in the script addBatchInfo.R use later on run-harmony.R.

Seurat RPCA

(30 min)

Now let’s perform integration with Seurat. Seurat provides different methods for integration. Here, since we’re working with a relatively large data set, we’ll use the fast integration reciprocal PCA (RPCA) method. This page discusses the difference between the RPCA method and the ‘traditional’ CCA based method. In sum, RPCA is faster and useful when the data sets being integrated are not expected to share the majority of cell types. In this case, we have several batches, with samples coming from blood and synovial fluid. Therefore, the RPCA seems to be more suitable for our data set.

To perform integration using the RPCA Seurat method we start by normalizing (NormalizeData()), finding HVG shared across data sets (by FindVariableFeatures() and later SelectIntegrationFeatures()), scaling (ScaleData()) and performing PCA (RunPCA()) independently for each data set to be integrated. Then it comes the integration part starting by finding anchors (FindIntegrationAnchors()) and integrating the data sets (IntegrateData()).

## Integration - Seurat fast RPCA

# Split Seurat object by batch samples to perform integration
seu_ls <- SplitObject(seu, split.by = "batch.orig")

# Log-normalize & find HVG across donor batch data sets 
set.seed(1024)
seu_ls <- lapply(X = seu_ls, FUN = function(x) {
             x <- NormalizeData(x)
             x <- FindVariableFeatures(x, selection.method = "vst", nfeatures = 2000)
           })

# Select robust & shared HVG across datasets 
features <- SelectIntegrationFeatures(object.list = seu_ls)
seu_ls <- lapply(X = seu_ls, FUN = function(x) {
             x <- ScaleData(x, features = features, verbose = FALSE)
             x <- RunPCA(x, features = features, verbose = FALSE)
           })

# Find integration anchors
anchors <- FindIntegrationAnchors(object.list=seu_ls, 
                                  anchor.features = features, 
                                  reduction = "rpca")

# Integrate data sets
int <- IntegrateData(anchorset=anchors)

# Change the default assay to 'integrated'
DefaultAssay(int) <- "integrated"

# Perform DR
int <- ScaleData(int)
int <- RunPCA(int, npcs=50)
#int <- RunTSNE(int, dims=1:30, perplexity=30)
int <- RunUMAP(int, reduction="pca", dims=1:30)

Inspect the DR plots below.

## Plot variables after integration (potential batches as well as others)
vars2plot <- c("cell.types.orig", "orig.ident", "newClusterID.orig", "seurat_clusters")
dr_int_plots[["tsne"]] <- dr_int_plots[["umap"]] <- dr_int_plots <- list()
for (v in vars2plot) {
    for (mth in c("tsne", "umap")) {
          if (mth %in% names(seu@reductions)) {
            dr_int_plots[[mth]][[v]] <- DimPlot(int, reduction=mth, group.by=v, label=TRUE) +
              ggtitle(v)
        }
    }
}

# Save tSNE & UMAP 
# pdf(paste(res_dirs[1], "int_seurat_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_int_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_seurat_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_int_plots[["umap"]], ncol=3)
dev.off()

## png 
##   2

# Print
#cowplot::plot_grid(plotlist=dr_int_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_int_plots[["umap"]], ncol=3)

Let’s cluster the result below and see how that looks like.

## Clustering: integrated data
# rename the previous unintegrated cluster 
if ("seurat_clusters" %in% colnames(int@meta.data)) colnames(int@meta.data)[which(colnames(int@meta.data)=="seurat_clusters")] <- "unint_seurat_clusters"
set.seed(1024)
int <- FindNeighbors(int, dims=1:30)
int <- FindClusters(int, resolution=0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 46663
## Number of edges: 1703879
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9346
## Number of communities: 25
## Elapsed time: 11 seconds

## Integrated clustering DR plots 
vars2plot <- c("cell.types.orig", "newClusterID.orig", "seurat_clusters")
dr_clts_seu_plots[["tsne"]] <- dr_clts_seu_plots[["umap"]] <- dr_clts_seu_plots <- list()
for (v in vars2plot) {
    for (mth in c("tsne", "umap")) {
      if (mth %in% names(seu@reductions)) {
        dr_clts_seu_plots[[mth]][[v]] <- DimPlot(int, reduction=mth, group.by=v, label=TRUE) + 
          ggtitle(v)
      }
    }
}

# Save tSNE & UMAP 
# pdf(paste(res_dirs[1], "int_seurat_clts_comp_tsne_plots.pdf", sep="/"), width=24, height=10)
# cowplot::plot_grid(plotlist=dr_clts_seu_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_seurat_clts_comp_umap_plots.pdf", sep="/"), width=24, height=10)
cowplot::plot_grid(plotlist=dr_clts_seu_plots[["umap"]], ncol=3)
dev.off()

## png 
##   2

# Print
#cowplot::plot_grid(plotlist=dr_clts_seu_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_clts_seu_plots[["umap"]], ncol=3)

Scanorama

(35 min)

Scanorama is a python package developed for integration based on panorama stitching (read more about - Hie et al., 2019).

Contrary to Seurat RPCA integration method above, where we used the integrated corrected gene expression matrix (at int@assays$integrated@data) to scale the data, run the PCA and cluster, here we’ll use the corrected joint embedding (saved as PCA) for dimensional reduction and clustering. The corrected gene expression matrix of Scanorama is saved at int_sca@assays$integrated@counts.

We’re using the corrected joint embedding for downstream analyses, such as DR & clustering, because in an independent benchmark comparison of several integration methods performed by Luecken et al., 2021 this was the batch-corrected/integrated result that worked better. Although you can also use the integrated corrected gene expression matrix for the same purpose.

It was created a wrapper function scanorama_int() that calls and runs Scanorama from R taking as input a list of Seurat data sets to integrate (see how to run it directly - page).

## Integration: Scanorama
# Run integration
set.seed(1024)
int_sca <- scanorama_int(seu_ls)

# DR
#int_sca <- RunTSNE(int_sca, dims=1:30, perplexity=30)
int_sca <- RunUMAP(int_sca, reduction="pca", dims=1:30)

# Cluster the data
int_sca <- FindNeighbors(int_sca, dims=1:30)
int_sca <- FindClusters(int_sca, resolution=0.5)

## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 46663
## Number of edges: 1596807
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9390
## Number of communities: 29
## Elapsed time: 9 seconds

Let’s look into the results.

## Plot variables after integration (potential batches as well as others)
vars2plot <- c("cell.types.orig", "orig.ident", "newClusterID.orig", "seurat_clusters")
dr_int_sca_plots[["tsne"]] <- dr_int_sca_plots[["umap"]] <- dr_int_sca_plots <- list()
for (v in vars2plot) {
    for (mth in c("tsne", "umap")) {
      if (mth %in% names(seu@reductions)) {
        dr_int_sca_plots[[mth]][[v]] <- DimPlot(int_sca, reduction=mth, group.by=v, label=TRUE) + 
          ggtitle(v)
      }
    }
}

# Save tSNE & UMAP 
# pdf(paste(res_dirs[1], "int_scanorama_vars_tsne_plots.pdf", sep="/"), width=24, height=16)
# cowplot::plot_grid(plotlist=dr_int_sca_plots[["tsne"]], ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "int_scanorama_vars_umap_plots.pdf", sep="/"), width=24, height=16)
cowplot::plot_grid(plotlist=dr_int_sca_plots[["umap"]], ncol=3)
dev.off()

## png 
##   2

# Print
#cowplot::plot_grid(plotlist=dr_int_sca_plots[["tsne"]], ncol=3)
cowplot::plot_grid(plotlist=dr_int_sca_plots[["umap"]], ncol=3)

TASK 2: Discuss within your group if the projections and, particularly, the clustering of integrated data with Seurat and Scanorama, look better now after integrate the datasets by ‘batch.ident’. Some tips: look into the comparison results in the last section.

Comparison

(15 min)

Now, let’s compare the three clustering results with the ground-truth, i.e., newClusterID (which correspond to the cell types annotated): unintegrated, Seurat integrated, Scanorama integrated.

Start by plotting the dimensional reduction plots below:

## Clustering comparison: DR plots
# pdf(paste(res_dirs[1], "clts_comp_all_tsne_plots.pdf", sep="/"), width=24, height=24)
# cowplot::plot_grid(plotlist=list(dr_clts_plots[["tsne"]][[1]], dr_clts_plots[["tsne"]][[2]], dr_clts_plots[["tsne"]][[3]],
#                dr_clts_seu_plots[["tsne"]][[1]], dr_clts_seu_plots[["tsne"]][[2]], dr_clts_seu_plots[["tsne"]][[3]],
#                dr_int_sca_plots[["tsne"]][[1]],dr_int_sca_plots[["tsne"]][[3]],dr_int_sca_plots[["tsne"]][[4]]
#                ), ncol=3)
# dev.off()
pdf(paste(res_dirs[1], "clts_comp_all_umap_plots.pdf", sep="/"), width=24, height=24)
cowplot::plot_grid(plotlist=list(dr_clts_plots[["umap"]][[1]], dr_clts_plots[["umap"]][[2]], dr_clts_plots[["umap"]][[3]],
                 dr_clts_seu_plots[["umap"]][[1]], dr_clts_seu_plots[["umap"]][[2]], dr_clts_seu_plots[["umap"]][[3]],
                 dr_int_sca_plots[["umap"]][[1]],dr_int_sca_plots[["umap"]][[3]],dr_int_sca_plots[["umap"]][[4]]
                 ), ncol=3)
dev.off()

## png 
##   2

# Print
# cowplot::plot_grid(plotlist=list(dr_clts_plots[["tsne"]][[1]], dr_clts_plots[["tsne"]][[2]], dr_clts_plots[["tsne"]][[3]],
#                dr_clts_seu_plots[["tsne"]][[1]], dr_clts_seu_plots[["tsne"]][[2]], dr_clts_seu_plots[["tsne"]][[3]],
#                dr_int_sca_plots[["tsne"]][[1]],dr_int_sca_plots[["tsne"]][[3]],dr_int_sca_plots[["tsne"]][[4]]
#                ), ncol=3)
cowplot::plot_grid(plotlist=list(dr_clts_plots[["umap"]][[1]], dr_clts_plots[["umap"]][[2]], dr_clts_plots[["umap"]][[3]],
                 dr_clts_seu_plots[["umap"]][[1]], dr_clts_seu_plots[["umap"]][[2]], dr_clts_seu_plots[["umap"]][[3]],
                 dr_int_sca_plots[["umap"]][[1]],dr_int_sca_plots[["umap"]][[3]],dr_int_sca_plots[["umap"]][[4]]
                 ), ncol=3)

In order to get a more concrete metric, let’s compare the clusterings obtained by us with the ground-truth provided by authors, i.e., newClusterID.orig and/or cell.types.orig. For this purpose we can check a confusion matrix and use the Adjusted Rand Index - ARI to compare two clustering results.

Let’s create a confusion matrix for every comparison, where we’ll compare the clustering results obtained here with the cell.types.orig.

## Clustering comparison - confusion matrix
clts2celltype <- list()
clts2celltype[["orig"]] <- data.frame(rbind(table(seu@meta.data$newClusterID.orig, seu@meta.data$cell.types.orig))) 
clts2celltype[["unint"]] <- data.frame(rbind(table(seu@meta.data$seurat_clusters, seu@meta.data$cell.types.orig)))
clts2celltype[["int_seu"]] <- data.frame(rbind(table(int@meta.data$seurat_clusters, int@meta.data$cell.types.orig)))
clts2celltype[["int_sca"]] <- data.frame(rbind(table(int_sca@meta.data$seurat_clusters, int_sca@meta.data$cell.types.orig)))

# save table
for (type in names(clts2celltype)) {
    write.table(x=cbind("Clusters"=row.names(clts2celltype[[type]]), clts2celltype[[type]]), 
            file=paste(res_dirs[2], paste0("conf_mtx_", type, "_clusters_vs_celltype.tsv"), sep="/"), 
            row.names=FALSE, quote=FALSE, sep="\t")
}

# Print heatmaps of the 4 confusion matrices
h1 <- Heatmap(t(clts2celltype[[1]]), name="No. cells", column_title="original", cluster_rows=FALSE, cluster_columns=FALSE)
h2 <- Heatmap(t(clts2celltype[[2]]), name="No. cells", column_title="unintegrated", 
          cluster_rows=FALSE, cluster_columns=FALSE)
h3 <- Heatmap(t(clts2celltype[[3]]), name="No. cells", column_title="Seurat", cluster_rows=FALSE, cluster_columns=FALSE)
h4 <- Heatmap(t(clts2celltype[[4]]), name="No. cells", column_title="Scanorama", cluster_rows=FALSE, cluster_columns=FALSE)
pdf(paste(res_dirs[1], "heatmap_conf_mtx_comp.pdf", sep="/"), width=18)
(h1+h2+h3+h4)
dev.off()

## png 
##   2

print((h1+h2+h3+h4))

# print clusters to cell types
knitr::kable(clts2celltype[["orig"]])

	Activated.CD4	Activated.CD8	CD16..NK.and.NKT	CD16..NK.and.NKT.1	CD38..CD8.T	CD38..CD8.T.1	Classical.monocytes	CX3CR1hi.Effector.CD8	CX3CR1lo.Effector.CD8	Cycling.T	Effector.CD8	Gamma.Delta.T	mDC	Megakaryocytes	Memory.B	Naive.B	Naive.T	Neutrophil	Non.classical.monocytes	pDC	Plasmablasts	Synovial.cells	Treg
1	0	0	0	0	0	0	0	0	0	1015	0	0	0	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	796
3	0	0	0	0	0	0	0	0	0	0	5139	0	0	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	0	0	0	0	0	0	767	0	0	0	0	0	0	0	0	0	0	0
5	3630	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
6	0	2180	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	5593	0	0	0	0	0	0
8	0	0	0	0	0	0	0	4824	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
9	0	0	0	0	0	0	0	0	393	0	0	0	0	0	0	0	0	0	0	0	0	0	0
10	0	0	0	0	241	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
11	0	0	0	0	0	415	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
12	0	0	0	6013	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
13	0	0	728	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
14	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1565	0	0	0	0	0	0	0	0
15	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	936	0	0	0	0	0	0	0
16	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	200	0	0
18	0	0	0	0	0	0	2952	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
19	0	0	0	0	0	0	1317	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
20	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	947	0	0	0	0
21	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	2206	0	0	0	0	0
22	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	982	0	0	0	0	0
23	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	188	0	0	0	0	0
24	0	0	0	0	0	0	0	0	0	0	0	0	38	0	0	0	0	0	0	0	0	0	0
25	0	0	0	0	0	0	0	0	0	0	0	0	1922	0	0	0	0	0	0	0	0	0	0
26	0	0	0	0	0	0	0	0	0	0	0	0	159	0	0	0	0	0	0	0	0	0	0
27	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	158	0	0	0
28	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	269	0	0	0
29	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	96	0	0	0
30	0	0	0	0	0	0	0	0	0	0	0	0	0	939	0	0	0	0	0	0	0	0	0
31	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	55	0

Above we’ve just printed the confusion matrix comparing the original cluster labels versus cell types given by the authors. To make the results more interpretable, the confusion matrices were printed as heatmaps.

You can open the other matrices if you want, but the ARI metric below will give us a more concrete value.

Question: Which original renamed cluster(s) corresponds to Tregs? (1 min)

Answer

Tregs (n=796) belong to cluster 2.

Print the R chunk code below to get the ARI metric.

## Clustering comparison - ARI
true_labels <- list("unint"=seu@meta.data$cell.types.orig, 
            "int_seu"=int@meta.data$cell.types.orig, 
            "int_sca"=int_sca@meta.data$cell.types.orig)
new_clts <- list("unint"=seu@meta.data$seurat_clusters, 
         "int_seu"=int@meta.data$seurat_clusters, 
         "int_sca"=int_sca@meta.data$seurat_clusters)
comp <- names(new_clts)
lapply(setNames(comp, comp), function(x) clustComp(true_labels[[x]], new_clts[[x]])[["ARI"]]) %>% 
    unlist()

##     unint   int_seu   int_sca 
## 0.5915921 0.6302659 0.5960452

TASK 3: Discuss within your group which result reproduces better the cell types/clusters found by the authors.

## Save objects
res_objs <- "results/GSE173303/objects"
if(!dir.exists(res_objs)) dir.create(res_objs, recursive=TRUE)
saveRDS(seu, paste(res_objs, "seu.rds", sep="/"))
saveRDS(int, paste(res_objs, "int.rds", sep="/"))
saveRDS(int_sca, paste(res_objs, "int_sca.rds", sep="/"))

R packages used and respective versions

## R packages and versions used in these analyses

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-conda-linux-gnu (64-bit)
## Running under: CentOS Linux 7 (Core)
## 
## Matrix products: default
## BLAS/LAPACK: /wrk/local2/B21027_scRNAseq_signatures/070622_embl_ebi_tcell_course/EMBL_EBI_scRNA_bioinformatics_Tcell_course_2022/workflow/.snakemake/conda/2b532403d31becf1b5412328473ca837/lib/libopenblasp-r0.3.20.so
## 
## locale:
##  [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=fi_FI.UTF-8        LC_COLLATE=en_GB.UTF-8    
##  [5] LC_MONETARY=fi_FI.UTF-8    LC_MESSAGES=en_GB.UTF-8   
##  [7] LC_PAPER=fi_FI.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] reticulate_1.25       aricode_1.0.0         ComplexHeatmap_2.10.0
## [4] sp_1.5-0              SeuratObject_4.1.0    Seurat_4.1.1         
## [7] ggplot2_3.3.6         dplyr_1.0.9          
## 
## loaded via a namespace (and not attached):
##   [1] circlize_0.4.15       plyr_1.8.7            igraph_1.3.2         
##   [4] lazyeval_0.2.2        splines_4.1.1         listenv_0.8.0        
##   [7] scattermore_0.8       digest_0.6.29         foreach_1.5.2        
##  [10] htmltools_0.5.2       fansi_1.0.3           magrittr_2.0.3       
##  [13] tensor_1.5            cluster_2.1.3         doParallel_1.0.17    
##  [16] ROCR_1.0-11           globals_0.15.0        matrixStats_0.62.0   
##  [19] spatstat.sparse_2.1-1 colorspace_2.0-3      rappdirs_0.3.3       
##  [22] ggrepel_0.9.1         xfun_0.31             crayon_1.5.1         
##  [25] jsonlite_1.8.0        progressr_0.10.1      spatstat.data_2.2-0  
##  [28] survival_3.3-1        zoo_1.8-10            iterators_1.0.14     
##  [31] glue_1.6.2            polyclip_1.10-0       gtable_0.3.0         
##  [34] leiden_0.4.2          GetoptLong_1.0.5      future.apply_1.9.0   
##  [37] shape_1.4.6           BiocGenerics_0.40.0   abind_1.4-5          
##  [40] scales_1.2.0          DBI_1.1.3             spatstat.random_2.2-0
##  [43] miniUI_0.1.1.1        Rcpp_1.0.8.3          viridisLite_0.4.0    
##  [46] xtable_1.8-4          clue_0.3-60           spatstat.core_2.4-4  
##  [49] klippy_0.0.0.9500     stats4_4.1.1          htmlwidgets_1.5.4    
##  [52] httr_1.4.3            RColorBrewer_1.1-3    ellipsis_0.3.2       
##  [55] ica_1.0-2             pkgconfig_2.0.3       farver_2.1.0         
##  [58] sass_0.4.1            uwot_0.1.11           deldir_1.0-6         
##  [61] utf8_1.2.2            here_1.0.1            labeling_0.4.2       
##  [64] tidyselect_1.1.2      rlang_1.0.2           reshape2_1.4.4       
##  [67] later_1.3.0           munsell_0.5.0         tools_4.1.1          
##  [70] cli_3.3.0             generics_0.1.2        ggridges_0.5.3       
##  [73] evaluate_0.15         stringr_1.4.0         fastmap_1.1.0        
##  [76] yaml_2.3.5            goftest_1.2-3         knitr_1.39           
##  [79] fitdistrplus_1.1-8    purrr_0.3.4           RANN_2.6.1           
##  [82] pbapply_1.5-0         future_1.26.1         nlme_3.1-157         
##  [85] mime_0.12             compiler_4.1.1        plotly_4.10.0        
##  [88] png_0.1-7             spatstat.utils_2.3-1  tibble_3.1.7         
##  [91] bslib_0.3.1           stringi_1.7.6         highr_0.9            
##  [94] RSpectra_0.16-1       rgeos_0.5-9           lattice_0.20-45      
##  [97] Matrix_1.4-1          vctrs_0.4.1           pillar_1.7.0         
## [100] lifecycle_1.0.1       spatstat.geom_2.4-0   lmtest_0.9-40        
## [103] jquerylib_0.1.4       GlobalOptions_0.1.2   RcppAnnoy_0.0.19     
## [106] data.table_1.14.2     cowplot_1.1.1         irlba_2.3.5          
## [109] httpuv_1.6.5          patchwork_1.1.1       R6_2.5.1             
## [112] promises_1.2.0.1      KernSmooth_2.23-20    gridExtra_2.3        
## [115] IRanges_2.28.0        parallelly_1.32.0     codetools_0.2-18     
## [118] MASS_7.3-57           assertthat_0.2.1      rprojroot_2.0.3      
## [121] rjson_0.2.21          withr_2.5.0           sctransform_0.3.3    
## [124] S4Vectors_0.32.3      mgcv_1.8-40           parallel_4.1.1       
## [127] rpart_4.1.16          tidyr_1.2.0           rmarkdown_2.11       
## [130] Rtsne_0.16            shiny_1.7.1

Integration: arthritis arAE under ICI therapy (Seurat & Scanorama)

António Sousa (e-mail: aggode@utu.fi)

07/07/2022

Notebook

Data

Download & Import Datasets

Seurat object

Metadata

QC

Viz

Filter cells

Select samples

Normalization

HVG

Scaling

DR

PCA

tSNE & UMAP

Clustering

Integration

Seurat RPCA

Scanorama

Comparison

R packages used and respective versions