## Import packages
library("dplyr") # data wrangling
library("ggplot2") # plotting
library("Seurat") # scRNA-seq analysis
library("ComplexHeatmap") # plot heatmaps
library("aricode") # ARI index clt comp
library("reticulate") # to work w/ python packages in R
scanorama <- import('scanorama') # import the Scanorama python package
source("scripts/helper_functions.R") # import functions adapted to automatize integration with scanorama
(7 min)
Run the R
chunk code below to import the table
data/GEO_GSE173303_project.tsv
which contains the
information to download the data sets that we’ll be using herein.
## Download datasets
# import table with datasets to download
data2download <- read.table("data/GEO_GSE173303_project.tsv",
sep="\t", header=TRUE)
Check below the content of the table imported.
# print table
knitr::kable(data2download)
filename | description | ftp_url |
---|---|---|
GSE173303_barcodes.tsv.gz | table of barcodes | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fbarcodes%2Etsv%2Egz |
GSE173303_features.tsv.gz | table of genes | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Ffeatures%2Etsv%2Egz |
GSE173303_matrix.mtx.gz | gene expression data | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fmatrix%2Emtx%2Egz |
GSE173303_subT_data_metadata.tsv.gz | cell metadata for sub T cells | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5FsubT%5Fdata%5Fmetadata%2Etsv%2Egz |
GSE173303_whole_data_metadata.tsv.gz | cell metadata for all | https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fwhole%5Fdata%5Fmetadata%2Etsv%2Egz |
Now use the information from the table above to download all the data from the GEO project GSE173303 (published by Kim et al., 2022). In this study the authors used peripheral blood and/or synovial fluid samples from 20 patients with arthritis-autoimmune related adverse events.
The GEO GSE173303 repository comprises two types of files:
10X Genomics Cell Ranger (droplet based scRNA-seq technology - read more):
GSE173303_barcodes.tsv.gz
(cell barcodes)
GSE173303_features.tsv.gz
(genes)
GSE173303_matrix.mtx.gz
(gene expression
values)
Metadata (data describing cell attributes):
GSE173303_whole_data_metadata.tsv.gz
(for
all)
GSE173303_subT_data_metadata.tsv.gz
(for T
cells)
The 10X data can be directly imported into Seurat
, a
R
package developed for the analysis of scRNA-seq data (page; Hao
et al., 2021). While the metadata will be imported separately as
tab-separated tables.
Since the function used to import 10X data into R
and
Seurat
, Read10X, looks
into a given directory and searches for the following file names,
barcodes.tsv
, genes.tsv
, and
matrix.mtx
(it allows compressed .gz
files),
it is more convenient to remove the prefix file name ‘GSE173303_’ from
every file.
## Download & import data
# create directory to save files
down_dir <- "data/GSE173303" # directory to save datasets
if (!dir.exists(down_dir)) dir.create(down_dir, recursive=TRUE) # create folder if doesn't exist
# loop over the table rows & download each dataset
for (f in 1:nrow(data2download)) {
down_filename <- data2download[f,"filename"]
down_filepath <- paste(down_dir, gsub("GSE173303_", "", down_filename),
sep="/")# download file to: 'data/GSE173303'
down_url <- data2download[f,"ftp_url"]
cat("Downloading file", down_filename, "to", down_filepath, "\n")
download.file(url=down_url, destfile=down_filepath)
}
## Downloading file GSE173303_barcodes.tsv.gz to data/GSE173303/barcodes.tsv.gz
## Downloading file GSE173303_features.tsv.gz to data/GSE173303/features.tsv.gz
## Downloading file GSE173303_matrix.mtx.gz to data/GSE173303/matrix.mtx.gz
## Downloading file GSE173303_subT_data_metadata.tsv.gz to data/GSE173303/subT_data_metadata.tsv.gz
## Downloading file GSE173303_whole_data_metadata.tsv.gz to data/GSE173303/whole_data_metadata.tsv.gz
# import gene expression 10X data into Seurat
data10x <- Read10X(data.dir=down_dir) # import 10x data as sparse matrix
seu <- CreateSeuratObject(counts=data10x, project="GSE173303") # convert gene exp. sparse matrix into Seurat class object
# import metadata
metadata <- list.files(down_dir, pattern="metadata", full.names=TRUE) # get path for metadata based on 'pattern' argument
names(metadata) <- gsub("_data_metadata.tsv.gz", "", basename(metadata)) # name the paths
metadata <- lapply(setNames(metadata, names(metadata)), function(x) {
read.table(x, header=TRUE, sep="\t", stringsAsFactors=FALSE)
}) # import tables into a list
(5 min)
The data was imported into R
as a sparse matrix (RAM
efficient - most of the entries are zero) with the Seurat
R
function data10x <- Read10X()
. Then, the
sparse matrix, where genes are rows and columns are cells,
data10x
was converted in a Seurat
class object
at
seu <- CreateSeuratObject(counts=data10x, project="GSE173303")
.
The object seu
will be used downstream to analyze the
scRNA-seq data.
Look into the structure of the Seurat class object by running the code below.
# Seurat object structure
str(seu)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
## ..@ assays :List of 1
## .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
## .. .. .. ..@ counts :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
## .. .. .. .. .. ..@ p : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 21831 89717
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
## .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## .. .. .. .. .. ..@ x : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. .. .. .. .. ..@ i : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
## .. .. .. .. .. ..@ p : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
## .. .. .. .. .. ..@ Dim : int [1:2] 21831 89717
## .. .. .. .. .. ..@ Dimnames:List of 2
## .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
## .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## .. .. .. .. .. ..@ x : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
## .. .. .. .. .. ..@ factors : list()
## .. .. .. ..@ scale.data : num[0 , 0 ]
## .. .. .. ..@ key : chr "rna_"
## .. .. .. ..@ assay.orig : NULL
## .. .. .. ..@ var.features : logi(0)
## .. .. .. ..@ meta.features:'data.frame': 21831 obs. of 0 variables
## .. .. .. ..@ misc : list()
## ..@ meta.data :'data.frame': 89717 obs. of 3 variables:
## .. ..$ orig.ident : Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ nCount_RNA : num [1:89717] 2669 3727 4039 3123 6949 ...
## .. ..$ nFeature_RNA: int [1:89717] 1024 1205 1728 1332 2041 1456 2507 3089 2526 1623 ...
## ..@ active.assay: chr "RNA"
## ..@ active.ident: Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
## .. ..- attr(*, "names")= chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
## ..@ graphs : list()
## ..@ neighbors : list()
## ..@ reductions : list()
## ..@ images : list()
## ..@ project.name: chr "GSE173303"
## ..@ misc : list()
## ..@ version :Classes 'package_version', 'numeric_version' hidden list of 1
## .. ..$ : int [1:3] 4 1 0
## ..@ commands : list()
## ..@ tools : list()
An object of Seurat
class has several layers of
information. You don’t need to know all. Let’s focus on
@assays
and @meta.data
. The former stores the
gene expression data. Assays
are further divided into
slots
of data. Particularly important are the
slots
: @counts
(unnormalized/raw counts),
@data
(normalized data, e.g., log-noramlized with 10K
scaling factor), @scale.data
(scaled data, e.g.,
standardized by Z-score). The latter, i.e., @meta.data
stores the cell metadata (number of genes expressed in the cell, cluster
identity, etc).
The object seu
imported above will be called almost in
every function during the analysis and most of the results will be saved
directly into it. Seurat
provides functions to access the
data stored in its own class.
Read more about the structure of a Seurat
object in the
Seurat wiki.
Question: How many genes and cells compose the GEO GSE173303 dataset? (1 min)
Tip: type the object
seu
in the R console and interpret the message printed or provide the objectseu
to the functionsnrow()
andncol()
which count the no. of rows (=genes) and columns (=cells).
The GEO GSE173303 dataset comprises 21831 genes and 89717 cells.
Print below the gene expression of the first 10 genes across the
first 10 cells stored in the main assay, called RNA
by
default, in the slot @counts
.
## Inspect data
seu@assays$RNA@counts[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##
## AL627309.1 . . . . . . . . . .
## AL669831.5 . . . . . . . . . .
## FAM87B . . . . . . . . . .
## LINC00115 . . . . . . . . . .
## FAM41C . . . . . . . . . .
## NOC2L . . . . . . . . . .
## KLHL17 . . . . . . . . . .
## PLEKHN1 . . . . . . . . . .
## HES4 . . . . . . . . . .
## ISG15 . . . . . . 1 2 2 3
The gene expression counts stored in the slot @counts
are in sparse format. Points represent zero.
Question: Which is the expression value for the 10th gene in the 10th cell? (1 min)
It is 3.
(5 min)
Let’s inspect below what the @meta.data
slot contains
(run the code).
# Print the first 6 rows of '@meta.data'
head(seu@meta.data)
## orig.ident nCount_RNA nFeature_RNA
## 45_AAACCTGAGAATGTGT 45 2669 1024
## 45_AAACCTGAGACTAGAT 45 3727 1205
## 45_AAACCTGAGACTGGGT 45 4039 1728
## 45_AAACCTGAGAGAGCTC 45 3123 1332
## 45_AAACCTGAGATATGGT 45 6949 2041
## 45_AAACCTGAGATCTGCT 45 4210 1456
Question: What do you think each column represents? (3 min)
Tip: discuss this in group.
The slot @meta.data
contains 3 columns/variables:
orig.ident
, nCount_RNA
,
nFeature_RNA
. These variables were created automatically
from the data without us providing any information.
orig.ident
refers to the identity of the cell. This matches
the prefix of the cell barcode represented as row names of the
orig.ident
. This is because the function
CreateSeuratObject()
contains two parameters:
names.delim = "_"
and names.field = 1
. These
parameters will split the cell barcode names by names.delim
and pick the first element, i.e., 45 for the first 6 cells. You will
understand the meaning of these values below. nCount_RNA
is
just the total number of UMIs in a cell (obtained by summing all the
expression values in a cell/column). nFeature_RNA
is the
number of different genes expressed in a cell (obtained by summing the
number of genes expressing a value different than zero).
Now, let’s check the metadata provided in the GEO repository that was
imported above. The object metadata
is a list with two
tables (or data frames) stored in it named: subT, whole. To access each
one of them, you can call the object with the accessor $
sign like this: metadata$subT
(to access subT
)
or metadata$whole
(to access whole
). You can
use the functions dim()
(to see the dimensions, i.e., no.
of rows x cols), summary()
(to print a summary for every
column) and head()
(to print the first 6 rows) to inspect
the tables. Do not attempt to print the whole tables, because the
thousand of rows they comprise.
For now let’s ignore the metadata$subT
data frame, since
it represents a subset of the whole
data.
# Print the first 6 lines of the 'whole' metadata
head(metadata$whole)
## orig.ident nCount_RNA nFeature_RNA percent.mito sbt_is_doublet
## 1 45 2669 1024 0.11689771 FALSE
## 2 45 3727 1205 0.07593239 FALSE
## 3 45 4039 1728 0.08294132 FALSE
## 4 45 3123 1332 0.05315402 FALSE
## 5 45 6949 2041 0.07281623 FALSE
## 6 45 4210 1456 0.08622328 FALSE
## sbt_doublet_score manually_is_doublet batch control GROUP RNA_snn_res.0.5
## 1 0.05778302 FALSE 2 SF Mono 2
## 2 0.04066337 FALSE 2 SF Mono 0
## 3 0.05618845 FALSE 2 SF Mono 3
## 4 0.06669545 FALSE 2 SF Mono 0
## 5 0.08527529 FALSE 2 SF Mono 2
## 6 0.05618845 FALSE 2 SF Mono 2
## seurat_clusters newClusterID cell.types barcode
## 1 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 2 0 3 Effector CD8 45_AAACCTGAGACTAGAT
## 3 3 5 Activated CD4 45_AAACCTGAGACTGGGT
## 4 0 3 Effector CD8 45_AAACCTGAGAGAGCTC
## 5 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 6 2 8 CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT
# Look into the summary
summary(metadata$whole)
## orig.ident nCount_RNA nFeature_RNA percent.mito
## Length:89717 Min. : 353 Min. : 202 Min. :0.00000
## Class :character 1st Qu.: 2373 1st Qu.: 986 1st Qu.:0.04151
## Mode :character Median : 4260 Median :1537 Median :0.05605
## Mean : 4974 Mean :1586 Mean :0.06067
## 3rd Qu.: 6067 3rd Qu.:1988 3rd Qu.:0.07425
## Max. :87816 Max. :7709 Max. :0.14999
## sbt_is_doublet sbt_doublet_score manually_is_doublet batch
## Mode :logical Min. :0.001366 Mode :logical Min. :0.000
## FALSE:89717 1st Qu.:0.030928 FALSE:89717 1st Qu.:1.000
## Median :0.053980 Median :2.000
## Mean :0.069264 Mean :1.786
## 3rd Qu.:0.089419 3rd Qu.:3.000
## Max. :0.626984 Max. :3.000
## control GROUP RNA_snn_res.0.5 seurat_clusters
## Length:89717 Length:89717 Min. : 0.00 Min. : 0.00
## Class :character Class :character 1st Qu.: 2.00 1st Qu.: 2.00
## Mode :character Mode :character Median : 4.00 Median : 4.00
## Mean : 6.43 Mean : 6.43
## 3rd Qu.: 9.00 3rd Qu.: 9.00
## Max. :30.00 Max. :30.00
## newClusterID cell.types barcode
## Min. : 1.00 Length:89717 Length:89717
## 1st Qu.: 5.00 Class :character Class :character
## Median : 8.00 Mode :character Mode :character
## Mean :10.62
## 3rd Qu.:14.00
## Max. :31.00
The first three columns of metadata$whole
seem to match
the first three columns from seu@meta.data
(this can be
computational tested by doing:
all_equal(seu@meta.data, metadata$whole[,c("orig.ident", "nCount_RNA", "nFeature_RNA")])
).
If you run the previous code you’ll realize that the message printed
suggests they are different. This is because the data types/classes of
the same columns are different across tables, but the values are the
same. How can we be sure? By comparing the cell barcodes in the
two tables:
all(row.names(seu@meta.data)==metadata$whole$barcodes)
. It
returns TRUE
meaning the cell barcodes between tables are
exactly the same. This is useful, because we know now that we can
combine the metadata$whole
to seu@meta.data
in
order to access this data from our Seurat
object. This will
be convenient to plot and query the data later.
To understand the meaning of every column in
metadata$whole
you would need to read carefully the
information available in the GEO GSE173303
repository from where we downloaded the data from. For reasons of time,
this was done for you and summarized below.
orig.ident
: original identity of every cell
regarding the 8 matched donor sample from arthritis-irAE subjects (8
synovial fluid samples vs 8 PBMCs, n=16) that they came from: A7, A40,
45, A56, 62, 63, 65, 76 (synovial), A7B, A40B, 45B, A56B, 62B, 63B, 65B,
76B (PBMCs).
nCount_RNA
: total no. of UMIs expressed by
cell.
nFeature_RNA
: no. of different genes expressed by
cell.
percent.mito
: percentage of mitochondrial genes
expressed by cell.
sbt_is_doublet
: logical value checking if a cell is
a doublet
(obtained with the software scrublet
). ‘FALSE’ for all the
cells, meaning that the authors removed doublets before submitting the
data to GEO.
sbt_doublet_score
: doublet score obtained with
scrublet
.
manually_is_doublet
: similar to
sbt_is_doublet
.
batch
: batch factors/categories. It assumes the
values: 0, 1, 2, 3.
control
: sample type. One of two: Blood
(Blood/PBMC samples, n=59255) or SF
(Synovial Fluid,
n=30462).
Group
: the identity of the treatment group. One of
two: Combo
(mixture of ICI therapy medicines, n=37465) or
Mono
(monotherapy, n=52252)
RNA_snn_res.0.5
: Seurat clusters obtained by using a
resolution value of 0.5.
seurat_clusters
: Seurat clusters. The same as
RNA_snn_res.0.5
.
newClusterID
: new cluster ID.
seurat_clusters
renamed.
cell.types
: cell-type annotations.
barcode
: cell barcodes with donor sample id
prefix.
Now that we know what every column means let’s combine this table
with our Seurat
object metadata:
seu@meta.data
. Before doing so, we’ll rename the columns
from metadata$whole
to avoid conflicts in the column names
at seu@meta.data
(because some of the column names are the
same between tables).
## Combine metadata
# Rename col names from 'metadata$whole' by adding the suffix '.orig' to the col names
tmp <- metadata$whole # temporary modified table - a copy from 'metadata$whole'
colnames(tmp) <- paste0(colnames(tmp),".orig")
# Combine metadata tables
stopifnot(all(row.names(seu@meta.data)==tmp$barcode.orig)) # ensure that cell barcodes match
seu@meta.data <- cbind(seu@meta.data, tmp)
# Look how the updated Seurat meta.data looks like
head(seu@meta.data)
## orig.ident nCount_RNA nFeature_RNA orig.ident.orig
## 45_AAACCTGAGAATGTGT 45 2669 1024 45
## 45_AAACCTGAGACTAGAT 45 3727 1205 45
## 45_AAACCTGAGACTGGGT 45 4039 1728 45
## 45_AAACCTGAGAGAGCTC 45 3123 1332 45
## 45_AAACCTGAGATATGGT 45 6949 2041 45
## 45_AAACCTGAGATCTGCT 45 4210 1456 45
## nCount_RNA.orig nFeature_RNA.orig percent.mito.orig
## 45_AAACCTGAGAATGTGT 2669 1024 0.11689771
## 45_AAACCTGAGACTAGAT 3727 1205 0.07593239
## 45_AAACCTGAGACTGGGT 4039 1728 0.08294132
## 45_AAACCTGAGAGAGCTC 3123 1332 0.05315402
## 45_AAACCTGAGATATGGT 6949 2041 0.07281623
## 45_AAACCTGAGATCTGCT 4210 1456 0.08622328
## sbt_is_doublet.orig sbt_doublet_score.orig
## 45_AAACCTGAGAATGTGT FALSE 0.05778302
## 45_AAACCTGAGACTAGAT FALSE 0.04066337
## 45_AAACCTGAGACTGGGT FALSE 0.05618845
## 45_AAACCTGAGAGAGCTC FALSE 0.06669545
## 45_AAACCTGAGATATGGT FALSE 0.08527529
## 45_AAACCTGAGATCTGCT FALSE 0.05618845
## manually_is_doublet.orig batch.orig control.orig GROUP.orig
## 45_AAACCTGAGAATGTGT FALSE 2 SF Mono
## 45_AAACCTGAGACTAGAT FALSE 2 SF Mono
## 45_AAACCTGAGACTGGGT FALSE 2 SF Mono
## 45_AAACCTGAGAGAGCTC FALSE 2 SF Mono
## 45_AAACCTGAGATATGGT FALSE 2 SF Mono
## 45_AAACCTGAGATCTGCT FALSE 2 SF Mono
## RNA_snn_res.0.5.orig seurat_clusters.orig newClusterID.orig
## 45_AAACCTGAGAATGTGT 2 2 8
## 45_AAACCTGAGACTAGAT 0 0 3
## 45_AAACCTGAGACTGGGT 3 3 5
## 45_AAACCTGAGAGAGCTC 0 0 3
## 45_AAACCTGAGATATGGT 2 2 8
## 45_AAACCTGAGATCTGCT 2 2 8
## cell.types.orig barcode.orig
## 45_AAACCTGAGAATGTGT CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 45_AAACCTGAGACTAGAT Effector CD8 45_AAACCTGAGACTAGAT
## 45_AAACCTGAGACTGGGT Activated CD4 45_AAACCTGAGACTGGGT
## 45_AAACCTGAGAGAGCTC Effector CD8 45_AAACCTGAGAGAGCTC
## 45_AAACCTGAGATATGGT CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 45_AAACCTGAGATCTGCT CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT
Now we’re ready to start our analysis with Seurat
!
(5 min)
The first step in every analysis is: Quality-Control (QC). In this case, the QC that we’ll perform is relatively small because the data was already processed by the 10X Genomics Cell Ranger. In addition, the authors filter out bad-quality cells such as doublets as well cells comprising a high percentage of mitochondrial genes (see a full description in the GEO GSE173303 repository or in the publication doi:10.1038/s41467-022-29539-3).
The percentage of mitochondrial genes was already estimated by the
authors, but let’s calculate it below to see if the values match. This
is done with the function PercentageFeatureSet()
. This
function looks into the genes with the pattern given at
pattern
to determine the percentage of these features. This
is done using regular
expressions and the pattern
given depends on the set of
features that you’re looking into and the annotation used. In our case
we’ve human data annotated against the reference GRCh38. All the
mitochondrial genes have the prefix ‘MT-’. Thus we will use this to
provide this pattern to pattern = "^MT-"
.
Question: Which and how many mitochondrial genes are present in this data set? (1 min)
Tip: use the following code
grep("^MT-", row.names(seu), value=TRUE)
to retrieve the mitochondrial genes together with the functionlength()
to get the number of mt genes.
The mitochondrial genes present are: MT-ND1, MT-ND2, MT-CO1, MT-CO2, MT-ATP8, MT-ATP6, MT-CO3, MT-ND3, MT-ND4L, MT-ND4, MT-ND5, MT-ND6, MT-CYB.
The total no. of mt genes is: 13.
## Calculate percentage of mitochondrial genes by cell
seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-") # add the 'percent.mt' to the 'meta.data' slot
# if you work with mouse data usually the prefix for mitochondrial genes is '^mt-' instead
The previous percent.mito.orig
is different than the
percentage of mitochondrial genes that you had just calculated
percent.mt
. It seems that the former is in relative
abundance, with a scale 0-1, whereas your calculation is in percentage
(0-100%).
Now let’s plot the three cell properties that Seurat
calculated: nFeature_RNA
, nCount_RNA
,
percent.mt
. By default Seurat uses the
@meta.data
variable orig.ident
to set the
identity of every cell. This means that the cells will be highlighted by
orig.ident
, i.e., donor sample origin. This can be changed,
but for now it is good to look into the QC highlighted by the donor
sample origin. In addition, we’ll take opportunity of the functionality
of visualization in Seurat
to highlight the data by
control.orig
(the metadata column with the categorical
variable of the sample type: Blood
or SF
):
SF
, in light blue, and Blood
, in light red,
violins.
## QC: violin plots
# Create folder to save the results
res_dirs <- paste("results/GSE173303", c("plots", "tables"), sep="/")
for (d in res_dirs) if (!dir.exists(d)) dir.create(d, recursive=TRUE)
# Plot
qc_vln <- VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"),
ncol = 3, split.by="control.orig")
# Seurat removes legend if there is more than 1 plot: https://github.com/satijalab/seurat/blob/a1294c4d363780548dbf9cc4a4abb3a6078a6d64/R/visualization.R#L5729
# Save
pdf(paste(res_dirs[1], "qc_violin_plots.pdf", sep="/"), width=12, height=4)
print(qc_vln)
dev.off()
## png
## 2
# Print plot
print(qc_vln)