Notebook


## Import packages
library("dplyr") # data wrangling
library("ggplot2") # plotting
library("Seurat") # scRNA-seq analysis
library("ComplexHeatmap") # plot heatmaps
library("aricode") # ARI index clt comp
library("reticulate") # to work w/ python packages in R
scanorama <- import('scanorama') # import the Scanorama python package
source("scripts/helper_functions.R") # import functions adapted to automatize integration with scanorama






Data


Download & Import Datasets

(7 min)


Run the R chunk code below to import the table data/GEO_GSE173303_project.tsv which contains the information to download the data sets that we’ll be using herein.

## Download datasets

# import table with datasets to download
data2download <- read.table("data/GEO_GSE173303_project.tsv", 
                sep="\t", header=TRUE)

Check below the content of the table imported.

# print table 
knitr::kable(data2download)
filename description ftp_url
GSE173303_barcodes.tsv.gz table of barcodes https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fbarcodes%2Etsv%2Egz
GSE173303_features.tsv.gz table of genes https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Ffeatures%2Etsv%2Egz
GSE173303_matrix.mtx.gz gene expression data https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fmatrix%2Emtx%2Egz
GSE173303_subT_data_metadata.tsv.gz cell metadata for sub T cells https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5FsubT%5Fdata%5Fmetadata%2Etsv%2Egz
GSE173303_whole_data_metadata.tsv.gz cell metadata for all https://ftp.ncbi.nlm.nih.gov/geo/series/GSE173nnn/GSE173303/suppl/GSE173303%5Fwhole%5Fdata%5Fmetadata%2Etsv%2Egz

Now use the information from the table above to download all the data from the GEO project GSE173303 (published by Kim et al., 2022). In this study the authors used peripheral blood and/or synovial fluid samples from 20 patients with arthritis-autoimmune related adverse events.

The GEO GSE173303 repository comprises two types of files:

  • 10X Genomics Cell Ranger (droplet based scRNA-seq technology - read more):

    • GSE173303_barcodes.tsv.gz (cell barcodes)

    • GSE173303_features.tsv.gz (genes)

    • GSE173303_matrix.mtx.gz (gene expression values)

  • Metadata (data describing cell attributes):

    • GSE173303_whole_data_metadata.tsv.gz (for all)

    • GSE173303_subT_data_metadata.tsv.gz (for T cells)


The 10X data can be directly imported into Seurat, a R package developed for the analysis of scRNA-seq data (page; Hao et al., 2021). While the metadata will be imported separately as tab-separated tables.

Since the function used to import 10X data into R and Seurat, Read10X, looks into a given directory and searches for the following file names, barcodes.tsv, genes.tsv, and matrix.mtx (it allows compressed .gz files), it is more convenient to remove the prefix file name ‘GSE173303_’ from every file.

## Download & import data

# create directory to save files
down_dir <- "data/GSE173303" # directory to save datasets 
if (!dir.exists(down_dir)) dir.create(down_dir, recursive=TRUE) # create folder if doesn't exist 

# loop over the table rows & download each dataset
for (f in 1:nrow(data2download)) {
    down_filename <- data2download[f,"filename"]
    down_filepath <- paste(down_dir, gsub("GSE173303_", "", down_filename), 
                   sep="/")# download file to: 'data/GSE173303'
    down_url <- data2download[f,"ftp_url"]
    cat("Downloading file", down_filename, "to", down_filepath, "\n")
    download.file(url=down_url, destfile=down_filepath)
}
## Downloading file GSE173303_barcodes.tsv.gz to data/GSE173303/barcodes.tsv.gz 
## Downloading file GSE173303_features.tsv.gz to data/GSE173303/features.tsv.gz 
## Downloading file GSE173303_matrix.mtx.gz to data/GSE173303/matrix.mtx.gz 
## Downloading file GSE173303_subT_data_metadata.tsv.gz to data/GSE173303/subT_data_metadata.tsv.gz 
## Downloading file GSE173303_whole_data_metadata.tsv.gz to data/GSE173303/whole_data_metadata.tsv.gz
# import gene expression 10X data into Seurat
data10x <- Read10X(data.dir=down_dir) # import 10x data as sparse matrix 
seu <- CreateSeuratObject(counts=data10x, project="GSE173303") # convert gene exp. sparse matrix into Seurat class object

# import metadata
metadata <- list.files(down_dir, pattern="metadata", full.names=TRUE) # get path for metadata based on 'pattern' argument
names(metadata) <- gsub("_data_metadata.tsv.gz", "", basename(metadata)) # name the paths
metadata <- lapply(setNames(metadata, names(metadata)), function(x) {
               read.table(x, header=TRUE, sep="\t", stringsAsFactors=FALSE)
                   }) # import tables into a list






Seurat object

(5 min)


The data was imported into R as a sparse matrix (RAM efficient - most of the entries are zero) with the Seurat R function data10x <- Read10X(). Then, the sparse matrix, where genes are rows and columns are cells, data10x was converted in a Seurat class object at seu <- CreateSeuratObject(counts=data10x, project="GSE173303"). The object seu will be used downstream to analyze the scRNA-seq data.

Look into the structure of the Seurat class object by running the code below.

# Seurat object structure
str(seu)
## Formal class 'Seurat' [package "SeuratObject"] with 13 slots
##   ..@ assays      :List of 1
##   .. ..$ RNA:Formal class 'Assay' [package "SeuratObject"] with 8 slots
##   .. .. .. ..@ counts       :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
##   .. .. .. .. .. ..@ p       : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 21831 89717
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
##   .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   .. .. .. .. .. ..@ x       : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ data         :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
##   .. .. .. .. .. ..@ i       : int [1:142316713] 31 45 55 84 93 108 115 130 240 267 ...
##   .. .. .. .. .. ..@ p       : int [1:89718] 0 1024 2229 3957 5289 7330 8786 11293 14382 16908 ...
##   .. .. .. .. .. ..@ Dim     : int [1:2] 21831 89717
##   .. .. .. .. .. ..@ Dimnames:List of 2
##   .. .. .. .. .. .. ..$ : chr [1:21831] "AL627309.1" "AL669831.5" "FAM87B" "LINC00115" ...
##   .. .. .. .. .. .. ..$ : chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   .. .. .. .. .. ..@ x       : num [1:142316713] 1 1 1 5 1 1 2 1 1 16 ...
##   .. .. .. .. .. ..@ factors : list()
##   .. .. .. ..@ scale.data   : num[0 , 0 ] 
##   .. .. .. ..@ key          : chr "rna_"
##   .. .. .. ..@ assay.orig   : NULL
##   .. .. .. ..@ var.features : logi(0) 
##   .. .. .. ..@ meta.features:'data.frame':   21831 obs. of  0 variables
##   .. .. .. ..@ misc         : list()
##   ..@ meta.data   :'data.frame': 89717 obs. of  3 variables:
##   .. ..$ orig.ident  : Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ nCount_RNA  : num [1:89717] 2669 3727 4039 3123 6949 ...
##   .. ..$ nFeature_RNA: int [1:89717] 1024 1205 1728 1332 2041 1456 2507 3089 2526 1623 ...
##   ..@ active.assay: chr "RNA"
##   ..@ active.ident: Factor w/ 16 levels "45","45B","62",..: 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:89717] "45_AAACCTGAGAATGTGT" "45_AAACCTGAGACTAGAT" "45_AAACCTGAGACTGGGT" "45_AAACCTGAGAGAGCTC" ...
##   ..@ graphs      : list()
##   ..@ neighbors   : list()
##   ..@ reductions  : list()
##   ..@ images      : list()
##   ..@ project.name: chr "GSE173303"
##   ..@ misc        : list()
##   ..@ version     :Classes 'package_version', 'numeric_version'  hidden list of 1
##   .. ..$ : int [1:3] 4 1 0
##   ..@ commands    : list()
##   ..@ tools       : list()



An object of Seurat class has several layers of information. You don’t need to know all. Let’s focus on @assays and @meta.data. The former stores the gene expression data. Assays are further divided into slots of data. Particularly important are the slots: @counts (unnormalized/raw counts), @data (normalized data, e.g., log-noramlized with 10K scaling factor), @scale.data (scaled data, e.g., standardized by Z-score). The latter, i.e., @meta.data stores the cell metadata (number of genes expressed in the cell, cluster identity, etc).

The object seu imported above will be called almost in every function during the analysis and most of the results will be saved directly into it. Seurat provides functions to access the data stored in its own class.

Read more about the structure of a Seurat object in the Seurat wiki.


Question: How many genes and cells compose the GEO GSE173303 dataset? (1 min)

Tip: type the object seu in the R console and interpret the message printed or provide the object seu to the functions nrow() and ncol() which count the no. of rows (=genes) and columns (=cells).

Answer

The GEO GSE173303 dataset comprises 21831 genes and 89717 cells.


Print below the gene expression of the first 10 genes across the first 10 cells stored in the main assay, called RNA by default, in the slot @counts.

## Inspect data
seu@assays$RNA@counts[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
##                               
## AL627309.1 . . . . . . . . . .
## AL669831.5 . . . . . . . . . .
## FAM87B     . . . . . . . . . .
## LINC00115  . . . . . . . . . .
## FAM41C     . . . . . . . . . .
## NOC2L      . . . . . . . . . .
## KLHL17     . . . . . . . . . .
## PLEKHN1    . . . . . . . . . .
## HES4       . . . . . . . . . .
## ISG15      . . . . . . 1 2 2 3

The gene expression counts stored in the slot @counts are in sparse format. Points represent zero.


Question: Which is the expression value for the 10th gene in the 10th cell? (1 min)

Answer

It is 3.






Metadata

(5 min)


Let’s inspect below what the @meta.data slot contains (run the code).

# Print the first 6 rows of '@meta.data'
head(seu@meta.data)
##                     orig.ident nCount_RNA nFeature_RNA
## 45_AAACCTGAGAATGTGT         45       2669         1024
## 45_AAACCTGAGACTAGAT         45       3727         1205
## 45_AAACCTGAGACTGGGT         45       4039         1728
## 45_AAACCTGAGAGAGCTC         45       3123         1332
## 45_AAACCTGAGATATGGT         45       6949         2041
## 45_AAACCTGAGATCTGCT         45       4210         1456


Question: What do you think each column represents? (3 min)

Tip: discuss this in group.

Answer

The slot @meta.data contains 3 columns/variables: orig.ident, nCount_RNA, nFeature_RNA. These variables were created automatically from the data without us providing any information. orig.ident refers to the identity of the cell. This matches the prefix of the cell barcode represented as row names of the orig.ident. This is because the function CreateSeuratObject() contains two parameters: names.delim = "_" and names.field = 1. These parameters will split the cell barcode names by names.delim and pick the first element, i.e., 45 for the first 6 cells. You will understand the meaning of these values below. nCount_RNA is just the total number of UMIs in a cell (obtained by summing all the expression values in a cell/column). nFeature_RNA is the number of different genes expressed in a cell (obtained by summing the number of genes expressing a value different than zero).

Now, let’s check the metadata provided in the GEO repository that was imported above. The object metadata is a list with two tables (or data frames) stored in it named: subT, whole. To access each one of them, you can call the object with the accessor $ sign like this: metadata$subT (to access subT) or metadata$whole (to access whole). You can use the functions dim() (to see the dimensions, i.e., no. of rows x cols), summary() (to print a summary for every column) and head() (to print the first 6 rows) to inspect the tables. Do not attempt to print the whole tables, because the thousand of rows they comprise.

For now let’s ignore the metadata$subT data frame, since it represents a subset of the whole data.

# Print the first 6 lines of the 'whole' metadata 
head(metadata$whole)
##   orig.ident nCount_RNA nFeature_RNA percent.mito sbt_is_doublet
## 1         45       2669         1024   0.11689771          FALSE
## 2         45       3727         1205   0.07593239          FALSE
## 3         45       4039         1728   0.08294132          FALSE
## 4         45       3123         1332   0.05315402          FALSE
## 5         45       6949         2041   0.07281623          FALSE
## 6         45       4210         1456   0.08622328          FALSE
##   sbt_doublet_score manually_is_doublet batch control GROUP RNA_snn_res.0.5
## 1        0.05778302               FALSE     2      SF  Mono               2
## 2        0.04066337               FALSE     2      SF  Mono               0
## 3        0.05618845               FALSE     2      SF  Mono               3
## 4        0.06669545               FALSE     2      SF  Mono               0
## 5        0.08527529               FALSE     2      SF  Mono               2
## 6        0.05618845               FALSE     2      SF  Mono               2
##   seurat_clusters newClusterID            cell.types             barcode
## 1               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 2               0            3          Effector CD8 45_AAACCTGAGACTAGAT
## 3               3            5         Activated CD4 45_AAACCTGAGACTGGGT
## 4               0            3          Effector CD8 45_AAACCTGAGAGAGCTC
## 5               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 6               2            8 CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT
# Look into the summary
summary(metadata$whole)
##   orig.ident          nCount_RNA     nFeature_RNA   percent.mito    
##  Length:89717       Min.   :  353   Min.   : 202   Min.   :0.00000  
##  Class :character   1st Qu.: 2373   1st Qu.: 986   1st Qu.:0.04151  
##  Mode  :character   Median : 4260   Median :1537   Median :0.05605  
##                     Mean   : 4974   Mean   :1586   Mean   :0.06067  
##                     3rd Qu.: 6067   3rd Qu.:1988   3rd Qu.:0.07425  
##                     Max.   :87816   Max.   :7709   Max.   :0.14999  
##  sbt_is_doublet  sbt_doublet_score  manually_is_doublet     batch      
##  Mode :logical   Min.   :0.001366   Mode :logical       Min.   :0.000  
##  FALSE:89717     1st Qu.:0.030928   FALSE:89717         1st Qu.:1.000  
##                  Median :0.053980                       Median :2.000  
##                  Mean   :0.069264                       Mean   :1.786  
##                  3rd Qu.:0.089419                       3rd Qu.:3.000  
##                  Max.   :0.626984                       Max.   :3.000  
##    control             GROUP           RNA_snn_res.0.5 seurat_clusters
##  Length:89717       Length:89717       Min.   : 0.00   Min.   : 0.00  
##  Class :character   Class :character   1st Qu.: 2.00   1st Qu.: 2.00  
##  Mode  :character   Mode  :character   Median : 4.00   Median : 4.00  
##                                        Mean   : 6.43   Mean   : 6.43  
##                                        3rd Qu.: 9.00   3rd Qu.: 9.00  
##                                        Max.   :30.00   Max.   :30.00  
##   newClusterID    cell.types          barcode         
##  Min.   : 1.00   Length:89717       Length:89717      
##  1st Qu.: 5.00   Class :character   Class :character  
##  Median : 8.00   Mode  :character   Mode  :character  
##  Mean   :10.62                                        
##  3rd Qu.:14.00                                        
##  Max.   :31.00

The first three columns of metadata$whole seem to match the first three columns from seu@meta.data (this can be computational tested by doing: all_equal(seu@meta.data, metadata$whole[,c("orig.ident", "nCount_RNA", "nFeature_RNA")])). If you run the previous code you’ll realize that the message printed suggests they are different. This is because the data types/classes of the same columns are different across tables, but the values are the same. How can we be sure? By comparing the cell barcodes in the two tables: all(row.names(seu@meta.data)==metadata$whole$barcodes). It returns TRUE meaning the cell barcodes between tables are exactly the same. This is useful, because we know now that we can combine the metadata$whole to seu@meta.data in order to access this data from our Seurat object. This will be convenient to plot and query the data later.

To understand the meaning of every column in metadata$whole you would need to read carefully the information available in the GEO GSE173303 repository from where we downloaded the data from. For reasons of time, this was done for you and summarized below.

  • orig.ident: original identity of every cell regarding the 8 matched donor sample from arthritis-irAE subjects (8 synovial fluid samples vs 8 PBMCs, n=16) that they came from: A7, A40, 45, A56, 62, 63, 65, 76 (synovial), A7B, A40B, 45B, A56B, 62B, 63B, 65B, 76B (PBMCs).

  • nCount_RNA: total no. of UMIs expressed by cell.

  • nFeature_RNA: no. of different genes expressed by cell.

  • percent.mito: percentage of mitochondrial genes expressed by cell.

  • sbt_is_doublet: logical value checking if a cell is a doublet (obtained with the software scrublet). ‘FALSE’ for all the cells, meaning that the authors removed doublets before submitting the data to GEO.

  • sbt_doublet_score: doublet score obtained with scrublet.

  • manually_is_doublet: similar to sbt_is_doublet.

  • batch: batch factors/categories. It assumes the values: 0, 1, 2, 3.

  • control: sample type. One of two: Blood (Blood/PBMC samples, n=59255) or SF (Synovial Fluid, n=30462).

  • Group: the identity of the treatment group. One of two: Combo (mixture of ICI therapy medicines, n=37465) or Mono (monotherapy, n=52252)

  • RNA_snn_res.0.5: Seurat clusters obtained by using a resolution value of 0.5.

  • seurat_clusters: Seurat clusters. The same as RNA_snn_res.0.5.

  • newClusterID: new cluster ID. seurat_clusters renamed.

  • cell.types: cell-type annotations.

  • barcode: cell barcodes with donor sample id prefix.


Now that we know what every column means let’s combine this table with our Seurat object metadata: seu@meta.data. Before doing so, we’ll rename the columns from metadata$whole to avoid conflicts in the column names at seu@meta.data (because some of the column names are the same between tables).

## Combine metadata

# Rename col names from 'metadata$whole' by adding the suffix '.orig' to the col names
tmp <- metadata$whole # temporary modified table - a copy from 'metadata$whole'
colnames(tmp) <- paste0(colnames(tmp),".orig")

# Combine metadata tables
stopifnot(all(row.names(seu@meta.data)==tmp$barcode.orig)) # ensure that cell barcodes match
seu@meta.data <- cbind(seu@meta.data, tmp)

# Look how the updated Seurat meta.data looks like
head(seu@meta.data)
##                     orig.ident nCount_RNA nFeature_RNA orig.ident.orig
## 45_AAACCTGAGAATGTGT         45       2669         1024              45
## 45_AAACCTGAGACTAGAT         45       3727         1205              45
## 45_AAACCTGAGACTGGGT         45       4039         1728              45
## 45_AAACCTGAGAGAGCTC         45       3123         1332              45
## 45_AAACCTGAGATATGGT         45       6949         2041              45
## 45_AAACCTGAGATCTGCT         45       4210         1456              45
##                     nCount_RNA.orig nFeature_RNA.orig percent.mito.orig
## 45_AAACCTGAGAATGTGT            2669              1024        0.11689771
## 45_AAACCTGAGACTAGAT            3727              1205        0.07593239
## 45_AAACCTGAGACTGGGT            4039              1728        0.08294132
## 45_AAACCTGAGAGAGCTC            3123              1332        0.05315402
## 45_AAACCTGAGATATGGT            6949              2041        0.07281623
## 45_AAACCTGAGATCTGCT            4210              1456        0.08622328
##                     sbt_is_doublet.orig sbt_doublet_score.orig
## 45_AAACCTGAGAATGTGT               FALSE             0.05778302
## 45_AAACCTGAGACTAGAT               FALSE             0.04066337
## 45_AAACCTGAGACTGGGT               FALSE             0.05618845
## 45_AAACCTGAGAGAGCTC               FALSE             0.06669545
## 45_AAACCTGAGATATGGT               FALSE             0.08527529
## 45_AAACCTGAGATCTGCT               FALSE             0.05618845
##                     manually_is_doublet.orig batch.orig control.orig GROUP.orig
## 45_AAACCTGAGAATGTGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGACTAGAT                    FALSE          2           SF       Mono
## 45_AAACCTGAGACTGGGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGAGAGCTC                    FALSE          2           SF       Mono
## 45_AAACCTGAGATATGGT                    FALSE          2           SF       Mono
## 45_AAACCTGAGATCTGCT                    FALSE          2           SF       Mono
##                     RNA_snn_res.0.5.orig seurat_clusters.orig newClusterID.orig
## 45_AAACCTGAGAATGTGT                    2                    2                 8
## 45_AAACCTGAGACTAGAT                    0                    0                 3
## 45_AAACCTGAGACTGGGT                    3                    3                 5
## 45_AAACCTGAGAGAGCTC                    0                    0                 3
## 45_AAACCTGAGATATGGT                    2                    2                 8
## 45_AAACCTGAGATCTGCT                    2                    2                 8
##                           cell.types.orig        barcode.orig
## 45_AAACCTGAGAATGTGT CX3CR1hi Effector CD8 45_AAACCTGAGAATGTGT
## 45_AAACCTGAGACTAGAT          Effector CD8 45_AAACCTGAGACTAGAT
## 45_AAACCTGAGACTGGGT         Activated CD4 45_AAACCTGAGACTGGGT
## 45_AAACCTGAGAGAGCTC          Effector CD8 45_AAACCTGAGAGAGCTC
## 45_AAACCTGAGATATGGT CX3CR1hi Effector CD8 45_AAACCTGAGATATGGT
## 45_AAACCTGAGATCTGCT CX3CR1hi Effector CD8 45_AAACCTGAGATCTGCT


Now we’re ready to start our analysis with Seurat!






QC


Viz

(5 min)


The first step in every analysis is: Quality-Control (QC). In this case, the QC that we’ll perform is relatively small because the data was already processed by the 10X Genomics Cell Ranger. In addition, the authors filter out bad-quality cells such as doublets as well cells comprising a high percentage of mitochondrial genes (see a full description in the GEO GSE173303 repository or in the publication doi:10.1038/s41467-022-29539-3).

The percentage of mitochondrial genes was already estimated by the authors, but let’s calculate it below to see if the values match. This is done with the function PercentageFeatureSet(). This function looks into the genes with the pattern given at pattern to determine the percentage of these features. This is done using regular expressions and the pattern given depends on the set of features that you’re looking into and the annotation used. In our case we’ve human data annotated against the reference GRCh38. All the mitochondrial genes have the prefix ‘MT-’. Thus we will use this to provide this pattern to pattern = "^MT-".

Question: Which and how many mitochondrial genes are present in this data set? (1 min)

Tip: use the following code grep("^MT-", row.names(seu), value=TRUE) to retrieve the mitochondrial genes together with the function length() to get the number of mt genes.

Answer

The mitochondrial genes present are: MT-ND1, MT-ND2, MT-CO1, MT-CO2, MT-ATP8, MT-ATP6, MT-CO3, MT-ND3, MT-ND4L, MT-ND4, MT-ND5, MT-ND6, MT-CYB.

The total no. of mt genes is: 13.

## Calculate percentage of mitochondrial genes by cell

seu[["percent.mt"]] <- PercentageFeatureSet(seu, pattern = "^MT-") # add the 'percent.mt' to the 'meta.data' slot

# if you work with mouse data usually the prefix for mitochondrial genes is '^mt-' instead

The previous percent.mito.orig is different than the percentage of mitochondrial genes that you had just calculated percent.mt. It seems that the former is in relative abundance, with a scale 0-1, whereas your calculation is in percentage (0-100%).

Now let’s plot the three cell properties that Seurat calculated: nFeature_RNA, nCount_RNA, percent.mt. By default Seurat uses the @meta.data variable orig.ident to set the identity of every cell. This means that the cells will be highlighted by orig.ident, i.e., donor sample origin. This can be changed, but for now it is good to look into the QC highlighted by the donor sample origin. In addition, we’ll take opportunity of the functionality of visualization in Seurat to highlight the data by control.orig (the metadata column with the categorical variable of the sample type: Blood or SF): SF, in light blue, and Blood, in light red, violins.

## QC: violin plots

# Create folder to save the results
res_dirs <- paste("results/GSE173303", c("plots", "tables"), sep="/")
for (d in res_dirs) if (!dir.exists(d)) dir.create(d, recursive=TRUE)

# Plot
qc_vln <- VlnPlot(seu, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), 
          ncol = 3, split.by="control.orig") 
# Seurat removes legend if there is more than 1 plot: https://github.com/satijalab/seurat/blob/a1294c4d363780548dbf9cc4a4abb3a6078a6d64/R/visualization.R#L5729

# Save
pdf(paste(res_dirs[1], "qc_violin_plots.pdf", sep="/"), width=12, height=4)
print(qc_vln)
dev.off()
## png 
##   2
# Print plot
print(qc_vln)