In this vignette we will be simulating data that resembles the data you would use in your REACTOR analysis and give you a step by step guide on how to use the REACTOR package for your analysis!
Let’s start by simulating data that resembles the data from SCENIC’s [1] binarize()-function, which is just a table consisting of binary values. The table should also contain a column that is used to identify the single-cell sample the values come from (this should also be found in the SCENIC output by default).
# Single cell sample identifiers for 2 cases, 2 controls. 10 single cell samples each
cellID <- c("Covid-1-1","Covid-1-2","Covid-1-3","Covid-1-4","Covid-1-5", "Covid-1-6","Covid-1-7","Covid-1-8","Covid-1-9","Covid-1-10",
"Covid-2-1","Covid-2-2","Covid-2-3","Covid-2-4","Covid-2-5", "Covid-2-6","Covid-2-7","Covid-2-8","Covid-2-9","Covid-2-10",
"Healthy-1-1","Healthy-1-2","Healthy-1-3","Healthy-1-4","Healthy-1-5", "Healthy-1-6","Healthy-1-7","Healthy-1-8","Healthy-1-9","Healthy-1-10",
"Healthy-2-1","Healthy-2-2","Healthy-2-3","Healthy-2-4","Healthy-2-5", "Healthy-2-6","Healthy-2-7","Healthy-2-8","Healthy-2-9","Healthy-2-10")
# Lets also name a few regulons our single cell experiments might have found and add "CellID" to the beginning of this list to get the final column names for our table.
colNames <- c("CellID","NFIL3", "MAFB", "MYCN", "FOXK2", "PPARG", "STAT2", "JUNB", "JUND", "RELB", "ETV1")
# Now simulating the binary matrix
set.seed(1234)
binary_matrix <- matrix(sample(0:1, 40 * 10, replace = TRUE), nrow = 40, ncol = 10)
# Lets add the cell identifier column we made earlier and set the column names
binary_matrix <- cbind.data.frame(cellID, binary_matrix)
colnames(binary_matrix) <- colNames
# Lets see what our table should look like
head(binary_matrix)
## CellID NFIL3 MAFB MYCN FOXK2 PPARG STAT2 JUNB JUND RELB ETV1
## 1 Covid-1-1 1 1 0 0 0 1 0 0 0 0
## 2 Covid-1-2 1 0 1 1 0 0 1 1 1 1
## 3 Covid-1-3 1 1 0 1 0 1 0 1 1 1
## 4 Covid-1-4 1 1 1 0 1 0 0 0 1 0
## 5 Covid-1-5 0 1 0 1 1 0 1 0 1 1
## 6 Covid-1-6 1 0 1 0 1 0 0 0 1 0
Now let’s simulate the clustering our e.g. Coralysis [1] or SEURAT [3] analysis might have made for us:
# Lets use the single cell sample identifiers we made earlier and assign them some cell-type clusters
cellTypeClusters <- rep(c("CD 14 Monocyte", "CD 16 Monocyte", "CD8m T", "NK"), 10)
clustering <- cbind.data.frame(cellID, cellTypeClusters)
# This is what our clustering table should look like
head(clustering)
## cellID cellTypeClusters
## 1 Covid-1-1 CD 14 Monocyte
## 2 Covid-1-2 CD 16 Monocyte
## 3 Covid-1-3 CD8m T
## 4 Covid-1-4 NK
## 5 Covid-1-5 CD 14 Monocyte
## 6 Covid-1-6 CD 16 Monocyte
And let’s finally simulate the studyDesign table we should also have in order to run REACTOR analysis:
# Donors IDs
donor <- c(rep("Covid1",10), rep("Covid2",10), rep("Healthy1",10), rep("Healthy2",10))
# Status of the samples
status <- c(rep("COVID",20), rep("Healthy",20))
# Combine into study design table
studyDesign <- cbind.data.frame(cellID, donor, status)
# This is what your study design should roughly look like. It should contain the information from which donor the sample came from and their status.
head(studyDesign)
## cellID donor status
## 1 Covid-1-1 Covid1 COVID
## 2 Covid-1-2 Covid1 COVID
## 3 Covid-1-3 Covid1 COVID
## 4 Covid-1-4 Covid1 COVID
## 5 Covid-1-5 Covid1 COVID
## 6 Covid-1-6 Covid1 COVID
Now that we have some data to work with we can start running the REACTOR analysis. The first step in the REACTOR workflow is to create the activity matrix for the differential activity analysis. This can be done using the REACTOR processData-function.
The parameters required for the analysis are as follows:
| Parameter | Explanation | Value to be used in this vignette |
|---|---|---|
| minCells | minimum number of cells present in a cell-type cluster within a donor | We will use 0 here as the simulated dataset is rather small |
| RBM | Regulon Binary Matrix. This is produced by SCENIC’s binarize-function! (1st column should represent the single cell sample IDs) | The binary_matrix we created earlier |
| Study Design | Study design dataframe. Should contain information (as columns) from which sample and which condition the single cell sample came and the 1st column should represent the single cell sample IDs | The studyDesign dataframe we created earlier |
| Clustering | Clustering dataframe (1st column should represent the single cell sample IDs) | The clustering dataframe we created earlier |
| cluster_cName | Column name of the clustering to use from the Clustering dataframe | “cellTypeClusters” |
| condition_cName | Name of the column of conditions to be contrasted from the StudyDesign dataframe (i.e COVID or Healthy) | “status” |
| donor_cName | Name of the donor column from the StudyDesign dataframe | “donor” |
# Please ensure that the first column of each of the input data matrices represents the single-cell sample ID so we can join them!
donor_cname = "donor"
cluster_cname = "cellTypeClusters"
condition_cname = "status"
minCells = 0
# processData returns the regulon activity table in a format processable by ROTS.
regulonActivity <- REACTOR::processData(minCells = minCells, RBM = binary_matrix,
StudyDesign = studyDesign, Clustering = clustering,
condition_cName = condition_cname, donor_cName = donor_cname,
cluster_cName = cluster_cname)
## Joining with `by = join_by(CellID)`
## Joining with `by = join_by(CellID)`
head(regulonActivity)
## Covid1 Covid2 Healthy1 Healthy2
## CD 14 Monocyte | NFIL3 33.33333 100.00000 100.00000 0.00000
## CD 14 Monocyte | MAFB 66.66667 50.00000 33.33333 50.00000
## CD 14 Monocyte | MYCN 33.33333 50.00000 33.33333 100.00000
## CD 14 Monocyte | FOXK2 66.66667 100.00000 33.33333 50.00000
## CD 14 Monocyte | PPARG 33.33333 50.00000 33.33333 50.00000
## CD 14 Monocyte | STAT2 66.66667 100.00000 33.33333 0.00000
With the processed data the differential analysis can be performed using the differentialActivityAnalysis(). REACTOR uses ROTS [2] to conduct the analysis.
The input parameters are as follows:
| Parameter | Explanation | Value to be used in this vignette |
|---|---|---|
| data | Dataframe containing the proportional counts of binary regulon activity. This is the output produced by the REACTOR::processData-function. | regulonActivity dataframe we created earlier using REACTOR |
| groups | Vector specifying the experimental groups (i.e. COVID, Healthy) as integers | In this case it is c(1,1,2,2). You can check this by viewing the regulonActivity dataframe and checking the column names and comparing them to the metadata you have |
| Max Zeros | Maximum number of zero values present in a row of the input data frame. Rows that contain more zero values than this parameter will be filtered before the ROTS analysis. | We will use the default parameter (NA) in this case as our dataset is rather small |
| … | Parameters passed onto ROTS. See ROTS | We will be setting seed to 1234 and K (top list size) to 100 |
groups <- c(rep(1,2), rep(2,2))
# Differential activity analysis using reactor
DAA_out <- REACTOR::differentialActivityAnalysis(data = regulonActivity, groups = groups, seed=1234, K=100)
## Warning in ROTS(data = data_filtered, groups = groups, ...): Top list size K is
## more than 90% of the data. Be cautious with reproducibility estimates.
## Bootstrapping samples
## Optimizing parameters
## Calculating p-values
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Calculating FDR
# The outputs from the analysis are as follows: at index 1 you have the ROTS object and at index 2 you have simplified results table
ROTS_obj <- DAA_out[[1]]
ROTS_results <- DAA_out[[2]]
head(ROTS_results)
## Covid1 Covid2 Healthy1 Healthy2 d
## CD 14 Monocyte | FOXK2 66.66667 100.00000 33.33333 50.00000 -2.236068
## CD 14 Monocyte | STAT2 66.66667 100.00000 33.33333 0.00000 -2.828428
## CD 16 Monocyte | STAT2 0.00000 0.00000 66.66667 50.00000 6.999999
## CD 16 Monocyte | RELB 66.66667 100.00000 33.33333 50.00000 -2.236068
## NK | MYCN 100.00000 66.66667 50.00000 33.33333 -2.236068
## CD 14 Monocyte | MAFB 66.66667 50.00000 33.33333 50.00000 -1.414214
## p fdr fc
## CD 14 Monocyte | FOXK2 0.1327375 0.6000000 41.66667
## CD 14 Monocyte | STAT2 0.0742625 0.6000000 66.66667
## CD 16 Monocyte | STAT2 0.0334375 0.6000000 -58.33333
## CD 16 Monocyte | RELB 0.1327375 0.6000000 41.66667
## NK | MYCN 0.1327375 0.6000000 41.66667
## CD 14 Monocyte | MAFB 0.2989125 0.8181818 16.66667
Based on this analysis we have one significant regulon-celltype combination based on p-value 0.05 cutoff (CD16 Monocyte | STAT2, where CD16 Monocyte is the celltype cluster and STAT2 the regulon). The FDR values are not significant due to the nature in which we simulated the data and the dataset being rather small. With the ROTS object you can easily do some plotting. Let’s explore a few options (more can be found in ROTS documentation)
plot(ROTS_obj, type="pvalue")
plot(ROTS_obj, type="reproducibility")
plot(ROTS_obj, type="volcano", fdr=0.61, labels = T)
[1] S. Aibar et al., “SCENIC: single-cell regulatory network inference and clustering”, Nat Methods 14, 1083–1086, Oct. 2017. https://doi.org/10.1038/nmeth.4463.
[2] T. Suomi et al., “ROTS: An R package for reproducibility-optimized statistical testing”, PLOS Comput. Biol., vol. 13, no. 5, p. e1005562, May 2017, doi: 10.1371/journal.pcbi.1005562.
[3] Y. Hao et al., “Dictionary learning for integrative, multimodal and scalable single-cell analysis”, Nat Biotechnol. Feb. 2024;42(2):293-304. doi: 10.1038/s41587-023-01767-y.