Simulating data for analysis

In this vignette we will be simulating data that resembles the data you would use in your REACTOR analysis and give you a step by step guide on how to use the REACTOR package for your analysis!

Let’s start by simulating data that resembles the data from SCENIC’s [1] binarize()-function, which is just a table consisting of binary values. The table should also contain a column that is used to identify the single-cell sample the values come from (this should also be found in the SCENIC output by default).

# Single cell sample identifiers for 2 cases, 2 controls. 10 single cell samples each
cellID <- c("Covid-1-1","Covid-1-2","Covid-1-3","Covid-1-4","Covid-1-5", "Covid-1-6","Covid-1-7","Covid-1-8","Covid-1-9","Covid-1-10", 
            "Covid-2-1","Covid-2-2","Covid-2-3","Covid-2-4","Covid-2-5", "Covid-2-6","Covid-2-7","Covid-2-8","Covid-2-9","Covid-2-10", 
          "Healthy-1-1","Healthy-1-2","Healthy-1-3","Healthy-1-4","Healthy-1-5", "Healthy-1-6","Healthy-1-7","Healthy-1-8","Healthy-1-9","Healthy-1-10", 
                    "Healthy-2-1","Healthy-2-2","Healthy-2-3","Healthy-2-4","Healthy-2-5", "Healthy-2-6","Healthy-2-7","Healthy-2-8","Healthy-2-9","Healthy-2-10")

# Lets also name a few regulons our single cell experiments might have found and add "CellID" to the beginning of this list to get the final column names for our table.
colNames <- c("CellID","NFIL3", "MAFB", "MYCN", "FOXK2", "PPARG", "STAT2", "JUNB", "JUND", "RELB", "ETV1")

# Now simulating the binary matrix
set.seed(1234)
binary_matrix <- matrix(sample(0:1, 40 * 10, replace = TRUE), nrow = 40, ncol = 10)
# Lets add the cell identifier column we made earlier and set the column names
binary_matrix <- cbind.data.frame(cellID, binary_matrix)
colnames(binary_matrix) <- colNames
# Lets see what our table should look like
head(binary_matrix)
##      CellID NFIL3 MAFB MYCN FOXK2 PPARG STAT2 JUNB JUND RELB ETV1
## 1 Covid-1-1     1    1    0     0     0     1    0    0    0    0
## 2 Covid-1-2     1    0    1     1     0     0    1    1    1    1
## 3 Covid-1-3     1    1    0     1     0     1    0    1    1    1
## 4 Covid-1-4     1    1    1     0     1     0    0    0    1    0
## 5 Covid-1-5     0    1    0     1     1     0    1    0    1    1
## 6 Covid-1-6     1    0    1     0     1     0    0    0    1    0

Now let’s simulate the clustering our e.g. Coralysis [1] or SEURAT [3] analysis might have made for us:

# Lets use the single cell sample identifiers we made earlier and assign them some cell-type clusters
cellTypeClusters <- rep(c("CD 14 Monocyte", "CD 16 Monocyte", "CD8m T", "NK"), 10)
clustering <- cbind.data.frame(cellID, cellTypeClusters)
# This is what our clustering table should look like
head(clustering)
##      cellID cellTypeClusters
## 1 Covid-1-1   CD 14 Monocyte
## 2 Covid-1-2   CD 16 Monocyte
## 3 Covid-1-3           CD8m T
## 4 Covid-1-4               NK
## 5 Covid-1-5   CD 14 Monocyte
## 6 Covid-1-6   CD 16 Monocyte

And let’s finally simulate the studyDesign table we should also have in order to run REACTOR analysis:

# Donors IDs
donor <- c(rep("Covid1",10), rep("Covid2",10), rep("Healthy1",10), rep("Healthy2",10))
# Status of the samples
status <- c(rep("COVID",20), rep("Healthy",20))

# Combine into study design table
studyDesign <- cbind.data.frame(cellID, donor, status)
# This is what your study design should roughly look like. It should contain the information from which donor the sample came from and their status.
head(studyDesign)
##      cellID  donor status
## 1 Covid-1-1 Covid1  COVID
## 2 Covid-1-2 Covid1  COVID
## 3 Covid-1-3 Covid1  COVID
## 4 Covid-1-4 Covid1  COVID
## 5 Covid-1-5 Covid1  COVID
## 6 Covid-1-6 Covid1  COVID

Running REACTOR

Now that we have some data to work with we can start running the REACTOR analysis. The first step in the REACTOR workflow is to create the activity matrix for the differential activity analysis. This can be done using the REACTOR processData-function.

Processing the input data

The parameters required for the analysis are as follows:

Parameter Explanation Value to be used in this vignette
minCells minimum number of cells present in a cell-type cluster within a donor We will use 0 here as the simulated dataset is rather small
RBM Regulon Binary Matrix. This is produced by SCENIC’s binarize-function! (1st column should represent the single cell sample IDs) The binary_matrix we created earlier
Study Design Study design dataframe. Should contain information (as columns) from which sample and which condition the single cell sample came and the 1st column should represent the single cell sample IDs The studyDesign dataframe we created earlier
Clustering Clustering dataframe (1st column should represent the single cell sample IDs) The clustering dataframe we created earlier
cluster_cName Column name of the clustering to use from the Clustering dataframe “cellTypeClusters”
condition_cName Name of the column of conditions to be contrasted from the StudyDesign dataframe (i.e COVID or Healthy) “status”
donor_cName Name of the donor column from the StudyDesign dataframe “donor”
# Please ensure that the first column of each of the input data matrices represents the single-cell sample ID so we can join them!


donor_cname      = "donor"
cluster_cname     = "cellTypeClusters"
condition_cname   = "status"

minCells = 0

# processData returns the regulon activity table in a format processable by ROTS.
regulonActivity  <- REACTOR::processData(minCells = minCells, RBM = binary_matrix,
StudyDesign = studyDesign, Clustering = clustering,
condition_cName = condition_cname, donor_cName = donor_cname,
cluster_cName = cluster_cname)
## Joining with `by = join_by(CellID)`
## Joining with `by = join_by(CellID)`
head(regulonActivity)
##                           Covid1    Covid2  Healthy1  Healthy2
## CD 14 Monocyte | NFIL3  33.33333 100.00000 100.00000   0.00000
## CD 14 Monocyte | MAFB   66.66667  50.00000  33.33333  50.00000
## CD 14 Monocyte | MYCN   33.33333  50.00000  33.33333 100.00000
## CD 14 Monocyte | FOXK2  66.66667 100.00000  33.33333  50.00000
## CD 14 Monocyte | PPARG  33.33333  50.00000  33.33333  50.00000
## CD 14 Monocyte | STAT2  66.66667 100.00000  33.33333   0.00000

Running differential activity analysis

With the processed data the differential analysis can be performed using the differentialActivityAnalysis(). REACTOR uses ROTS [2] to conduct the analysis.

The input parameters are as follows:

Parameter Explanation Value to be used in this vignette
data Dataframe containing the proportional counts of binary regulon activity. This is the output produced by the REACTOR::processData-function. regulonActivity dataframe we created earlier using REACTOR
groups Vector specifying the experimental groups (i.e. COVID, Healthy) as integers In this case it is c(1,1,2,2). You can check this by viewing the regulonActivity dataframe and checking the column names and comparing them to the metadata you have
Max Zeros Maximum number of zero values present in a row of the input data frame. Rows that contain more zero values than this parameter will be filtered before the ROTS analysis. We will use the default parameter (NA) in this case as our dataset is rather small
Parameters passed onto ROTS. See ROTS We will be setting seed to 1234 and K (top list size) to 100
groups <- c(rep(1,2), rep(2,2)) 

# Differential activity analysis using reactor
DAA_out <- REACTOR::differentialActivityAnalysis(data = regulonActivity, groups = groups, seed=1234, K=100)
## Warning in ROTS(data = data_filtered, groups = groups, ...): Top list size K is
## more than 90% of the data. Be cautious with reproducibility estimates.
## Bootstrapping samples
## Optimizing parameters
## Calculating p-values
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Warning in pvalue(observed, permuted): subscript out of bounds (index 80000 >=
## vector size 80000)
## Calculating FDR
# The outputs from the analysis are as follows: at index 1 you have the ROTS object and at index 2 you have simplified results table 
ROTS_obj <- DAA_out[[1]]
ROTS_results <- DAA_out[[2]]

head(ROTS_results)
##                           Covid1    Covid2 Healthy1 Healthy2         d
## CD 14 Monocyte | FOXK2  66.66667 100.00000 33.33333 50.00000 -2.236068
## CD 14 Monocyte | STAT2  66.66667 100.00000 33.33333  0.00000 -2.828428
## CD 16 Monocyte | STAT2   0.00000   0.00000 66.66667 50.00000  6.999999
## CD 16 Monocyte | RELB   66.66667 100.00000 33.33333 50.00000 -2.236068
## NK | MYCN              100.00000  66.66667 50.00000 33.33333 -2.236068
## CD 14 Monocyte | MAFB   66.66667  50.00000 33.33333 50.00000 -1.414214
##                                p       fdr        fc
## CD 14 Monocyte | FOXK2 0.1327375 0.6000000  41.66667
## CD 14 Monocyte | STAT2 0.0742625 0.6000000  66.66667
## CD 16 Monocyte | STAT2 0.0334375 0.6000000 -58.33333
## CD 16 Monocyte | RELB  0.1327375 0.6000000  41.66667
## NK | MYCN              0.1327375 0.6000000  41.66667
## CD 14 Monocyte | MAFB  0.2989125 0.8181818  16.66667

Based on this analysis we have one significant regulon-celltype combination based on p-value 0.05 cutoff (CD16 Monocyte | STAT2, where CD16 Monocyte is the celltype cluster and STAT2 the regulon). The FDR values are not significant due to the nature in which we simulated the data and the dataset being rather small. With the ROTS object you can easily do some plotting. Let’s explore a few options (more can be found in ROTS documentation)

plot(ROTS_obj, type="pvalue")

plot(ROTS_obj, type="reproducibility")

plot(ROTS_obj, type="volcano", fdr=0.61, labels = T)

References

[1] S. Aibar et al., “SCENIC: single-cell regulatory network inference and clustering”, Nat Methods 14, 1083–1086, Oct. 2017. https://doi.org/10.1038/nmeth.4463.

[2] T. Suomi et al., “ROTS: An R package for reproducibility-optimized statistical testing”, PLOS Comput. Biol., vol. 13, no. 5, p. e1005562, May 2017, doi: 10.1371/journal.pcbi.1005562.

[3] Y. Hao et al., “Dictionary learning for integrative, multimodal and scalable single-cell analysis”, Nat Biotechnol. Feb. 2024;42(2):293-304. doi: 10.1038/s41587-023-01767-y.